Human Data Science (HDS)

Agenda

14 March 2019
15:00 - 16:00
Sjoerd Groenmangebouw B1.09

Shiva Nadi: Text Classification and Named Entity Recognition for Online News Data

Thursday 14/03/2019 at 15:00 in room B1.09

 

The next MSDSlab meetings will be on Thursday the 14th of March and will be presented by Shiva Nadi of Utrecht University.  She will  provide a brief overview of the work she is doing on text mining and present some of the challenges she is facing.

 

AbstractText Classification and Named Entity Recognition for Online News Data

Nowadays, news websites provide information every day for millions of users. But with the continuous development of information technology, the amount of unstructured news data in social sciences is increasing. How to organize the text and make automatically text classification is still a challenge for social science applications. Text classification is a smart way of classifying text into categories. Using machine learning algorithms to automate these tasks, makes the whole process fast and efficient. This project, mainly makes a research about the text news classification. It proposes a model based on the Scikit-Learn Python package. We feed labeled data to the machine learning algorithm to train on with the pre-defined categories. Due to the noises and high-dimension, this model uses preprocessing steps to reduce text dimension and get features. At the same time, the work also uses a tool for Named Entity Recognition, getting related features such as name of persons, organizations, locations and  so forth, out of text. During the testing phase, the algorithm is fed with unobserved data and classifies them into categories based on the training model.