Topic Modelling for News Article Categorization Using Latent Dirichlet Allocation: A Text Mining Approach
Abstract
An overly increasing amount of news is produced on the internet, making it impossible to categorize articles in terms of themes manually and accurately. To solve this challenge, this research investigates the process of topic modelling for classifying news articles into pre-defined issues, including politics, sports, technology, and entertainment. The news source employed for textual analysis includes thousands of articles from different sources. Latent Dirichlet Allocation (LDA), an unsupervised learning algorithm, is employed to identify hidden topics in the articles. The obtained issues are used to classify the articles under the appropriate category. Some steps that can be taken include cleaning text, breaking text into tokens, and transforming these tokens into a vector. While the quality of the topics is assessed based on coherence scores, the contribution of vector representation methods like TF-IDF or word embeddings is also discussed to enhance the model's performance. The results suggest that the topic modelling can accurately classify articles and post specific difficulties in classifying those articles that contain topics from different categories or whose content is unc could be more apparent, shows that SERP analysis with the help of the unsupervised approach can improve indexing and facilitate navigation in news delivery services.