Topic Modelling for News Article Categorization Using Latent Dirichlet Allocation: A Text Mining Approach

Authors

  • Manish Avishkar Dhatrak
  • Samarth Bharat Jadhav
  • Pawan Dilip Kudake
  • A. V. Brahmane

Abstract

An overly increasing amount of news is produced on the internet, making it impossible to categorize articles in terms of themes manually and accurately. To solve this challenge, this research investigates the process of topic modelling for classifying news articles into pre-defined issues, including politics, sports, technology, and entertainment. The news source employed for textual analysis includes thousands of articles from different sources. Latent Dirichlet Allocation (LDA), an unsupervised learning algorithm, is employed to identify hidden topics in the articles. The obtained issues are used to classify the articles under the appropriate category. Some steps that can be taken include cleaning text, breaking text into tokens, and transforming these tokens into a vector. While the quality of the topics is assessed based on coherence scores, the contribution of vector representation methods like TF-IDF or word embeddings is also discussed to enhance the model's performance. The results suggest that the topic modelling can accurately classify articles and post specific difficulties in classifying those articles that contain topics from different categories or whose content is unc could be more apparent, shows that SERP analysis with the help of the unsupervised approach can improve indexing and facilitate navigation in news delivery services.

Published

2024-10-09

How to Cite

Avishkar Dhatrak, M., Bharat Jadhav, S., Dilip Kudake, P., & Brahmane, A. V. (2024). Topic Modelling for News Article Categorization Using Latent Dirichlet Allocation: A Text Mining Approach. Journal of Knowledge in Data Science and Information Management, 1(3), 11–24. Retrieved from https://matjournals.net/engineering/index.php/JoKDSIM/article/view/1003

Issue

Section

Articles