Transformer-Based Topic Modeling for Research Literature

Shah Ahmed Shakir  Abu Asim Khan; Eraf Ali; Osama Zaheer; Ahmad Bilal Zaidi; Zuhair Arif; Mohammad Rayyan

doi:10.46610/JoSCCI.2025.v02i03.004

Authors

Shah Ahmed Shakir Abu Asim Khan Postgraduate Student, Department of Computer Engineering, Aligarh Muslim University, Aligarh, Uttar Pradesh, India
Eraf Ali Undergraduate Student, Department of Electrical Engineering, Aligarh Muslim University, Aligarh, Uttar Pradesh, India
Osama Zaheer Undergraduate Student, Department of Computer Engineering, Aligarh Muslim University, Aligarh, Uttar Pradesh, India
Ahmad Bilal Zaidi Postgraduate Student, Department of Computer Engineering, Aligarh Muslim University, Aligarh, Uttar Pradesh, India
Zuhair Arif Undergraduate Student, Department of Computer Engineering, Aligarh Muslim University, Aligarh, Uttar Pradesh, India
Mohammad Rayyan Postgraduate Student, Department of Computer Science, Aligarh Muslim University, Aligarh, Uttar Pradesh, India

DOI:

https://doi.org/10.46610/JoSCCI.2025.v02i03.004

Keywords:

BERTopic, Document categorization, HDBSCAN, Sentence-transformers, Topic modeling, Transformer embeddings, UMAP

Abstract

Scalable techniques to arrange, condense, and evaluate textual data are required due to the scientific literature’s exponential growth. In this work, a reproducible pipeline based on BERTopic is presented, which combines class-based TF-IDF for topic labelling, UMAP for dimensionality reduction, HDBSCAN for density-aware clustering, and transformer-based sentence embeddings. Compared to traditional baselines like LDA and TF–IDF+KMeans, the pipeline produces comprehensible topics, strong clustering metrics, and interpretable labels when tested on a multi-domain corpus of roughly 10,462 arXiv abstracts from 2010 to 2023. We offer rich visualizations (heatmap, dendrogram, distribution charts), quantify improvements (silhouette and coherence), and talk about design decisions and constraints for real-world implementation.

References

S. Suh, J. Choo, J. Lee, and C. K. Reddy, “L-EnsNMF: Boosted Local Topic Discovery via Ensemble of Nonnegative Matrix Factorization,” In2016 IEEE 16th International Conference on Data Mining, Dec. 2016, doi: https://doi.org/10.1109/icdm.2016.0059

D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research, vol. 3, pp. 993–1022, Mar. 2003, doi: https://doi.org/10.5555/944919.944937

S.-W. Kim and J.-M. Gil, “Research paper classification systems based on TF-IDF and LDA schemes,” Human-centric Computing and Information Sciences, vol. 9, no. 1, Aug. 2019, doi: https://doi.org/10.1186/s13673-019-0192-7

L. Williams, E. Anthi, L. Arman, and P. Burnap, “Topic Modelling: Going beyond Token Outputs,” Big Data and Cognitive Computing, vol. 8, no. 5, pp. 44–44, Apr. 2024, doi: https://doi.org/10.3390/bdcc8050044

T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” Arxiv.org, Sep. 07, 2013. https://arxiv.org/abs/1301.3781

J. Pennington, R. Socher, and C. Manning, “Glove: Global Vectors for Word Representation,” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, 2014. https://aclanthology.org/D14-1162/

P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with Subword Information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, Dec. 2017, doi: https://doi.org/10.1162/tacl_a_00051

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Proceedings of the 2019 Conference of the North, vol. 1, pp. 4171–4186, 2019, doi: https://doi.org/10.18653/v1/n19-1423

I. Beltagy, K. Lo, and A. Cohan, “SciBERT: A Pretrained Language Model for Scientific Text,” Arxiv.org, 2019. https://arxiv.org/abs/1903.10676

N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” Arxiv.org, 2019. https://arxiv.org/abs/1908.10084

M. Grootendorst, “BERTopic: Neural topic modeling with a class-based TF-IDF procedure,” Arxiv:2203.05794, Mar. 2022, doi: https://doi.org/10.48550/arXiv.2203.05794

A. P. Bhopale and A. Tiwari, “Transformer based contextual text representation framework for intelligent information retrieval,” Expert Systems with Applications, vol. 238, pp. 121629–121629, Mar. 2024, doi: https://doi.org/10.1016/j.eswa.2023.121629

C.-H. Chang, J.-T. Tsai, Y.-H. Tsai, and S.-Y. Hwang, “LITA: An Efficient LLM-Assisted Iterative Topic Augmentation Framework,” Lecture Notes in Computer Science, pp. 449–460, 2025, doi: https://doi.org/10.1007/978-981-96-8170-9_35

D. Angelov and D. Inkpen, “Topic Modelling: Contextual Token Embeddings Are All You Need,” In Findings of the Association for Computational Linguistics, pp. 13528–13539, Jan. 2024, doi: https://doi.org/10.18653/v1/2024.findings-emnlp.790

E. Zosa and L. Pivovarova, “Multilingual and Multimodal Topic Modelling with Pretrained Embeddings,” Arxiv.org, 2022. https://arxiv.org/abs/2211.08057

A. Vaswani, N. Shazeer, N. Parmar, “Attention Is All You Need,” Arxiv.org, 2017. https://arxiv.org/abs/1706.03762

L. McInnes, J. Healy, and J. Melville, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,” Arxiv.org, 2018. https://arxiv.org/abs/1802.03426

L. McInnes, J. Healy, and S. Astels, “hdbscan: Hierarchical density based clustering,” The Journal of Open Source Software, vol. 2, no. 11, p. 205, Mar. 2017, doi: https://doi.org/10.21105/joss.00205

S. L’Yi, B. Ko, D. Shin, Y. Min Cho, “XCluSim: a visual analytics tool for interactively comparing multiple clustering results of bioinformatics data,” BMC Bioinformatics, vol. 16, no. S11, Aug. 2015, doi: https://doi.org/10.1186/1471-2105-16-s11-s5

Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchical Attention Networks for Document Classification,” Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, doi: https://doi.org/10.18653/v1/n16-1174

Transformer-Based Topic Modeling for Research Literature

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

Current Issue