Transformer-Based Topic Modeling for Research Literature
DOI:
https://doi.org/10.46610/JoSCCI.2025.v02i03.004Keywords:
BERTopic, Document categorization, HDBSCAN, Sentence-transformers, Topic modeling, Transformer embeddings, UMAPAbstract
Scalable techniques to arrange, condense, and evaluate textual data are required due to the scientific literature’s exponential growth. In this work, a reproducible pipeline based on BERTopic is presented, which combines class-based TF-IDF for topic labelling, UMAP for dimensionality reduction, HDBSCAN for density-aware clustering, and transformer-based sentence embeddings. Compared to traditional baselines like LDA and TF–IDF+KMeans, the pipeline produces comprehensible topics, strong clustering metrics, and interpretable labels when tested on a multi-domain corpus of roughly 10,462 arXiv abstracts from 2010 to 2023. We offer rich visualizations (heatmap, dendrogram, distribution charts), quantify improvements (silhouette and coherence), and talk about design decisions and constraints for real-world implementation.
References
S. Suh, J. Choo, J. Lee, and C. K. Reddy, “L-EnsNMF: Boosted Local Topic Discovery via Ensemble of Nonnegative Matrix Factorization,” In2016 IEEE 16th International Conference on Data Mining, Dec. 2016, doi: https://doi.org/10.1109/icdm.2016.0059
D. M. Blei, A. Y. Ng, and M. I. Jordan, “Latent Dirichlet allocation,” Journal of Machine Learning Research, vol. 3, pp. 993–1022, Mar. 2003, doi: https://doi.org/10.5555/944919.944937
S.-W. Kim and J.-M. Gil, “Research paper classification systems based on TF-IDF and LDA schemes,” Human-centric Computing and Information Sciences, vol. 9, no. 1, Aug. 2019, doi: https://doi.org/10.1186/s13673-019-0192-7
L. Williams, E. Anthi, L. Arman, and P. Burnap, “Topic Modelling: Going beyond Token Outputs,” Big Data and Cognitive Computing, vol. 8, no. 5, pp. 44–44, Apr. 2024, doi: https://doi.org/10.3390/bdcc8050044
T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient Estimation of Word Representations in Vector Space,” Arxiv.org, Sep. 07, 2013. https://arxiv.org/abs/1301.3781
J. Pennington, R. Socher, and C. Manning, “Glove: Global Vectors for Word Representation,” Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, 2014. https://aclanthology.org/D14-1162/
P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching Word Vectors with Subword Information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, Dec. 2017, doi: https://doi.org/10.1162/tacl_a_00051
J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” Proceedings of the 2019 Conference of the North, vol. 1, pp. 4171–4186, 2019, doi: https://doi.org/10.18653/v1/n19-1423
I. Beltagy, K. Lo, and A. Cohan, “SciBERT: A Pretrained Language Model for Scientific Text,” Arxiv.org, 2019. https://arxiv.org/abs/1903.10676
N. Reimers and I. Gurevych, “Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks,” Arxiv.org, 2019. https://arxiv.org/abs/1908.10084
M. Grootendorst, “BERTopic: Neural topic modeling with a class-based TF-IDF procedure,” Arxiv:2203.05794, Mar. 2022, doi: https://doi.org/10.48550/arXiv.2203.05794
A. P. Bhopale and A. Tiwari, “Transformer based contextual text representation framework for intelligent information retrieval,” Expert Systems with Applications, vol. 238, pp. 121629–121629, Mar. 2024, doi: https://doi.org/10.1016/j.eswa.2023.121629
C.-H. Chang, J.-T. Tsai, Y.-H. Tsai, and S.-Y. Hwang, “LITA: An Efficient LLM-Assisted Iterative Topic Augmentation Framework,” Lecture Notes in Computer Science, pp. 449–460, 2025, doi: https://doi.org/10.1007/978-981-96-8170-9_35
D. Angelov and D. Inkpen, “Topic Modelling: Contextual Token Embeddings Are All You Need,” In Findings of the Association for Computational Linguistics, pp. 13528–13539, Jan. 2024, doi: https://doi.org/10.18653/v1/2024.findings-emnlp.790
E. Zosa and L. Pivovarova, “Multilingual and Multimodal Topic Modelling with Pretrained Embeddings,” Arxiv.org, 2022. https://arxiv.org/abs/2211.08057
A. Vaswani, N. Shazeer, N. Parmar, “Attention Is All You Need,” Arxiv.org, 2017. https://arxiv.org/abs/1706.03762
L. McInnes, J. Healy, and J. Melville, “UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction,” Arxiv.org, 2018. https://arxiv.org/abs/1802.03426
L. McInnes, J. Healy, and S. Astels, “hdbscan: Hierarchical density based clustering,” The Journal of Open Source Software, vol. 2, no. 11, p. 205, Mar. 2017, doi: https://doi.org/10.21105/joss.00205
S. L’Yi, B. Ko, D. Shin, Y. Min Cho, “XCluSim: a visual analytics tool for interactively comparing multiple clustering results of bioinformatics data,” BMC Bioinformatics, vol. 16, no. S11, Aug. 2015, doi: https://doi.org/10.1186/1471-2105-16-s11-s5
Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, and E. Hovy, “Hierarchical Attention Networks for Document Classification,” Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2016, doi: https://doi.org/10.18653/v1/n16-1174