Comprehensive Study on Digitization of Handwritten Documents using OCR and LLM

Authors

  • Priya Nandihal
  • John Charles JT
  • Alan Albuquerque
  • Akhil Pendyala
  • Chathur BR

Keywords:

Convolutional Neural Networks (CNN), Digitization, Document archiving, Historical records, Image processing, Large Language Models (LLMs), Optical Character Recognition (OCR), Text restoration

Abstract

Digitization and preservation of historical documents is a complex yet necessary task, often hindered by document degradation, fragmented content, and missing text. This comprehensive study introduces an innovative framework that synergizes Image Processing, Optical Character Recognition (OCR), and Large Language Models (LLMs) to effectively confront these issues. For initial cases, we deploy superior image-processing algorithms such as the Convolutional Neural Network algorithm for denoising degradation so that after being degraded, high-resolution text, more precisely enhanced extracted images yield significant improvement accuracy, making retrieval feasible even where such a set would have appeared otherwise illegible and irrecoverable. We use LLMs such as GPT and BERT on this basis: they are efficient at reconstructing text that contains damage and are good at rebuilding missing content so that coherent text is derived in line with the intent that originally existed for the content involved. This helps integrate manual, hand-based solutions with automated forms of restoration within an interdisciplinary means that allows easier and more consistent preservation of these historical records.

Our findings hold great importance because they are directly related to the progress of historical research methodologies, the saving of our cultural heritage, and modern archiving systems. This work takes fragmented physical records and transforms them into accessible digital repositories, pointing out the critical role that technology plays in the preservation of our historical legacy for future generations and leading down the path to further openness and engagement with our shared cultural heritage.

References

A. Radford et al., “Learning Transferable Visual Models from Natural Language Supervision,” Proceedings of the 38th International Conference on Machine Learning, vol. 139, Jul. 01, 2021. https://proceedings.mlr.press/v139/radford21a.html

F. J. Meng, Y. Huang, S. X. Yang and H. Su, "A Framework for Extensible Collaborative Asset-based Service Engagement," IEEE International Conference on e-Business Engineering (ICEBE'07), Hong Kong, China, 2007, pp. 477-484, doi: https://doi.org/10.1109/ICEBE.2007.18.

C. Chinnappan and R. Porkodi, “Fingerprint Recognition Technology Using Deep Learning: A Review,” International Journal of Creative Research Thoughts (IJCRT), vol. 9, no. 1, pp. 4647-4663, Jan. 2025, https://ijcrt.org/papers/IJCRT2101569.pdf.

A. Ray et al., "An End-to-End Trainable Framework for Joint Optimization of Document Enhancement and Recognition," 2019 International Conference on Document Analysis and Recognition (ICDAR), Sydney, NSW, Australia, 2019, pp. 59-64, doi: https://doi.org/10.1109/ICDAR.2019.00019.

I. Goodfellow et al., “Generative Adversarial Nets,” Neural Information Processing Systems, 2014. https://papers.nips.cc/paper_files/paper/2014/hash/5ca3e9b122f61f8f06494c97b1afccf3-Abstract.html

“Nodejs Development Company in USA | NeoITO,” NeoITO, Sep. 18, 2023. https://www.neoito.com/nodejs-development-company/ (accessed Jan. 28, 2025).

S. Faizullah, M. S. Ayub, T. Alghamdi, T. S. Ali, M. A. Khan, and E. Nabil, “Revolutionizing Historical Document Digitization: LSTM-Enhanced OCR for Arabic Handwritten Manuscripts,” International Journal of Advanced Computer Science and Applications, vol. 15, no. 10, 2024, doi: https://doi.org/10.14569/ijacsa.2024.01510120.

C. Kaundilya, D. Chawla, and Y. Chopra, “Automated Text Extraction from Images using OCR System,” International Conference on Computing for Sustainable Global Development, 2019. https://www.semanticscholar.org/paper/Automated-Text-Extraction-from-Images-using-OCR-Kaundilya-Chawla/88647423e75bd358a16e3a70b4ed2b82bb9ec0a4 (accessed Jan. 28, 2025).

P. Nandihal, P. K. Pareek, V. H. C. De Albuquerque, M. R. B, A. Khanna and V. S. Kumar, "Ant Colony Optimization based Medical Image Preservation and Segmentation," 2022 Second International Conference on Advanced Technologies in Intelligent Control, Environment, Computing & Communication Engineering (ICATIECE), Bangalore, India, 2022, pp. 1-7, doi: https://doi.org/10.1109/icatiece56365.2022.10047584.

P. Nandihal, V. Shetty S, T. Guha and P. K. Pareek, "Glioma Detection using Improved Artificial Neural Network in MRI Images," 2022 IEEE 2nd Mysore Sub Section International Conference (MysuruCon), Mysuru, India, 2022, pp. 1-9, doi: https://doi.org/10.1109/MysuruCon55714.2022.9972712.

K. S. A, Priya Nandihal, Seemanthini K, M. D. R, and Liyakathunisa Liyakathunisa, “Prior Detection of Alzheimer’s Disease with the Aid of MRI Images and Deep Neural Networks,” Malaysian Journal of Computer Science, pp. 16–28, Dec. 2022, doi: https://doi.org/10.22452/mjcs.sp2022no2.2.

Mohammad Hussein Amiri, Nastaran Mehrabi Hashjin, Mohsen Montazeri, Seyedali Mirjalili, and Nima Khodadadi, “Hippopotamus optimization algorithm: a novel nature-inspired optimization algorithm,” Scientific Reports, vol. 14, no. 1, Feb. 2024, doi: https://doi.org/10.1038/s41598-024-54910-3.

S. Mukherjee, A. Lalitha, S. Sengupta, A. Deshmukh, and B. Kveton, “Multi-Objective Alignment of Large Language Models through Hypervolume Maximization,” arXiv.org, 2024. https://arxiv.org/abs/2412.05469 (accessed Jan. 28, 2025).

S. S. Reddy and C. Nandini, “Detection of communicable and non-communicable diseases using hyperparameter optimization with Bi-LSTM model in pathology images,” International Journal of Intelligent Computing and Cybernetics, Mar. 2022, doi: https://doi.org/10.1108/ijicc-11-2021-0260.

Published

2025-01-30

How to Cite

Priya Nandihal, Charles JT, J., Alan Albuquerque, Akhil Pendyala, & Chathur BR. (2025). Comprehensive Study on Digitization of Handwritten Documents using OCR and LLM. Journal of Information Technology and Sciences, 11(1), 17–23. Retrieved from https://matjournals.net/engineering/index.php/JOITS/article/view/1367

Issue

Section

Articles