Personally Identifiable Information Detection Using Natural Language Processing
Keywords:
Deberta v3, Hyperparameter, Large Language Model (LLM), Natural Language Processing (NLP), Personally Identifiable Information (PII)Abstract
Nowadays, data is one of the most valuable assets in the world. As technology grows, the value of data also increases. When it's not required to disclose the information to prevent problems like identity theft, financial loss, etc., the need to protect personal information also increases. This report will discuss the approach to refine the detection of Personally Identifiable Information (PII) in diverse text data using advanced Natural Language Processing (NLP) and Transformer models implemented in PyTorch. Other than the primary objective, PII detection can also be used to ensure compliance with data protection regulations across organizations. The methodology involves the development of large language models like DeBERTa v3 to distinguish between PII and non-PII within text data while continuing to be flexible to meet changing regulatory needs. Techniques like hyperparameter tuning are done to optimize its performance. Throughout this project, the primary aim is to contribute to advancing data privacy protection by providing a complete and flexible solution for PII detection in diverse textual datasets.