Evaluating AI Response Quality Using Prompt and Answer Context with Machine Learning
Keywords:
AI evaluation, CatBoost, Machine learning, Natural language processing (NLP), Prompt engineeringAbstract
The growing adoption of large language models has made evaluating AI-generated responses an important research area. This study focuses on assessing the quality of AI outputs using a machine learning-based approach. A dataset consisting of prompt-response pairs was used, with only English language prompts selected to maintain consistency. Since there were no predetermined quality labels in the dataset, a combination of AI evaluation and rule-based approaches was adopted to create quality attributes for classification. These quality attributes include clarity, relevance, and accuracy of AI outputs. To improve assessment, the prompt and response were merged into a single textual feature. Various machine learning models, such as SVM, Naive Bayes, Decision Tree, KNN, Gradient Boosting, and CatBoost, were employed in classifying response quality. These algorithms were compared based on accuracy, precision, recall, and F1-score. The proposed system helps in improving AI response evaluation and can benefit researchers, developers, and organizations in building more reliable AI applications.
References
T. Brown et al., “Language Models Are Few-Shot Learners,” ArXiv (Cornell University), vol. 4, May 2020.
A. Radford, J. Wu, R. Child, et al. Language Models are Unsupervised Multitask Learners. OpenAI, 2019.
J. Devlin, M. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proc. North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), vol. 1, Minneapolis, MN, USA, Jun. 2019, pp. 4171–4186.
C. Raffel et al., “Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer,” arXiv (Cornell University), Oct. 2019.
A. Vaswani, et al., “Attention is all you need,” in Proc. 31st Conf. Neural Inf. Process. Syst. (NeurIPS), Long Beach, CA, USA, 2017, pp. 5998–6008.
J. Z. Wei et al., “Chain-of-thought prompting elicits reasoning in large language models,” arXiv, Jan. 2022.
V. Sanh et al., “Multitask prompted training enables zero-shot task generalization,” in Proc. Int. Conf. Learn. Representations (ICLR), 2022.
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “BLEU: a Method for Automatic Evaluation of Machine Translation,” Proceedings of the 40th Annual Meeting on Association for Computational Linguistics - ACL ’02, 2002.
C.-Y. Lin, “ROUGE: A package for automatic evaluation of summaries,” in Proc. ACL Workshop on Text Summarization Branches Out, Barcelona, Spain, 2004, pp. 74–81.
J. H. Friedman, “Greedy function approximation: A gradient boosting machine,” The Annals of Statistics, vol. 29, no. 5, pp. 1189–1232, Oct. 2001.
C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–297, 1995.
L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, and A. Gulin, “CatBoost: Unbiased boosting with categorical features,” in Proc. 32nd Conf. Neural Inf. Process. Syst. (NeurIPS), Montréal, QC, Canada, 2018.
M. Fernández-Delgado, E. Cernadas, S. Barro, and D. Amorim, “Do we need hundreds of classifiers to solve real world classification problems?” J. Mach. Learn. Res., vol. 15, no. 1, pp. 3133–3181, 2014.
A. Chowdhery et al., “PaLM: Scaling language modeling with pathways,” J. Mach. Learn. Res., vol. 24, no. 240, pp. 1–113, 2023.
L. Ouyang et al., “Training language models to follow instructions with human feedback,” arXiv (Cornell University), Mar. 2022.
X. Y. Fu, M. T. R. Laskar, C. Chen, and S. B. Tn, “Are large language models reliable judges? A study on the factuality evaluation capabilities of LLMs,” in Proc. 3rd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM), Dec. 2023, pp. 310–316.
W. Liang, G. A. Tadesse, D. Ho, L. Fei-Fei, M. Zaharia, C. Zhang, and J. Zou, “Advances, challenges and opportunities in creating data for trustworthy AI,” Nat. Mach. Intell., vol. 4, no. 8, pp. 669–677, 2022.
S. Boeschoten, C. Catal, B. Tekinerdogan, A. Lommen, and M. Blokland, “The automation of the development of classification models and improvement of model quality using feature engineering techniques,” Expert Systems with Applications, vol. 213, p. 118912, Mar. 2023.
P. Naresh, B. Akshay, B. Rajasree, G. Ramesh, and K. Y. Kumar, “High dimensional text classification using unsupervised machine learning algorithm,” 2024 3rd International Conference on Applied Artificial Intelligence and Computing (ICAAIC), pp. 368–372, Jun. 2024.
A. Gao, “Prompt engineering for large language models,” Social Science Research Network, Jul. 08, 2023.
X. Fang, W. Wang, X. Lv, and J. Yan, “PCQA: A strong baseline for AIGC quality assessment based on prompt condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024. (pp. 6167–6176).
B. Qu, H. Li, and W. Gao, “Bringing textual prompt to AI-generated image quality assessment,” 2022 IEEE International Conference on Multimedia and Expo (ICME), Niagara Falls, ON, Canada, 2024, pp. 1–6, Jul. 2024.
L. Prahallad, S. U. Choudarypally, P. Prahallad, and P. Prahallad, Prompt-based clarity evaluation and topic detection in political question answering. arXiv preprint arXiv:2601.08176. 2026.
N. Arabzadeh and C. L. A. Clarke, “A human-AI comparative analysis of prompt sensitivity in LLM-based relevance judgment,” in Proc. 48th Int. ACM SIGIR Conf. Res. Develop. Inf. Retrieval (SIGIR ’25), 2025, pp. 2784–2788.