大语言模型场景下指令微调数据的效用评估研究——基于数据质量维度

doi:10.13365/j.jirm.2025.03.108

Journal of Information Resources Management ›› 2025, Vol. 15 ›› Issue (3): 108-121.doi: 10.13365/j.jirm.2025.03.108

Previous Articles Next Articles

Evaluation of Prompt Fine-Tuning Data Efficacy in Large Language Models： A Focus on Data Quality

Liu Xiaohui　Ran Congjing　Liu Xingshen　Li Wang

School of Information Management, Wuhan University, Wuhan, 430072

Online:2025-05-26 Published:2025-06-16
About author:Liu Xiaohui, Ph.D. candidate, research interests include information resource management, data elements, and informetrics; Ran Congjing(corresponding author), professor, Ph.D., doctoral supervisor, research interests include intellectual property, big data governance,Email: rancongjing@whu.edu.cn; Liu Xingshen, Ph.D. candidate, research interests include data science, natural language processing, intellectual property; Li Wang, Ph.D.candidate, research interests include data science, intellectual property.
Supported by:
This paper is one of the research outcomes of the Major Project of the National Social Science Fund of China "Research on the Construction of a Security System for Big Data Sovereignty" (21&ZD169) and the Youth Project of the National Social Science Fund of China "Research on the Intelligent Discrimination and Recommendation of University Patent Quality Based on Knowledge Units"(23CTQ028).

Abstract

Abstract: Breakthroughs in generative artificial intelligence have led to the emergence of phenomenon-level large language models (LLMs), such as ChatGPT, posing unprecedented challenges to traditional data utility assessment methods. In response, this study focuses on evaluating the utility of instruction-tuning data for LLMs by establishing a multi-dimensional assessment framework that integrates three key dimensions—complexity, usability, and diversity—and accordingly proposes a novel data utility evaluation function. Experiments on multiple publicly available instruction-tuning datasets demonstrate that the proposed approach provides a reasonable and effective means of measuring data quality, while the reasoning loss observed in LLMs fine-tuned on different datasets exhibits a high degree of consistency with the proposed evaluation metrics. This work is the first to directly employ reasoning loss as a measure of the quality of LLM instruction-tuning data, further introducing the three dimensions—complexity, usability, and diversity—to characterize “high-quality data”. By proposing new quantitative metrics, this study offers important theoretical and practical guidance for future improvements in the quality of instruction-tuning data for large language models and related research applications.

Key words: Data elements, Data utility, Data quality, Large language models, Prompt fine-tuning data

CLC Number:

G350

Liu Xiaohui　Ran Congjing　Liu Xingshen　Li Wang. Evaluation of Prompt Fine-Tuning Data Efficacy in Large Language Models： A Focus on Data Quality[J]. Journal of Information Resources Management, 2025, 15(3): 108-121.

[1]	Wu Yuting　Zhou Xiaoying　Chen Yanfang. Pathways to the Value Realization of Cultural Heritage Data Elements: A Three-Force-Driven Perspective [J]. Journal of Information Resources Management, 2025, 15(3): 21-36.
[2]	Ma Haiqun　Liu Xinrui. Construction of the Whole Process Field of Data Element Circulation for the High-quality Development of Digital Economy [J]. Journal of Information Resources Management, 2024, 14(4): 29-35.
[3]	Zhao Caijing. Value Creation of Data Elements: Review and Research Prospects [J]. Journal of Information Resources Management, 2024, 14(2): 41-53.
[4]	Zhang Huiping　Zhao Qin　Ma Taiping　Zhang Yaoyao　Zhang Jingran. Two Modes and Ecological System Construction of Data Element Market Circulation in China [J]. Journal of Information Resources Management, 2023, 13(6): 29-42.
[5]	Rao Zixin　Xu Xin. The Influencing Factors of Scientific Data Reuse in the Field of Cultural Heritage [J]. Journal of Information Resources Management, 2023, 13(5): 32-43.
[6]	Ma Feicheng　Wu Yishu　Lu Huizhi. Research on the Path to Realize the Value of Data Elements [J]. Journal of Information Resources Management, 2023, 13(2): 4-11.
[7]	Zhi Fengwen　Zhang Meng　Zhao Mengfan　Li Shanqing. Research on Influencing Factors of Scientific Data Sharing Behavior from the Perspective of Dual Paths [J]. Journal of Information Resources Management, 2021, 11(6): 40-50.
[8]	Liu Cundi. Cognitive Bias and Breakthrough Path:Big Data and Social Research After the Hype Peak [J]. Journal of Information Resources Management, 2020, 10(2): 37-47.
[9]	Chen Fang. Core Content and Condition Guarantee of Implementing Data Governance in Enterprises [J]. Journal of Information Resources Management, 2018, 8(4): 35-40.

Evaluation of Prompt Fine-Tuning Data Efficacy in Large Language Models： A Focus on Data Quality

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 9

Recommended Articles

Metrics

Comments