信息资源管理学报 ›› 2025, Vol. 15 ›› Issue (3): 108-121.doi: 10.13365/j.jirm.2025.03.108

• 研究论文 • 上一篇    下一篇

大语言模型场景下指令微调数据的效用评估研究——基于数据质量维度

刘晓慧 冉从敬 刘省身 李旺   

  1. 武汉大学信息管理学院,武汉,430072
  • 出版日期:2025-05-26 发布日期:2025-06-16
  • 作者简介:刘晓慧,博士研究生,研究方向为信息资源管理、数据要素、信息计量;冉从敬(通讯作者),教授,博士,博士生导师,研究方向为知识产权、大数据治理,Email:rancongjing@whu.edu.cn;刘省身,博士研究生,研究方向为数据科学、自然语言处理、知识产权;李旺,博士研究生,研究方向为数据科学、知识产权。
  • 基金资助:
    本文系国家社会科学基金重大项目“大数据主权安全保障体系建设研究”(21&ZD169),国家社会科学基金青年项目“基于知识元的高校专利质量智能判别及其推荐研究”(23CTQ028)的研究成果之一。

Evaluation of Prompt Fine-Tuning Data Efficacy in Large Language Models: A Focus on Data Quality

Liu Xiaohui Ran Congjing Liu Xingshen Li Wang   

  1. School of Information Management, Wuhan University, Wuhan, 430072
  • Online:2025-05-26 Published:2025-06-16
  • About author:Liu Xiaohui, Ph.D. candidate, research interests include information resource management, data elements, and informetrics; Ran Congjing(corresponding author), professor, Ph.D., doctoral supervisor, research interests include intellectual property, big data governance,Email: rancongjing@whu.edu.cn; Liu Xingshen, Ph.D. candidate, research interests include data science, natural language processing, intellectual property; Li Wang, Ph.D.candidate, research interests include data science, intellectual property.
  • Supported by:
    This paper is one of the research outcomes of the Major Project of the National Social Science Fund of China "Research on the Construction of a Security System for Big Data Sovereignty" (21&ZD169) and the Youth Project of the National Social Science Fund of China "Research on the Intelligent Discrimination and Recommendation of University Patent Quality Based on Knowledge Units"(23CTQ028).

摘要: 生成式人工智能的突破性进展催生了ChatGPT等现象级大语言模型,对传统数据效用评估方法提出了全新挑战。为此,本研究针对大语言模型的指令微调数据效用评估问题,构建了一种融合复杂性、可用性和多样性三大维度的多维评估方法,并据此设计了全新的数据效用评估函数。基于7B中等参数规模模型的实验表明,该评估方法在多个公共指令微调数据集上能够合理、有效地衡量数据质量,且在不同数据集上微调的大语言模型的推理损失与所提评估指标呈现出高度一致性。本研究首次将推理损失直接用于衡量语言模型指令微调数据的质量,并针对大语言模型指令微调的特点,引入复杂性、可用性和多样性三大关键维度来界定“好数据”的特征。通过提出全新的定量度量指标,为进一步提升大语言模型指令微调数据质量及相关研究应用提供了重要的理论支撑与实践参考。

关键词: 数据要素, 数据效用, 数据质量, 大语言模型, 指令微调数据

Abstract: Breakthroughs in generative artificial intelligence have led to the emergence of phenomenon-level large language models (LLMs), such as ChatGPT, posing unprecedented challenges to traditional data utility assessment methods. In response, this study focuses on evaluating the utility of instruction-tuning data for LLMs by establishing a multi-dimensional assessment framework that integrates three key dimensions—complexity, usability, and diversity—and accordingly proposes a novel data utility evaluation function. Experiments on multiple publicly available instruction-tuning datasets demonstrate that the proposed approach provides a reasonable and effective means of measuring data quality, while the reasoning loss observed in LLMs fine-tuned on different datasets exhibits a high degree of consistency with the proposed evaluation metrics. This work is the first to directly employ reasoning loss as a measure of the quality of LLM instruction-tuning data, further introducing the three dimensions—complexity, usability, and diversity—to characterize “high-quality data”. By proposing new quantitative metrics, this study offers important theoretical and practical guidance for future improvements in the quality of instruction-tuning data for large language models and related research applications.

Key words: Data elements, Data utility, Data quality, Large language models, Prompt fine-tuning data

中图分类号: