信息资源管理学报 ›› 2025, Vol. 15 ›› Issue (4): 129-143.doi: 10.13365/j.jirm.2025.04.129

• 研究论文 • 上一篇    下一篇

面向科技实体抽取任务的大小模型少样本数据协同训练研究

梁柱1,2 刘寅鹏1,2 石湘1,2 黄永1,2 程齐凯1,2   

  1. 1.武汉大学信息管理学院,武汉,430072; 
    2.武汉大学智能与创新治理研究所,武汉,430072
  • 出版日期:2025-07-26 发布日期:2025-08-31
  • 作者简介:梁柱,博士研究生,研究方向为信息检索、数据挖掘等;刘寅鹏,博士研究生,研究方向为文本挖掘、文档智能等;石湘,博士研究生,研究方向为文本挖掘、文档智能等;黄永,副教授,博士,研究方向为文本挖掘、科学计量等;程齐凯(通讯作者),副教授,博士,研究方向为文本挖掘、信息检索等,Email:chengqikai@whu.edu.cn。
  • 基金资助:
    本文系新一代人工智能国家科技重大专项项目“高可靠科技文献智能引擎关键技术研发与示范应用”(2023ZD0121502)、国家自然科学基金面上项目“基于机器阅读理解的科学命题文本论证逻辑识别”(72174157)和国家自然科学基金重点项目“数智赋能的科技信息资源与知识管理理论变革”(72234005)的研究成果之一。

Research on Collaborative Training of Small and Large Language Models for Scientific Entity Extraction with Few-Shot Data

Liang Zhu1,2 Liu Yinpeng1,2 Shi Xiang1,2 Huang Yong1,2 Cheng Qikai1,2   

  1. 1.School of Information Management, Wuhan University, Wuhan, 430072; 
    2.Institute of Intelligence and Innovation Governance, Wuhan University, Wuhan, 430072
  • Online:2025-07-26 Published:2025-08-31
  • About author:Liang Zhu, Ph.D. candidate, research interests include information retrieval and data mining; Liu Yinpeng, Ph.D. candidate, research interests include text mining and document intelligence; Shi Xiang, Ph.D. candidate, research interests include text mining and document intelligence; Huang Yong, associate professor, Ph.D. research interests include text mining and science metrics; Cheng Qikai(corresponding author), associate professor, Ph.D., research interests include text mining and information retrieval, Email:chengqikai@whu.edu.cn.
  • Supported by:
    This work is supported by the National Science and Technology Major Project "Key Technologies Research and Development for High-Reliability Sci-Tech Literature Intelligent Engine and Its Demonstration Application"(2023ZD0121502), the project "Argumentation Logic Recognition of Scientific Proposition Text based on Machine Reading Comprehension"(72174157) and the Key Project "Data and Intelligence Empowered Theoretic Change of Scientific Information Resource and Knowledge Management Theory"(72234005) supported by National Natural Science Foundation of China.

摘要: 针对科研人员在科技实体抽取任务中面临资源消耗大、处理时间长、可扩展性差等问题,本研究提出一种兼顾大小语言模型各自优势的协同训练框架,通过NCBI、BC4CHEMD、S800、SCIERC等四种不同领域的科技文献数据集,证实本研究方法在少样本环境下能达到与全量数据微调相一致的结果,同时深入分析了在科技实体抽取任务中大模型预测策略的局限性,并系统地测试了大小模型在不同数据规模下,通过多轮协同训练所展现出的模型性能。本研究所构建的大小语言模型协同训练框架,能够同时发挥大模型认知优势和小模型低成本高效率运行优势,可更好地帮助低资源、少样本环境下的文献信息高效抽取。

关键词: 少样本, 大模型, 模型蒸馏, 科技实体, 实体抽取

Abstract: Faced with issues such as high resource consumption, long processing times, and poor scalability in scientific entity extraction tasks, this paper proposes a collaborative training framework that balances the advantages of both small and large language models. The paper tests the effectiveness of this learning framework's model training under different data scales. Using four datasets from different fields—NCBI, BC4CHEMD, S800, and SCIERC—this method is shown to achieve results consistent with full-data fine-tuning in few-shot environments. The paper provides an in-depth analysis of the limitations of large language model prediction strategies in scientific entity extraction tasks and systematically tests the model performance exhibited by small and large language models under different data scales through multiple rounds of collaborative training. Additionally, from the dual perspectives of small model recognition strategies and training data similarity, this paper thoroughly examines the reasons for the improved performance of the proposed learning framework. The collaborative training framework built in this paper enables the simultaneous exploitation of large language model cognitive advantages and small model low-cost, high-efficiency operations, thus better supporting efficient extraction of bibliometric information in low-resource, few-shot environments.

Key words: Few-shot, Large language model, Model distillation, Scientific entities, Entity extraction

中图分类号: