面向科技实体抽取任务的大小模型少样本数据协同训练研究

doi:10.13365/j.jirm.2025.04.129

信息资源管理学报 ›› 2025, Vol. 15 ›› Issue (4): 129-143.doi: 10.13365/j.jirm.2025.04.129

面向科技实体抽取任务的大小模型少样本数据协同训练研究

梁柱^1,2　刘寅鹏^1,2　石湘^1,2　黄永^1,2　程齐凯^1,2

1.武汉大学信息管理学院，武汉，430072；　
2.武汉大学智能与创新治理研究所，武汉，430072

出版日期:2025-07-26 发布日期:2025-08-31
作者简介:梁柱，博士研究生，研究方向为信息检索、数据挖掘等；刘寅鹏，博士研究生，研究方向为文本挖掘、文档智能等；石湘，博士研究生，研究方向为文本挖掘、文档智能等；黄永，副教授，博士，研究方向为文本挖掘、科学计量等；程齐凯（通讯作者），副教授，博士，研究方向为文本挖掘、信息检索等，Email:chengqikai@whu.edu.cn。
基金资助:
本文系新一代人工智能国家科技重大专项项目“高可靠科技文献智能引擎关键技术研发与示范应用”（2023ZD0121502）、国家自然科学基金面上项目“基于机器阅读理解的科学命题文本论证逻辑识别”（72174157）和国家自然科学基金重点项目“数智赋能的科技信息资源与知识管理理论变革”（72234005）的研究成果之一。

Research on Collaborative Training of Small and Large Language Models for Scientific Entity Extraction with Few-Shot Data

Liang Zhu^1,2　Liu Yinpeng^1,2　Shi Xiang^1,2　Huang Yong^1,2　Cheng Qikai^1,2

1.School of Information Management, Wuhan University, Wuhan, 430072；　
2.Institute of Intelligence and Innovation Governance, Wuhan University, Wuhan, 430072

Online:2025-07-26 Published:2025-08-31
About author:Liang Zhu, Ph.D. candidate, research interests include information retrieval and data mining; Liu Yinpeng, Ph.D. candidate, research interests include text mining and document intelligence; Shi Xiang, Ph.D. candidate, research interests include text mining and document intelligence; Huang Yong, associate professor, Ph.D. research interests include text mining and science metrics; Cheng Qikai(corresponding author), associate professor, Ph.D., research interests include text mining and information retrieval, Email:chengqikai@whu.edu.cn.
Supported by:
This work is supported by the National Science and Technology Major Project "Key Technologies Research and Development for High-Reliability Sci-Tech Literature Intelligent Engine and Its Demonstration Application"(2023ZD0121502), the project "Argumentation Logic Recognition of Scientific Proposition Text based on Machine Reading Comprehension"(72174157) and the Key Project "Data and Intelligence Empowered Theoretic Change of Scientific Information Resource and Knowledge Management Theory"(72234005) supported by National Natural Science Foundation of China.

摘要/Abstract

摘要： 针对科研人员在科技实体抽取任务中面临资源消耗大、处理时间长、可扩展性差等问题，本研究提出一种兼顾大小语言模型各自优势的协同训练框架，通过NCBI、BC4CHEMD、S800、SCIERC等四种不同领域的科技文献数据集，证实本研究方法在少样本环境下能达到与全量数据微调相一致的结果，同时深入分析了在科技实体抽取任务中大模型预测策略的局限性，并系统地测试了大小模型在不同数据规模下，通过多轮协同训练所展现出的模型性能。本研究所构建的大小语言模型协同训练框架，能够同时发挥大模型认知优势和小模型低成本高效率运行优势，可更好地帮助低资源、少样本环境下的文献信息高效抽取。

关键词: 少样本, 大模型, 模型蒸馏, 科技实体, 实体抽取

Abstract: Faced with issues such as high resource consumption, long processing times, and poor scalability in scientific entity extraction tasks, this paper proposes a collaborative training framework that balances the advantages of both small and large language models. The paper tests the effectiveness of this learning framework's model training under different data scales. Using four datasets from different fields—NCBI, BC4CHEMD, S800, and SCIERC—this method is shown to achieve results consistent with full-data fine-tuning in few-shot environments. The paper provides an in-depth analysis of the limitations of large language model prediction strategies in scientific entity extraction tasks and systematically tests the model performance exhibited by small and large language models under different data scales through multiple rounds of collaborative training. Additionally, from the dual perspectives of small model recognition strategies and training data similarity, this paper thoroughly examines the reasons for the improved performance of the proposed learning framework. The collaborative training framework built in this paper enables the simultaneous exploitation of large language model cognitive advantages and small model low-cost, high-efficiency operations, thus better supporting efficient extraction of bibliometric information in low-resource, few-shot environments.

Key words: Few-shot, Large language model, Model distillation, Scientific entities, Entity extraction

中图分类号:

G350

梁柱　刘寅鹏　石湘　黄永　程齐凯. 面向科技实体抽取任务的大小模型少样本数据协同训练研究[J]. 信息资源管理学报, 2025, 15(4): 129-143.

Liang Zhu　Liu Yinpeng　Shi Xiang　Huang Yong　Cheng Qikai. Research on Collaborative Training of Small and Large Language Models for Scientific Entity Extraction with Few-Shot Data[J]. Journal of Information Resources Management, 2025, 15(4): 129-143.

[1]	刘炜　单蓉蓉　金家琴. 超越文本中心主义：多模态技术驱动下的中文数字人文转型[J]. 信息资源管理学报, 2025, 15(5): 14-20.
[2]	朱丹浩　赵志枭　张一平　孙光耀　刘畅　胡蝶　王东波. 面向古文自然语言处理生成任务的大语言模型评测研究[J]. 信息资源管理学报, 2024, 14(5): 45-58.
[3]	罗威　田昌海　毛彬　吴叔義　刘鹏年. 科技信息资源智能挖掘服务的探索与思考[J]. 信息资源管理学报, 2024, 14(1): 22-28.
[4]	蔡驰宇　胡宇轩　刘枝. 总体国家安全观视角下的人工智能生产方式风险应对[J]. 信息资源管理学报, 2023, 13(6): 43-47.
[5]	报告人：章成志胡志刚徐硕汪雪锋师庆辉王巍. 全文本计量分析理论与技术的新进展与新探索——2019全文本文献计量分析学术沙龙综述 [J]. 信息资源管理学报, 2020, 10(1): 111-.

面向科技实体抽取任务的大小模型少样本数据协同训练研究

Research on Collaborative Training of Small and Large Language Models for Scientific Entity Extraction with Few-Shot Data

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 5

编辑推荐

Metrics

本文评价