信息资源管理学报 ›› 2024, Vol. 14 ›› Issue (5): 23-35.doi: 10.13365/j.jirm.2024.05.023

• 专题·大语言模型下的古籍智能信息处理 • 上一篇    下一篇

基于大语言模型的《四库全书》自动分类研究

左亮1,2 赵志枭3 王东波3   

  1. 1.南京农业大学数字人文研究中心,南京,210095;
    2.南京邮电大学社会与人口学院、社会工作学院,南京,210023;
    3.南京农业大学信息管理学院,南京,210095
  • 出版日期:2024-09-26 发布日期:2024-10-15
  • 作者简介:左亮,博士,研究方向为科技史信息组织、历史文献、思想政治教育;赵志枭,硕士研究生,研究方向为数字人文、智能信息处理;王东波(通讯作者),教授,博士生导师,研究方向为数字人文、智能信息组织,Email:db.wang@njau.edu.cn。
  • 基金资助:
    本文系国家社会科学基金重大项目“中国古代典籍跨语言知识库构建及应用研究”(21&ZD331)的研究成果之一。

A Study on Automatic Categorization of the Siku Quanshu Based on a Large Language Model

Zuo Liang1,2 Zhao Zhixiao3 Wang Dongbo3   

  1. 1.Digital Humanities Research Center, Nanjing Agricultural University, Nanjing 210095;
    2.School of Sociology and Population Studies,School of Social Work,Nanjing University of Posts and Telecommunications,Nanjing 210023;
    3.School of Information Management, Nanjing Agricultural University, Nanjing, 210095
  • Online:2024-09-26 Published:2024-10-15
  • About author:Zuo Liang, Ph.D., research interests include information organization in the history of science and technology, historical documents and ideological education; Zhao Zhixiao, master candidate, research interests include digital humanities, intelligent information processing; Wang Dongbo(corresponding author), professor, Ph.D., research interests are digital humanities, intelligent information organization, Email: db.wang@njau.edu.cn.
  • Supported by:
    This paper is one of the research results of the Major Project of the National Social Science Foundation of China, "Research on the Construction and Application of Cross-Language Knowledge Base of Ancient Chinese Canonical Texts"(21&ZD331).

摘要: 在古籍研究掀起热潮以及古籍活化成为时代要求的背景下,古籍自动分类面临更高的要求。结合当下前沿的大语言模型,以《四库全书》史部和经部的25类语料作为输入语料,探究荀子古籍大语言系列模型在古籍自动分类上的分类效果。通过与其基座模型对比实验表明,荀子古籍大语言系列模型在古籍自动分类任务中具有明显优势,其中Xunzi-Baichuan2-7B大语言模型的优势最为显著,整体分类值达到96.90%;调整训练数据规模的实验表明,荀子古籍大语言模型仅需少量的数据就能够达到与基座模型相当的分类效果。因此,本研究提出的基于荀子古籍大语言模型的古籍自动分类模型,能够实现对古籍的高效细粒度分类,并为资源受限情境下的古籍分类开辟了新途径。

关键词: 《四库全书》, 分类模型, 荀子古籍大语言模型, 文本自动分类

Abstract: The craze of ancient book research and the contemporary requirement of ancient book revitalisation have raised higher requirements for automatic classification of ancient books. This study explores the classification effect of Xunzi large language series models on the automatic classification of ancient books by combining the large language model along the current preface with the 25 categories of corpus from the history and scripture sections of the Siku Quanshu as the input corpus.Through the comparison experiments with its base model, the results show that Xunzi large language models for ancient books have obvious advantages in the automatic classification task of ancient books, among which the Xunzi-Baichuan2-7B large language model has the most significant advantage in the automatic classification task of ancient books, and the overall classification F1 value reaches 96.90%. In addition, the experiments of adjusting the training data size show that the Xunzi-Baichuan2-7B large language model is able to achieve comparable classification results with the base model with only a small amount of data. Therefore, the automatic classification model for ancient books based on Xunzi large language models for ancient books proposed in this study can achieve efficient fine-grained classification of ancient books and opens up a new way for the classification of ancient books in resource-constrained contexts.

Key words: Siku Quanshu, Classification models, Xunzi large language model, Automatic text classification

中图分类号: