Journal of Information Resources Management ›› 2024, Vol. 14 ›› Issue (5): 23-35.doi: 10.13365/j.jirm.2024.05.023

Previous Articles     Next Articles

A Study on Automatic Categorization of the Siku Quanshu Based on a Large Language Model

Zuo Liang1,2 Zhao Zhixiao3 Wang Dongbo3   

  1. 1.Digital Humanities Research Center, Nanjing Agricultural University, Nanjing 210095;
    2.School of Sociology and Population Studies,School of Social Work,Nanjing University of Posts and Telecommunications,Nanjing 210023;
    3.School of Information Management, Nanjing Agricultural University, Nanjing, 210095
  • Online:2024-09-26 Published:2024-10-15
  • About author:Zuo Liang, Ph.D., research interests include information organization in the history of science and technology, historical documents and ideological education; Zhao Zhixiao, master candidate, research interests include digital humanities, intelligent information processing; Wang Dongbo(corresponding author), professor, Ph.D., research interests are digital humanities, intelligent information organization, Email: db.wang@njau.edu.cn.
  • Supported by:
    This paper is one of the research results of the Major Project of the National Social Science Foundation of China, "Research on the Construction and Application of Cross-Language Knowledge Base of Ancient Chinese Canonical Texts"(21&ZD331).

Abstract: The craze of ancient book research and the contemporary requirement of ancient book revitalisation have raised higher requirements for automatic classification of ancient books. This study explores the classification effect of Xunzi large language series models on the automatic classification of ancient books by combining the large language model along the current preface with the 25 categories of corpus from the history and scripture sections of the Siku Quanshu as the input corpus.Through the comparison experiments with its base model, the results show that Xunzi large language models for ancient books have obvious advantages in the automatic classification task of ancient books, among which the Xunzi-Baichuan2-7B large language model has the most significant advantage in the automatic classification task of ancient books, and the overall classification F1 value reaches 96.90%. In addition, the experiments of adjusting the training data size show that the Xunzi-Baichuan2-7B large language model is able to achieve comparable classification results with the base model with only a small amount of data. Therefore, the automatic classification model for ancient books based on Xunzi large language models for ancient books proposed in this study can achieve efficient fine-grained classification of ancient books and opens up a new way for the classification of ancient books in resource-constrained contexts.

Key words: Siku Quanshu, Classification models, Xunzi large language model, Automatic text classification

CLC Number: