信息资源管理学报 ›› 2015, Vol. 5 ›› Issue (4): 24-31, 46.doi: 10.13365/j.jirm.2015.04.024

• 研究论文 • 上一篇    下一篇

LDA模型下文本自动分类算法比较研究——基于网页和图书期刊等数字文本资源的对比

李湘东 潘练   

  • 收稿日期:2014-09-11 出版日期:2015-10-26 发布日期:2015-10-26
  • 作者简介:李湘东,博士,副教授,研究方向为自动分类、数据挖掘,Email:xli_xiao@hotmail.com;潘练,硕士生,研究方向为自动分类。

Text Classification Algorithms Using the LDA Model: On the Comparison of the Applicaitons on Webpages and eTexts Including Books and Journals

Li Xiangdong Pan Lian   

  • Received:2014-09-11 Online:2015-10-26 Published:2015-10-26

摘要:

本文以信息资源管理中的网页、图书期刊的书目或题录信息等主要数字文本为对象,使用概率主题模型(LDA)建模,通过对比分析KNN、类中心向量法、SVM等三种常见的分类算法所产生的不同分类效果,研究数字文本资源管理中的自动分类特性。实验表明LDA模型下三种分类算法的分类正确率基本都能达到80%左右,SVM算法分类准确率相较另两种算法大约高0.7~22%左右。本文的结论可为数字文本分类系统使用LDA对文本建模时选择合适的分类算法提供一定的依据。

关键词: LDA,     , 数字资源,     , 书目信息,     , 自动分类,     , 分类算法

Abstract:

The object of this research is the bibliographic information and other major digital text of Webpage, books and journals in the information resource management. Based on the LDA model, this paper studies the characteristics of automatic text classification in digital resources management,and analyzes the different effect and influence of three kinds of common classification algorithm which including KNN, SVM and Rocchio algorithm. The experiment shows that the accuracy of three classification algorithms basic is about 80%, while in most cases SVM algorithm having  0.7~22% higher classification accuracy than the other two algorithms. Its conclusion may provide a certain basis for choosing the appropriate classification algorithm when LDA model is using in digital information classification system.

Key words:

 

中图分类号: