信息资源管理学报 ›› 2020, Vol. 10 ›› Issue (5): 23-29.doi: 10.13365/j.jirm.2020.05.023

• 专题-人文社科专题数据库建设关键技术与方法研究 • 上一篇    下一篇

基于深度学习的人文社科专题数据库文本资源分类研究——以“新华丝路”数据库与“一带一路”专题库为例

是 沁 李 阳   

  1. 南京大学信息管理学院,南京,210023
  • 出版日期:2020-09-26 发布日期:2020-10-13
  • 作者简介:是沁,博士研究生,研究方向为数据管理与知识服务,Email:13809072562@163.com;李阳,博士,助理研究员,研究方向为应急情报与信息资源管理。
  • 基金资助:
    本文系国家社会科学基金重大项目“人文社科专题数据库建设规范化管理研究”(18ZDA326)的成果之一。

Research on Text Resource Classification of Humanities and Social Sciences Thematic Database Based on Deep Learning: Taking “XinHua Silkroad”Database and“One Belt One Road”Database as Examples

Shi Qin Li Yang   

  1. School of Information Management, Nanjing University,Nanjing,210023
  • Online:2020-09-26 Published:2020-10-13

摘要: 文本资源是专题数据库建设的重要组成部分,亦是目前人文社科研究获取领域知识的主要途径。针对专题文本资源主题相近、内容专深、特征相似的特点,基于长短期记忆模型,提出一种融合注意力机制的人文社科专题文本资源分类模型。采用词向量完成样本文本数字化,利用长短期记忆模型进行语义特征提取,并引入注意力机制,突出关键短语以优化特征提取过程,最后采用Softmax给出专题文本分类结果。通过爬取“新华丝路”数据库与“一带一路”专题库的相关文本信息资源,对本文提出的模型的可行性和优越性进行验证,结果显示,融合长短期记忆模型与注意力机制的人文社科专题文本资源分类模型,在长短专题文本分类任务中表现都优于其他模型。

关键词: 人文社科, 专题数据库, 专题文本分类, 长短期记忆网络, 注意力机制

Abstract: With the deepening of digital transformation, the construction of thematic databases in the field of Humanities and Social Sciences continues to develop. Text resources are an important part of thematic database construction, and also the main way to acquire domain knowledge for Humanities and social sciences research. Based on the Long Short-Term Memory model, a classification model of thematic textual resources in the Humanities and Social Sciences that integrates attentional mechanisms is proposed to address the characteristics of similar themes, in-depth content and similar features. In this paper, we use word vectors to complete the digitization of sample text, use the Long Short-Term Memory model for semantic feature extraction, and use the attention mechanism to highlight key phrases to optimize the feature extraction process, and finally use Softmax to give thematic text classification results. By crawling the relevant texts of the "Xinhua SilkRoad" database and the "One belt One road" thematic database, the feasibility and superiority of the model proposed in this paper are verified. The results showed that the Humanities and Social Sciences thematic text resource classification model, which combines the Long Short-Term Memory model with the attentional mechanism is superior to other models in the long and short text classification task.

Key words: Humanities and Social Sciences, Thematic database, Thematic text classification, Long Short-Term Memory model, Attention mechanism

中图分类号: