信息资源管理学报 ›› 2023, Vol. 13 ›› Issue (1): 129-139.doi: 10.13365/j.jirm.2023.01.129

• 研究论文 • 上一篇    下一篇

结合短文本扩展和BERT的商品评论文本自动分类

李湘东1,2 孙倩茹1 石健1   

  1. 1.武汉大学信息管理学院; 
    2.武汉大学电子商务研究与发展中心,武汉,430072
  • 出版日期:2023-01-26 发布日期:2023-03-18
  • 作者简介:李湘东,博士,副教授,研究生导师,研究方向为自动分类和文本数据挖掘,Email:xli_whu@whu.edu.cn。孙倩茹,硕士生,研究方向为数据挖掘、信息分析。石健,硕士生,研究方向为数据挖掘、信息分析。

Automatic Classification of Product Review Texts Combining Short Text Extension and BERT

Li Xiangdong1,2 Sun Qianru1 Shi Jian1   

  1. 1.School of Information Management, Wuhan University; 
    2.Center for Electronic Commerce Research and Development, Wuhan University, Wuhan, 430072
  • Online:2023-01-26 Published:2023-03-18

摘要: 针对商品评论文本具有短文本及表述用词不规范的特点,探讨如何实现商品评论文本按照商品种类进行自动归类并提高其分类效果。通过TF-IDF和LDA构建训练集的核心词集,利用Word2Vec相似度计算方式对短文本进行特征扩展获得的商品评论文本作为分类对象,基于BERT模型实现分类,并设计相应的对比实验证明本方法的有效性。对商品评论文本扩展后使用BERT分类时,本文方法比未扩展时的F1值提升2.1%,比使用Hownet相似度计算方式扩展时的F1值提升0.9%。从基本原理、不同相似度计算方法以及用词方式等方面分析本方法有效性的原因。本文提出的方法能有效提升商品评论文本按照商品进行信息组织时的分类效果,可以应用于电子商务信息的信息组织及其相关理论方法研究等领域。

关键词: 商品评论文本, 短文本, 特征扩展, Word2Vec, BERT

Abstract: In view of the fact that texts of product reviews are short and words are informal, this research aims to explore how to automatically classify product review texts by product categories and improve the classification performance. The core words set of the training set is constructed through the TF-IDF and LDA model, and short texts are extended by Word2Vec similarity calculation method. After extension, the product reviews are categorized by the product categories based on the Bidirectional Encoder Representation of Transformer (BERT) model. And then we design corresponding comparative experiments to prove the effectiveness of this method. When using BERT classification for the product reviews after extension, the F1 value obtained by the method proposed in this paper is 2.1 percent higher than are not extended, and it is 0.9 percent higher than that when using HowNet similarity calculation method. The reasons for the effectiveness of the method proposed in this paper are analyzed from the aspects of basic principles, different word similarity calculation methods, and words used methods. The method proposed in this paper can effectively improve the classification performance of the product reviews when organizing information by product categories, and be applied to the field of information organization of e-commerce information and research on related theories and methods.

Key words: Product review texts, Short text, Feature extension, Word2Vec, BERT

中图分类号: