科研人员Web数据自动抓取模式及其开源解决方案

doi:10.13365/j.jirm.2015.02.021

信息资源管理学报 ›› 2015, Vol. 5 ›› Issue (2): 21-27.doi: 10.13365/j.jirm.2015.02.021

科研人员Web数据自动抓取模式及其开源解决方案

张婷婷　刘凯　王伟军

收稿日期:2014-09-02 出版日期:2015-04-26 发布日期:2015-04-26
作者简介:张婷婷，硕士研究生，研究方向为用户行为与个性化服务；刘凯，博士研究生，研究方向为用户行为与大数据分析、个性化信息服务；王伟军，教授、博士生导师，研究方向为信息资源管理、知识管理与知识服务、用户行为与电子商务，Email:wangwj@mail.ccnu.edu.cn。
基金资助:
本文系国家自然科学基金项目“基于用户偏好感知的SaaS服务选择优化研究”（71271099）和湖北省自然科学基金创新群体重点项目“基于云计算的知识集成与服务研究”（2011CDA116）的成果之一。

The Mode of Automatically Crawling Web Data and its Open Source Solutions for Researchers

Zhang Tingting　Liu Kai　Wang Weijun

Received:2014-09-02 Online:2015-04-26 Published:2015-04-26

摘要/Abstract

摘要：

大数据时代的科研竞争是数据之争，高质量数据的获取往往决定着研究结论的优劣乃至项目的成败。然而对于科研人员的Web数据自动抓取问题，学界目前尚未有系统性研究成果出现。本文对数据抓取的基本模式进行分析，归纳出四类科研人员Web数据抓取的基本模式：单站静态抓取模式、跨站静态抓取模式、单站动态抓取模式及跨站动态抓取模式及其技术难点。本文同时也提出了科研人员Web数据自动抓取技术的两种开源解决方案：基于开源爬虫和自行定制爬虫，最后详细探讨了各方案的软件架构并给出了基本代码框架。

关键词: 科研人员, 　Web数据抓取, 　技术方案, 　开源软件

Abstract:

In Big Data era, the quantity and quality of data which usually determines the quality of research findings as well as the whole project’s success is becoming the key factor in scientific competition. However, taking the issue of automatically crawling web data into consideration, there is not yet a systematic academic research. To address this issue, this paper carries out an analysis of the basic patterns that web crawling emerges and presents four basic web crawling modes of researchers: single site static crawl mode, cross-site static crawl mode, single site dynamic crawl mode and cross-site dynamic crawl mode. In the meantime, this paper introduces two kinds of method to solve the problem based on the architecture of open source: the open-source crawlers and researchers’ own custom reptile. Finally, this paper gives a detailed discussion of the software architecture and the basic code of each solution.

Key words: Researcher, 　Web crawler, 　Technical solution, 　Open source software

中图分类号:

TP311.5

张婷婷　刘凯　王伟军. 科研人员Web数据自动抓取模式及其开源解决方案[J]. 信息资源管理学报, 2015, 5(2): 21-27.

Zhang Tingting　Liu Kai　Wang Weijun. The Mode of Automatically Crawling Web Data and its Open Source Solutions for Researchers[J]. Journal of Information Resources Management, 2015, 5(2): 21-27.

[1]	肖鹏　郑炜楠. 天下英才聚何处：新中国第一代图书馆情报人才的就业流动研究[J]. 信息资源管理学报, 2023, 13(4): 22-34.
[2]	田丹　李江. 归国科研人员的合作模式演变规律及其原因研究[J]. 信息资源管理学报, 2022, 12(5): 130-138.
[3]	张　莹　戚景琳　孙玉伟. 管理学科研人员数据复用行为特征探析[J]. 信息资源管理学报, 2020, 10(4): 79-87.
[4]	曾元祥方卿. 论开放存取对学术交流的影响（一） ——基于科研人员视角的分析[J]. 信息资源管理学报, 2011, 1(3): 53-56.

科研人员Web数据自动抓取模式及其开源解决方案

The Mode of Automatically Crawling Web Data and its Open Source Solutions for Researchers

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 4

编辑推荐

Metrics

本文评价