Journal of Information Resources Management ›› 2015, Vol. 5 ›› Issue (2): 21-27.doi: 10.13365/j.jirm.2015.02.021

Previous Articles     Next Articles

The Mode of Automatically Crawling Web Data and its Open Source Solutions for Researchers

Zhang Tingting Liu Kai Wang Weijun   

  • Received:2014-09-02 Online:2015-04-26 Published:2015-04-26

Abstract:

In Big Data era, the quantity and quality of data which usually determines the quality of research findings as well as the whole project’s success is becoming the key factor in scientific competition. However, taking the issue of automatically crawling web data into consideration, there is not yet a systematic academic research. To address this issue, this paper carries out an analysis of the basic patterns that web crawling emerges and presents four basic web crawling modes of researchers: single site static crawl mode, cross-site static crawl mode, single site dynamic crawl mode and cross-site dynamic crawl mode. In the meantime, this paper introduces two kinds of method to solve the problem based on the architecture of open source: the open-source crawlers and researchers’ own custom reptile. Finally, this paper gives a detailed discussion of the software architecture and the basic code of each solution.

Key words: Researcher,  Web crawler,  Technical solution,  Open source software

CLC Number: