Journal of Information Resources Management ›› 2015, Vol. 5 ›› Issue (2): 21-27.doi: 10.13365/j.jirm.2015.02.021
Previous Articles Next Articles
Zhang Tingting Liu Kai Wang Weijun
Received:
Online:
Published:
Abstract:
In Big Data era, the quantity and quality of data which usually determines the quality of research findings as well as the whole project’s success is becoming the key factor in scientific competition. However, taking the issue of automatically crawling web data into consideration, there is not yet a systematic academic research. To address this issue, this paper carries out an analysis of the basic patterns that web crawling emerges and presents four basic web crawling modes of researchers: single site static crawl mode, cross-site static crawl mode, single site dynamic crawl mode and cross-site dynamic crawl mode. In the meantime, this paper introduces two kinds of method to solve the problem based on the architecture of open source: the open-source crawlers and researchers’ own custom reptile. Finally, this paper gives a detailed discussion of the software architecture and the basic code of each solution.
Key words: Researcher, Web crawler, Technical solution, Open source software
CLC Number:
TP311.5
Zhang Tingting Liu Kai Wang Weijun. The Mode of Automatically Crawling Web Data and its Open Source Solutions for Researchers[J]. Journal of Information Resources Management, 2015, 5(2): 21-27.
0 / / Recommend
Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks
URL: http://jirm.whu.edu.cn/jwk3/xxzyglxb/EN/10.13365/j.jirm.2015.02.021
http://jirm.whu.edu.cn/jwk3/xxzyglxb/EN/Y2015/V5/I2/21