Journal of Networks, Vol 6, No 12 (2011), 1682-1689, Dec 2011
doi:10.4304/jnw.6.12.1682-1689

A Web Crawler System Design Based on Distributed Technology

Shaojun Zhong, Zhijuan Deng

Abstract


A practical distributed web crawler architecture is designed. The distributed cooperative grasping algorithm is put forward to solve the problem of distributed Web Crawler grasping. Log structure and Hash structure are combined and a large-scale web store structure is devised, which can meet not only the need of a large amount of random accesses, but also the need of newly added pages. Experiment results have shown that the distributed Web Crawler's performance, scalability, and load balance are better.



Keywords


Search Engine, Web Crawler, Grasping Strategy, Distributed System

References


Li Xiaoming, Yan Hongfei, Wang Jimin, Search Engine- Principle, Technology and System. Beijing: science press, 2005.

M.Najork, J.Wiener, “Breadth-first search crawling yields high-quality pages, ” In 10th International World Wide Web Conference, 2001.
http://dx.doi.org/10.1145/371920.371965

Reka Albert, Hawoong Jeong, Albert-Laszlo Barabasi, “Diameter of the World-Wide Web, ” Nature 401, pp. 130-131, 1999.
http://dx.doi.org/10.1038/43601

Li Xiaoming, “Estimation of the Number of Static Web Pages in China, ” PKU_CS_NET_TR2002006, 2002. A.Broker, R.Kumar, F.Maghoul, Tomkins, a.J.Winener, “Graph structure in the web: experiments and models, ” presented at Proceedings of the 9th World-Wide Web Conference, Amsterdam, 2000. Arasu. A, Cho. J, Garcia-Molina. H, “Searching the Web, ” ACM Transactions on Internet Technology, pp. 42.

Narayannan Shivakuma, Hector Garcia-Molina, “Finding near-replicas of documents on the web, ” Web DB 1998, pp. 204-212.

CHO. J, GARCIA-MOLINA. H, “Estimating Frequency of Change, ” ACM Transactions on Internet Technology, Vol. 3, 2003. A Standard for Robot Exclusion [EB/OL], http://www.robotstxt.org/wc/norobots.html

J. Talim, Z. Liu, Ph. Nain, E. G. Coffman. “Controlling the robots of Web search engines, ” Proceedings of the 2001 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, Cambridge, Massachusetts, United States, 2001.

Junghoo Cho, Hector Garcia-Molina, “Parallel crawlers, ” In Proceedings of the eleventh international conference on World Wide Web, Honolulu, Hawaii, USA, ACM Press, pp. 124-135, 2002.

Paolo Boldi, Bruno Codenotti, Massimo Santini and Sebastiano Vigna, UbiCrawler: A Scalable Fully Distributed WebCrawler, 2003.

Yan Hongfei, “Primary Exploration on Design, Realization and Application of Extensible Web Information Collection System, ” Beijing University Doctoral Dissertation, 2002.


Full Text: PDF


Journal of Networks (JNW, ISSN 1796-2056)

Copyright @ 2006-2013 by ACADEMY PUBLISHER – All rights reserved.