Journal of Advances in Information Technology, Vol 3, No 1 (2012), 36-47, Feb 2012
doi:10.4304/jait.3.1.36-47

A Hybrid Revisit Policy For Web Search

Vipul Sharma, Mukesh Kumar, Renu Vig

Abstract


A crawler is a program that retrieves and stores pages from the Web, commonly for a Web search engine. A crawler often has to download hundreds of millions of pages in a short period of time and has to constantly monitor and refresh the downloaded pages. Once the crawler has downloaded a significant number of pages, it has to start revisiting the downloaded pages in order to refresh the downloaded collection. Due to resource constraints, search engines usually have difficulties keeping the entire local repository synchronized with the web. Given the size of web today and inherent resource constraints: re-crawling too frequently leads to wasted bandwidth, re-crawling too infrequently brings down the quality of the search engine. In this paper a hybrid approach is build on the basis of which a web crawler maintains the retrieved pages “fresh” in the local collection. Towards this goal the concept of Page rank and Age of a web page is used. As higher page rank means that more number of users are visiting that very web page and that page has higher link popularity. Age of web page is a measure that indicates how outdated the local copy is. Using these two parameters a hybrid approach is proposed that can identify important pages at the early stage of a crawl, and the crawler re-visit these important pages with higher priority.



References


[1] Christopher D.Manning and Prabhakar Raghavan. An introduction to Information Retrieval. Preliminary draft© 2008 Cambridge UP.

[2] J. Cho, H. Garcia-Molina, and L. Page. Efficient crawling through url ordering. In Proceedings of 7th World Wide Web Conference (WWW7), 1998

[3] Sergey Brin and Lawrence Page. “The anatomy of a large scale hypertextual web search engine”. In proceedings of the seventh international world wide web conference, Bristbane, Australia, April 1998.

[4] J. Cho and H. Garcia-Molina. Synchronizing a database to improve freshness. In Proceedings of 2000 ACM International Conference on Management of Data (SIGMOD), 1999.

[5] J. Cho and H. Garcia-Molina. The evolution of the web and implications for an incremental crawler. In Proceedings of 26th International Conference on Very Large Databases (VLDB), pages 136 – 147, 2000.

[6] Junghoo Cho and Hector Garcia Molina. “Effective page refresh policies for web crawler” . ACM Transactions on Database Systems, December 2003.

[7] Hadrien Bullot and S K Gupta. A Data-Mining Approach for Optimizing Performance of an Incremental Crawler. Proceedings of the IEEE/WIC International Conference on Web Intelligence (WI’03)

[8] Christopher Olston and Sandeep Pandey. User centric Web crawling. In Proceedings of WWW’05, pages 401–411, New York, NY, USA,2005. ACM Press.

[9] Wen-Kun Mie, LU Zeng Dhing. “A cooperative schema between Web sever and search engine for improving freshness of Web repository”. Wuhan University Journal of natural sciences, Vol. 11, No.1, 2006.

[10] Divakar Yadav, J.P.Gupta. “Change Detection in Web Pages”. In proceedings of 10th International Conference on Information Technology, 2007.

[11] Rahul choudhari and Ajay choudhari. “Increasing Search Engine Efficiency using Cooperative Web”. In proceedings of International Conference on Computer Science and Software Engineering,2008.

[12] A.K. Sharma and Ashutosh Dixit. Self adjusting Refresh Time based Architecture for incremental web crawler. IJCSNS International Journal of Computer Science and network security, Vol.8 No.12, December2008.

[13] K.S. Kuppusamy and G. Aghila. “FEAST - A Multistep, Feedback Centric, Freshness Oriented Search Engine”. In proceedings of 2009 IEEE International Advance Computing Conference (IACC 2009).

[14] Ravita Chahar, Komal Hooda, and Annu Dhankahar . “Management Of Volatile Information In Incremental Web Crawler”. IJCSI International Journal of Computer Science Issues, Vol. 4, No. 1, 2009.

[15] Junghoo Cho. “Crawling the web: Discovery and Maintenance of Large Scale Web Data”. A Thesis Nov 2001.

[16] Scott J. Simon. “Network Theory: 80/20 Rule and Small Worlds Theory”.

[17] W3 Header Field Definitions.http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html

[18] List of HTTP Headers. en.wikipedia.org/wiki/List_of_HTTP_headers

[19] Robots exclusion. http://info.webcrawler.com/mak/projects/robotsexclusion.html

[20] Carlos Castillo, “Effective Web Crawling”, PhD. thesis University of Chile November 2004.

[21] B. E. Brewington and G. Cybenko “How dynamic is the web?” In Proceedings of 9th World Wide Web Conference (WWW9), January 2000.


Full Text: PDF


Journal of Advances in Information Technology (JAIT, ISSN 1798-2340)

Copyright @ 2006-2014 by ACADEMY PUBLISHER – All rights reserved.