Tuesday, February 12, 2013

Web Crawlers: How Do They Work?


By 


Web Crawlers or spiders are nothing but a computer program that crawls the web from a given seed page. For example, once a web-crawler is initialized for a web page it'll fetch all the links present on this page. After fetching these links, it'll push these links to the list of to_visit which internally is being implemented as a stack. Each link is popped from the stack and all the links are pushed one by one onto the to_visit stack. The link which was popped is being added to a list named visited. Similarly the web crawler goes on and on until the to_visit stack is empty.
Following is the stepwise function performed by the web spider:
  1. Visit the given web page and the get the source code of the page.
  2. From the source code extract all the links present on the web page
  3. Add the visited page's link to the list named "visited".
  4. Push the extracted links onto a stack "to_visit".
  5. Pop the link from the stack "to_visit" and repeat the procedure from step 1 until the stack "to_visit" becomes empty.
By understanding the concept of a web crawler one gets to know a lot about the various concepts of computer science. You have a number of languages at your disposal to build a web-crawler. However, the Python language is mostly used and for obvious reasons. Python constructs are easy to understand as they are very much English like. Python is portable and extensible i.e. it is platform independent. Suffice is to say that Google's uses Python as the development language for most of it's products.
Using Python we can easily build a web-crawler with indexing feature in a couple of dozen lines. The keywords are mapped to it's respective link and maintained in a Dictionary type. Dictionary type is a built in data structure provided in Python. the Dictionary type stores value mapped to it's respective key. Therefore, the Dictionary type can easily be used to store the links(values) mapped to it's respective keyword(key). When searching for a specific keyword(key) the Python runtime extract the links(values) associated with that key.
Once you have developed a web spider in Python, you can easily modify it to suit your requirements. For example, you can tweak the code so that your web crawler crawls the web collecting all the ".mp3" links it encounters on a web-page. Also, you may modify it so that you can crawl the web searching for specific type of sites and indexing them with their corresponding keywords into the Python dictionary type. All this is achievable without much of an effort.
Learn how to build a web-crawler using Python in a week. Along the way you'll be able to learn a lot about programming concepts.
See for yourself why Python happens to be the hot favorite language of hackers around the globe. Teach yourself how to build a web crawler in a week and along with it the most basic programming concepts. New to programming and a novice, worry not, Python is the solution for your dilemma. There's no better language other than Python to start with. Lets get started and learn Python for beginner to build a web crawler.



0 Responses to “Web Crawlers: How Do They Work?”

Post a Comment