Build your own Web Crawler in Python

web crawler python

Have you ever wondered how search engines gather all that information from the internet? Learn how to build your own web crawler in Python and discover the power of web crawling!

Table of Contents

Introduction

In today’s digital age, the internet is the backbone of all communication and commerce, and as a result, there is a wealth of information available on the web. Web crawling is the process of gathering information from the internet by traversing web pages and extracting information. Python is one of the most popular languages for web crawling because of its ease of use, readability, and numerous libraries. In this article, we will learn what a web crawler is, its brief history, and how to build a simple web crawler using Python.

What is a Web Crawler?

A web crawler is a program that automatically navigates through web pages, extracts information and stores it in a structured format such as a database or a text file. Web crawlers are also known as spiders or bots. The primary function of a web crawler is to collect data from the internet, which can be used for various purposes such as indexing websites for search engines, gathering data for research, or monitoring web content for changes.

Brief History of Web Crawling

Web crawling has been around since the early days of the internet. In 1993, Brian Pinkerton developed the first web crawler, called Web Wanderer, which could traverse through web pages and build a searchable index. In 1994, the first web search engine, called WebCrawler, was developed by Brian Pinkerton. Since then, web crawling has evolved significantly, and there are numerous tools and libraries available for web crawling in various programming languages.

Breakdown a Python Script

The following simple Python script demonstrates how to build a web crawler that extracts links and content from web pages. This script uses the requests library to make HTTP requests, the BeautifulSoup library to parse HTML, and the tqdm library to display a progress bar.

				
					# wcrawler.py
# Web Crawler written in Python

import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
import sys

def get_links(url):
	response = requests.get(url)
	soup = BeautifulSoup(response.content, 'html.parser')
	links = []
	for link in soup.find_all('a'):
    	links.append(link.get('href'))
	return links

def crawl_and_count(url):
	visited = set()
	to_crawl = [url]
	with tqdm(total=5000) as pbar:
    	while to_crawl:
        	page = to_crawl.pop(0)
        	if page in visited:
            	continue
        	visited.add(page)
        	links = get_links(page)
        	for link in links:
            	if link and link.startswith(url) and link not in visited and link not in to_crawl:
                	to_crawl.append(link)
        	if len(visited) > 5000:
            	print("Number of pages crawled has exceeded 5000. Exiting program...")
            	sys.exit()
        	response = requests.get(url)
        	soup = BeautifulSoup(response.content, 'html.parser', from_encoding=response.encoding)
        	post_title = soup.find("h1", class_="post-title")
        	if post_title:
            	print(post_title.text.strip())
        	pbar.update(1)
	print(f"\nNumber of pages crawled: {len(visited)}")

# Get website URL from user input
url = input("Enter website URL to crawl: ")

# Call the crawl_and_count function with the user-provided URL
crawl_and_count(url)

				
			

Import the required libraries

The script starts by importing the required libraries: requests, BeautifulSoup, tqdm, and sys.

				
					import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
import sys
				
			

The get_links function

Next, lets examine the get_links function:

				
					def get_links(url):
	response = requests.get(url)
	soup = BeautifulSoup(response.content, 'html.parser')
	links = []
	for link in soup.find_all('a'):
    	links.append(link.get('href'))
	return links
				
			

The get_links function takes a URL as an argument, makes an HTTP request using the requests library, and parses the HTML content using the BeautifulSoup library. It then extracts all the links on the page using the find_all method and appends them to a list, which is returned.

The crawl_and_count function

Lets examine the crawl and count function:

				
					def crawl_and_count(url):
	visited = set()
	to_crawl = [url]
	with tqdm(total=5000) as pbar:
    	while to_crawl:
        	page = to_crawl.pop(0)
        	if page in visited:
            	continue
        	visited.add(page)
        	links = get_links(page)
        	for link in links:
            	if link and link.startswith(url) and link not in visited and link not in to_crawl:
                	to_crawl.append(link)
        	if len(visited) > 5000:
            	print("Number of pages crawled has exceeded 5000. Exiting program...")
            	sys.exit()
        	response = requests.get(url)
        	soup = BeautifulSoup(response.content, 'html.parser', from_encoding=response.encoding)
        	post_title = soup.find("h1", class_="post-title")
        	if post_title:
            	print(post_title.text.strip())
        	pbar.update(1)
	print(f"\nNumber of pages crawled: {len(visited)}")
				
			

The crawl_and_count function takes a URL as an argument and initializes two sets, visited and to_crawl. The visited set keeps track of all the pages that have been visited, while the to_crawl set keeps track of all the pages that need to be visited. The function then enters a while loop, which continues until there are no more pages to crawl.

Functions and Loops

Inside the loop, the function pops the first URL from the to_crawl set and checks if it has already been visited. If it has, the function continues to the next URL. If it hasn’t, the function adds it to the visited set, extracts all the links on the page using the get_links function, and adds them to the to_crawl set if they meet certain criteria. The criteria include:

  • The link starts with the same URL as the initial URL provided by the user.
  • The link has not already been visited.
  • The link is not already in the to_crawl set.

The function also checks if the number of pages visited has exceeded 5000, in which case it prints a message and exits the program using the sys library.

Functions and Libraries

The function then makes another HTTP request using the requests library and parses the HTML content using the BeautifulSoup library. It finds the first occurrence of an HTML tag with the class name “post-title” and prints its text content if it exists. Finally, the function updates the progress bar using the tqdm library.

After the while loop, the function prints the total number of pages visited.

The script then prompts the user to enter a URL, which is passed as an argument to the crawl_and_count function.

Run the Python script

To run the python script used in this article, copy and paste (from the Breakdown a Python Script Section) the code into a file and give it execute permissions. For the purposes of this tutorial, we are calling the script wcrawler.py.

				
					$ sudo chmod +x wcrawler.py
				
			

Execute the script with the following command:

				
					$ python wpcrawler.py 
Enter website URL to crawl: 
				
			

If there are no errors, you will be prompted to enter the URL of a site you’d like to crawl and that’s it! You’ve successfully built your very own web crawler using python!

Best Practices

When building a web crawler, there are several best practices to keep in mind:

  • Respect the website’s terms of service: Before crawling a website, make sure to read and understand its terms of service. Some websites prohibit web crawling, and violating their terms of service can result in legal action.
  • Use polite crawling behavior: Avoid overwhelming the website’s servers by limiting the number of requests per second and by setting a crawl delay.
  • Use appropriate data storage: Store the data collected during the crawl in an appropriate format, such as a database or a text file.
  • Handle errors gracefully: Web crawling can encounter errors such as HTTP errors, DNS errors, and connection timeouts. It is important to handle these errors gracefully by retrying the request or skipping the page.
  • Test and debug: Test the web crawler thoroughly to ensure that it is working as expected. Use debugging tools such as print statements and logging to troubleshoot errors.

Conclusion

Web crawling is a powerful tool for gathering information from the internet, and Python is an excellent language for building web crawlers. In this article, we learned what a web crawler is, its brief history, and how to build a simple web crawler using Python. We also discussed some best practices for building a web crawler, such as respecting the website’s terms of service, using polite crawling behavior, and handling errors gracefully.

By following these best practices, you can build a web crawler that is efficient, effective, and respectful of the websites you are crawling. Was this article helpful to you? If so, leave us comment below and share!

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *