In this article, we will learn how to install Python NumPy on CentOS9 or RHEL9, the essential package for scientific computing and data manipulation! Table
Have you ever wondered how search engines gather all that information from the internet? Learn how to build your own web crawler in Python and discover the power of web crawling!
In today’s digital age, the internet is the backbone of all communication and commerce, and as a result, there is a wealth of information available on the web. Web crawling is the process of gathering information from the internet by traversing web pages and extracting information. Python is one of the most popular languages for web crawling because of its ease of use, readability, and numerous libraries. In this article, we will learn what a web crawler is, its brief history, and how to build a simple web crawler using Python.
A web crawler is a program that automatically navigates through web pages, extracts information and stores it in a structured format such as a database or a text file. Web crawlers are also known as spiders or bots. The primary function of a web crawler is to collect data from the internet, which can be used for various purposes such as indexing websites for search engines, gathering data for research, or monitoring web content for changes.
Web crawling has been around since the early days of the internet. In 1993, Brian Pinkerton developed the first web crawler, called Web Wanderer, which could traverse through web pages and build a searchable index. In 1994, the first web search engine, called WebCrawler, was developed by Brian Pinkerton. Since then, web crawling has evolved significantly, and there are numerous tools and libraries available for web crawling in various programming languages.
The following simple Python script demonstrates how to build a web crawler that extracts links and content from web pages. This script uses the requests library to make HTTP requests, the BeautifulSoup library to parse HTML, and the tqdm library to display a progress bar.
# wcrawler.py
# Web Crawler written in Python
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
import sys
def get_links(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
links = []
for link in soup.find_all('a'):
links.append(link.get('href'))
return links
def crawl_and_count(url):
visited = set()
to_crawl = [url]
with tqdm(total=5000) as pbar:
while to_crawl:
page = to_crawl.pop(0)
if page in visited:
continue
visited.add(page)
links = get_links(page)
for link in links:
if link and link.startswith(url) and link not in visited and link not in to_crawl:
to_crawl.append(link)
if len(visited) > 5000:
print("Number of pages crawled has exceeded 5000. Exiting program...")
sys.exit()
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser', from_encoding=response.encoding)
post_title = soup.find("h1", class_="post-title")
if post_title:
print(post_title.text.strip())
pbar.update(1)
print(f"\nNumber of pages crawled: {len(visited)}")
# Get website URL from user input
url = input("Enter website URL to crawl: ")
# Call the crawl_and_count function with the user-provided URL
crawl_and_count(url)
The script starts by importing the required libraries: requests
, BeautifulSoup
, tqdm
, and sys
.
import requests
from bs4 import BeautifulSoup
from tqdm import tqdm
import sys
Next, lets examine the get_links function:
def get_links(url):
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
links = []
for link in soup.find_all('a'):
links.append(link.get('href'))
return links
The get_links
function takes a URL as an argument, makes an HTTP request using the requests library, and parses the HTML content using the BeautifulSoup library. It then extracts all the links on the page using the find_all method and appends them to a list, which is returned.
Lets examine the crawl and count function:
def crawl_and_count(url):
visited = set()
to_crawl = [url]
with tqdm(total=5000) as pbar:
while to_crawl:
page = to_crawl.pop(0)
if page in visited:
continue
visited.add(page)
links = get_links(page)
for link in links:
if link and link.startswith(url) and link not in visited and link not in to_crawl:
to_crawl.append(link)
if len(visited) > 5000:
print("Number of pages crawled has exceeded 5000. Exiting program...")
sys.exit()
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser', from_encoding=response.encoding)
post_title = soup.find("h1", class_="post-title")
if post_title:
print(post_title.text.strip())
pbar.update(1)
print(f"\nNumber of pages crawled: {len(visited)}")
The crawl_and_count
function takes a URL as an argument and initializes two sets, visited
and to_crawl
. The visited set keeps track of all the pages that have been visited, while the to_crawl
set keeps track of all the pages that need to be visited. The function then enters a while loop, which continues until there are no more pages to crawl.
Inside the loop, the function pops the first URL from the to_crawl set and checks if it has already been visited. If it has, the function continues to the next URL. If it hasn’t, the function adds it to the visited set, extracts all the links on the page using the get_links function, and adds them to the to_crawl set if they meet certain criteria. The criteria include:
to_crawl
set.The function also checks if the number of pages visited has exceeded 5000, in which case it prints a message and exits the program using the sys library.
The function then makes another HTTP request using the requests
library and parses the HTML content using the BeautifulSoup
library. It finds the first occurrence of an HTML tag with the class name “post-title” and prints its text content if it exists. Finally, the function updates the progress bar using the tqdm
library.
After the while loop, the function prints the total number of pages visited.
The script then prompts the user to enter a URL, which is passed as an argument to the crawl_and_count
function.
To run the python script used in this article, copy and paste (from the Breakdown a Python Script Section) the code into a file and give it execute permissions. For the purposes of this tutorial, we are calling the script wcrawler.py
.
$ sudo chmod +x wcrawler.py
Execute the script with the following command:
$ python wpcrawler.py
Enter website URL to crawl:
If there are no errors, you will be prompted to enter the URL of a site you’d like to crawl and that’s it! You’ve successfully built your very own web crawler using python!
When building a web crawler, there are several best practices to keep in mind:
Web crawling is a powerful tool for gathering information from the internet, and Python is an excellent language for building web crawlers. In this article, we learned what a web crawler is, its brief history, and how to build a simple web crawler using Python. We also discussed some best practices for building a web crawler, such as respecting the website’s terms of service, using polite crawling behavior, and handling errors gracefully.
By following these best practices, you can build a web crawler that is efficient, effective, and respectful of the websites you are crawling. Was this article helpful to you? If so, leave us comment below and share!
Related Posts
In this article, we will learn how to install Python NumPy on CentOS9 or RHEL9, the essential package for scientific computing and data manipulation! Table
In this article, we will learn how to install Python Matplotlib on CentOS9 or RHEL9, a popular tool for scientific computing and data manipulation. Table
Scientific computing using Python refers to the use of the Python programming language and its associated libraries to solve scientific problems. Scientific computing is the