In this tutorial, we will explain how to create a simple web crawler with Python.
Warning - It is illegal to crawl someone's website without his or her permission. Such an action is considered an attack.
Note: The purpose of this tutorial is to explain the very basics of Web Crawlers. We are not going to create a complex application, we are not going to implement models for entities or pages, and we are not going to use any algorithm at all. We do going to explain how to fetch a web page, parse it, and output relevant data.
Python is a modern programming language, and among other programming languages, it is considered one of the most straightforward yet most functional programming languages. The reason why Python is so popular among young developers is its library support. You can create almost anything using Python with the least effort.
You’ll find this tutorial super easy because it takes only a few lines of code to make a complex tool like a Web Crawler. But before proceeding with the build, let us quickly brief you about Crawler and it’s working.
What is a Web Crawler?
A Web Crawler is an internet bot that filters out desired websites and gathers meaningful information. Here "meaningful information" indicates the information the developer wants to collect.
There are good crawlers and bad crawlers. For example, Google Bot is a good crawler - it requests for your permission (you need to take actions in order to integrate it into your website) and it indexes your website in order to be found on Google search engine.
However, there are also bad Web Crawlers that are made to scrape different websites and generate leads.
How does a Web Crawler work?
The crawler is divided into 3 main parts:
Fetch module - an HTTP library to execute GET requests to a website.
Parse Module - an HTML/XML parser library. Used to extract data in a hierarchical data structure.
Output - In our application, we are going to print the results into the console. In advanced application, data can be forwarded to advanced services for analysis.
We start by sending an http GET request to the target website. This will return us an HTML page within a response object.
Then we convert that HTTP response to a BeautifulSoup object which allow us to parse HTML hierarchy. In the next step I will show how we can dig inside an HTML hierarchy.
What are we going to scrap?
For this tutorial, I created a special Products page on my website.
You can find it at: https://www.theswdeveloper.com/my-products
This products page is a mockup of a typical products page that you can find on all e-commerce websites. It shows a list of Products, and each product has:
Parse an HTML page
Click F12 on the browser to open the developer-tools box. This will allow us to inspect the code of the website.
Switch to "Elements" tab and press "ctrl + f" to open the search box.
Now we can search for HTML elements using a css-selector language.
Here is a great css-selector doc - https://www.w3.org/TR/selectors-3/
Every HTML element contains properties - a Tag, and attributes.
We use these properties to search for HTML elements.
Let's examine https://www.theswdeveloper.com/my-products page for example:
All products on that page are located inside a div tag, followed by an h4 tag. The following css-selector expression will filter out all products: div h4:first-child. Paste it in the search box - it should locate all product titles.
Let’s understand it -
div is a tag that is used to wrap a block of HTML elements.
h4 is a tag that is used for header text.
We can follow sequential tags, hence div h4 will get us to an h4 which is a child of a div. In our case this expression is not enough and leads us to few more unwanted results, hence we tune it more by requesting the 1st h4 child of that div.
Now that you understand it, it’s really simple to read the expression div h4:first-child.
We do the same in order to find the product's price, but this time we use the last h4 in that div: div h4:last-child
Here, we have just imported the requests package. We use it to perform HTTP requests, in our case, a GET request to https://www.theswdeveloper.com/my-products using the request.get(url) method.
We save the response into the “response” object.
We use BeautifulSoup to parse the response into a web page element. It gets two parameters - the HTTP response and the parser type - I chose to use “lxml” parser, which is the recommended parser by BeautifulSoup, but you can use any other supported parser.
Now comes the logical part. We need to extract only the relevant parts of that HTML page.
In the previous section we learned how to extract a specific HTML entity from a whole HTML page, we use these expressions as parameters for the BeautifulSoup select method.
Now we have 2 HTML entities - products and prices.
We loop over the arrays of HTML entities, each iteration we call the text() method to get the inner text of that element and we print it to the console.
As mentioned in the begining, dvanced crawler should send the output to a processing service for insights and actions.
That was all in this crawler guide. This build can go on further, but it isn’t possible to cover the whole topic in a single guide, also, as i mentioned in the begining, the purpose of this tutorial is to explain the very basics of Web Crawlers.
Now that you are familiar with the basics, you are able to start your journey on understanding advanced concepts for high end crawlers.