In one of the previous articles, we learned what parsing is and looked at examples of obtaining data from HTML pages using Python.
In this guide, we continue to move in that direction and offer web scraping best practices and tips that will help you automatically extract data from most existing websites.
Obtaining data automatically may be prohibited by the terms of use of websites. We do not encourage violations of these terms, the rules specified in the robots.txt file, or any other applicable legal norms. Use the methods presented here only within permitted scenarios, and respect the policies of website owners.
By the way information is delivered, websites can be divided into two groups: static and dynamic. On static websites, all data is stored in the form of fixed HTML files that are kept on the server. Their contents do not change unless the developer modifies them. Dynamic websites, on the other hand, support real-time content generation and can load information from storage or, for example, from a database.
Usually, writing a script for a static site is easier, since the information is definitely located inside the HTML document, and you don’t need to look for additional requests.
The first thing a developer needs in order to identify the source of data quickly is to learn how to use the developer tools (DevTools). They exist in every browser and can be opened using the F12 key, or the combination Ctrl + Alt + I on Windows, or Command + Option + I on macOS.
At first, you will only need two tabs: Elements and Network. The first allows you to see the structure of the page and determine in which DOM element the data is located. The Network tab is needed for working with requests, which we will later copy.
The tabs are located at the top of the developer tools.
Most often, information reaches the site in two ways:
Below is an action algorithm recommended for starting work with any donor site:
Find a GET request with content type text/html
, which the browser sends when the page is initialized.
To do this, go to the page from which you need to extract data. Open the web inspector in the Network tab. Clear the requests by clicking on the trash bin icon to the left of the request search bar. Reload the page with Ctrl + R on Windows/Linux or Command + R on macOS. One of the first requests will be the needed GET request with content type text/html
.
Click on the request you found. Then go to the Response tab. A preview mode of the server’s response will open. The page layout may appear broken; this is normal.
Try to find the required data visually in the preview mode. For example, the HTML markup of articles on Hostman is generated by the server. If you needed to automatically obtain the text of an article, then most of the work would already be done.
If you can’t find it visually, go to the HTML markup view mode of the server response (not to be confused with the Elements tab). Activate search within the response with Ctrl + F on Windows or Command + F on macOS. Enter an example of data that you know is definitely on the page (for instance, the developer knows that the article contains the phrase “configuring Nginx,” so that exact phrase can be searched). The browser will highlight the substring if matches are found.
Often, if the information is delivered by the server as HTML markup, selector names remain the same. For convenience, you can use the standard element search tool with the mouse: Ctrl + Shift + C on Windows or Cmd + Shift + C on macOS. Press the shortcut and select the element directly on the page. The browser will show the desired element, and its selectors can be conveniently transferred into your code.
If the required data is not present, proceed to the next step.
Find the requests that contain only JSON. This is easiest to do by filtering: click on the search bar for requests and enter the filter:
mime-type: application/json
Go through each request with the donor site’s domain and repeat the search for data, as in the previous step.
If no necessary data is found, then most likely you will need to resort to browser emulation to parse the information.
In most cases, along with the request, the browser sends request headers and cookies to the server. Headers transmit metadata that allows the server to understand what data format is being requested and how best to deliver it. Cookies store session information and user preferences. Thanks to this, the server forms a personalized response.
Without this data, the server may reject the request if it considers it insufficiently secure.
This method allows you to export ready-made code for making a request, not only in Python. It works for any requests.
The code contains dictionaries with headers, cookie data, JSON request parameters (json_data
, if present), and everything necessary to fully duplicate the request made in the browser.
Most often, scripts for parsing and automation are later uploaded to a remote server. A virtual environment creates a separate environment for the project and isolates its dependencies from system libraries. This helps to avoid version conflicts and reduces the risk of unexpected failures.
We explained more about virtual environments and how to create them in another article.
To quickly transfer the project to a server, provided you worked in a virtual environment on your local computer, first save the list of libraries with versions from pip into a file requirements.txt
:
pip freeze > requirements.txt
If you just created a server on Ubuntu, you can use a universal script to install Python, a virtual environment, and all dependencies on a clean server.
First, transfer the project files (using the scp
utility or the FTP protocol), go to the project directory, and paste the ready-made command into the terminal. At the beginning of the command, specify the required Python version in the variable PYVER
and then execute the command:
export PYVER=3.9 && sudo apt update && sudo apt upgrade -y && sudo apt install -y software-properties-common && sudo add-apt-repository ppa:deadsnakes/ppa -y && sudo apt update && sudo apt install -y python${PYVER} python${PYVER}-venv python${PYVER}-dev python3-pip && python${PYVER} -m venv venv && source venv/bin/activate && pip install --upgrade pip && [ -f requirements.txt ] && pip install -r requirements.txt
When developing a parser, it is important to provide an error handling mechanism. Network failures, changes in the HTML structure, or unexpected blocking by the site may lead to script failures.
Add retries for requests, timeouts, and a logging system for all actions and errors. This approach allows you to quickly detect problems, adjust parsing algorithms, and ensure the stability of the application even when the donor site changes.
In Python, you can use:
try
, except
, finally
constructs;logging
library for logging;requests
:
requests.get("hostman.com", timeout=20)
aiohttp
:
timeout = aiohttp.ClientTimeout(total=60, sock_connect=10, sock_read=10)
async with aiohttp.ClientSession(timeout=timeout) as session:
async with session.get(url) as response:
return await response.text()
A generator is a class that implements the logic of an object that iteratively yields elements as needed.
Generators are especially convenient to use when developing a parsing script for the following reasons:
Lazy evaluation. Generators calculate and return data “on the fly,” which makes it possible to process large volumes of information without consuming significant amounts of memory. When parsing large files or web pages, this is critical: data is processed gradually, and only the current part is stored in memory, not the entire result at once.
Increased performance. Since elements are generated as needed, you can begin processing and transferring data (for example, to a database or a bot) before the entire dataset has been obtained. This reduces delays and allows you to react faster to incoming data.
Code organization convenience. Generators simplify the implementation of iterative processes, allowing you to focus on the parsing logic rather than managing iteration state. This is especially useful when you need to process a data stream and pass it to other parts of the system.
In the loop where the generator is used, it is convenient to initiate writing data to a database or, for example, sending notifications through a Telegram bot. Using generators makes the code more readable.
import requests
from bs4 import BeautifulSoup
class MyParser:
def __init__(self, url):
self.url = url
def parse(self):
"""
Generator that sequentially returns data
(for example, titles of elements on a page).
"""
response = requests.get(self.url)
if response.status_code != 200:
raise Exception(f"Failed to retrieve page, status: {response.status_code}")
soup = BeautifulSoup(response.text, "html.parser")
items = soup.select("div")
for item in items:
title = item.select_one("h1").get_text(strip=True) if item.select_one("h1") else "No title"
yield {
"title": title,
"content": item.get_text(strip=True)
}
if __name__ == "__main__":
parser = MyParser("https://example.com")
for data_item in parser.parse():
print(data_item["title"], "--", data_item["content"])
When parsing a large number of pages, a synchronous approach often becomes a bottleneck, since each request waits for the previous one to finish. Asynchronous libraries, such as aiohttp
in Python, allow you to execute many requests simultaneously, which significantly speeds up data collection.
However, to avoid overloading both your application and the donor servers, it is important to properly regulate the request flow. This is where throttling, exponential backoff, and task queue techniques come into play.
Asynchronous requests. Create an asynchronous session with specified timeouts (for example, total timeout, connection timeout, and read timeout). This allows you to process many requests in parallel without blocking the main execution thread.
Throttling. To prevent excessive load on the donor server, it makes sense to limit the number of simultaneous requests. This can be done using semaphores or other concurrency control mechanisms (for example, asyncio.Semaphore
), so as not to send requests faster than allowed.
Exponential backoff. If a request fails (for example, due to a timeout or temporary blocking), use an exponential backoff strategy. On each retry, the waiting interval increases (for example, 1 second, then 2, 4, 8…), which allows the server to recover and reduces the likelihood of repeated errors.
Task queues. Organizing queues (for example, with asyncio.Queue
) helps manage a large flow of requests. First, a queue of URLs is formed, then requests are processed as “slots” become available for execution. This ensures an even distribution of load and stable operation of the parser.
import asyncio
import aiohttp
from aiohttp import ClientTimeout
# Limit the number of simultaneous requests
semaphore = asyncio.Semaphore(10)
async def fetch(session, url):
async with semaphore:
try:
async with session.get(url) as response:
return await response.text()
except Exception:
# Apply exponential backoff in case of error
for delay in [1, 2, 4, 8]:
await asyncio.sleep(delay)
try:
async with session.get(url) as response:
return await response.text()
except Exception:
continue
return None
async def main(urls):
timeout = ClientTimeout(total=60, sock_connect=10, sock_read=10)
async with aiohttp.ClientSession(timeout=timeout) as session:
tasks = [asyncio.create_task(fetch(session, url)) for url in urls]
results = await asyncio.gather(*tasks)
# Process the obtained data
for result in results:
if result:
print(result[:200]) # Print the first 200 characters of the response
# Example list of URLs for parsing
urls = ["http://hostman.com"] * 100
asyncio.run(main(urls))
There are also recommendations that will help simplify a developer’s work:
Check if the donor site has a public API. Sometimes the task of writing a parsing algorithm has already been solved, and the site offers a convenient API that fully covers the required functionality.
Monitor changes in the site’s structure. Donor site developers may change the layout, which would require you to update the selectors of the elements used in your code.
Test function execution at every stage. Automated tests (unit tests, integration tests) help promptly detect issues related to site structure changes or internal code modifications.
We have systematized the information from this article so you can understand which parsing method to use when working with any donor site.
The universal parsing methods presented here form a reliable foundation for developing algorithms capable of extracting data from a wide variety of websites, regardless of the programming language chosen. Following these scraping best practices and tips allows you to build a flexible, scalable, and change-resistant algorithm. Such an approach not only helps to optimally use system resources but also ensures the ability to quickly integrate the obtained data with databases, messengers, or other external services.