Parsing is the automatic search for various patterns (based on pre-defined structures) in text data sources to extract specific information.
Although parsing is a broad term, it most commonly refers to the process of collecting and analyzing data from remote web resources.
In the Python programming language, you can create programs for parsing data from external websites can using two key tools:
However, data processing capabilities are not limited to just HTML documents.
Thanks to a wide range of external libraries in Python, you can organize parsing for documents of any complexity, whether they are arbitrary text, popular markup languages (e.g., XML), or even rare programming languages.
If there is no suitable parsing library available, you can implement it manually using low-level methods that Python provides by default, such as simple string searching or regular expressions. Although, of course, this requires additional skills.
This guide will cover how to organize parsers in Python. We will focus on extracting data from HTML pages based on specified tags and attributes.
We run all the examples in this guide using Python 3.10.12 interpreter on a Hostman cloud server with Ubuntu 22.04 and Pip 22.0.2 as the package manager.
Any document written in HTML consists of two types of tags:
Opening: Defined within less-than (<
) and greater-than (>
) symbols, e.g., <div>
.
Closing: Defined within less-than (<
) and greater-than (>
) symbols with a forward slash (/
), e.g., </div>
.
Each tag can have various attributes, the values of which are written in quotes after the equal sign. Some commonly used attributes include:
href
: Link to a resource. E.g., href="https://hostman.com"
.class
: The class of an object. E.g., class="surface panel panel_closed"
.id
: Identifier of an object. E.g., id="menu"
.Each tag, with or without attributes, is an element (object) of the so-called DOM (Document Object Model) tree, which is built by practically any HTML interpreter (parser).
This builds a hierarchy of elements, where nested tags are child elements to their parent tags.
For example, in a browser, we access elements and their attributes through JavaScript scripts. In Python, we use separate libraries for this purpose. The difference is that after parsing the HTML document, the browser not only constructs the DOM tree but also displays it on the monitor.
<!DOCTYPE html>
<html>
<head>
<title>This is the page title</title>
</head>
<body>
<h1>This is a heading</h1>
<p>This is a simple text.</p>
</body>
</html>
The markup of this page is built with tags in a hierarchical structure without specifying any attributes:
Such a document structure is more than enough to extract information. We can parse the data by reading the data between opening and closing tags.
However, real website tags have additional attributes that specify both the specific function of the element and its special styling (described in separate CSS files):
<!DOCTYPE html>
<html>
<body>
<h1 class="h1_bright">This is a heading</h1>
<p>This is simple text.</p>
<div class="block" href="https://hostman.com/products/cloud-server">
<div class="block__title">Cloud Services</div>
<div class="block__information">Cloud Servers</div>
</div>
<div class="block" href="https://hostman.com/products/vps-server-hosting">
<div class="block__title">VPS Hosting</div>
<div class="block__information">Cloud Infrastructure</div>
</div>
<div class="block" href="https://hostman.com/services/app-platform">
<div class="block__title">App Platform</div>
<div class="block__information">Apps in the Cloud</div>
</div>
</body>
</html>
Thus, in addition to explicitly specified tags, the required information can be refined with specific attributes, extracting only the necessary elements from the DOM tree.
Web pages can be of two types:
Static: During the loading and viewing of the site, the HTML markup remains unchanged. Parsing does not require emulating the browser's behavior.
Dynamic: During the loading and viewing of the site (Single-page application, SPA), the HTML markup is modified using JavaScript. Parsing requires emulating the browser's behavior.
Parsing static websites is relatively simple—after making a remote request, the necessary data is extracted from the received HTML document.
Parsing dynamic websites requires a more complex approach. After making a remote request, both the HTML document itself and the JavaScript scripts controlling it are downloaded to the local machine. These scripts, in turn, usually perform several remote requests automatically, loading additional content and modifying the HTML document while viewing the page.
Because of this, parsing dynamic websites requires emulating the browser’s behavior and user actions on the local machine. Without this, the necessary data simply won’t load.
Most modern websites load additional content using JavaScript scripts in one way or another.
The variety of technical implementations of modern websites is so large that they can’t be classified as entirely static or entirely dynamic.
Typically, general information is loaded initially, while specific information is loaded later.
Most HTML parsers are designed for static pages. Systems that emulate browser behavior to generate dynamic content are much less common.
In Python, libraries (packages) intended for analyzing HTML markup can be divided into two groups:
Low-level processors: Compact, but syntactically complex packages with a complicated implementation that parse HTML (or XML) syntax and build a hierarchical tree of elements.
High-level libraries and frameworks: Large, but syntactically concise packages with a wide range of features to extract formalized data from raw HTML documents. This group includes not only compact HTML parsers but also full-fledged systems for data scraping. Often, these packages use low-level parsers (processors) from the first group as their core for parsing.
Several low-level libraries are available for Python:
lxml
: A low-level XML syntax processor that is also used for HTML parsing. It is based on the popular libxml2
library written in C.
html5lib
: A Python library for HTML syntax parsing, written according to the HTML specification by WHATWG (The Web Hypertext Application Technology Working Group), which is followed by all modern browsers.
However, using high-level libraries is faster and easier—they have simpler syntax and a wider range of functions:
This guide will review several of these high-level libraries.
We can install all parsing libraries (as well as many other packages) in Python through the standard package manager, pip
, which needs to be installed separately.
First, update the list of available repositories:
sudo apt update
Then, install pip
using the APT package manager:
sudo apt install python3-pip -y
The -y
flag will automatically confirm all terminal prompts during the installation.
To verify that pip
was installed correctly, check its version:
pip3 --version
The terminal will display the pip version and the installation path:
pip 22.0.2 from /usr/lib/python3/dist-packages/pip (python 3.10)
As shown, this guide uses pip version 22.0.2.
Usually, the default Python interpreter includes the Requests package, which allows making requests to remote servers. We will use it in the examples of this guide.
However, in some cases, it might not be installed. Then, you can manually install requests
via pip
:
pip install requests
If the system already has it, you will see the following message in the terminal:
Requirement already satisfied: requests in /usr/lib/python3/dist-packages (2.25.1)
Otherwise, the command will add requests
to the list of available packages for import in Python scripts.
To install BeautifulSoup version 4, use pip
:
pip install beautifulsoup4
After this, the library will be available for import in Python scripts. However, it also requires the previously mentioned low-level HTML processors to work properly.
First, install lxml
:
pip install lxml
Then install html5lib
:
pip install html5lib
In the future, you can specify one of these processors as the core parser for BeautifulSoup in your Python code.
Create a new file in your home directory:
nano bs.py
Add the following code:
import requests
from bs4 import BeautifulSoup
# Request to the website 'https://hostman.com'
response = requests.get('https://hostman.com')
# Parse the HTML content of the page using 'html5lib' parser
page = BeautifulSoup(response.text, 'html5lib')
# Extract the title of the page
pageTitle = page.find('title')
print(pageTitle)
print(pageTitle.string)
print("")
# Extract all <a> links on the page
pageParagraphs = page.find_all('a')
# Print the content of the first 3 links (if they exist)
for i, link in enumerate(pageParagraphs[:3]):
print(link.string)
print("")
# Find all div elements with a class starting with 'socials--'
social_links_containers = page.find_all('div', class_=lambda c: c and c.startswith('socials--'))
# Collect the links from these divs
for container in social_links_containers:
links = container.find_all('a', href=True)
for link in links:
href = link['href']
# Ignore links related to Cloudflare's email protection
if href.startswith('/cdn-cgi/l/email-protection'):
continue
print(href)
Now run the script:
python bs.py
This will produce the following console output:
<title>Hostman - Cloud Service Provider with a Global Cloud Infrastructure</title>
Hostman - Cloud Service Provider with a Global Cloud Infrastructure
Partners
Tutorials
API
https://wa.me/35795959804
https://twitter.com/hostman_com
https://www.facebook.com/profile.php?id=61556075738626
https://github.com/hostman-cloud
https://www.linkedin.com/company/hostman-inc/about/
https://www.reddit.com/r/Hostman_com/
Of course, instead of html5lib
, you can specify lxml
:
page = BeautifulSoup(response.text, 'lxml')
However, it is best to use the html5lib
library as the processor. Unlike lxml
, which is specifically designed for working with XML markup, html5lib
has full support for modern HTML5 standards.
Despite the fact that the BeautifulSoup library has a concise syntax, it does not support browser emulation, meaning it cannot dynamically load content.
The Scrapy framework is implemented in a more object-oriented manner. In Scrapy, website parsing is based on three core entities:
Spiders: Classes that contain information about parsing details for specified websites, including URLs, element selectors (CSS or XPath), and page browsing mechanisms.
Items: Variables for storing extracted data, which are more complex forms of Python dictionaries with a special internal structure.
Pipelines: Intermediate handlers for extracted data that can modify items and interact with external software (such as databases).
You can install Scrapy through the pip
package manager:
pip install scrapy
After that, you need to initialize a parser project, which creates a separate directory with its own folder structure and configuration files:
scrapy startproject parser
Now, you can navigate to the newly created directory:
cd parser
Check the contents of the current directory:
ls
It has a general configuration file and a directory with project source files:
parser scrapy.cfg
Move to the source files directory:
cd parser
If you check its contents:
ls
You will see both special Python scripts, each performing its function, and a separate directory for spiders:
__init__.py items.py middlewares.py pipelines.py settings.py spiders
Let's open the settings file:
nano settings.py
By default, most parameters are commented out with the hash symbol (#
). For the parser to work correctly, you need to uncomment some of these parameters without changing the default values specified in the file:
USER_AGENT
ROBOTSTXT_OBEY
CONCURRENT_REQUESTS
DOWNLOAD_DELAY
COOKIES_ENABLED
Each specific project will require a more precise configuration of the framework. You can find all available parameters in the official documentation.
After that, you can generate a new spider:
scrapy genspider hostmanspider hostman.com
After running the above command, the console should display a message about the creation of a new spider:
Created spider ‘hostmanspider' using template 'basic' in module:
parser.spiders.hostmanspider
Now, if you check the contents of the spiders
directory:
ls spiders
You will see the empty source files for the new spider:
__init__.py __pycache__ hostmanspider.py
Let's open the script file:
nano spiders/hostmanspider.py
And fill it with the following code:
from pathlib import Path # Package for working with files
import scrapy # Package from the Scrapy framework
class HostmanSpider(scrapy.Spider): # Spider class inherits from the Spider class
name = 'hostmanspider' # Name of the spider
def start_requests(self):
urls = ["https://hostman.com"]
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
open("output", "w").close() # Clear the content of the 'output' file
someFile = open("output", "a") # Create (or append to) a new file
dataTitle = response.css("title::text").get() # Extract the title from the server response using a CSS selector
dataA = response.css("a").getall() # Extract the first 3 links from the server response using a CSS selector
someFile.write(dataTitle + "\n\n")
for i in range(3):
someFile.write(dataA[i] + "\n")
someFile.close()
You can now run the created spider with the following command:
scrapy crawl hostmanspider
Running the spider will create an output file in the current directory. To view the contents of this file, you can use:
cat output
The content of this file will look something like this:
Hostman - Cloud Service Provider with a Global Cloud Infrastructure
<a href="/partners/" itemprop="url" class="body4 medium nd-link-primary"><span itemprop="name">Partners</span></a>
<a href="/tutorials/" itemprop="url" class="body4 medium nd-link-primary"><span itemprop="name">Tutorials</span></a>
<a href="/api-docs/" itemprop="url" class="body4 medium nd-link-primary"><span itemprop="name">API</span></a>
You can find more detailed information on extracting data using selectors (both CSS and XPath) can be found in the official Scrapy documentation.
Data parsing from remote sources in Python is made possible by two main components:
These libraries can range from simple ones, suitable only for parsing static websites, to more complex ones that can emulate browser behavior and, consequently, parse dynamic websites.
In Python, the most popular libraries for parsing static data are:
These tools, similar to JavaScript functions (e.g., getElementsByClassName() using CSS selectors), allow us to extract data (attributes and text) from the DOM tree elements of any HTML document.