Sign In
Sign In

Web Scraping with Python and Beautiful Soup

Web Scraping with Python and Beautiful Soup
Hostman Team
Technical writer
Python
15.10.2024
Reading time: 5 min

Beautiful Soup is a Python library that allows you to extract data from web documents. The library provides a simple and intuitive interface for parsing and processing web documents, making it useful for both web scraping and data analysis. In this article, we'll describe how to use Beautiful Soup in Python, explore several essential functions of this library, including applying search filters, using built-in functions, and modifying the DOM tree. But first, let's cover some key terminology.

What is the DOM Tree?

The DOM (Document Object Model) is a hierarchical structure that represents the content of an HTML document as specific objects called nodes. The DOM tree consists of nodes, each with its own type. For example, elements might be tags like <div>, <p>, or <a>, while text nodes contain textual information displayed on the page.

The DOM tree allows access to the document's content, enabling developers to manipulate elements, attributes, and page content using programming languages. For example, DOM lets you modify element content, add new elements, or remove existing ones.

Installing Beautiful Soup

To install Beautiful Soup for Python 3, open the command line or terminal on your computer and enter the following command:

pip install beautifulsoup4

On Windows, you can open the command line by clicking the "Start" button, typing cmd in the search bar, and pressing Enter. On MacOS and Linux, open the terminal from the applications menu or press Ctrl + Alt + T. After that, type the command pip install beautifulsoup4 and press Enter to begin the installation process.

If you encounter an error like "pip" is not recognized as an internal or external command… on Windows, it may mean that the path to the pip executable is not set in the PATH variable. To fix this:

  1. Open Control Panel.

  2. Choose "System and Security" or "System" depending on your Windows version.

  3. Click on "Advanced system settings."

  4. Select "Environment Variables."

  5. Find the PATH variable and click "Edit."

  6. Add the path to the Scripts folder where the pip executable is located (usually C:\PythonXX\Scripts, where XX is your Python version). Add it at the end of the line, separated by a semicolon.

Be careful not to alter any other paths! After making these changes, click "OK" and close all windows. Now, rerun the pip install beautifulsoup4 command, and it should work.

Searching Elements in the DOM

find_all() is the standard method in Beautiful Soup used to search for elements on a web page that match certain criteria. find_all() searches for all elements that meet the given parameters and returns them as a list of objects. For example, if you want to find all H2 tags in an HTML document, first specify the path to the document:

from bs4 import BeautifulSoup
with open("C:/Files/my_wiki_example.html") as fp:
    soup = BeautifulSoup(fp, 'html.parser')

Now, perform the search:

for heading in soup.find_all('h2'):
   print(heading.text)

If there are headings, you will see output like:

Chapter 1
Chapter 2
Chapter 3
Chapter 4

If Beautiful Soup doesn’t find anything, it will return an empty string.

You can also search using regular expressions. Let’s try to print all headings from the page using:

import re
for heading in soup.find_all(re.compile('^h[1-6]')):
    print(heading.name + ' ' + heading.text.strip())

This will return:

h1 My First Wiki Doc
h2 Chapter 1
h2 Chapter 2
h2 Chapter 3
h2 Chapter 4

As we see, find_all() accepts two arguments: the tag name and a dictionary of attributes. You can use either argument or both together for searching. You can also list tags directly:

for heading in soup.find_all(['h1', 'h2', 'h3']):
    print(heading.name + ' ' + heading.text.strip())

You’ll get:

h1 My First Wiki Doc
h2 Chapter 1
h2 Chapter 2
h2 Chapter 3
h2 Chapter 4

If you want to count a certain number of elements, you can use:

len(soup.find_all(True))

To solve more complex searches, you can write a custom function. Here’s a simple example:

def big_lists(tag):
    return len(tag.contents) > 5 and tag.name == 'ul'
len(soup.find_all(big_lists))

This code searches for unordered lists (<ul>) with more than five items and shows that one such list is found.

Filtering Elements

To filter elements using find_all(), you can search by specific attributes. Let's search by class names and filter the results:

import re
len(soup.find_all(class_='table'))  # Finds tables
18
len(soup.find_all(class_='image'))  # Finds images
12
len(soup.find_all(class_='references'))  # Finds references
5
len(soup.find_all(class_='redirect'))  # Finds redirects
3

If you want to limit the number of results, for instance, to get only the first three tables:

soup.find_all(class_='table', limit=3)

The output might look like:

<span class='table' id='Table1'>Table 1</span>
<span class='table' id='Table2'>Table 2</span>
<span class='table' id='Table3'>Table 3</span>

To filter direct descendants, you can use:

len(soup.html.find_all('meta', recursive=False))

The recursive=False stops global searching and limits it to direct child elements.

Alternatively, you can use find(). For instance, to find an H2 header:

soup.find('h2')

Modifying the DOM

Beautiful Soup also allows modifying the content of web documents. To change the text of an H1 heading:

title = soup.find('h1')
title.string = 'New Title'

For confirmation, you can check:

soup.find('h1')
# Output: <h1>New Title</h1>

You can also add a new attribute to the first paragraph:

paragraph = soup.find('p')
paragraph['data-id'] = '12345'

To delete a paragraph:

paragraph = soup.find('p')
paragraph.extract()

Use prettify() and print() to check the modified document:

print(soup.prettify())

Other useful methods include:

  • append() to add content to an element,

  • insert() to insert content inside an element,

  • clear() to clear an element’s content without removing the tag,

  • decompose() to completely remove an element and its content from the DOM,

  • replace_with() to replace an element with another.

These methods allow flexible modifications to web documents. With a bit of practice, you’ll be able to easily find elements in HTML documents and modify their content with no trouble.

Python
15.10.2024
Reading time: 5 min

Do you have questions,
comments, or concerns?

Our professionals are available to assist you at any moment,
whether you need help or are just unsure of where to start
Email us