Sign In
Sign In

Useful Tips for Web Data Scraping

Useful Tips for Web Data Scraping
Hostman Team
Technical writer
Python
23.09.2025
Reading time: 12 min

In one of the previous articles, we learned what parsing is and looked at examples of obtaining data from HTML pages using Python.

In this guide, we continue to move in that direction and offer web scraping best practices and tips that will help you automatically extract data from most existing websites.

Obtaining data automatically may be prohibited by the terms of use of websites. We do not encourage violations of these terms, the rules specified in the robots.txt file, or any other applicable legal norms. Use the methods presented here only within permitted scenarios, and respect the policies of website owners.

Tip 1. Learn to Work with DevTools

By the way information is delivered, websites can be divided into two groups: static and dynamic. On static websites, all data is stored in the form of fixed HTML files that are kept on the server. Their contents do not change unless the developer modifies them. Dynamic websites, on the other hand, support real-time content generation and can load information from storage or, for example, from a database.

Usually, writing a script for a static site is easier, since the information is definitely located inside the HTML document, and you don’t need to look for additional requests.

Working with the Web Inspector

The first thing a developer needs in order to identify the source of data quickly is to learn how to use the developer tools (DevTools). They exist in every browser and can be opened using the F12 key, or the combination Ctrl + Alt + I on Windows, or Command + Option + I on macOS.

At first, you will only need two tabs: Elements and Network. The first allows you to see the structure of the page and determine in which DOM element the data is located. The Network tab is needed for working with requests, which we will later copy.

The tabs are located at the top of the developer tools.

Most often, information reaches the site in two ways:

  1. In the HTML markup of the page. This happens if the data is added to the page during backend processing.
  2. In JSON format. Such data can be requested by the frontend both during page loading and after certain user actions on the page.

Tip 2. Use a Ready-Made Algorithm to Start Working with Any Donor Site

Below is an action algorithm recommended for starting work with any donor site:

  1. Find a GET request with content type text/html, which the browser sends when the page is initialized.

    To do this, go to the page from which you need to extract data. Open the web inspector in the Network tab. Clear the requests by clicking on the trash bin icon to the left of the request search bar. Reload the page with Ctrl + R on Windows/Linux or Command + R on macOS. One of the first requests will be the needed GET request with content type text/html.

  2. Click on the request you found. Then go to the Response tab. A preview mode of the server’s response will open. The page layout may appear broken; this is normal.

  3. Try to find the required data visually in the preview mode. For example, the HTML markup of articles on Hostman is generated by the server. If you needed to automatically obtain the text of an article, then most of the work would already be done.

  4. If you can’t find it visually, go to the HTML markup view mode of the server response (not to be confused with the Elements tab). Activate search within the response with Ctrl + F on Windows or Command + F on macOS. Enter an example of data that you know is definitely on the page (for instance, the developer knows that the article contains the phrase “configuring Nginx,” so that exact phrase can be searched). The browser will highlight the substring if matches are found.

  5. Often, if the information is delivered by the server as HTML markup, selector names remain the same. For convenience, you can use the standard element search tool with the mouse: Ctrl + Shift + C on Windows or Cmd + Shift + C on macOS. Press the shortcut and select the element directly on the page. The browser will show the desired element, and its selectors can be conveniently transferred into your code.

  6. If the required data is not present, proceed to the next step.

  7. Find the requests that contain only JSON. This is easiest to do by filtering: click on the search bar for requests and enter the filter:

mime-type: application/json
  1. Go through each request with the donor site’s domain and repeat the search for data, as in the previous step.

If no necessary data is found, then most likely you will need to resort to browser emulation to parse the information.

Tip 3. Use Quick Export of Requests

In most cases, along with the request, the browser sends request headers and cookies to the server. Headers transmit metadata that allows the server to understand what data format is being requested and how best to deliver it. Cookies store session information and user preferences. Thanks to this, the server forms a personalized response.

Without this data, the server may reject the request if it considers it insufficiently secure.

Exporting a Request with cURL

This method allows you to export ready-made code for making a request, not only in Python. It works for any requests.

  1. Find the desired request in the web inspector.
  2. Right-click the request, then choose Copy and Copy as cURL. Now the request information is copied to your clipboard.
  3. Go to curlconverter.com, a Swiss Army knife for developers of parsing and automation scripts.
  4. Click Python in the programming language selection bar.
  5. Paste the copied request into the input field. You now have a ready-made code template with all request parameters, suitable for importing into your IDE.

The code contains dictionaries with headers, cookie data, JSON request parameters (json_data, if present), and everything necessary to fully duplicate the request made in the browser.

Tip 4. Use a Virtual Environment When Working with Python

Most often, scripts for parsing and automation are later uploaded to a remote server. A virtual environment creates a separate environment for the project and isolates its dependencies from system libraries. This helps to avoid version conflicts and reduces the risk of unexpected failures.

We explained more about virtual environments and how to create them in another article.

To quickly transfer the project to a server, provided you worked in a virtual environment on your local computer, first save the list of libraries with versions from pip into a file requirements.txt:

pip freeze > requirements.txt

If you just created a server on Ubuntu, you can use a universal script to install Python, a virtual environment, and all dependencies on a clean server.

First, transfer the project files (using the scp utility or the FTP protocol), go to the project directory, and paste the ready-made command into the terminal. At the beginning of the command, specify the required Python version in the variable PYVER and then execute the command:

export PYVER=3.9 && sudo apt update && sudo apt upgrade -y && sudo apt install -y software-properties-common && sudo add-apt-repository ppa:deadsnakes/ppa -y && sudo apt update && sudo apt install -y python${PYVER} python${PYVER}-venv python${PYVER}-dev python3-pip && python${PYVER} -m venv venv && source venv/bin/activate && pip install --upgrade pip && [ -f requirements.txt ] && pip install -r requirements.txt

Tip 5. Include Error Handlers in Your Algorithm

When developing a parser, it is important to provide an error handling mechanism. Network failures, changes in the HTML structure, or unexpected blocking by the site may lead to script failures.

Add retries for requests, timeouts, and a logging system for all actions and errors. This approach allows you to quickly detect problems, adjust parsing algorithms, and ensure the stability of the application even when the donor site changes.

In Python, you can use:

  • try, except, finally constructs;
  • the logging library for logging;
  • loops for retrying failed requests;
  • timeouts, for example:

requests:

requests.get("hostman.com", timeout=20)

aiohttp:

timeout = aiohttp.ClientTimeout(total=60, sock_connect=10, sock_read=10) 

async with aiohttp.ClientSession(timeout=timeout) as session: 

    async with session.get(url) as response: 

        return await response.text()

Tip 6. Implement Your Parser as a Generator

A generator is a class that implements the logic of an object that iteratively yields elements as needed.

Generators are especially convenient to use when developing a parsing script for the following reasons:

  • Lazy evaluation. Generators calculate and return data “on the fly,” which makes it possible to process large volumes of information without consuming significant amounts of memory. When parsing large files or web pages, this is critical: data is processed gradually, and only the current part is stored in memory, not the entire result at once.

  • Increased performance. Since elements are generated as needed, you can begin processing and transferring data (for example, to a database or a bot) before the entire dataset has been obtained. This reduces delays and allows you to react faster to incoming data.

  • Code organization convenience. Generators simplify the implementation of iterative processes, allowing you to focus on the parsing logic rather than managing iteration state. This is especially useful when you need to process a data stream and pass it to other parts of the system.

Example of Implementing a Parser as a Generator in Python

In the loop where the generator is used, it is convenient to initiate writing data to a database or, for example, sending notifications through a Telegram bot. Using generators makes the code more readable.

import requests
from bs4 import BeautifulSoup
 
class MyParser:
    def __init__(self, url):
        self.url = url
 
    def parse(self):
        """
        Generator that sequentially returns data
        (for example, titles of elements on a page).
        """
        response = requests.get(self.url)
        if response.status_code != 200:
            raise Exception(f"Failed to retrieve page, status: {response.status_code}")
 
        soup = BeautifulSoup(response.text, "html.parser")
        items = soup.select("div")
 
        for item in items:
            title = item.select_one("h1").get_text(strip=True) if item.select_one("h1") else "No title"
            yield {
                "title": title,
                "content": item.get_text(strip=True)
            }
 
if __name__ == "__main__":
    parser = MyParser("https://example.com")
    for data_item in parser.parse():
        print(data_item["title"], "--", data_item["content"])

Tip 7. Use an Asynchronous Approach to Speed Up Processing a Large Number of Requests

When parsing a large number of pages, a synchronous approach often becomes a bottleneck, since each request waits for the previous one to finish. Asynchronous libraries, such as aiohttp in Python, allow you to execute many requests simultaneously, which significantly speeds up data collection.

However, to avoid overloading both your application and the donor servers, it is important to properly regulate the request flow. This is where throttling, exponential backoff, and task queue techniques come into play.

How It Works

  • Asynchronous requests. Create an asynchronous session with specified timeouts (for example, total timeout, connection timeout, and read timeout). This allows you to process many requests in parallel without blocking the main execution thread.

  • Throttling. To prevent excessive load on the donor server, it makes sense to limit the number of simultaneous requests. This can be done using semaphores or other concurrency control mechanisms (for example, asyncio.Semaphore), so as not to send requests faster than allowed.

  • Exponential backoff. If a request fails (for example, due to a timeout or temporary blocking), use an exponential backoff strategy. On each retry, the waiting interval increases (for example, 1 second, then 2, 4, 8…), which allows the server to recover and reduces the likelihood of repeated errors.

  • Task queues. Organizing queues (for example, with asyncio.Queue) helps manage a large flow of requests. First, a queue of URLs is formed, then requests are processed as “slots” become available for execution. This ensures an even distribution of load and stable operation of the parser.

Example of Implementation in Python Using aiohttp

import asyncio
import aiohttp
from aiohttp import ClientTimeout
 
# Limit the number of simultaneous requests
semaphore = asyncio.Semaphore(10)
 
async def fetch(session, url):
    async with semaphore:
        try:
            async with session.get(url) as response:
                return await response.text()
        except Exception:
            # Apply exponential backoff in case of error
            for delay in [1, 2, 4, 8]:
                await asyncio.sleep(delay)
                try:
                    async with session.get(url) as response:
                        return await response.text()
                except Exception:
                    continue
            return None
 
async def main(urls):
    timeout = ClientTimeout(total=60, sock_connect=10, sock_read=10)
    async with aiohttp.ClientSession(timeout=timeout) as session:
        tasks = [asyncio.create_task(fetch(session, url)) for url in urls]
        results = await asyncio.gather(*tasks)
        # Process the obtained data
        for result in results:
            if result:
                print(result[:200])  # Print the first 200 characters of the response
 
# Example list of URLs for parsing
urls = ["http://hostman.com"] * 100
 
asyncio.run(main(urls))

Recommendations for Developers

There are also recommendations that will help simplify a developer’s work:

  • Check if the donor site has a public API. Sometimes the task of writing a parsing algorithm has already been solved, and the site offers a convenient API that fully covers the required functionality.

  • Monitor changes in the site’s structure. Donor site developers may change the layout, which would require you to update the selectors of the elements used in your code.

  • Test function execution at every stage. Automated tests (unit tests, integration tests) help promptly detect issues related to site structure changes or internal code modifications.

Checklist for Determining the Parsing Method

We have systematized the information from this article so you can understand which parsing method to use when working with any donor site.

Checklist

Conclusion

The universal parsing methods presented here form a reliable foundation for developing algorithms capable of extracting data from a wide variety of websites, regardless of the programming language chosen. Following these scraping best practices and tips allows you to build a flexible, scalable, and change-resistant algorithm. Such an approach not only helps to optimally use system resources but also ensures the ability to quickly integrate the obtained data with databases, messengers, or other external services.



Python
23.09.2025
Reading time: 12 min

Similar

Python

How to Use Python time.sleep()

Sometimes, while running a program, it’s necessary to pause: wait for data to load, give the user time to enter input, or reduce the load on the system. One of the simplest ways to achieve this in Python is with the time.sleep() function, which suspends program execution for a given interval. In this article, we’ll examine how time.sleep() works in Python, its features and alternatives, as well as possible errors. We’ll discuss when it’s appropriate to use it in multithreaded and asynchronous programs, and when it’s better to choose asyncio.sleep() or other waiting mechanisms. What is the sleep() Function in Python? Python's time.sleep() function can be used to freeze the current thread's execution for a specific period of time. The built-in time module in Python contains this function. This function in Python was added to the standard library to simplify creating pauses in code. It is located in the time module and is called time.sleep, allowing you to pause program execution for a specified number of seconds. In practice, sleep() is useful for pauses in test environments, delays between API requests, or intervals between sending messages. However, you should not confuse its use for system-level tasks, such as thread synchronization, with simply slowing down a script. If precise timing coordination or asynchronous I/O is needed, other tools are more suitable. How time.sleep() Works The time.sleep() function in Python pauses the current thread for the specified number of seconds. In a multithreaded scenario, other threads continue running, but the one where time.sleep() was called remains "frozen" for that interval. It’s important to note that time.sleep() blocks code execution at that point, delaying all subsequent operations. Ignoring this rule can lead to reduced performance or even a frozen user interface in desktop applications. When time.sleep() is Used Most often, time.sleep() is used in testing and debugging, when a short delay is needed—for example, to verify the correctness of an API response or wait for a server reply. It’s also used for step-by-step script execution, giving the user time to view information or enter data. In demonstrations, tutorials, and prototyping, time.sleep() helps simulate long-running processes, and when working with external services, it helps avoid penalties or blocks from too frequent requests. However, sleep() is not the only way to slow down code execution. Further in the article, we’ll review some alternatives. How to Use time.sleep() in Python You must import the time module before you can use time.sleep(). The required delay in seconds can then be passed as a parameter when calling time.sleep(). For a few seconds, this delay may be expressed as a floating-point number or as a whole number.  Basic Syntax of time.sleep() To call the time.sleep() function, first import the time module: import time time.sleep(5) In this example, the program will "sleep" for 5 seconds. The number passed to the function can be either an integer or a float, which allows sleeping for fractions of a second. Syntax: time.sleep(seconds) The time.sleep() function does not return any value. That means you cannot precisely determine how accurate the pause was—it simply suspends the current thread for the specified duration. Example: Delaying Code Execution Suppose you have a small script that prints messages with a 2-second interval. To add a delay in Python, just insert time.sleep(2): import time print("First message") time.sleep(2) print("Second message") time.sleep(2) print("Third message") When running this script, the user will see a 2-second pause between each message. That’s exactly how a delay in Python works using time.sleep(2). Parameters of time.sleep() The time.sleep() function accepts only one parameter, but it can be either an integer or a float. This adds flexibility when implementing delays in Python. Passing Values in Seconds Most examples of time.sleep() usage pass an integer representing seconds. For example: time.sleep(10) Here, the script pauses for 10 seconds. This is convenient when you need a long pause or want to limit request frequency. Using Fractions of a Second (Milliseconds) Sometimes you need to pause for a few milliseconds or fractions of a second. To do this, you can pass a floating-point number: time.sleep(0.5) This creates a half-second pause. However, because of OS and Python timer limitations, the delay may slightly exceed 500 milliseconds. For most tasks, this isn’t critical, but in high-precision real-time systems, specialized tools should be used instead. Alternative Ways to Pause in Python Although time.sleep() is the most popular and simplest way to create pauses, there are other methods that may be more suitable when waiting for external events or handling multiple threads. Let’s look at the most common alternatives. Using input() for Waiting The simplest way to pause in Python is by calling input(). It suspends execution until the user presses Enter or enters data. Example: print("Press Enter to continue...") input() print("Resuming program execution") While this feels like a pause, technically it’s not a timed delay. The program waits for user action, not a fixed interval. This method is rarely useful in automated scripts but can be handy in debugging or console utilities where a "pause on demand" is needed. Waiting with threading.Event() If you’re writing a multithreaded program, it can be more useful to use synchronization objects like threading.Event(). You can configure it to block a thread until a signal is received. Example: import threading event = threading.Event() def worker():     print("Starting work in thread")     event.wait()     print("Event received, resuming work") thread = threading.Thread(target=worker) thread.start() import time time.sleep(3) event.set() In this case, the thread is blocked until event.set() is called. You can still use time.sleep() to set a maximum pause, but unlike plain sleep(), this approach allows more flexible control. The thread can be "woken up" immediately without waiting for the full interval. asyncio.sleep() for Asynchronous Programs In asynchronous Python programming (asyncio module), asyncio.sleep() is used. Unlike time.sleep(), it doesn’t block the entire thread but only suspends the current coroutine, allowing the event loop to continue running other tasks. Example: import asyncio async def main(): print("Start async work") await asyncio.sleep(2) print("2 seconds passed, resuming") asyncio.run(main()) This is especially useful when you have multiple asynchronous functions that should run in parallel without interfering with each other. If you use regular time.sleep() in async code, it will block the entire event loop, causing other coroutines to wait too. Common Issues When Using time.sleep()  The time.sleep() function is simple, but misusing it can cause unexpected problems. It’s important to understand how it affects program execution so you don’t block important processes. Blocking the Main Thread The main feature of time.sleep() is that it blocks the thread where it was called. If you use it in the main thread of a GUI application (for example, Tkinter or PyQt), the interface will stop responding, creating a "frozen" effect. To avoid this, use time.sleep() only in separate threads or switch to asynchronous approaches like asyncio.sleep() for non-blocking delays. In GUI applications, it’s better to use timers (QTimer, after, etc.), which call functions at intervals without blocking the interface. Use in Multithreaded and Asynchronous Code In multithreaded code, time.sleep() can be called independently in each thread, but note that it doesn’t automatically release the Global Interpreter Lock (GIL). While other threads can still run during one thread’s sleep, in Python this depends on OS-level thread scheduling. In asynchronous code, time.sleep() should be used cautiously. If called inside an event loop (like with asyncio.run()), it blocks the entire loop, defeating the benefits of async programming. Instead, use asyncio.sleep(), which hands control back to the scheduler, letting other coroutines run in the background. Real-World Example of Using time.sleep() Imagine you’re writing a script to periodically poll an external API, which, according to its rules, must not be called more than once every 30 seconds. If requests are too frequent, the server may return errors or block your IP. Solution using time.sleep(): import time def poll_api(): print("Making API request...") def main(): while True: poll_api() time.sleep(30) if __name__ == "__main__": main() Here, after each request, we pause for 30 seconds with time.sleep(). This ensures no more than two requests per minute, respecting the limits. Async alternative: import asyncio async def poll_api(): print("Making API request...") async def main(): while True: await poll_api() await asyncio.sleep(30) if __name__ == "__main__": asyncio.run(main()) This version doesn’t block the entire program, allowing other requests or tasks to run in the same async environment. It’s more flexible and scalable. Conclusion Organizing pauses and delays is an important aspect of Python development. time.sleep() is the first and most obvious tool for this, but choosing between time.sleep(), asyncio.sleep(), and other methods should depend on your project’s architecture. In single-threaded scripts and console utilities, time.sleep() is perfectly fine, but for multithreaded and asynchronous applications, other mechanisms are better. Key recommendations: Use time.sleep() for short delays in tests, pauses between requests, or interface demonstrations. Don’t block the main thread of GUI applications to avoid a "frozen" interface. In async code, replace time.sleep() with asyncio.sleep() to keep the event loop efficient. In multithreaded programs, remember only the current thread pauses, but GIL affects scheduling. In special cases, use threading.Event() or input() to wait for events or user actions.
19 September 2025 · 8 min to read
Python

How to Delete Characters from a String in Python

When writing Python code, developers often need to modify string data. Common string modifications include: Removing specific characters from a sequence Replacing characters with others Changing letter case Joining substrings into a single sequence In this guide, we will focus on the first transformation—deleting characters from a string in Python. It’s important to note that strings in Python are immutable, meaning that any method or function that modifies a string will return a new string object with the changes applied. Methods for Deleting Characters from a String This section covers the main methods in Python used for deleting characters from a string. We will explore the following methods: replace() translate() re.sub() For each method, we will explain the syntax and provide practical examples. replace() The first Pyhton method we will discuss is replace(). It is used to replace specific characters in a string with others. Since strings are immutable, replace() returns a new string object with the modifications applied. Syntax: original_string.replace(old, new[, count]) Where: original_string – The string where modifications will take place old – The substring to be replaced new – The substring that will replace old count (optional) – The number of occurrences to replace (if omitted, all occurrences will be replaced) First, let’s remove all spaces from the string "H o s t m a n": example_str = "H o s t m a n" result_str = example_str.replace(" ", "") print(result_str) Output: Hostman We can also use the replace() method to remove newline characters (\n). example_str = "\nHostman\nVPS" print(f'Original string: {example_str}') result_str = example_str.replace("\n", " ") print(f'String after adjustments: {result_str}') Output: Original string: Hostman VPS String after adjustments: Hostman VPS The replace() method has an optional third argument, which specifies the number of replacements to perform. example_str = "Hostman VPS Hostman VPS Hostman VPS" print(f'Original string: {example_str}') result_str = example_str.replace("Hostman VPS", "", 2) print(f'String after adjustments: {result_str}') Output: Original string: Hostman VPS Hostman VPS Hostman VPS String after adjustments: Hostman VPS Here, only two occurrences of "Hostman VPS" were removed, while the third occurrence remained unchanged. We have now explored the replace() method and demonstrated its usage in different situations. Next, let’s see how we can delete and modify characters in a string using translate(). translate( The Python translate() method functions similarly to replace() but with additional flexibility. Instead of replacing characters one at a time, it allows mapping multiple characters using a dictionary or translation table. The method returns a new string object with the modifications applied. Syntax: original_string.translate(mapping_table) In the first example, let’s remove all occurrences of the $ symbol in a string and replace them with spaces: example_str = "Hostman$Cloud$—$Cloud$Service$Provider." print(f'Original string: {example_str}') result_str = example_str.translate({ord('$'): ' '}) print(f'String after adjustments: {result_str}') Output: Original string: Hostman$Cloud$—$Cloud$Service$Provider. String after adjustments: Hostman Cloud — Cloud Service Provider. To improve code readability, we can define the mapping table before calling translate(). This is useful when dealing with multiple replacements: example_str = "\nHostman%Cloud$—$Cloud$Service$Provider.\n" print(f'Original string: {example_str}') # Define translation table example_table = {ord('\n'): None, ord('$'): ' ', ord('%'): ' '} result_str = example_str.translate(example_table) print(f'String after adjustments: {result_str}') Output: Original string: Hostman%Cloud$—$Cloud$Service$Provider. String after adjustments: Hostman Cloud — Cloud Service Provider. re.sub() In addition to replace() and translate(), we can use regular expressions for more advanced character removal and replacement. Python's built-in re module provides the sub() method, which searches for a pattern in a string and replaces it. Syntax: re.sub(pattern, replacement, original_string [, count=0, flags=0]) pattern – The regular expression pattern to match replacement – The string or character that will replace the matched pattern original_string – The string where modifications will take place count (optional) – Limits the number of replacements (default is 0, meaning replace all occurrences) flags (optional) – Used to modify the behavior of the regex search Let's remove all whitespace characters (\s) using the sub() method from the re module: import re example_str = "H o s t m a n" print(f'Original string: {example_str}') result_str = re.sub('\s', '', example_str) print(f'String after adjustments: {result_str}') Output: Original string: H o s t m a nString after adjustments: Hostman Using Slices to Remove Characters In addition to using various methods to delete characters, Python also allows the use of slices. As we know, slices extract a sequence of characters from a string. To delete characters from a string by index in Python, we can use the following slice: example_str = "\nHostman \nVPS" print(f'Original string: {example_str}') result_str = example_str[1:9] + example_str[10:] print(f'String after adjustments: {result_str}') In this example, we used slices to remove newline characters. The output of the code: Original string:HostmanVPSString after adjustments: Hostman VPS Apart from using two slice parameters, you can also use a third one, which specifies the step size for index increments. For example, if we set the step to 2, it will remove every odd-indexed character in the string. Keep in mind that indexing starts at 0. Example: example_str = "Hostman Cloud" print(f'Original string: {example_str}') result_str = example_str[::2] print(f'String after adjustments: {result_str}') Output: Original string: Hostman CloudString after adjustments: HsmnCod Conclusion In this guide, we learned how to delete characters from a string in Python using different methods, including regular expressions and slices. The choice of method depends on the specific task. For example, the replace() method is suitable for simpler cases, while re.sub() is better for more complex situations.
23 August 2025 · 5 min to read
Python

Command-Line Option and Argument Parsing using argparse in Python

Command-line interfaces (CLIs) are one of the quickest and most effective means of interacting with software. They enable you to provide commands directly which leads to quicker execution and enhanced features. Developers often build CLIs using Python for several applications, utilities, and automation scripts, ensuring they can dynamically process user input. This is where the Python argparse module steps in. The argparse Python module streamlines the process of managing command-line inputs, enabling developers to create interactive and user-friendly utilities. As part of the standard library, it allows programmers to define, process, and validate inputs seamlessly without the need for complex logic. This article will discuss some of the most important concepts, useful examples, and advanced features of the argparse module so that you can start building solid command-line tools right away. How to Use Python argparse for Command-Line Interfaces This is how to use argparse in your Python script: Step 1: Import Module First import the module into your Python parser script: import argparse This inclusion enables parsing .py arg inputs from the command line. Step 2: Create an ArgumentParser Object The ArgumentParser class is the most minimal class of the Python argumentparser module's API. To use it, begin by creating an instance of the class: parser = argparse.ArgumentParser(description="A Hostman tutorial on Python argparse.") Here: description describes what the program does and will be displayed when someone runs --help. Step 3: Add Inputs and Options Define the parameters and features your program accepts via add_argument() function: parser.add_argument('filename', type=str, help="Name of the file to process") parser.add_argument('--verbose', action='store_true', help="Enable verbose mode") Here: filename is a mandatory option. --verbose is optional, to allow you to set the flag to make it verbose. Step 4: Parse User Inputs Process the user-provided inputs by invoking the parse_args() Python method: args = parser.parse_args() This stores the command-line values as attributes of the args object for further use in your Python script.  Step 5: Access Processed Data Access the inputs and options for further use in your program: For example: print(f"File to process: {args.filename}") if args.verbose:     print("Verbose mode enabled") else:     print("Verbose mode disabled") Example CLI Usage Here are some scenarios to run this script: File Processing Without Verbose Mode python3 file.py example.txt File Processing With Verbose Mode python3 file.py example.txt --verbose Display Help If you need to see what arguments the script accepts or their description, use the --help argument: python3 file.py --help Common Examples of argparse Usage Let's explore a few practical examples of the module. Example 1: Adding Default Values Sometimes, optional inputs in command-line interfaces need predefined values for smoother execution. With this module, you can set a default value that applies when someone doesn’t provide input. This script sets a default timeout of 30 seconds if you don’t specify the --timeout parameter. import argparse # Create the argument parser parser = argparse.ArgumentParser(description="Demonstrating default argument values.") # Pass an optional argument with a default value parser.add_argument('--timeout', type=int, default=30, help="Timeout in seconds (default: 30)") # Interpret the arguments args = parser.parse_args() # Retrieve and print the timeout value print(f"Timeout value: {args.timeout} seconds") Explanation Importing Module: Importing the argparse module. Creating the ArgumentParser Instance: An ArgumentParser object is created with a description so that a short description of the program purpose is provided. This description is displayed when the user runs the program via the --help option. Including --timeout: The --timeout option is not obligatory (indicated by the -- prefix). The type=int makes the argument for --timeout an integer. The default=30 is provided so that in case the user does not enter a value, then the timeout would be 30 seconds. The help parameter adds a description to the argument, and it will also appear in the help documentation. Parsing Process: The parse_args() function processes user inputs and makes them accessible as attributes of the args object. In our example, we access args.timeout and print out its value. Case 1: Default Value Used If the --timeout option is not specified, the default value of 30 seconds is used: python file.py Case 2: Custom Value Provided For a custom value for --timeout (e.g., 60 seconds), apply: python file.py --timeout 60 Example 2: Utilizing Choices The argparse choices parameter allows you to restrict an argument to a set of beforehand known valid values. This is useful if your program features some specific modes, options, or settings to check. Here, we will specify a --mode option with two default values: basic and advanced. import argparse # Creating argument parser parser = argparse.ArgumentParser(description="Demonstrating the use of choices in argparse.") # Adding the --mode argument with predefined choices parser.add_argument('--mode', choices=['basic', 'advanced'], help="Choose the mode of operation") # Parse the arguments args = parser.parse_args() # Access and display the selected mode if args.mode: print(f"Mode selected: {args.mode}") else: print("No mode selected. Please choose 'basic' or 'advanced'.") Adding --mode: The choices argument indicates that valid options for the --mode are basic and advanced. The application will fail when the user supplies an input other than in choices. Help Text: The help parameter gives valuable information when the --help command is executed. Case 1: Valid Input To specify a valid value for --mode, utilize: python3 file.py --mode basic Case 2: No Input Provided For running the program without specifying a mode: python3 file.py Case 3: Invalid Input If a value is provided that is not in the predefined choices: python3 file.py --mode intermediate Example 3: Handling Multiple Values The nargs option causes an argument to accept more than one input. This is useful whenever your program requires a list of values for processing, i.e., numbers, filenames, or options. Here we will show how to use nargs='+' to accept a --numbers option that can take multiple integers. import argparse # Create an ArgumentParser object parser = argparse.ArgumentParser(description="Demonstrating how to handle multiple values using argparse.") # Add the --numbers argument with nargs='+' parser.add_argument('--numbers', nargs='+', type=int, help="List of numbers to process") # Parse the arguments args = parser.parse_args() # Access and display the numbers if args.numbers: print(f"Numbers provided: {args.numbers}") print(f"Sum of numbers: {sum(args.numbers)}") else: print("No numbers provided. Please use --numbers followed by a list of integers.") Adding the --numbers Option: The user can provide a list of values as arguments for --numbers. type=int interprets the input as an integer. If a non-integer input is provided, the program raises an exception. The help parameter gives the information.  Parsing Phase: After parsing the arguments, the input to --numbers is stored in the form of a list in args.numbers. Utilizing the Input: You just need to iterate over the list, calculate statistics (e.g., sum, mean), or any other calculation on the input. Case 1: Providing Multiple Numbers To specify multiple integers for the --numbers parameter, execute: python3 file.py --numbers 10 20 30 Case 2: Providing a Single Number If just one integer is specified, run: python3 file.py --numbers 5 Case 3: No Input Provided If the script is run without --numbers: python3 file.py Case 4: Invalid Input In case of inputting a non-integer value: python3 file.py --numbers 10 abc 20 Example 4: Required Optional Arguments Optional arguments (those that begin with the --) are not mandatory by default. But there are times when you would like them to be mandatory for your script to work properly. You can achieve this by passing the required=True parameter when defining the argument. In this script, --config specifies a path to a configuration file. By leveraging required=True, the script enforces that a value for --config must be provided. If omitted, the program will throw an error. import argparse # Create an ArgumentParser object parser = argparse.ArgumentParser(description="Demonstrating required optional arguments in argparse.") # Add the --config argument parser.add_argument('--config', required=True, help="Path to the configuration file") # Parse the arguments args = parser.parse_args() # Access and display the provided configuration file path print(f"Configuration file path: {args.config}") Adding the --config Option: --config is considered optional since it starts with --. However, thanks to the required=True parameter, users must include it when they run the script. The help parameter clarifies what this parameter does, and you'll see this information in the help message when you use --help. Parsing: The parse_args() method takes care of processing the arguments. If someone forgets to include --config, the program will stop and show a clear error message. Accessing the Input: The value you provide for --config gets stored in args.config. You can then use this in your script to work with the configuration file. Case 1: Valid Input For providing a valid path to the configuration file, use: python3 file.py --config settings.json Case 2: Missing the Required Argument For running the script without specifying --config, apply: python3 file.py Advanced Features  While argparse excels at handling basic command-line arguments, it also provides advanced features that enhance the functionality and usability of your CLIs. These features ensure your scripts are scalable, readable, and easy to maintain. Below are some advanced capabilities you can leverage. Handling Boolean Flags Boolean flags allow toggling features (on/off) without requiring user input. Use the action='store_true' or action='store_false' parameters to implement these flags. parser.add_argument('--debug', action='store_true', help="Enable debugging mode") Including --debug enables debugging mode, useful for many Python argparse examples. Grouping Related Arguments Use add_argument_group() to organize related arguments, improving readability in complex CLIs. group = parser.add_argument_group('File Operations') group.add_argument('--input', type=str, help="Input file") group.add_argument('--output', type=str, help="Output file") Grouped arguments appear under their own section in the --help documentation. Mutually Exclusive Arguments To ensure users select only one of several conflicting options, use the add_mutually_exclusive_group() method. group = parser.add_mutually_exclusive_group() group.add_argument('--json', action='store_true', help="Output in JSON format") group.add_argument('--xml', action='store_true', help="Output in XML format") This ensures one can choose either JSON or XML, but not both. Conclusion The argparse Python module simplifies creating reliable CLIs for handling Python program command line arguments. From the most basic option of just providing an input to more complex ones like setting choices and nargs, developers can build user-friendly and robust CLIs. Following the best practices of giving proper names to arguments and writing good docstrings would help you in making your scripts user-friendly and easier to maintain.
21 July 2025 · 10 min to read

Do you have questions,
comments, or concerns?

Our professionals are available to assist you at any moment,
whether you need help or are just unsure of where to start.
Email us
Hostman's Support