Sign In
Sign In

Time Series Forecasting with ARIMA in Python 3

Time Series Forecasting with ARIMA in Python 3
Hostman Team
Technical writer
Python
15.07.2024
Reading time: 23 min

Data analytics has long become an integral part of modern life. We encounter a massive flow of information daily, which we need to collect and interpret correctly. One method of data analysis is time series forecasting. A time series (dynamic series) is a sequence of data points or observations taken at regular intervals. Examples of time series include monthly sales, daily temperatures, annual profits, and so on. Time series forecasting is a scientific field where models are built to predict the future behavior of a process or phenomenon based on past observations recorded in the dynamic series.

In this guide, we will focus on using the ARIMA model, one of the most commonly applied approaches in time series analysis. We will thoroughly examine the process of using the ARIMA model in Python 3—from the initial stages of loading and processing data to the final stage of forecasting. We will also learn how to determine and interpret the parameters of the ARIMA model and how to evaluate its quality.

Whether you are new to data analysis or an experienced analyst, this guide aims to teach you how to apply the ARIMA model for time series data forecasting. But not just apply it but do so effectively and in an automated manner, using the extensive functionality of Python.

Setting Up the Working Environment for Data Analysis in Python

Installing Python

First and foremost, you need to install Python itself—the programming language we will use for data analysis. You can download it from the official website, python.org, following the installation instructions provided there. After completing the installation, open the command line (on Windows) or terminal (on macOS/Linux) and enter:

python --version

If everything was done correctly, you will see the version number of the installed Python.

Setting Up the Development Environment

To work with Python, you can choose a development environment (IDE) that suits you. In this guide, we will work with Jupyter Notebook, which is very popular among data analysts. Other popular options include PyCharm, Visual Studio Code, and Spyder. To install Jupyter Notebook, enter the following in the command line:

pip install jupyter

Installing Necessary Python Packages

Along with Python, some basic libraries are always installed. These are extremely useful tools, but you may need additional tools for more in-depth data analysis. In this guide, we will use:

  • pandas (for working with tabular data)

  • numpy (for scientific computations)

  • matplotlib (for data visualization).stats models

  • statsmodels (a library for statistical models).

You can install these libraries using the pip3 install command in the terminal or command line:

pip3 install pandas numpy matplotlib statsmodels

We will also need the libraries warnings (for generating warnings) and itertools (for creating efficient looping structures), which are already included in the standard Python library, so you do not need to install them separately. To check the installed packages, use the command:

pip list

As a result, you will get a list of all installed modules and their versions.

Creating a Working Directory

Your working directory is the place on your computer where you will store all your Python scripts and project files. To create a new directory, open the terminal or command line and enter the following commands:

cd path_to_your_directory
mkdir Your_Project_Name
cd Your_Project_Name

Here, path_to_your_directory is the path to the location where the project folder will be created, and Your_Project_Name is the name of your project.

After successfully completing the above steps, you are ready to work on data analysis in Python. Your development environment is set up, your working directory is ready, and all necessary packages are installed.

Loading and Processing Data

Starting Jupyter Notebook

Let's start by launching Jupyter Notebook, our main tool for writing and testing Python code. In the command line (or terminal), navigate to your working directory and enter the following command:

jupyter notebook

A new tab with the Jupyter Notebook interface will open in your browser. To create a new document, select the "New" tab in the top right corner of the window and choose "Python 3" from the dropdown menu. You will be automatically redirected to a new tab where your notebook will be created.

Importing Libraries

The next step is to import the necessary Python libraries. Create a new cell in your notebook and insert the following code:

import warnings
import itertools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

To run the code, use Shift+Enter. Now these libraries are available in your project, and you can use their functionalities to perform various data analysis tasks. Pressing Shift+Enter will execute the current code cell and move the focus to the next cell, where you can continue writing your code.

Loading Data

For time series forecasting in Python, we will use the Airline Passengers dataset. This dataset represents the monthly tracking of the number of passengers on international airlines, expressed in thousands, from 1949 to 1960. You can find this data here. To load data from a CSV file via URL, use the pandas library:

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv"
time_series = pd.read_csv(url)

If your CSV file is stored locally on your computer, use the following line of code to load it:

time_series = pd.read_csv('airline_passengers.csv')

Now the data is saved in a DataFrame named time_series. A DataFrame is the primary data structure in the pandas library. It is a two-dimensional table where each row is a separate observation, and the columns are various features or variables of that observation. To verify the correctness of the data loading and to affirm the data format, you can display the first few rows of our dataset:

print(time_series.head())

This code will output the first five rows of the loaded dataset, allowing you to quickly check if they were loaded correctly and if they look as expected. By default, the head() method outputs five rows, but you can specify a different number in the method's parentheses to view another quantity.

You can also view the last rows of the DataFrame:

print(time_series.tail())

Data Processing

Before starting the analysis, the data needs to be preprocessed. Generally, this can involve many steps, but for our time series, we can limit ourselves to the following actions.

Checking for Missing Values

Handling missing values is an important step in the preprocessing of dynamic series. Missing values can cause issues in the analysis and distort forecasting results. To check for missing values, you can use the isnull() method from the pandas library:

print(time_series.isnull().sum())

If 0 is indicated for all columns, this means there are no missing values in the data. However, if missing values are found during execution, they should be handled. There are various ways to handle missing values, and the approach will depend on the nature of your data. For example, we can fill in missing values with the column's mean value:

time_series = time_series.fillna(time_series.mean())

To replace missing values only in certain columns, use this command:

time_series['Column 1'] = time_series['Column 1'].fillna(time_series['Column 1'].mean())

Data Type Conversion 

Each column in the DataFrame has a specific data type. In dynamic series, the DataTime type, specifically designed for storing dates and times, is particularly important. Our DataFrame in pandas reads the information as text by default. Even if the column contains dates, pandas will perceive them as regular strings. In our case, we need to convert the Month column to DataTime so that we can work with temporal data:

time_series['Month'] = pd.to_datetime(time_series['Month'])

Setting DataTime as an Index

In pandas, each data row has its unique index (similar to a row number). However, sometimes it is more convenient to use a specific column from your data as an index. When working with time series, the most convenient choice for the index is the column containing the date or time. This allows for easy selection and analysis of data for specific time periods. In our case, we use the Month column as the index:

time_series.set_index('Month', inplace=True)

Rescaling Data

Another important step in data preprocessing is checking the need for rescaling the data. If the range of your data is too large (e.g., the Passengers value ranges from thousands to millions), you may need to transform the data. In the case of airline passenger data, they look quite organized, and such rescaling may not be relevant. However, it is always important to check the data range before further steps. An example of data standardization when the range is large:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
time_series[['Passengers']] = scaler.fit_transform(time_series[['Passengers']])

All the aforementioned steps are important in preparing data for further analysis. They help improve the quality of the time series and make the process of working with it simpler and more efficient. In this guide, we have covered only some data processing steps. But this stage can also include other actions, such as detecting and handling anomalies or outliers, creating new variables or features, and dividing the data into subgroups or categories.

Data Visualization

An important element when working with data is its visual representation. Using matplotlib, we can easily turn data into a visual chart, which helps us understand the structure of the time sequence. Visualization allows us to immediately see trends and seasonality in the data. A trend is the general direction of data movement over a long period. Seasonality is recurring data fluctuations in predictable time frames (week, month, quarter, year, etc.). Generally, a trend is associated with long-term data movement, while seasonality is associated with short-term, regular, and periodic changes.

For example, if you see that the number of passengers grows every year, this indicates an upward trend. If the number of passengers grows in the same months every year, this indicates annual seasonality.

To draw a chart, use the following lines of code:

plt.figure(figsize=(15,8))
plt.plot(time_series['Passengers'])
plt.title('Time Series Plot')
plt.xlabel('Date')
plt.ylabel('Passengers')
plt.show()

For our data, we get the following plot:

Group 1321314183 (1)

Stationarity of Data

In time series analysis, it is crucial to pay attention to the concept of stationarity. Stationarity in a time series means that the series' characteristics (such as mean and variance) remain constant over time. Non-stationary series often lead to errors in further predictions.

The ARIMA model can adapt the series to stationarity "on its own" through a special model parameter (d). However, understanding whether your initial time series is stationary helps you better understand ARIMA's workings.

There are several methods to check for stationarity in a series:

  1. Visual Analysis: Start by plotting the data and observe the following aspects:

    • Mean: Fluctuations in this indicator over time may signal that the time series is not stationary.

    • Variance: If the variance changes over time, this also indicates non-stationarity.

    • Trend: A visible trend on the graph is another indication of non-stationarity.

    • Seasonality: Seasonal fluctuations on the graph can also suggest non-stationarity.

  2. Statistical Analysis: Perform statistical tests like the Dickey-Fuller test. This method provides a quantitative assessment of a time series' stationarity. The null hypothesis of the test assumes that the time series is non-stationary. If the p-value is less than the significance level of 0.05, the null hypothesis is rejected, and the series can be considered stationary.

Running the Dickey-Fuller test on our data might look like this:

from statsmodels.tsa.stattools import adfuller

print('Test result:')
df_result = adfuller(time_series['Passengers'])
df_labels = ['ADF Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used']
for result_value, label in zip(df_result, df_labels):
    print(label + ' : ' + str(result_value))

if df_result[1] <= 0.05:
    print("Strong evidence against the null hypothesis, the series is stationary.")
else:
    print("Weak evidence against the null hypothesis, the series is not stationary.")

Our time series is not stationary, but in the following sections, we will automatically search for parameters for the ARIMA model and find the necessary parameters to make the series stationary.

Test result:

Test result:
ADF Test Statistic : 0.8153688792060482
p-value : 0.991880243437641
#Lags Used : 13
Number of Observations Used : 130
Weak evidence against the null hypothesis, the series is not stationary.

Even though we don't need to manually make the series stationary, it's useful to know which methods can be used to do so. There are many methods, including:

  • Differencing: One of the most common methods, differencing involves calculating the difference between consecutive observations in the time series.

  • Seasonal Differencing: A variation of regular differencing, applied to data with a seasonal component.

  • Log Transformation: Taking the logarithm of the data can help reduce variability in the series and make it more stationary.

Some time series may be particularly complex and require combining transformation methods. After transforming the series, you should recheck for stationarity using the Dickey-Fuller test to ensure the transformation was successful.

ARIMA Model

ARIMA (AutoRegressive Integrated Moving Average) is a statistical model used for analyzing and forecasting time series data.

  • AutoRegressive (AR): Uses the dependency between an observation and a number of lagged observations (e.g., predicting tomorrow's weather based on previous days' weather).

  • Integrated (I): Involves differencing the time series data to make it stationary.

  • Moving Average (MA): Models the error between the actual observation and the predicted value using a combination of past errors.

The ARIMA model is usually denoted as ARIMA(p, d, q), where p, d, and q are the model parameters:

  • p: The order of the autoregressive part (number of lagged observations included).

  • d: The degree of differencing (number of times the data is differenced to achieve stationarity).

  • q: The order of the moving average part (number of lagged forecast errors included).

Choosing the appropriate values for (p, d, q) involves analyzing autocorrelation and partial autocorrelation plots and applying information criteria.

Seasonal ARIMA Model

Seasonal ARIMA (SARIMA) extends ARIMA to account for seasonality in time series data. In many cases, time series exhibit clear seasonal patterns, such as ice cream sales increasing in summer and decreasing in winter. SARIMA captures these seasonal patterns.

SARIMA is typically denoted as SARIMA(p, d, q)(P, D, Q)m, where p, d, q are non-seasonal parameters, and P, D, Q are seasonal parameters:

  • p, d, q: The same as in ARIMA.

  • P: The order of seasonal autoregression (number of lagged seasons affecting the current season).

  • D: The degree of seasonal differencing (number of times seasonal trends are differenced).

  • Q: The order of seasonal moving average (number of lagged seasonal forecast errors included).

  • m: The length of the seasonal period (e.g., 12 for monthly data with yearly seasonality).

Like ARIMA, SARIMA is suitable for forecasting time series data but with the added capability of capturing and modeling seasonal patterns.

Although ARIMA, particularly seasonal ARIMA, may seem complex due to the need to carefully select numerous parameters, automating this process can simplify the task.

Defining Model Parameters

The first step in configuring an ARIMA model is determining the optimal parameter values for our specific dataset.

To tune the ARIMA parameters, we will use "grid search." The essence of this method is that it goes through all possible parameter combinations from a predefined grid of values and trains the model on each combination. After training the model on each combination, the model with the best performance is selected.

The more different parameter values, the more combinations need to be checked, and the longer the process will take. For our case, we will use only two possible values (0 and 1) for each parameter, resulting in a total of 8 combinations for the ARIMA parameters and 8 for the seasonal part (with a seasonal period length = 12). Thus, the total number of combinations to check is 64, leading to a relatively quick execution.

It's important to remember that the goal is to find a balance between the time spent on the grid search and the quality of the final model, meaning finding parameter values that yield the highest quality while minimizing time costs.

Importing Necessary Packages

from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

Statsmodels provides us with methods for building ARIMA models, and itertools (which we imported earlier) is used to create combinations of possible parameter values.

Ignoring Warnings

When working with large datasets and complex computations like statistical analysis or machine learning, libraries and functions may generate warnings about potential issues or non-optimality. However, these warnings are often insignificant or irrelevant to your specific case. Therefore, we set the warnings filter to ignore:

warnings.filterwarnings("ignore")

Creating a Range of Parameters for Model Tuning

To determine the model parameters, we'll define the function search_optimal_sarima.

def search_optimal_sarima(time_series, seasonal_cycle):
    order_vals = diff_vals = ma_vals = range(0, 2)
    pdq_combinations = list(itertools.product(order_vals, diff_vals, ma_vals))
    seasonal_combinations = [(combo[0], combo[1], combo[2], seasonal_cycle) for combo in pdq_combinations]
       
    smallest_aic = float("inf")
    optimal_order_param = optimal_seasonal_param = None

    for order_param in pdq_combinations:
        for seasonal_param in seasonal_combinations:
            try:
                sarima_model = sm.tsa.statespace.SARIMAX(time_series,
                                                         order=order_param,
                                                         seasonal_order=seasonal_param,
                                                         enforce_stationarity=False,
                                                         enforce_invertibility=False)

                model_results = sarima_model.fit()
                if model_results.aic < smallest_aic:
                    smallest_aic = model_results.aic
                    optimal_order_param = order_param
                    optimal_seasonal_param = seasonal_param
            except:
                continue

    print('ARIMA{}x{} - AIC:{}'.format(optimal_order_param, optimal_seasonal_param, smallest_aic))

seasonal_cycle_length = 12
search_optimal_sarima(time_series, seasonal_cycle_length)

The first three lines of code in our function create the parameter ranges. As we already know, the ARIMA model has three main parameters, p, d, q. In the code above, p, d, and q are ranges from 0 to 2, meaning they can take values 0 or 1. The itertools.product() function generates all possible combinations of these three parameters. Examples of combinations include (0, 0, 0), (0, 0, 1), (0, 1, 1), and so on.

Then we create additional combinations by adding the seasonal period to each of the pdq combinations. This allows the model to account for seasonal influences on the time series.

Finding the Best Parameters for the Model

Now we need to apply the parameters we determined earlier to automatically tune ARIMA models. When working with forecasting models, our task is to choose the model that best explains and predicts the data. However, selecting the best model is not always straightforward. The Akaike Information Criterion (AIC) helps us compare different models and determine which one is better. AIC helps evaluate how well the model fits the data, considering its complexity. So, the goal is to find the model with the lowest AIC value.

The code above iterates through all possible parameter combinations and uses the SARIMAX function to build the seasonal ARIMA model. The order parameter sets the main parameters (p, d, q), and the seasonal_order sets the seasonal parameters of the model (P, D, Q, S).

For our data, we get the following result:

ARIMA(0, 1, 1)x(1, 1, 1, 12) - AIC:920.3192974989254

Building and Evaluating the SARIMAX Model

Once we have found the optimal parameters using grid search, we can use these parameters to train the SARIMAX model on our time series data. This helps us understand how well the model fits our data and provides an opportunity to adjust the model’s parameters if necessary.

First, we define the SARIMAX model with the previously found parameters:

from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(time_series, order=(0, 1, 1), seasonal_order=(1, 1, 1, 12))

Next, fit the model:

results = model.fit()

Print the model summary:

print(results.summary())

The model summary is widely used to assess the quality of the parameter fit. Key aspects to pay attention to include:

  • Coefficients: They should be statistically significant. Check the p-values of the coefficients (P>|z|); they should be less than 0.05.

  • AIC (Akaike Information Criterion): A lower AIC value indicates a better model fit.

  • Ljung-Box (L1) (Q): This is the p-value for the Ljung-Box Q-statistic. If the p-value is greater than 0.05, the residuals are random, which is good.

  • Jarque-Bera (JB): This is a test for the normality of residuals. If Prob(JB) is greater than 0.05, the residuals are normally distributed, which is good.

  • Heteroskedasticity (H): This is a test for heteroskedasticity in the residuals. If Prob(H) (two-sided) is greater than 0.05, the residuals are homoscedastic, which is good. Heteroskedasticity occurs when the variance of your forecast errors changes depending on the time period, which means there is a non-uniformity in your data.

Ideally, your model should have statistically significant coefficients, a low AIC value, and residuals that are normally distributed and homoscedastic. Meeting these criteria indicates a good model.

For our model, we obtained the following output:

Image3

Plot the model diagnostics:

results.plot_diagnostics(figsize=(12, 8))
plt.show()

This command generates four diagnostic plots:

  • Residuals Plot: A plot of model residuals over time. If the model is good, the residuals should be random, and the plot should look like white noise.

  • Q-Q Plot: A plot comparing the distribution of residuals to a standard normal distribution. If the points follow the diagonal line, it indicates that the residuals are normally distributed.

  • ACF Plot: A plot of autocorrelation of residuals. If the model is good, the residuals should not be correlated with each other. The absence of blue bars outside the blue noise range indicates this.

  • Histogram of Residuals: A histogram of the distribution of residuals. If the model is good, the residuals should be normally distributed, and the histogram should resemble a bell curve.

These plots, along with the model summary, help us check how well the model fits our data and whether it was correctly specified. If the model is incorrect or unsuitable for the data, it may provide inaccurate forecasts, which could negatively impact decisions made based on these forecasts.

Our diagnostic plots look as follows:

Image12

The model we selected generally meets the requirements, but there is still potential for improving the parameters of the seasonal ARIMA model. Applying SARIMA to time series data often requires a careful approach, and it is always beneficial to conduct a thorough data analysis and spend more time on data preprocessing and exploratory analysis before applying time series models.

Static and Dynamic Forecasting

After successfully training the model, the next step is to generate forecasts and compare the predicted values with the actual data.

Static Forecasting

First, we generate forecasted values using the model, starting from a specific date and extending to the end of the dataset. The get_prediction method returns a prediction object from which we can extract forecasted values using predicted_mean:

st_pred = results.get_prediction(start=pd.to_datetime('1955-12-01'), dynamic=False)
forecast_values = st_pred.predicted_mean

Here, December 1955 is used as an example starting date, but you can adjust this date according to your needs.

Now we have the forecasted values that we can compare with the actual time series data. We will use the Mean Squared Error (MSE) as our metric for evaluating the accuracy of the forecast:

actual_values = time_series['1955-12-01':]['Passengers']
forecast_mse = ((forecast_values - actual_values) ** 2).mean()
print('Mean Squared Error of the forecast is {}'.format(round(forecast_mse, 2)))

MSE is a widely accepted metric for evaluating the performance of forecasting models. A lower MSE indicates a more accurate model. Of course, there is no perfect model, and there will always be some deviation between forecasts and actual data. In our case, the Mean Squared Error of the forecast is 170.37.

Finally, we visualize the results to visually assess the accuracy of our forecasts compared to the actual data:

plt.figure(figsize=(15,8))

plt.plot(actual_values.index, actual_values, label='Actual Values', color='blue')
plt.plot(forecast_values.index, forecast_values, label='Forecasted Values', color='red')

plt.title('Actual and Forecasted Values')
plt.xlabel('Date')
plt.ylabel('Passengers')
plt.legend()

plt.show()

This code generates a plot showing the actual and forecasted passenger numbers over time. The red line represents the forecasted values, while the blue line shows the actual data.

Group 1321314184

This visualization helps you understand how well the model predicts the data.

Dynamic Forecasting

Dynamic forecasting generally provides a more realistic view of future time series behavior because it incorporates forecasts into future predictions.

In static forecasting, the model uses the entire known dataset to forecast each subsequent value. Dynamic forecasting, however, uses the most recent forecasted values for future predictions, starting from a user-defined start date.

To perform dynamic forecasting, set the dynamic parameter to True:

dyn_pred = results.get_prediction(start=pd.to_datetime('1955-12-01'), dynamic=True)
dynamic_forecast_values = dyn_pred.predicted_mean

You can also calculate the Mean Squared Error for the dynamic forecast:

mse_dynamic_forecast = ((dynamic_forecast_values - actual_values) ** 2).mean()
print('Mean Squared Error of the dynamic forecast is {}'.format(round(mse_dynamic_forecast, 2)))

And plot the actual and dynamically forecasted values:

plt.figure(figsize=(15,8))

plt.plot(actual_values.index, actual_values, label='Actual Values', color='blue')
plt.plot(dynamic_forecast_values.index, dynamic_forecast_values, label='Dynamic Forecast', color='green')

plt.title('Actual and Dynamically Forecasted Values')
plt.xlabel('Date')
plt.ylabel('Passengers')
plt.legend()

plt.show()

Group 1321314186

After performing static and dynamic forecasts, we can evaluate whether our time series model is successful. The next step is to attempt to predict future data in this time series.

Creating and Visualizing Forecasts

Now we can finally use the ARIMA model in Python to forecast future values.

To perform forecasting for a certain number of steps ahead, you can use the get_forecast method from the results model:

pred_future = results.get_forecast(steps=12)

We use the trained model (results) to get forecasts for the next 12 periods. Since our data includes information up to December 1960, we will generate predictions for the number of passengers each month for the year 1961.

We will print the forecasted mean values and confidence intervals:

print(f'Forecasted mean values:\n\n{pred_future.predicted_mean}')
print(f'\nConfidence intervals:\n\n{pred_future.conf_int()}')

We can also visualize our forecast:

fig = plt.figure()
plt.plot(pred_future.predicted_mean, label='Forecasted Mean Values')
plt.fill_between(pred_future.conf_int().index,
                 pred_future.conf_int().iloc[:, 0],
                 pred_future.conf_int().iloc[:, 1], color='k', alpha=.2)
plt.legend()  
plt.show()

Group 1321314187

This visualization is very useful for understanding what the model predicts. The forecasted mean values show the expected number of passengers each month in 1961, and the shaded area around the forecast represents the confidence interval.

Conclusion

In this tutorial, we discussed how to apply the ARIMA model for time series forecasting using Python. We covered the entire process from data loading and preprocessing to finding optimal parameters for the model, evaluating it, and ultimately forecasting future values.

Using ARIMA helps us understand the application of more advanced forecasting techniques. It is important to remember that the ARIMA model might not work for all time series, and the results will depend on the quality of your initial data and the preprocessing performed.

Now you can automate the forecasting of time series data using the ARIMA model and the Python programming language. We encourage you to practice and revisit this tutorial with different datasets to enhance your skills.

On our app platform you can find Python applications, such as Celery, Django, FastAPI and Flask. 

Python
15.07.2024
Reading time: 23 min

Similar

Python

Useful Tips for Web Data Scraping

In one of the previous articles, we learned what parsing is and looked at examples of obtaining data from HTML pages using Python. In this guide, we continue to move in that direction and offer web scraping best practices and tips that will help you automatically extract data from most existing websites. Obtaining data automatically may be prohibited by the terms of use of websites. We do not encourage violations of these terms, the rules specified in the robots.txt file, or any other applicable legal norms. Use the methods presented here only within permitted scenarios, and respect the policies of website owners. Tip 1. Learn to Work with DevTools By the way information is delivered, websites can be divided into two groups: static and dynamic. On static websites, all data is stored in the form of fixed HTML files that are kept on the server. Their contents do not change unless the developer modifies them. Dynamic websites, on the other hand, support real-time content generation and can load information from storage or, for example, from a database. Usually, writing a script for a static site is easier, since the information is definitely located inside the HTML document, and you don’t need to look for additional requests. Working with the Web Inspector The first thing a developer needs in order to identify the source of data quickly is to learn how to use the developer tools (DevTools). They exist in every browser and can be opened using the F12 key, or the combination Ctrl + Alt + I on Windows, or Command + Option + I on macOS. At first, you will only need two tabs: Elements and Network. The first allows you to see the structure of the page and determine in which DOM element the data is located. The Network tab is needed for working with requests, which we will later copy. The tabs are located at the top of the developer tools. Most often, information reaches the site in two ways: In the HTML markup of the page. This happens if the data is added to the page during backend processing. In JSON format. Such data can be requested by the frontend both during page loading and after certain user actions on the page. Tip 2. Use a Ready-Made Algorithm to Start Working with Any Donor Site Below is an action algorithm recommended for starting work with any donor site: Find a GET request with content type text/html, which the browser sends when the page is initialized.To do this, go to the page from which you need to extract data. Open the web inspector in the Network tab. Clear the requests by clicking on the trash bin icon to the left of the request search bar. Reload the page with Ctrl + R on Windows/Linux or Command + R on macOS. One of the first requests will be the needed GET request with content type text/html. Click on the request you found. Then go to the Response tab. A preview mode of the server’s response will open. The page layout may appear broken; this is normal. Try to find the required data visually in the preview mode. For example, the HTML markup of articles on Hostman is generated by the server. If you needed to automatically obtain the text of an article, then most of the work would already be done. If you can’t find it visually, go to the HTML markup view mode of the server response (not to be confused with the Elements tab). Activate search within the response with Ctrl + F on Windows or Command + F on macOS. Enter an example of data that you know is definitely on the page (for instance, the developer knows that the article contains the phrase “configuring Nginx,” so that exact phrase can be searched). The browser will highlight the substring if matches are found. Often, if the information is delivered by the server as HTML markup, selector names remain the same. For convenience, you can use the standard element search tool with the mouse: Ctrl + Shift + C on Windows or Cmd + Shift + C on macOS. Press the shortcut and select the element directly on the page. The browser will show the desired element, and its selectors can be conveniently transferred into your code. If the required data is not present, proceed to the next step. Find the requests that contain only JSON. This is easiest to do by filtering: click on the search bar for requests and enter the filter: mime-type: application/json Go through each request with the donor site’s domain and repeat the search for data, as in the previous step. If no necessary data is found, then most likely you will need to resort to browser emulation to parse the information. Tip 3. Use Quick Export of Requests In most cases, along with the request, the browser sends request headers and cookies to the server. Headers transmit metadata that allows the server to understand what data format is being requested and how best to deliver it. Cookies store session information and user preferences. Thanks to this, the server forms a personalized response. Without this data, the server may reject the request if it considers it insufficiently secure. Exporting a Request with cURL This method allows you to export ready-made code for making a request, not only in Python. It works for any requests. Find the desired request in the web inspector. Right-click the request, then choose Copy and Copy as cURL. Now the request information is copied to your clipboard. Go to curlconverter.com, a Swiss Army knife for developers of parsing and automation scripts. Click Python in the programming language selection bar. Paste the copied request into the input field. You now have a ready-made code template with all request parameters, suitable for importing into your IDE. The code contains dictionaries with headers, cookie data, JSON request parameters (json_data, if present), and everything necessary to fully duplicate the request made in the browser. Tip 4. Use a Virtual Environment When Working with Python Most often, scripts for parsing and automation are later uploaded to a remote server. A virtual environment creates a separate environment for the project and isolates its dependencies from system libraries. This helps to avoid version conflicts and reduces the risk of unexpected failures. We explained more about virtual environments and how to create them in another article. To quickly transfer the project to a server, provided you worked in a virtual environment on your local computer, first save the list of libraries with versions from pip into a file requirements.txt: pip freeze > requirements.txt If you just created a server on Ubuntu, you can use a universal script to install Python, a virtual environment, and all dependencies on a clean server. First, transfer the project files (using the scp utility or the FTP protocol), go to the project directory, and paste the ready-made command into the terminal. At the beginning of the command, specify the required Python version in the variable PYVER and then execute the command: export PYVER=3.9 && sudo apt update && sudo apt upgrade -y && sudo apt install -y software-properties-common && sudo add-apt-repository ppa:deadsnakes/ppa -y && sudo apt update && sudo apt install -y python${PYVER} python${PYVER}-venv python${PYVER}-dev python3-pip && python${PYVER} -m venv venv && source venv/bin/activate && pip install --upgrade pip && [ -f requirements.txt ] && pip install -r requirements.txt Tip 5. Include Error Handlers in Your Algorithm When developing a parser, it is important to provide an error handling mechanism. Network failures, changes in the HTML structure, or unexpected blocking by the site may lead to script failures. Add retries for requests, timeouts, and a logging system for all actions and errors. This approach allows you to quickly detect problems, adjust parsing algorithms, and ensure the stability of the application even when the donor site changes. In Python, you can use: try, except, finally constructs; the logging library for logging; loops for retrying failed requests; timeouts, for example: requests: requests.get("hostman.com", timeout=20) aiohttp: timeout = aiohttp.ClientTimeout(total=60, sock_connect=10, sock_read=10)  async with aiohttp.ClientSession(timeout=timeout) as session:      async with session.get(url) as response:          return await response.text() Tip 6. Implement Your Parser as a Generator A generator is a class that implements the logic of an object that iteratively yields elements as needed. Generators are especially convenient to use when developing a parsing script for the following reasons: Lazy evaluation. Generators calculate and return data “on the fly,” which makes it possible to process large volumes of information without consuming significant amounts of memory. When parsing large files or web pages, this is critical: data is processed gradually, and only the current part is stored in memory, not the entire result at once. Increased performance. Since elements are generated as needed, you can begin processing and transferring data (for example, to a database or a bot) before the entire dataset has been obtained. This reduces delays and allows you to react faster to incoming data. Code organization convenience. Generators simplify the implementation of iterative processes, allowing you to focus on the parsing logic rather than managing iteration state. This is especially useful when you need to process a data stream and pass it to other parts of the system. Example of Implementing a Parser as a Generator in Python In the loop where the generator is used, it is convenient to initiate writing data to a database or, for example, sending notifications through a Telegram bot. Using generators makes the code more readable. import requests from bs4 import BeautifulSoup class MyParser: def __init__(self, url): self.url = url def parse(self): """ Generator that sequentially returns data (for example, titles of elements on a page). """ response = requests.get(self.url) if response.status_code != 200: raise Exception(f"Failed to retrieve page, status: {response.status_code}") soup = BeautifulSoup(response.text, "html.parser") items = soup.select("div") for item in items: title = item.select_one("h1").get_text(strip=True) if item.select_one("h1") else "No title" yield { "title": title, "content": item.get_text(strip=True) } if __name__ == "__main__": parser = MyParser("https://example.com") for data_item in parser.parse(): print(data_item["title"], "--", data_item["content"]) Tip 7. Use an Asynchronous Approach to Speed Up Processing a Large Number of Requests When parsing a large number of pages, a synchronous approach often becomes a bottleneck, since each request waits for the previous one to finish. Asynchronous libraries, such as aiohttp in Python, allow you to execute many requests simultaneously, which significantly speeds up data collection. However, to avoid overloading both your application and the donor servers, it is important to properly regulate the request flow. This is where throttling, exponential backoff, and task queue techniques come into play. How It Works Asynchronous requests. Create an asynchronous session with specified timeouts (for example, total timeout, connection timeout, and read timeout). This allows you to process many requests in parallel without blocking the main execution thread. Throttling. To prevent excessive load on the donor server, it makes sense to limit the number of simultaneous requests. This can be done using semaphores or other concurrency control mechanisms (for example, asyncio.Semaphore), so as not to send requests faster than allowed. Exponential backoff. If a request fails (for example, due to a timeout or temporary blocking), use an exponential backoff strategy. On each retry, the waiting interval increases (for example, 1 second, then 2, 4, 8…), which allows the server to recover and reduces the likelihood of repeated errors. Task queues. Organizing queues (for example, with asyncio.Queue) helps manage a large flow of requests. First, a queue of URLs is formed, then requests are processed as “slots” become available for execution. This ensures an even distribution of load and stable operation of the parser. Example of Implementation in Python Using aiohttp import asyncio import aiohttp from aiohttp import ClientTimeout # Limit the number of simultaneous requests semaphore = asyncio.Semaphore(10) async def fetch(session, url): async with semaphore: try: async with session.get(url) as response: return await response.text() except Exception: # Apply exponential backoff in case of error for delay in [1, 2, 4, 8]: await asyncio.sleep(delay) try: async with session.get(url) as response: return await response.text() except Exception: continue return None async def main(urls): timeout = ClientTimeout(total=60, sock_connect=10, sock_read=10) async with aiohttp.ClientSession(timeout=timeout) as session: tasks = [asyncio.create_task(fetch(session, url)) for url in urls] results = await asyncio.gather(*tasks) # Process the obtained data for result in results: if result: print(result[:200]) # Print the first 200 characters of the response # Example list of URLs for parsing urls = ["http://hostman.com"] * 100 asyncio.run(main(urls)) Recommendations for Developers There are also recommendations that will help simplify a developer’s work: Check if the donor site has a public API. Sometimes the task of writing a parsing algorithm has already been solved, and the site offers a convenient API that fully covers the required functionality. Monitor changes in the site’s structure. Donor site developers may change the layout, which would require you to update the selectors of the elements used in your code. Test function execution at every stage. Automated tests (unit tests, integration tests) help promptly detect issues related to site structure changes or internal code modifications. Checklist for Determining the Parsing Method We have systematized the information from this article so you can understand which parsing method to use when working with any donor site. Conclusion The universal parsing methods presented here form a reliable foundation for developing algorithms capable of extracting data from a wide variety of websites, regardless of the programming language chosen. Following these scraping best practices and tips allows you to build a flexible, scalable, and change-resistant algorithm. Such an approach not only helps to optimally use system resources but also ensures the ability to quickly integrate the obtained data with databases, messengers, or other external services.
23 September 2025 · 12 min to read
Python

How to Use Python time.sleep()

Sometimes, while running a program, it’s necessary to pause: wait for data to load, give the user time to enter input, or reduce the load on the system. One of the simplest ways to achieve this in Python is with the time.sleep() function, which suspends program execution for a given interval. In this article, we’ll examine how time.sleep() works in Python, its features and alternatives, as well as possible errors. We’ll discuss when it’s appropriate to use it in multithreaded and asynchronous programs, and when it’s better to choose asyncio.sleep() or other waiting mechanisms. What is the sleep() Function in Python? Python's time.sleep() function can be used to freeze the current thread's execution for a specific period of time. The built-in time module in Python contains this function. This function in Python was added to the standard library to simplify creating pauses in code. It is located in the time module and is called time.sleep, allowing you to pause program execution for a specified number of seconds. In practice, sleep() is useful for pauses in test environments, delays between API requests, or intervals between sending messages. However, you should not confuse its use for system-level tasks, such as thread synchronization, with simply slowing down a script. If precise timing coordination or asynchronous I/O is needed, other tools are more suitable. How time.sleep() Works The time.sleep() function in Python pauses the current thread for the specified number of seconds. In a multithreaded scenario, other threads continue running, but the one where time.sleep() was called remains "frozen" for that interval. It’s important to note that time.sleep() blocks code execution at that point, delaying all subsequent operations. Ignoring this rule can lead to reduced performance or even a frozen user interface in desktop applications. When time.sleep() is Used Most often, time.sleep() is used in testing and debugging, when a short delay is needed—for example, to verify the correctness of an API response or wait for a server reply. It’s also used for step-by-step script execution, giving the user time to view information or enter data. In demonstrations, tutorials, and prototyping, time.sleep() helps simulate long-running processes, and when working with external services, it helps avoid penalties or blocks from too frequent requests. However, sleep() is not the only way to slow down code execution. Further in the article, we’ll review some alternatives. How to Use time.sleep() in Python You must import the time module before you can use time.sleep(). The required delay in seconds can then be passed as a parameter when calling time.sleep(). For a few seconds, this delay may be expressed as a floating-point number or as a whole number.  Basic Syntax of time.sleep() To call the time.sleep() function, first import the time module: import time time.sleep(5) In this example, the program will "sleep" for 5 seconds. The number passed to the function can be either an integer or a float, which allows sleeping for fractions of a second. Syntax: time.sleep(seconds) The time.sleep() function does not return any value. That means you cannot precisely determine how accurate the pause was—it simply suspends the current thread for the specified duration. Example: Delaying Code Execution Suppose you have a small script that prints messages with a 2-second interval. To add a delay in Python, just insert time.sleep(2): import time print("First message") time.sleep(2) print("Second message") time.sleep(2) print("Third message") When running this script, the user will see a 2-second pause between each message. That’s exactly how a delay in Python works using time.sleep(2). Parameters of time.sleep() The time.sleep() function accepts only one parameter, but it can be either an integer or a float. This adds flexibility when implementing delays in Python. Passing Values in Seconds Most examples of time.sleep() usage pass an integer representing seconds. For example: time.sleep(10) Here, the script pauses for 10 seconds. This is convenient when you need a long pause or want to limit request frequency. Using Fractions of a Second (Milliseconds) Sometimes you need to pause for a few milliseconds or fractions of a second. To do this, you can pass a floating-point number: time.sleep(0.5) This creates a half-second pause. However, because of OS and Python timer limitations, the delay may slightly exceed 500 milliseconds. For most tasks, this isn’t critical, but in high-precision real-time systems, specialized tools should be used instead. Alternative Ways to Pause in Python Although time.sleep() is the most popular and simplest way to create pauses, there are other methods that may be more suitable when waiting for external events or handling multiple threads. Let’s look at the most common alternatives. Using input() for Waiting The simplest way to pause in Python is by calling input(). It suspends execution until the user presses Enter or enters data. Example: print("Press Enter to continue...") input() print("Resuming program execution") While this feels like a pause, technically it’s not a timed delay. The program waits for user action, not a fixed interval. This method is rarely useful in automated scripts but can be handy in debugging or console utilities where a "pause on demand" is needed. Waiting with threading.Event() If you’re writing a multithreaded program, it can be more useful to use synchronization objects like threading.Event(). You can configure it to block a thread until a signal is received. Example: import threading event = threading.Event() def worker():     print("Starting work in thread")     event.wait()     print("Event received, resuming work") thread = threading.Thread(target=worker) thread.start() import time time.sleep(3) event.set() In this case, the thread is blocked until event.set() is called. You can still use time.sleep() to set a maximum pause, but unlike plain sleep(), this approach allows more flexible control. The thread can be "woken up" immediately without waiting for the full interval. asyncio.sleep() for Asynchronous Programs In asynchronous Python programming (asyncio module), asyncio.sleep() is used. Unlike time.sleep(), it doesn’t block the entire thread but only suspends the current coroutine, allowing the event loop to continue running other tasks. Example: import asyncio async def main(): print("Start async work") await asyncio.sleep(2) print("2 seconds passed, resuming") asyncio.run(main()) This is especially useful when you have multiple asynchronous functions that should run in parallel without interfering with each other. If you use regular time.sleep() in async code, it will block the entire event loop, causing other coroutines to wait too. Common Issues When Using time.sleep()  The time.sleep() function is simple, but misusing it can cause unexpected problems. It’s important to understand how it affects program execution so you don’t block important processes. Blocking the Main Thread The main feature of time.sleep() is that it blocks the thread where it was called. If you use it in the main thread of a GUI application (for example, Tkinter or PyQt), the interface will stop responding, creating a "frozen" effect. To avoid this, use time.sleep() only in separate threads or switch to asynchronous approaches like asyncio.sleep() for non-blocking delays. In GUI applications, it’s better to use timers (QTimer, after, etc.), which call functions at intervals without blocking the interface. Use in Multithreaded and Asynchronous Code In multithreaded code, time.sleep() can be called independently in each thread, but note that it doesn’t automatically release the Global Interpreter Lock (GIL). While other threads can still run during one thread’s sleep, in Python this depends on OS-level thread scheduling. In asynchronous code, time.sleep() should be used cautiously. If called inside an event loop (like with asyncio.run()), it blocks the entire loop, defeating the benefits of async programming. Instead, use asyncio.sleep(), which hands control back to the scheduler, letting other coroutines run in the background. Real-World Example of Using time.sleep() Imagine you’re writing a script to periodically poll an external API, which, according to its rules, must not be called more than once every 30 seconds. If requests are too frequent, the server may return errors or block your IP. Solution using time.sleep(): import time def poll_api(): print("Making API request...") def main(): while True: poll_api() time.sleep(30) if __name__ == "__main__": main() Here, after each request, we pause for 30 seconds with time.sleep(). This ensures no more than two requests per minute, respecting the limits. Async alternative: import asyncio async def poll_api(): print("Making API request...") async def main(): while True: await poll_api() await asyncio.sleep(30) if __name__ == "__main__": asyncio.run(main()) This version doesn’t block the entire program, allowing other requests or tasks to run in the same async environment. It’s more flexible and scalable. Conclusion Organizing pauses and delays is an important aspect of Python development. time.sleep() is the first and most obvious tool for this, but choosing between time.sleep(), asyncio.sleep(), and other methods should depend on your project’s architecture. In single-threaded scripts and console utilities, time.sleep() is perfectly fine, but for multithreaded and asynchronous applications, other mechanisms are better. Key recommendations: Use time.sleep() for short delays in tests, pauses between requests, or interface demonstrations. Don’t block the main thread of GUI applications to avoid a "frozen" interface. In async code, replace time.sleep() with asyncio.sleep() to keep the event loop efficient. In multithreaded programs, remember only the current thread pauses, but GIL affects scheduling. In special cases, use threading.Event() or input() to wait for events or user actions.
19 September 2025 · 8 min to read
Python

How to Delete Characters from a String in Python

When writing Python code, developers often need to modify string data. Common string modifications include: Removing specific characters from a sequence Replacing characters with others Changing letter case Joining substrings into a single sequence In this guide, we will focus on the first transformation—deleting characters from a string in Python. It’s important to note that strings in Python are immutable, meaning that any method or function that modifies a string will return a new string object with the changes applied. Methods for Deleting Characters from a String This section covers the main methods in Python used for deleting characters from a string. We will explore the following methods: replace() translate() re.sub() For each method, we will explain the syntax and provide practical examples. replace() The first Pyhton method we will discuss is replace(). It is used to replace specific characters in a string with others. Since strings are immutable, replace() returns a new string object with the modifications applied. Syntax: original_string.replace(old, new[, count]) Where: original_string – The string where modifications will take place old – The substring to be replaced new – The substring that will replace old count (optional) – The number of occurrences to replace (if omitted, all occurrences will be replaced) First, let’s remove all spaces from the string "H o s t m a n": example_str = "H o s t m a n" result_str = example_str.replace(" ", "") print(result_str) Output: Hostman We can also use the replace() method to remove newline characters (\n). example_str = "\nHostman\nVPS" print(f'Original string: {example_str}') result_str = example_str.replace("\n", " ") print(f'String after adjustments: {result_str}') Output: Original string: Hostman VPS String after adjustments: Hostman VPS The replace() method has an optional third argument, which specifies the number of replacements to perform. example_str = "Hostman VPS Hostman VPS Hostman VPS" print(f'Original string: {example_str}') result_str = example_str.replace("Hostman VPS", "", 2) print(f'String after adjustments: {result_str}') Output: Original string: Hostman VPS Hostman VPS Hostman VPS String after adjustments: Hostman VPS Here, only two occurrences of "Hostman VPS" were removed, while the third occurrence remained unchanged. We have now explored the replace() method and demonstrated its usage in different situations. Next, let’s see how we can delete and modify characters in a string using translate(). translate( The Python translate() method functions similarly to replace() but with additional flexibility. Instead of replacing characters one at a time, it allows mapping multiple characters using a dictionary or translation table. The method returns a new string object with the modifications applied. Syntax: original_string.translate(mapping_table) In the first example, let’s remove all occurrences of the $ symbol in a string and replace them with spaces: example_str = "Hostman$Cloud$—$Cloud$Service$Provider." print(f'Original string: {example_str}') result_str = example_str.translate({ord('$'): ' '}) print(f'String after adjustments: {result_str}') Output: Original string: Hostman$Cloud$—$Cloud$Service$Provider. String after adjustments: Hostman Cloud — Cloud Service Provider. To improve code readability, we can define the mapping table before calling translate(). This is useful when dealing with multiple replacements: example_str = "\nHostman%Cloud$—$Cloud$Service$Provider.\n" print(f'Original string: {example_str}') # Define translation table example_table = {ord('\n'): None, ord('$'): ' ', ord('%'): ' '} result_str = example_str.translate(example_table) print(f'String after adjustments: {result_str}') Output: Original string: Hostman%Cloud$—$Cloud$Service$Provider. String after adjustments: Hostman Cloud — Cloud Service Provider. re.sub() In addition to replace() and translate(), we can use regular expressions for more advanced character removal and replacement. Python's built-in re module provides the sub() method, which searches for a pattern in a string and replaces it. Syntax: re.sub(pattern, replacement, original_string [, count=0, flags=0]) pattern – The regular expression pattern to match replacement – The string or character that will replace the matched pattern original_string – The string where modifications will take place count (optional) – Limits the number of replacements (default is 0, meaning replace all occurrences) flags (optional) – Used to modify the behavior of the regex search Let's remove all whitespace characters (\s) using the sub() method from the re module: import re example_str = "H o s t m a n" print(f'Original string: {example_str}') result_str = re.sub('\s', '', example_str) print(f'String after adjustments: {result_str}') Output: Original string: H o s t m a nString after adjustments: Hostman Using Slices to Remove Characters In addition to using various methods to delete characters, Python also allows the use of slices. As we know, slices extract a sequence of characters from a string. To delete characters from a string by index in Python, we can use the following slice: example_str = "\nHostman \nVPS" print(f'Original string: {example_str}') result_str = example_str[1:9] + example_str[10:] print(f'String after adjustments: {result_str}') In this example, we used slices to remove newline characters. The output of the code: Original string:HostmanVPSString after adjustments: Hostman VPS Apart from using two slice parameters, you can also use a third one, which specifies the step size for index increments. For example, if we set the step to 2, it will remove every odd-indexed character in the string. Keep in mind that indexing starts at 0. Example: example_str = "Hostman Cloud" print(f'Original string: {example_str}') result_str = example_str[::2] print(f'String after adjustments: {result_str}') Output: Original string: Hostman CloudString after adjustments: HsmnCod Conclusion In this guide, we learned how to delete characters from a string in Python using different methods, including regular expressions and slices. The choice of method depends on the specific task. For example, the replace() method is suitable for simpler cases, while re.sub() is better for more complex situations.
23 August 2025 · 5 min to read

Do you have questions,
comments, or concerns?

Our professionals are available to assist you at any moment,
whether you need help or are just unsure of where to start.
Email us
Hostman's Support