Sign In
Sign In

Time Series Forecasting with ARIMA in Python 3

Time Series Forecasting with ARIMA in Python 3
Hostman Team
Technical writer
Python
15.07.2024
Reading time: 23 min

Data analytics has long become an integral part of modern life. We encounter a massive flow of information daily, which we need to collect and interpret correctly. One method of data analysis is time series forecasting. A time series (dynamic series) is a sequence of data points or observations taken at regular intervals. Examples of time series include monthly sales, daily temperatures, annual profits, and so on. Time series forecasting is a scientific field where models are built to predict the future behavior of a process or phenomenon based on past observations recorded in the dynamic series.

In this guide, we will focus on using the ARIMA model, one of the most commonly applied approaches in time series analysis. We will thoroughly examine the process of using the ARIMA model in Python 3—from the initial stages of loading and processing data to the final stage of forecasting. We will also learn how to determine and interpret the parameters of the ARIMA model and how to evaluate its quality.

Whether you are new to data analysis or an experienced analyst, this guide aims to teach you how to apply the ARIMA model for time series data forecasting. But not just apply it but do so effectively and in an automated manner, using the extensive functionality of Python.

Setting Up the Working Environment for Data Analysis in Python

Installing Python

First and foremost, you need to install Python itself—the programming language we will use for data analysis. You can download it from the official website, python.org, following the installation instructions provided there. After completing the installation, open the command line (on Windows) or terminal (on macOS/Linux) and enter:

python --version

If everything was done correctly, you will see the version number of the installed Python.

Setting Up the Development Environment

To work with Python, you can choose a development environment (IDE) that suits you. In this guide, we will work with Jupyter Notebook, which is very popular among data analysts. Other popular options include PyCharm, Visual Studio Code, and Spyder. To install Jupyter Notebook, enter the following in the command line:

pip install jupyter

Installing Necessary Python Packages

Along with Python, some basic libraries are always installed. These are extremely useful tools, but you may need additional tools for more in-depth data analysis. In this guide, we will use:

  • pandas (for working with tabular data)

  • numpy (for scientific computations)

  • matplotlib (for data visualization).stats models

  • statsmodels (a library for statistical models).

You can install these libraries using the pip3 install command in the terminal or command line:

pip3 install pandas numpy matplotlib statsmodels

We will also need the libraries warnings (for generating warnings) and itertools (for creating efficient looping structures), which are already included in the standard Python library, so you do not need to install them separately. To check the installed packages, use the command:

pip list

As a result, you will get a list of all installed modules and their versions.

Creating a Working Directory

Your working directory is the place on your computer where you will store all your Python scripts and project files. To create a new directory, open the terminal or command line and enter the following commands:

cd path_to_your_directory
mkdir Your_Project_Name
cd Your_Project_Name

Here, path_to_your_directory is the path to the location where the project folder will be created, and Your_Project_Name is the name of your project.

After successfully completing the above steps, you are ready to work on data analysis in Python. Your development environment is set up, your working directory is ready, and all necessary packages are installed.

Loading and Processing Data

Starting Jupyter Notebook

Let's start by launching Jupyter Notebook, our main tool for writing and testing Python code. In the command line (or terminal), navigate to your working directory and enter the following command:

jupyter notebook

A new tab with the Jupyter Notebook interface will open in your browser. To create a new document, select the "New" tab in the top right corner of the window and choose "Python 3" from the dropdown menu. You will be automatically redirected to a new tab where your notebook will be created.

Importing Libraries

The next step is to import the necessary Python libraries. Create a new cell in your notebook and insert the following code:

import warnings
import itertools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

To run the code, use Shift+Enter. Now these libraries are available in your project, and you can use their functionalities to perform various data analysis tasks. Pressing Shift+Enter will execute the current code cell and move the focus to the next cell, where you can continue writing your code.

Loading Data

For time series forecasting in Python, we will use the Airline Passengers dataset. This dataset represents the monthly tracking of the number of passengers on international airlines, expressed in thousands, from 1949 to 1960. You can find this data here. To load data from a CSV file via URL, use the pandas library:

url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv"
time_series = pd.read_csv(url)

If your CSV file is stored locally on your computer, use the following line of code to load it:

time_series = pd.read_csv('airline_passengers.csv')

Now the data is saved in a DataFrame named time_series. A DataFrame is the primary data structure in the pandas library. It is a two-dimensional table where each row is a separate observation, and the columns are various features or variables of that observation. To verify the correctness of the data loading and to affirm the data format, you can display the first few rows of our dataset:

print(time_series.head())

This code will output the first five rows of the loaded dataset, allowing you to quickly check if they were loaded correctly and if they look as expected. By default, the head() method outputs five rows, but you can specify a different number in the method's parentheses to view another quantity.

You can also view the last rows of the DataFrame:

print(time_series.tail())

Data Processing

Before starting the analysis, the data needs to be preprocessed. Generally, this can involve many steps, but for our time series, we can limit ourselves to the following actions.

Checking for Missing Values

Handling missing values is an important step in the preprocessing of dynamic series. Missing values can cause issues in the analysis and distort forecasting results. To check for missing values, you can use the isnull() method from the pandas library:

print(time_series.isnull().sum())

If 0 is indicated for all columns, this means there are no missing values in the data. However, if missing values are found during execution, they should be handled. There are various ways to handle missing values, and the approach will depend on the nature of your data. For example, we can fill in missing values with the column's mean value:

time_series = time_series.fillna(time_series.mean())

To replace missing values only in certain columns, use this command:

time_series['Column 1'] = time_series['Column 1'].fillna(time_series['Column 1'].mean())

Data Type Conversion 

Each column in the DataFrame has a specific data type. In dynamic series, the DataTime type, specifically designed for storing dates and times, is particularly important. Our DataFrame in pandas reads the information as text by default. Even if the column contains dates, pandas will perceive them as regular strings. In our case, we need to convert the Month column to DataTime so that we can work with temporal data:

time_series['Month'] = pd.to_datetime(time_series['Month'])

Setting DataTime as an Index

In pandas, each data row has its unique index (similar to a row number). However, sometimes it is more convenient to use a specific column from your data as an index. When working with time series, the most convenient choice for the index is the column containing the date or time. This allows for easy selection and analysis of data for specific time periods. In our case, we use the Month column as the index:

time_series.set_index('Month', inplace=True)

Rescaling Data

Another important step in data preprocessing is checking the need for rescaling the data. If the range of your data is too large (e.g., the Passengers value ranges from thousands to millions), you may need to transform the data. In the case of airline passenger data, they look quite organized, and such rescaling may not be relevant. However, it is always important to check the data range before further steps. An example of data standardization when the range is large:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
time_series[['Passengers']] = scaler.fit_transform(time_series[['Passengers']])

All the aforementioned steps are important in preparing data for further analysis. They help improve the quality of the time series and make the process of working with it simpler and more efficient. In this guide, we have covered only some data processing steps. But this stage can also include other actions, such as detecting and handling anomalies or outliers, creating new variables or features, and dividing the data into subgroups or categories.

Data Visualization

An important element when working with data is its visual representation. Using matplotlib, we can easily turn data into a visual chart, which helps us understand the structure of the time sequence. Visualization allows us to immediately see trends and seasonality in the data. A trend is the general direction of data movement over a long period. Seasonality is recurring data fluctuations in predictable time frames (week, month, quarter, year, etc.). Generally, a trend is associated with long-term data movement, while seasonality is associated with short-term, regular, and periodic changes.

For example, if you see that the number of passengers grows every year, this indicates an upward trend. If the number of passengers grows in the same months every year, this indicates annual seasonality.

To draw a chart, use the following lines of code:

plt.figure(figsize=(15,8))
plt.plot(time_series['Passengers'])
plt.title('Time Series Plot')
plt.xlabel('Date')
plt.ylabel('Passengers')
plt.show()

For our data, we get the following plot:

Group 1321314183 (1)

Stationarity of Data

In time series analysis, it is crucial to pay attention to the concept of stationarity. Stationarity in a time series means that the series' characteristics (such as mean and variance) remain constant over time. Non-stationary series often lead to errors in further predictions.

The ARIMA model can adapt the series to stationarity "on its own" through a special model parameter (d). However, understanding whether your initial time series is stationary helps you better understand ARIMA's workings.

There are several methods to check for stationarity in a series:

  1. Visual Analysis: Start by plotting the data and observe the following aspects:

    • Mean: Fluctuations in this indicator over time may signal that the time series is not stationary.

    • Variance: If the variance changes over time, this also indicates non-stationarity.

    • Trend: A visible trend on the graph is another indication of non-stationarity.

    • Seasonality: Seasonal fluctuations on the graph can also suggest non-stationarity.

  2. Statistical Analysis: Perform statistical tests like the Dickey-Fuller test. This method provides a quantitative assessment of a time series' stationarity. The null hypothesis of the test assumes that the time series is non-stationary. If the p-value is less than the significance level of 0.05, the null hypothesis is rejected, and the series can be considered stationary.

Running the Dickey-Fuller test on our data might look like this:

from statsmodels.tsa.stattools import adfuller

print('Test result:')
df_result = adfuller(time_series['Passengers'])
df_labels = ['ADF Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used']
for result_value, label in zip(df_result, df_labels):
    print(label + ' : ' + str(result_value))

if df_result[1] <= 0.05:
    print("Strong evidence against the null hypothesis, the series is stationary.")
else:
    print("Weak evidence against the null hypothesis, the series is not stationary.")

Our time series is not stationary, but in the following sections, we will automatically search for parameters for the ARIMA model and find the necessary parameters to make the series stationary.

Test result:

Test result:
ADF Test Statistic : 0.8153688792060482
p-value : 0.991880243437641
#Lags Used : 13
Number of Observations Used : 130
Weak evidence against the null hypothesis, the series is not stationary.

Even though we don't need to manually make the series stationary, it's useful to know which methods can be used to do so. There are many methods, including:

  • Differencing: One of the most common methods, differencing involves calculating the difference between consecutive observations in the time series.

  • Seasonal Differencing: A variation of regular differencing, applied to data with a seasonal component.

  • Log Transformation: Taking the logarithm of the data can help reduce variability in the series and make it more stationary.

Some time series may be particularly complex and require combining transformation methods. After transforming the series, you should recheck for stationarity using the Dickey-Fuller test to ensure the transformation was successful.

ARIMA Model

ARIMA (AutoRegressive Integrated Moving Average) is a statistical model used for analyzing and forecasting time series data.

  • AutoRegressive (AR): Uses the dependency between an observation and a number of lagged observations (e.g., predicting tomorrow's weather based on previous days' weather).

  • Integrated (I): Involves differencing the time series data to make it stationary.

  • Moving Average (MA): Models the error between the actual observation and the predicted value using a combination of past errors.

The ARIMA model is usually denoted as ARIMA(p, d, q), where p, d, and q are the model parameters:

  • p: The order of the autoregressive part (number of lagged observations included).

  • d: The degree of differencing (number of times the data is differenced to achieve stationarity).

  • q: The order of the moving average part (number of lagged forecast errors included).

Choosing the appropriate values for (p, d, q) involves analyzing autocorrelation and partial autocorrelation plots and applying information criteria.

Seasonal ARIMA Model

Seasonal ARIMA (SARIMA) extends ARIMA to account for seasonality in time series data. In many cases, time series exhibit clear seasonal patterns, such as ice cream sales increasing in summer and decreasing in winter. SARIMA captures these seasonal patterns.

SARIMA is typically denoted as SARIMA(p, d, q)(P, D, Q)m, where p, d, q are non-seasonal parameters, and P, D, Q are seasonal parameters:

  • p, d, q: The same as in ARIMA.

  • P: The order of seasonal autoregression (number of lagged seasons affecting the current season).

  • D: The degree of seasonal differencing (number of times seasonal trends are differenced).

  • Q: The order of seasonal moving average (number of lagged seasonal forecast errors included).

  • m: The length of the seasonal period (e.g., 12 for monthly data with yearly seasonality).

Like ARIMA, SARIMA is suitable for forecasting time series data but with the added capability of capturing and modeling seasonal patterns.

Although ARIMA, particularly seasonal ARIMA, may seem complex due to the need to carefully select numerous parameters, automating this process can simplify the task.

Defining Model Parameters

The first step in configuring an ARIMA model is determining the optimal parameter values for our specific dataset.

To tune the ARIMA parameters, we will use "grid search." The essence of this method is that it goes through all possible parameter combinations from a predefined grid of values and trains the model on each combination. After training the model on each combination, the model with the best performance is selected.

The more different parameter values, the more combinations need to be checked, and the longer the process will take. For our case, we will use only two possible values (0 and 1) for each parameter, resulting in a total of 8 combinations for the ARIMA parameters and 8 for the seasonal part (with a seasonal period length = 12). Thus, the total number of combinations to check is 64, leading to a relatively quick execution.

It's important to remember that the goal is to find a balance between the time spent on the grid search and the quality of the final model, meaning finding parameter values that yield the highest quality while minimizing time costs.

Importing Necessary Packages

from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

Statsmodels provides us with methods for building ARIMA models, and itertools (which we imported earlier) is used to create combinations of possible parameter values.

Ignoring Warnings

When working with large datasets and complex computations like statistical analysis or machine learning, libraries and functions may generate warnings about potential issues or non-optimality. However, these warnings are often insignificant or irrelevant to your specific case. Therefore, we set the warnings filter to ignore:

warnings.filterwarnings("ignore")

Creating a Range of Parameters for Model Tuning

To determine the model parameters, we'll define the function search_optimal_sarima.

def search_optimal_sarima(time_series, seasonal_cycle):
    order_vals = diff_vals = ma_vals = range(0, 2)
    pdq_combinations = list(itertools.product(order_vals, diff_vals, ma_vals))
    seasonal_combinations = [(combo[0], combo[1], combo[2], seasonal_cycle) for combo in pdq_combinations]
       
    smallest_aic = float("inf")
    optimal_order_param = optimal_seasonal_param = None

    for order_param in pdq_combinations:
        for seasonal_param in seasonal_combinations:
            try:
                sarima_model = sm.tsa.statespace.SARIMAX(time_series,
                                                         order=order_param,
                                                         seasonal_order=seasonal_param,
                                                         enforce_stationarity=False,
                                                         enforce_invertibility=False)

                model_results = sarima_model.fit()
                if model_results.aic < smallest_aic:
                    smallest_aic = model_results.aic
                    optimal_order_param = order_param
                    optimal_seasonal_param = seasonal_param
            except:
                continue

    print('ARIMA{}x{} - AIC:{}'.format(optimal_order_param, optimal_seasonal_param, smallest_aic))

seasonal_cycle_length = 12
search_optimal_sarima(time_series, seasonal_cycle_length)

The first three lines of code in our function create the parameter ranges. As we already know, the ARIMA model has three main parameters, p, d, q. In the code above, p, d, and q are ranges from 0 to 2, meaning they can take values 0 or 1. The itertools.product() function generates all possible combinations of these three parameters. Examples of combinations include (0, 0, 0), (0, 0, 1), (0, 1, 1), and so on.

Then we create additional combinations by adding the seasonal period to each of the pdq combinations. This allows the model to account for seasonal influences on the time series.

Finding the Best Parameters for the Model

Now we need to apply the parameters we determined earlier to automatically tune ARIMA models. When working with forecasting models, our task is to choose the model that best explains and predicts the data. However, selecting the best model is not always straightforward. The Akaike Information Criterion (AIC) helps us compare different models and determine which one is better. AIC helps evaluate how well the model fits the data, considering its complexity. So, the goal is to find the model with the lowest AIC value.

The code above iterates through all possible parameter combinations and uses the SARIMAX function to build the seasonal ARIMA model. The order parameter sets the main parameters (p, d, q), and the seasonal_order sets the seasonal parameters of the model (P, D, Q, S).

For our data, we get the following result:

ARIMA(0, 1, 1)x(1, 1, 1, 12) - AIC:920.3192974989254

Building and Evaluating the SARIMAX Model

Once we have found the optimal parameters using grid search, we can use these parameters to train the SARIMAX model on our time series data. This helps us understand how well the model fits our data and provides an opportunity to adjust the model’s parameters if necessary.

First, we define the SARIMAX model with the previously found parameters:

from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(time_series, order=(0, 1, 1), seasonal_order=(1, 1, 1, 12))

Next, fit the model:

results = model.fit()

Print the model summary:

print(results.summary())

The model summary is widely used to assess the quality of the parameter fit. Key aspects to pay attention to include:

  • Coefficients: They should be statistically significant. Check the p-values of the coefficients (P>|z|); they should be less than 0.05.

  • AIC (Akaike Information Criterion): A lower AIC value indicates a better model fit.

  • Ljung-Box (L1) (Q): This is the p-value for the Ljung-Box Q-statistic. If the p-value is greater than 0.05, the residuals are random, which is good.

  • Jarque-Bera (JB): This is a test for the normality of residuals. If Prob(JB) is greater than 0.05, the residuals are normally distributed, which is good.

  • Heteroskedasticity (H): This is a test for heteroskedasticity in the residuals. If Prob(H) (two-sided) is greater than 0.05, the residuals are homoscedastic, which is good. Heteroskedasticity occurs when the variance of your forecast errors changes depending on the time period, which means there is a non-uniformity in your data.

Ideally, your model should have statistically significant coefficients, a low AIC value, and residuals that are normally distributed and homoscedastic. Meeting these criteria indicates a good model.

For our model, we obtained the following output:

Image3

Plot the model diagnostics:

results.plot_diagnostics(figsize=(12, 8))
plt.show()

This command generates four diagnostic plots:

  • Residuals Plot: A plot of model residuals over time. If the model is good, the residuals should be random, and the plot should look like white noise.

  • Q-Q Plot: A plot comparing the distribution of residuals to a standard normal distribution. If the points follow the diagonal line, it indicates that the residuals are normally distributed.

  • ACF Plot: A plot of autocorrelation of residuals. If the model is good, the residuals should not be correlated with each other. The absence of blue bars outside the blue noise range indicates this.

  • Histogram of Residuals: A histogram of the distribution of residuals. If the model is good, the residuals should be normally distributed, and the histogram should resemble a bell curve.

These plots, along with the model summary, help us check how well the model fits our data and whether it was correctly specified. If the model is incorrect or unsuitable for the data, it may provide inaccurate forecasts, which could negatively impact decisions made based on these forecasts.

Our diagnostic plots look as follows:

Image12

The model we selected generally meets the requirements, but there is still potential for improving the parameters of the seasonal ARIMA model. Applying SARIMA to time series data often requires a careful approach, and it is always beneficial to conduct a thorough data analysis and spend more time on data preprocessing and exploratory analysis before applying time series models.

Static and Dynamic Forecasting

After successfully training the model, the next step is to generate forecasts and compare the predicted values with the actual data.

Static Forecasting

First, we generate forecasted values using the model, starting from a specific date and extending to the end of the dataset. The get_prediction method returns a prediction object from which we can extract forecasted values using predicted_mean:

st_pred = results.get_prediction(start=pd.to_datetime('1955-12-01'), dynamic=False)
forecast_values = st_pred.predicted_mean

Here, December 1955 is used as an example starting date, but you can adjust this date according to your needs.

Now we have the forecasted values that we can compare with the actual time series data. We will use the Mean Squared Error (MSE) as our metric for evaluating the accuracy of the forecast:

actual_values = time_series['1955-12-01':]['Passengers']
forecast_mse = ((forecast_values - actual_values) ** 2).mean()
print('Mean Squared Error of the forecast is {}'.format(round(forecast_mse, 2)))

MSE is a widely accepted metric for evaluating the performance of forecasting models. A lower MSE indicates a more accurate model. Of course, there is no perfect model, and there will always be some deviation between forecasts and actual data. In our case, the Mean Squared Error of the forecast is 170.37.

Finally, we visualize the results to visually assess the accuracy of our forecasts compared to the actual data:

plt.figure(figsize=(15,8))

plt.plot(actual_values.index, actual_values, label='Actual Values', color='blue')
plt.plot(forecast_values.index, forecast_values, label='Forecasted Values', color='red')

plt.title('Actual and Forecasted Values')
plt.xlabel('Date')
plt.ylabel('Passengers')
plt.legend()

plt.show()

This code generates a plot showing the actual and forecasted passenger numbers over time. The red line represents the forecasted values, while the blue line shows the actual data.

Group 1321314184

This visualization helps you understand how well the model predicts the data.

Dynamic Forecasting

Dynamic forecasting generally provides a more realistic view of future time series behavior because it incorporates forecasts into future predictions.

In static forecasting, the model uses the entire known dataset to forecast each subsequent value. Dynamic forecasting, however, uses the most recent forecasted values for future predictions, starting from a user-defined start date.

To perform dynamic forecasting, set the dynamic parameter to True:

dyn_pred = results.get_prediction(start=pd.to_datetime('1955-12-01'), dynamic=True)
dynamic_forecast_values = dyn_pred.predicted_mean

You can also calculate the Mean Squared Error for the dynamic forecast:

mse_dynamic_forecast = ((dynamic_forecast_values - actual_values) ** 2).mean()
print('Mean Squared Error of the dynamic forecast is {}'.format(round(mse_dynamic_forecast, 2)))

And plot the actual and dynamically forecasted values:

plt.figure(figsize=(15,8))

plt.plot(actual_values.index, actual_values, label='Actual Values', color='blue')
plt.plot(dynamic_forecast_values.index, dynamic_forecast_values, label='Dynamic Forecast', color='green')

plt.title('Actual and Dynamically Forecasted Values')
plt.xlabel('Date')
plt.ylabel('Passengers')
plt.legend()

plt.show()

Group 1321314186

After performing static and dynamic forecasts, we can evaluate whether our time series model is successful. The next step is to attempt to predict future data in this time series.

Creating and Visualizing Forecasts

Now we can finally use the ARIMA model in Python to forecast future values.

To perform forecasting for a certain number of steps ahead, you can use the get_forecast method from the results model:

pred_future = results.get_forecast(steps=12)

We use the trained model (results) to get forecasts for the next 12 periods. Since our data includes information up to December 1960, we will generate predictions for the number of passengers each month for the year 1961.

We will print the forecasted mean values and confidence intervals:

print(f'Forecasted mean values:\n\n{pred_future.predicted_mean}')
print(f'\nConfidence intervals:\n\n{pred_future.conf_int()}')

We can also visualize our forecast:

fig = plt.figure()
plt.plot(pred_future.predicted_mean, label='Forecasted Mean Values')
plt.fill_between(pred_future.conf_int().index,
                 pred_future.conf_int().iloc[:, 0],
                 pred_future.conf_int().iloc[:, 1], color='k', alpha=.2)
plt.legend()  
plt.show()

Group 1321314187

This visualization is very useful for understanding what the model predicts. The forecasted mean values show the expected number of passengers each month in 1961, and the shaded area around the forecast represents the confidence interval.

Conclusion

In this tutorial, we discussed how to apply the ARIMA model for time series forecasting using Python. We covered the entire process from data loading and preprocessing to finding optimal parameters for the model, evaluating it, and ultimately forecasting future values.

Using ARIMA helps us understand the application of more advanced forecasting techniques. It is important to remember that the ARIMA model might not work for all time series, and the results will depend on the quality of your initial data and the preprocessing performed.

Now you can automate the forecasting of time series data using the ARIMA model and the Python programming language. We encourage you to practice and revisit this tutorial with different datasets to enhance your skills.

On our app platform you can find Python applications, such as Celery, Django, FastAPI and Flask. 

Python
15.07.2024
Reading time: 23 min

Similar

Python

Command-Line Option and Argument Parsing using argparse in Python

Command-line interfaces (CLIs) are one of the quickest and most effective means of interacting with software. They enable you to provide commands directly which leads to quicker execution and enhanced features. Developers often build CLIs using Python for several applications, utilities, and automation scripts, ensuring they can dynamically process user input. This is where the Python argparse module steps in. The argparse Python module streamlines the process of managing command-line inputs, enabling developers to create interactive and user-friendly utilities. As part of the standard library, it allows programmers to define, process, and validate inputs seamlessly without the need for complex logic. This article will discuss some of the most important concepts, useful examples, and advanced features of the argparse module so that you can start building solid command-line tools right away. How to Use Python argparse for Command-Line Interfaces This is how to use argparse in your Python script: Step 1: Import Module First import the module into your Python parser script: import argparse This inclusion enables parsing .py arg inputs from the command line. Step 2: Create an ArgumentParser Object The ArgumentParser class is the most minimal class of the Python argumentparser module's API. To use it, begin by creating an instance of the class: parser = argparse.ArgumentParser(description="A Hostman tutorial on Python argparse.") Here: description describes what the program does and will be displayed when someone runs --help. Step 3: Add Inputs and Options Define the parameters and features your program accepts via add_argument() function: parser.add_argument('filename', type=str, help="Name of the file to process") parser.add_argument('--verbose', action='store_true', help="Enable verbose mode") Here: filename is a mandatory option. --verbose is optional, to allow you to set the flag to make it verbose. Step 4: Parse User Inputs Process the user-provided inputs by invoking the parse_args() Python method: args = parser.parse_args() This stores the command-line values as attributes of the args object for further use in your Python script.  Step 5: Access Processed Data Access the inputs and options for further use in your program: For example: print(f"File to process: {args.filename}") if args.verbose:     print("Verbose mode enabled") else:     print("Verbose mode disabled") Example CLI Usage Here are some scenarios to run this script: File Processing Without Verbose Mode python3 file.py example.txt File Processing With Verbose Mode python3 file.py example.txt --verbose Display Help If you need to see what arguments the script accepts or their description, use the --help argument: python3 file.py --help Common Examples of argparse Usage Let's explore a few practical examples of the module. Example 1: Adding Default Values Sometimes, optional inputs in command-line interfaces need predefined values for smoother execution. With this module, you can set a default value that applies when someone doesn’t provide input. This script sets a default timeout of 30 seconds if you don’t specify the --timeout parameter. import argparse # Create the argument parser parser = argparse.ArgumentParser(description="Demonstrating default argument values.") # Pass an optional argument with a default value parser.add_argument('--timeout', type=int, default=30, help="Timeout in seconds (default: 30)") # Interpret the arguments args = parser.parse_args() # Retrieve and print the timeout value print(f"Timeout value: {args.timeout} seconds") Explanation Importing Module: Importing the argparse module. Creating the ArgumentParser Instance: An ArgumentParser object is created with a description so that a short description of the program purpose is provided. This description is displayed when the user runs the program via the --help option. Including --timeout: The --timeout option is not obligatory (indicated by the -- prefix). The type=int makes the argument for --timeout an integer. The default=30 is provided so that in case the user does not enter a value, then the timeout would be 30 seconds. The help parameter adds a description to the argument, and it will also appear in the help documentation. Parsing Process: The parse_args() function processes user inputs and makes them accessible as attributes of the args object. In our example, we access args.timeout and print out its value. Case 1: Default Value Used If the --timeout option is not specified, the default value of 30 seconds is used: python file.py Case 2: Custom Value Provided For a custom value for --timeout (e.g., 60 seconds), apply: python file.py --timeout 60 Example 2: Utilizing Choices The argparse choices parameter allows you to restrict an argument to a set of beforehand known valid values. This is useful if your program features some specific modes, options, or settings to check. Here, we will specify a --mode option with two default values: basic and advanced. import argparse # Creating argument parser parser = argparse.ArgumentParser(description="Demonstrating the use of choices in argparse.") # Adding the --mode argument with predefined choices parser.add_argument('--mode', choices=['basic', 'advanced'], help="Choose the mode of operation") # Parse the arguments args = parser.parse_args() # Access and display the selected mode if args.mode: print(f"Mode selected: {args.mode}") else: print("No mode selected. Please choose 'basic' or 'advanced'.") Adding --mode: The choices argument indicates that valid options for the --mode are basic and advanced. The application will fail when the user supplies an input other than in choices. Help Text: The help parameter gives valuable information when the --help command is executed. Case 1: Valid Input To specify a valid value for --mode, utilize: python3 file.py --mode basic Case 2: No Input Provided For running the program without specifying a mode: python3 file.py Case 3: Invalid Input If a value is provided that is not in the predefined choices: python3 file.py --mode intermediate Example 3: Handling Multiple Values The nargs option causes an argument to accept more than one input. This is useful whenever your program requires a list of values for processing, i.e., numbers, filenames, or options. Here we will show how to use nargs='+' to accept a --numbers option that can take multiple integers. import argparse # Create an ArgumentParser object parser = argparse.ArgumentParser(description="Demonstrating how to handle multiple values using argparse.") # Add the --numbers argument with nargs='+' parser.add_argument('--numbers', nargs='+', type=int, help="List of numbers to process") # Parse the arguments args = parser.parse_args() # Access and display the numbers if args.numbers: print(f"Numbers provided: {args.numbers}") print(f"Sum of numbers: {sum(args.numbers)}") else: print("No numbers provided. Please use --numbers followed by a list of integers.") Adding the --numbers Option: The user can provide a list of values as arguments for --numbers. type=int interprets the input as an integer. If a non-integer input is provided, the program raises an exception. The help parameter gives the information.  Parsing Phase: After parsing the arguments, the input to --numbers is stored in the form of a list in args.numbers. Utilizing the Input: You just need to iterate over the list, calculate statistics (e.g., sum, mean), or any other calculation on the input. Case 1: Providing Multiple Numbers To specify multiple integers for the --numbers parameter, execute: python3 file.py --numbers 10 20 30 Case 2: Providing a Single Number If just one integer is specified, run: python3 file.py --numbers 5 Case 3: No Input Provided If the script is run without --numbers: python3 file.py Case 4: Invalid Input In case of inputting a non-integer value: python3 file.py --numbers 10 abc 20 Example 4: Required Optional Arguments Optional arguments (those that begin with the --) are not mandatory by default. But there are times when you would like them to be mandatory for your script to work properly. You can achieve this by passing the required=True parameter when defining the argument. In this script, --config specifies a path to a configuration file. By leveraging required=True, the script enforces that a value for --config must be provided. If omitted, the program will throw an error. import argparse # Create an ArgumentParser object parser = argparse.ArgumentParser(description="Demonstrating required optional arguments in argparse.") # Add the --config argument parser.add_argument('--config', required=True, help="Path to the configuration file") # Parse the arguments args = parser.parse_args() # Access and display the provided configuration file path print(f"Configuration file path: {args.config}") Adding the --config Option: --config is considered optional since it starts with --. However, thanks to the required=True parameter, users must include it when they run the script. The help parameter clarifies what this parameter does, and you'll see this information in the help message when you use --help. Parsing: The parse_args() method takes care of processing the arguments. If someone forgets to include --config, the program will stop and show a clear error message. Accessing the Input: The value you provide for --config gets stored in args.config. You can then use this in your script to work with the configuration file. Case 1: Valid Input For providing a valid path to the configuration file, use: python3 file.py --config settings.json Case 2: Missing the Required Argument For running the script without specifying --config, apply: python3 file.py Advanced Features  While argparse excels at handling basic command-line arguments, it also provides advanced features that enhance the functionality and usability of your CLIs. These features ensure your scripts are scalable, readable, and easy to maintain. Below are some advanced capabilities you can leverage. Handling Boolean Flags Boolean flags allow toggling features (on/off) without requiring user input. Use the action='store_true' or action='store_false' parameters to implement these flags. parser.add_argument('--debug', action='store_true', help="Enable debugging mode") Including --debug enables debugging mode, useful for many Python argparse examples. Grouping Related Arguments Use add_argument_group() to organize related arguments, improving readability in complex CLIs. group = parser.add_argument_group('File Operations') group.add_argument('--input', type=str, help="Input file") group.add_argument('--output', type=str, help="Output file") Grouped arguments appear under their own section in the --help documentation. Mutually Exclusive Arguments To ensure users select only one of several conflicting options, use the add_mutually_exclusive_group() method. group = parser.add_mutually_exclusive_group() group.add_argument('--json', action='store_true', help="Output in JSON format") group.add_argument('--xml', action='store_true', help="Output in XML format") This ensures one can choose either JSON or XML, but not both. Conclusion The argparse Python module simplifies creating reliable CLIs for handling Python program command line arguments. From the most basic option of just providing an input to more complex ones like setting choices and nargs, developers can build user-friendly and robust CLIs. Following the best practices of giving proper names to arguments and writing good docstrings would help you in making your scripts user-friendly and easier to maintain.
21 July 2025 · 10 min to read
Python

How to Get the Length of a List in Python

Lists in Python are used almost everywhere. In this tutorial we will look at four ways to find the length of a Python list: by using built‑in functions, recursion, and a loop. Knowing the length of a list is most often required to iterate through it and perform various operations on it. len() function len() is a built‑in Python function for finding the length of a list. It takes one argument—the list itself—and returns an integer equal to the list’s length. The same function also works with other iterable objects, such as strings. Country_list = ["The United States of America", "Cyprus", "Netherlands", "Germany"] count = len(Country_list) print("There are", count, "countries") Output: There are 4 countries Finding the Length of a List with a Loop You can determine a list’s length in Python with a for loop. The idea is to traverse the entire list while incrementing a counter by  1 on each iteration. Let’s wrap this in a separate function: def list_length(list): counter = 0 for i in list: counter = counter + 1 return counter Country_list = ["The United States of America", "Cyprus", "Netherlands", "Germany", "Japan"] count = list_length(Country_list) print("There are", count, "countries") Output: There are 5 countries Finding the Length of a List with Recursion The same task can be solved with recursion: def list_length_recursive(list): if not list: return 0 return 1 + list_length_recursive(list[1:]) Country_list = ["The United States of America", "Cyprus", "Netherlands","Germany", "Japan", "Poland"] count = list_length_recursive(Country_list) print("There are", count, "countries") Output: There are 6 countries How it works. The function list_length_recursive() receives a list as input. If the list is empty, it returns 0—the length of an empty list. Otherwise it calls itself recursively with the argument list[1:], a slice of the original list starting from index 1 (i.e., the list without the element at index 0). The result of that call is added to 1. With each recursive step the returned value grows by one while the list shrinks by one element. length_hint() function The length_hint() function lives in the operator module. That module contains functions analogous to Python’s internal operators: addition, subtraction, comparison, and so on. length_hint() returns the length of iterable objects such as strings, tuples, dictionaries, and lists. It works similarly to len(): from operator import length_hint Country_list = ["The United States of America", "Cyprus", "Netherlands","Germany", "Japan", "Poland", "Sweden"] count = length_hint(Country_list) print("There are", count, "countries") Output: There are 7 countries Note that length_hint() must be imported before use. Conclusion In this guide we covered four ways to determine the length of a list in Python. Under equal conditions the most efficient method is len(). The other approaches are justified mainly when you are implementing custom classes similar to list.
17 July 2025 · 3 min to read
Python

Understanding the main() Function in Python

In any complex program, it’s crucial to organize the code properly: define a starting point and separate its logical components. In Python, modules can be executed on their own or imported into other modules, so a well‑designed program must detect the execution context and adjust its behavior accordingly.  Separating run‑time code from import‑time code prevents premature execution, and having a single entry point makes it easier to configure launch parameters, pass command‑line arguments, and set up tests. When all important logic is gathered in one place, adding automated tests and rolling out new features becomes much more convenient.  For exactly these reasons it is common in Python to create a dedicated function that is called only when the script is run directly. Thanks to it, the code stays clean, modular, and controllable. That function, usually named main(), is the focus of this article. All examples were executed with Python 3.10.12 on a Hostman cloud server running Ubuntu 22.04. Each script was placed in a separate .py file (e.g., script.py) and started with: python script.py The scripts are written so they can be run just as easily in any online Python compiler for quick demonstrations. What Is the main() Function in Python The simplest Python code might look like: print("Hello, world!")  # direct execution Or a script might execute statements in sequence at file level: print("Hello, world!")       # action #1 print("How are you, world?") # action #2 print("Good‑bye, world...")  # action #3 That trivial arrangement works only for the simplest scripts. As a program grows, the logic quickly becomes tangled and demands re‑organization: # function containing the program’s main logic (entry point) def main():     print("Hello, world!") # launch the main logic if __name__ == "__main__":     main()                    # call the function with the main logic With more actions the code might look like: def main(): print("Hello, world!") print("How are you, world?") print("Good‑bye, world...") if __name__ == "__main__": main() This implementation has several important aspects, discussed below. The main() Function The core program logic lives inside a separate function. Although the name can be anything, developers usually choose main, mirroring C, C++, Java, and other languages.  Both helper code and the main logic are encapsulated: nothing sits “naked” at file scope. # greeting helper def greet(name): print(f"Hello, {name}!") # program logic def main(): name = input("Enter your name: ") greet(name) # launch the program if __name__ == "__main__": main() Thus main() acts as the entry point just as in many other languages. The if __name__ == "__main__" Check Before calling main() comes the somewhat odd construct if __name__ == "__main__":.  Its purpose is to split running from importing logic: If the script runs directly, the code inside the if block executes. If the script is imported, the block is skipped. Inside that block, you can put any code—not only the main() call: if __name__ == "__main__":     print("Any code can live here, not only main()") __name__ is one of Python’s built‑in “dunder” (double‑underscore) variables, often called magic or special. All dunder objects are defined and used internally by Python, but regular users can read them too. Depending on the context, __name__ holds: "__main__" when the module runs as a standalone script. The module’s own name when it is imported elsewhere. This lets a module discover its execution context. Advantages of Using  main() Organization Helper functions and classes, as well as the main function, are wrapped separately, making them easy to find and read. Global code is minimal—only initialization stays at file scope: def process_data(data): return [d * 2 for d in data] def main(): raw = [1, 2, 3, 4] result = process_data(raw) print("Result:", result) if __name__ == "__main__": main() A consistent style means no data manipulation happens at the file level. Even in a large script you can quickly locate the start of execution and any auxiliary sections. Isolation When code is written directly at the module level, every temporary variable, file handle, or connection lives in the global namespace, which can be painful for debugging and testing. Importing such a module pollutes the importer’s globals: # executes immediately on import values = [2, 4, 6] doubles = [] for v in values: doubles.append(v * 2) print("Doubled values:", doubles) With main() everything is local; when the function returns, its variables vanish: def double_list(items): return [x * 2 for x in items] # create a new list with doubled elements def main(): values = [2, 4, 6] result = double_list(values) print("Doubled values:", result) if __name__ == "__main__": main() That’s invaluable for unit testing, where you might run specific functions (including  main()) without triggering the whole program. Safety Without the __name__ check, top‑level code runs even on import—usually undesirable and potentially harmful. some.py: print("This code will execute even on import!") def useful_function(): return 42 main.py: import some print("The logic of the imported module executed itself...") Console: This code will execute even on import! The logic of the imported module executed itself... The safer some.py: def useful_function():     return 42 def main():     print("This code will not run on import") main() plus the __name__ check guard against accidental execution. Inside main() you can also verify user permissions or environment variables. How to Write main() in Python Remember: main() is not a language construct, just a regular function promoted to “entry point.” To ensure it runs only when the script starts directly: Tools – define helper functions with business logic. Logic – assemble them inside main() in the desired order. Check – add the if __name__ == "__main__" guard.  This template yields structured, import‑safe, test‑friendly code—excellent practice for any sizable Python project. Example Python Program Using main() # import the standard counter from collections import Counter # runs no matter how the program starts print("The text‑analysis program is active") # text‑analysis helper def analyze_text(text): words = text.split() # split text into words total = len(words) # total word count unique = len(set(words)) # unique word count avg_len = sum(len(w) for w in words) / total if total else 0 freq = Counter(words) # build frequency counter top3 = freq.most_common(3) # top three words return { 'total': total, 'unique': unique, 'avg_len': avg_len, 'top3': top3 } # program’s main logic def main(): print("Enter text (multiple lines). Press Enter on an empty line to finish:") lines = [] while True: line = input() if not line: break lines.append(line) text = ' '.join(lines) stats = analyze_text(text) print(f"\nTotal number of words: {stats['total']}") print(f"Unique words: {stats['unique']}") print(f"Average word length: {stats['avg_len']:.2f}") print("Top‑3 most frequent words:") for word, count in stats['top3']: print(f" {word!r}: {count} time(s)") # launch program if __name__ == "__main__": main() Running the script prints a prompt: Enter text (multiple lines). Press Enter on an empty line to finish: Input first line: Star cruiser Orion glided silently through the darkness of intergalactic space. Second line: Signals of unknown life‑forms flashed on the onboard sensors where the nebula glowed with a phosphorescent light. Third line: The cruiser checked the sensors, then the cruiser activated the defense system, and the cruiser returned to its course. Console output: The text‑analysis program is active Total number of words: 47 Unique words: 37 Average word length: 5.68 Top‑3 most frequent words: 'the': 7 time(s) 'cruiser': 4 time(s) 'of': 2 time(s) If you import this program (file program.py) elsewhere: import program         # importing program.py Only the code outside main() runs: The text‑analysis program is active So, a moderately complex text‑analysis utility achieves clear logic separation and context detection. When to Use main() and When Not To Use  main() (almost always appropriate) when: Medium/large scripts – significant code with non‑trivial logic, multiple functions/classes. Libraries or CLI utilities – you want parts of the module importable without side effects. Autotests – you need to test pure logic without extra boilerplate. You can skip main() when: Tiny one‑off scripts – trivial logic for a quick data tweak. Educational snippets – short examples illustrating a few syntax features. In short, if your Python program is a standalone utility or app with multiple processing stages, command‑line arguments, and external resources—introduce  main(). If it’s a small throw‑away script, omitting main() keeps things concise. Conclusion The  main() function in Python serves two critical purposes: Isolates the program’s core logic from the global namespace. Separates standalone‑execution logic from import logic. Thus, a Python file evolves from a straightforward script of sequential actions into a fully‑fledged program with an entry point, encapsulated logic, and the ability to detect its runtime environment.
14 July 2025 · 8 min to read

Do you have questions,
comments, or concerns?

Our professionals are available to assist you at any moment,
whether you need help or are just unsure of where to start.
Email us
Hostman's Support