Data analytics has long become an integral part of modern life. We encounter a massive flow of information daily, which we need to collect and interpret correctly. One method of data analysis is time series forecasting. A time series (dynamic series) is a sequence of data points or observations taken at regular intervals. Examples of time series include monthly sales, daily temperatures, annual profits, and so on. Time series forecasting is a scientific field where models are built to predict the future behavior of a process or phenomenon based on past observations recorded in the dynamic series.
In this guide, we will focus on using the ARIMA model, one of the most commonly applied approaches in time series analysis. We will thoroughly examine the process of using the ARIMA model in Python 3—from the initial stages of loading and processing data to the final stage of forecasting. We will also learn how to determine and interpret the parameters of the ARIMA model and how to evaluate its quality.
Whether you are new to data analysis or an experienced analyst, this guide aims to teach you how to apply the ARIMA model for time series data forecasting. But not just apply it but do so effectively and in an automated manner, using the extensive functionality of Python.
First and foremost, you need to install Python itself—the programming language we will use for data analysis. You can download it from the official website, python.org, following the installation instructions provided there. After completing the installation, open the command line (on Windows) or terminal (on macOS/Linux) and enter:
python --version
If everything was done correctly, you will see the version number of the installed Python.
To work with Python, you can choose a development environment (IDE) that suits you. In this guide, we will work with Jupyter Notebook, which is very popular among data analysts. Other popular options include PyCharm, Visual Studio Code, and Spyder. To install Jupyter Notebook, enter the following in the command line:
pip install jupyter
Along with Python, some basic libraries are always installed. These are extremely useful tools, but you may need additional tools for more in-depth data analysis. In this guide, we will use:
pandas
(for working with tabular data)
numpy
(for scientific computations)
matplotlib
(for data visualization).stats models
statsmodels
(a library for statistical models).
You can install these libraries using the pip3 install
command in the terminal or command line:
pip3 install pandas numpy matplotlib statsmodels
We will also need the libraries warnings (for generating warnings) and itertools (for creating efficient looping structures), which are already included in the standard Python library, so you do not need to install them separately. To check the installed packages, use the command:
pip list
As a result, you will get a list of all installed modules and their versions.
Your working directory is the place on your computer where you will store all your Python scripts and project files. To create a new directory, open the terminal or command line and enter the following commands:
cd path_to_your_directory
mkdir Your_Project_Name
cd Your_Project_Name
Here, path_to_your_directory
is the path to the location where the project folder will be created, and Your_Project_Name
is the name of your project.
After successfully completing the above steps, you are ready to work on data analysis in Python. Your development environment is set up, your working directory is ready, and all necessary packages are installed.
Let's start by launching Jupyter Notebook, our main tool for writing and testing Python code. In the command line (or terminal), navigate to your working directory and enter the following command:
jupyter notebook
A new tab with the Jupyter Notebook interface will open in your browser. To create a new document, select the "New" tab in the top right corner of the window and choose "Python 3" from the dropdown menu. You will be automatically redirected to a new tab where your notebook will be created.
The next step is to import the necessary Python libraries. Create a new cell in your notebook and insert the following code:
import warnings
import itertools
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
To run the code, use Shift+Enter. Now these libraries are available in your project, and you can use their functionalities to perform various data analysis tasks. Pressing Shift+Enter will execute the current code cell and move the focus to the next cell, where you can continue writing your code.
For time series forecasting in Python, we will use the Airline Passengers dataset. This dataset represents the monthly tracking of the number of passengers on international airlines, expressed in thousands, from 1949 to 1960. You can find this data here. To load data from a CSV file via URL, use the pandas
library:
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv"
time_series = pd.read_csv(url)
If your CSV file is stored locally on your computer, use the following line of code to load it:
time_series = pd.read_csv('airline_passengers.csv')
Now the data is saved in a DataFrame
named time_series
. A DataFrame
is the primary data structure in the pandas
library. It is a two-dimensional table where each row is a separate observation, and the columns are various features or variables of that observation. To verify the correctness of the data loading and to affirm the data format, you can display the first few rows of our dataset:
print(time_series.head())
This code will output the first five rows of the loaded dataset, allowing you to quickly check if they were loaded correctly and if they look as expected. By default, the head()
method outputs five rows, but you can specify a different number in the method's parentheses to view another quantity.
You can also view the last rows of the DataFrame
:
print(time_series.tail())
Before starting the analysis, the data needs to be preprocessed. Generally, this can involve many steps, but for our time series, we can limit ourselves to the following actions.
Handling missing values is an important step in the preprocessing of dynamic series. Missing values can cause issues in the analysis and distort forecasting results. To check for missing values, you can use the isnull()
method from the pandas
library:
print(time_series.isnull().sum())
If 0 is indicated for all columns, this means there are no missing values in the data. However, if missing values are found during execution, they should be handled. There are various ways to handle missing values, and the approach will depend on the nature of your data. For example, we can fill in missing values with the column's mean value:
time_series = time_series.fillna(time_series.mean())
To replace missing values only in certain columns, use this command:
time_series['Column 1'] = time_series['Column 1'].fillna(time_series['Column 1'].mean())
Each column in the DataFrame
has a specific data type. In dynamic series, the DataTime
type, specifically designed for storing dates and times, is particularly important. Our DataFrame
in pandas
reads the information as text by default. Even if the column contains dates, pandas will perceive them as regular strings. In our case, we need to convert the Month
column to DataTime
so that we can work with temporal data:
time_series['Month'] = pd.to_datetime(time_series['Month'])
In pandas
, each data row has its unique index (similar to a row number). However, sometimes it is more convenient to use a specific column from your data as an index. When working with time series, the most convenient choice for the index is the column containing the date or time. This allows for easy selection and analysis of data for specific time periods. In our case, we use the Month
column as the index:
time_series.set_index('Month', inplace=True)
Another important step in data preprocessing is checking the need for rescaling the data. If the range of your data is too large (e.g., the Passengers
value ranges from thousands to millions), you may need to transform the data. In the case of airline passenger data, they look quite organized, and such rescaling may not be relevant. However, it is always important to check the data range before further steps. An example of data standardization when the range is large:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
time_series[['Passengers']] = scaler.fit_transform(time_series[['Passengers']])
All the aforementioned steps are important in preparing data for further analysis. They help improve the quality of the time series and make the process of working with it simpler and more efficient. In this guide, we have covered only some data processing steps. But this stage can also include other actions, such as detecting and handling anomalies or outliers, creating new variables or features, and dividing the data into subgroups or categories.
An important element when working with data is its visual representation. Using matplotlib
, we can easily turn data into a visual chart, which helps us understand the structure of the time sequence. Visualization allows us to immediately see trends and seasonality in the data. A trend is the general direction of data movement over a long period. Seasonality is recurring data fluctuations in predictable time frames (week, month, quarter, year, etc.). Generally, a trend is associated with long-term data movement, while seasonality is associated with short-term, regular, and periodic changes.
For example, if you see that the number of passengers grows every year, this indicates an upward trend. If the number of passengers grows in the same months every year, this indicates annual seasonality.
To draw a chart, use the following lines of code:
plt.figure(figsize=(15,8))
plt.plot(time_series['Passengers'])
plt.title('Time Series Plot')
plt.xlabel('Date')
plt.ylabel('Passengers')
plt.show()
For our data, we get the following plot:
In time series analysis, it is crucial to pay attention to the concept of stationarity. Stationarity in a time series means that the series' characteristics (such as mean and variance) remain constant over time. Non-stationary series often lead to errors in further predictions.
The ARIMA model can adapt the series to stationarity "on its own" through a special model parameter (d). However, understanding whether your initial time series is stationary helps you better understand ARIMA's workings.
There are several methods to check for stationarity in a series:
Visual Analysis: Start by plotting the data and observe the following aspects:
Mean: Fluctuations in this indicator over time may signal that the time series is not stationary.
Variance: If the variance changes over time, this also indicates non-stationarity.
Trend: A visible trend on the graph is another indication of non-stationarity.
Seasonality: Seasonal fluctuations on the graph can also suggest non-stationarity.
Statistical Analysis: Perform statistical tests like the Dickey-Fuller test. This method provides a quantitative assessment of a time series' stationarity. The null hypothesis of the test assumes that the time series is non-stationary. If the p-value is less than the significance level of 0.05, the null hypothesis is rejected, and the series can be considered stationary.
Running the Dickey-Fuller test on our data might look like this:
from statsmodels.tsa.stattools import adfuller
print('Test result:')
df_result = adfuller(time_series['Passengers'])
df_labels = ['ADF Test Statistic', 'p-value', '#Lags Used', 'Number of Observations Used']
for result_value, label in zip(df_result, df_labels):
print(label + ' : ' + str(result_value))
if df_result[1] <= 0.05:
print("Strong evidence against the null hypothesis, the series is stationary.")
else:
print("Weak evidence against the null hypothesis, the series is not stationary.")
Our time series is not stationary, but in the following sections, we will automatically search for parameters for the ARIMA model and find the necessary parameters to make the series stationary.
Test result:
Test result:
ADF Test Statistic : 0.8153688792060482
p-value : 0.991880243437641
#Lags Used : 13
Number of Observations Used : 130
Weak evidence against the null hypothesis, the series is not stationary.
Even though we don't need to manually make the series stationary, it's useful to know which methods can be used to do so. There are many methods, including:
Differencing: One of the most common methods, differencing involves calculating the difference between consecutive observations in the time series.
Seasonal Differencing: A variation of regular differencing, applied to data with a seasonal component.
Log Transformation: Taking the logarithm of the data can help reduce variability in the series and make it more stationary.
Some time series may be particularly complex and require combining transformation methods. After transforming the series, you should recheck for stationarity using the Dickey-Fuller test to ensure the transformation was successful.
ARIMA (AutoRegressive Integrated Moving Average) is a statistical model used for analyzing and forecasting time series data.
AutoRegressive (AR): Uses the dependency between an observation and a number of lagged observations (e.g., predicting tomorrow's weather based on previous days' weather).
Integrated (I): Involves differencing the time series data to make it stationary.
Moving Average (MA): Models the error between the actual observation and the predicted value using a combination of past errors.
The ARIMA model is usually denoted as ARIMA(p, d, q)
, where p
, d
, and q
are the model parameters:
p
: The order of the autoregressive part (number of lagged observations included).
d
: The degree of differencing (number of times the data is differenced to achieve stationarity).
q
: The order of the moving average part (number of lagged forecast errors included).
Choosing the appropriate values for (p, d, q)
involves analyzing autocorrelation and partial autocorrelation plots and applying information criteria.
Seasonal ARIMA (SARIMA) extends ARIMA to account for seasonality in time series data. In many cases, time series exhibit clear seasonal patterns, such as ice cream sales increasing in summer and decreasing in winter. SARIMA captures these seasonal patterns.
SARIMA is typically denoted as SARIMA(p, d, q)(P, D, Q)m
, where p
, d
, q
are non-seasonal parameters, and P
, D
, Q
are seasonal parameters:
p
, d
, q
: The same as in ARIMA.
P
: The order of seasonal autoregression (number of lagged seasons affecting the current season).
D
: The degree of seasonal differencing (number of times seasonal trends are differenced).
Q
: The order of seasonal moving average (number of lagged seasonal forecast errors included).
m
: The length of the seasonal period (e.g., 12 for monthly data with yearly seasonality).
Like ARIMA, SARIMA is suitable for forecasting time series data but with the added capability of capturing and modeling seasonal patterns.
Although ARIMA, particularly seasonal ARIMA, may seem complex due to the need to carefully select numerous parameters, automating this process can simplify the task.
The first step in configuring an ARIMA model is determining the optimal parameter values for our specific dataset.
To tune the ARIMA parameters, we will use "grid search." The essence of this method is that it goes through all possible parameter combinations from a predefined grid of values and trains the model on each combination. After training the model on each combination, the model with the best performance is selected.
The more different parameter values, the more combinations need to be checked, and the longer the process will take. For our case, we will use only two possible values (0 and 1) for each parameter, resulting in a total of 8 combinations for the ARIMA parameters and 8 for the seasonal part (with a seasonal period length = 12). Thus, the total number of combinations to check is 64, leading to a relatively quick execution.
It's important to remember that the goal is to find a balance between the time spent on the grid search and the quality of the final model, meaning finding parameter values that yield the highest quality while minimizing time costs.
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
Statsmodels
provides us with methods for building ARIMA models, and itertools
(which we imported earlier) is used to create combinations of possible parameter values.
When working with large datasets and complex computations like statistical analysis or machine learning, libraries and functions may generate warnings about potential issues or non-optimality. However, these warnings are often insignificant or irrelevant to your specific case. Therefore, we set the warnings filter to ignore
:
warnings.filterwarnings("ignore")
To determine the model parameters, we'll define the function search_optimal_sarima
.
def search_optimal_sarima(time_series, seasonal_cycle):
order_vals = diff_vals = ma_vals = range(0, 2)
pdq_combinations = list(itertools.product(order_vals, diff_vals, ma_vals))
seasonal_combinations = [(combo[0], combo[1], combo[2], seasonal_cycle) for combo in pdq_combinations]
smallest_aic = float("inf")
optimal_order_param = optimal_seasonal_param = None
for order_param in pdq_combinations:
for seasonal_param in seasonal_combinations:
try:
sarima_model = sm.tsa.statespace.SARIMAX(time_series,
order=order_param,
seasonal_order=seasonal_param,
enforce_stationarity=False,
enforce_invertibility=False)
model_results = sarima_model.fit()
if model_results.aic < smallest_aic:
smallest_aic = model_results.aic
optimal_order_param = order_param
optimal_seasonal_param = seasonal_param
except:
continue
print('ARIMA{}x{} - AIC:{}'.format(optimal_order_param, optimal_seasonal_param, smallest_aic))
seasonal_cycle_length = 12
search_optimal_sarima(time_series, seasonal_cycle_length)
The first three lines of code in our function create the parameter ranges. As we already know, the ARIMA model has three main parameters, p, d, q. In the code above, p, d, and q are ranges from 0 to 2, meaning they can take values 0 or 1. The itertools.product() function generates all possible combinations of these three parameters. Examples of combinations include (0, 0, 0), (0, 0, 1), (0, 1, 1), and so on.
Then we create additional combinations by adding the seasonal period to each of the pdq combinations. This allows the model to account for seasonal influences on the time series.
Now we need to apply the parameters we determined earlier to automatically tune ARIMA models. When working with forecasting models, our task is to choose the model that best explains and predicts the data. However, selecting the best model is not always straightforward. The Akaike Information Criterion (AIC) helps us compare different models and determine which one is better. AIC helps evaluate how well the model fits the data, considering its complexity. So, the goal is to find the model with the lowest AIC value.
The code above iterates through all possible parameter combinations and uses the SARIMAX function to build the seasonal ARIMA model. The order
parameter sets the main parameters (p, d, q), and the seasonal_order
sets the seasonal parameters of the model (P, D, Q, S).
For our data, we get the following result:
ARIMA(0, 1, 1)x(1, 1, 1, 12) - AIC:920.3192974989254
Once we have found the optimal parameters using grid search, we can use these parameters to train the SARIMAX model on our time series data. This helps us understand how well the model fits our data and provides an opportunity to adjust the model’s parameters if necessary.
First, we define the SARIMAX model with the previously found parameters:
from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(time_series, order=(0, 1, 1), seasonal_order=(1, 1, 1, 12))
Next, fit the model:
results = model.fit()
Print the model summary:
print(results.summary())
The model summary is widely used to assess the quality of the parameter fit. Key aspects to pay attention to include:
Coefficients: They should be statistically significant. Check the p-values of the coefficients (P>|z|); they should be less than 0.05.
AIC (Akaike Information Criterion): A lower AIC value indicates a better model fit.
Ljung-Box (L1) (Q): This is the p-value for the Ljung-Box Q-statistic. If the p-value is greater than 0.05, the residuals are random, which is good.
Jarque-Bera (JB): This is a test for the normality of residuals. If Prob(JB) is greater than 0.05, the residuals are normally distributed, which is good.
Heteroskedasticity (H): This is a test for heteroskedasticity in the residuals. If Prob(H) (two-sided) is greater than 0.05, the residuals are homoscedastic, which is good. Heteroskedasticity occurs when the variance of your forecast errors changes depending on the time period, which means there is a non-uniformity in your data.
Ideally, your model should have statistically significant coefficients, a low AIC value, and residuals that are normally distributed and homoscedastic. Meeting these criteria indicates a good model.
For our model, we obtained the following output:
Plot the model diagnostics:
results.plot_diagnostics(figsize=(12, 8))
plt.show()
This command generates four diagnostic plots:
Residuals Plot: A plot of model residuals over time. If the model is good, the residuals should be random, and the plot should look like white noise.
Q-Q Plot: A plot comparing the distribution of residuals to a standard normal distribution. If the points follow the diagonal line, it indicates that the residuals are normally distributed.
ACF Plot: A plot of autocorrelation of residuals. If the model is good, the residuals should not be correlated with each other. The absence of blue bars outside the blue noise range indicates this.
Histogram of Residuals: A histogram of the distribution of residuals. If the model is good, the residuals should be normally distributed, and the histogram should resemble a bell curve.
These plots, along with the model summary, help us check how well the model fits our data and whether it was correctly specified. If the model is incorrect or unsuitable for the data, it may provide inaccurate forecasts, which could negatively impact decisions made based on these forecasts.
Our diagnostic plots look as follows:
The model we selected generally meets the requirements, but there is still potential for improving the parameters of the seasonal ARIMA model. Applying SARIMA to time series data often requires a careful approach, and it is always beneficial to conduct a thorough data analysis and spend more time on data preprocessing and exploratory analysis before applying time series models.
After successfully training the model, the next step is to generate forecasts and compare the predicted values with the actual data.
First, we generate forecasted values using the model, starting from a specific date and extending to the end of the dataset. The get_prediction
method returns a prediction object from which we can extract forecasted values using predicted_mean
:
st_pred = results.get_prediction(start=pd.to_datetime('1955-12-01'), dynamic=False)
forecast_values = st_pred.predicted_mean
Here, December 1955 is used as an example starting date, but you can adjust this date according to your needs.
Now we have the forecasted values that we can compare with the actual time series data. We will use the Mean Squared Error (MSE) as our metric for evaluating the accuracy of the forecast:
actual_values = time_series['1955-12-01':]['Passengers']
forecast_mse = ((forecast_values - actual_values) ** 2).mean()
print('Mean Squared Error of the forecast is {}'.format(round(forecast_mse, 2)))
MSE is a widely accepted metric for evaluating the performance of forecasting models. A lower MSE indicates a more accurate model. Of course, there is no perfect model, and there will always be some deviation between forecasts and actual data. In our case, the Mean Squared Error of the forecast is 170.37.
Finally, we visualize the results to visually assess the accuracy of our forecasts compared to the actual data:
plt.figure(figsize=(15,8))
plt.plot(actual_values.index, actual_values, label='Actual Values', color='blue')
plt.plot(forecast_values.index, forecast_values, label='Forecasted Values', color='red')
plt.title('Actual and Forecasted Values')
plt.xlabel('Date')
plt.ylabel('Passengers')
plt.legend()
plt.show()
This code generates a plot showing the actual and forecasted passenger numbers over time. The red line represents the forecasted values, while the blue line shows the actual data.
This visualization helps you understand how well the model predicts the data.
Dynamic forecasting generally provides a more realistic view of future time series behavior because it incorporates forecasts into future predictions.
In static forecasting, the model uses the entire known dataset to forecast each subsequent value. Dynamic forecasting, however, uses the most recent forecasted values for future predictions, starting from a user-defined start date.
To perform dynamic forecasting, set the dynamic parameter to True:
dyn_pred = results.get_prediction(start=pd.to_datetime('1955-12-01'), dynamic=True)
dynamic_forecast_values = dyn_pred.predicted_mean
You can also calculate the Mean Squared Error for the dynamic forecast:
mse_dynamic_forecast = ((dynamic_forecast_values - actual_values) ** 2).mean()
print('Mean Squared Error of the dynamic forecast is {}'.format(round(mse_dynamic_forecast, 2)))
And plot the actual and dynamically forecasted values:
plt.figure(figsize=(15,8))
plt.plot(actual_values.index, actual_values, label='Actual Values', color='blue')
plt.plot(dynamic_forecast_values.index, dynamic_forecast_values, label='Dynamic Forecast', color='green')
plt.title('Actual and Dynamically Forecasted Values')
plt.xlabel('Date')
plt.ylabel('Passengers')
plt.legend()
plt.show()
After performing static and dynamic forecasts, we can evaluate whether our time series model is successful. The next step is to attempt to predict future data in this time series.
Now we can finally use the ARIMA model in Python to forecast future values.
To perform forecasting for a certain number of steps ahead, you can use the get_forecast
method from the results model:
pred_future = results.get_forecast(steps=12)
We use the trained model (results
) to get forecasts for the next 12 periods. Since our data includes information up to December 1960, we will generate predictions for the number of passengers each month for the year 1961.
We will print the forecasted mean values and confidence intervals:
print(f'Forecasted mean values:\n\n{pred_future.predicted_mean}')
print(f'\nConfidence intervals:\n\n{pred_future.conf_int()}')
We can also visualize our forecast:
fig = plt.figure()
plt.plot(pred_future.predicted_mean, label='Forecasted Mean Values')
plt.fill_between(pred_future.conf_int().index,
pred_future.conf_int().iloc[:, 0],
pred_future.conf_int().iloc[:, 1], color='k', alpha=.2)
plt.legend()
plt.show()
This visualization is very useful for understanding what the model predicts. The forecasted mean values show the expected number of passengers each month in 1961, and the shaded area around the forecast represents the confidence interval.
In this tutorial, we discussed how to apply the ARIMA model for time series forecasting using Python. We covered the entire process from data loading and preprocessing to finding optimal parameters for the model, evaluating it, and ultimately forecasting future values.
Using ARIMA helps us understand the application of more advanced forecasting techniques. It is important to remember that the ARIMA model might not work for all time series, and the results will depend on the quality of your initial data and the preprocessing performed.
Now you can automate the forecasting of time series data using the ARIMA model and the Python programming language. We encourage you to practice and revisit this tutorial with different datasets to enhance your skills.
On our app platform you can find Python applications, such as Celery, Django, FastAPI and Flask.