How to Read Excel Files in Python using Pandas

Technical writer

Python

03.10.2024

Reading time: 8 min

Excel files are commonly used to organize, sort, and analyze data in a tabular format with rows and columns. They are widely applied in industries like data analysis, finance, and reporting.

Using Python, the pandas library allows for efficient manipulation of Excel files, enabling operations like reading and writing data. This article will cover how to use the read_excel function from pandas to read Excel files.

Installing Pandas

To begin, install pandas by running the following command:

pip install pandas

This will install pandas along with the required dependencies in your work environment. Additionally, the openpyxl module is needed for reading .xlsx files.

Why OpenPyXL?

Excel files come in different formats and extensions. To ensure compatibility when working with these files, pandas allows you to specify the engine you want to use. Below is a list of supported engines for reading Excel files:

OpenPyXL: Used for reading and writing .xlsx files (Excel 2007+).
XlsxWriter: Primarily used for writing .xlsx files.
xlrd: Used for reading older .xls files (Excel 97-2003).
Pyxlsb: Used for reading .xlsb (binary Excel format) files.

OpenPyXL also supports Excel-specific features, such as formatting and formulas. OpenPyXL is already installed as a dependency of pandas, but you can install it using the following command:

pip install openpyxl

While OpenPyXL can be used on its own to read Excel files, it is also integrated as an engine within pandas for reading and writing .xlsx files.

We will work with an Excel file that you can download here. Download the file and move it into your working environment.

Basic Usage of read_excel Function

The Excel file we are working with has the following structure:

It also has three worksheets: Orders, Returns, and Users.

To read this file, the read_excel function from pandas will be used.

The read_excel function in pandas is used to import data from Excel files into a pandas DataFrame, a powerful structure for analyzing and manipulating data. This function is highly versatile, allowing users to read data from specific sheets, columns, or ranges.

Here is how to use this function while specifying the engine:

import pandas as pd 

df = pd.read_excel('SuperStoreUS-2015.xlsx')

print(df)

This code imports the pandas library and uses the read_excel function to read the SuperStoreUS-2015.xlsx Excel file into a pandas DataFrame. The print(df) statement outputs the DataFrame contents, displaying the data from the Excel file. Below is the resulting output:

       Row ID Order Priority  Discount  Unit Price  Shipping Cost  ...  Ship Date     Profit Quantity ordered new    Sales Order ID

0      20847           High      0.01        2.84           0.93  ... 2015-01-08     4.5600                    4    13.01    88522

1      20228  Not Specified      0.02      500.98          26.00  ... 2015-06-15  4390.3665                   12  6362.85    90193

2      21776       Critical      0.06        9.48           7.29  ... 2015-02-17   -53.8096                   22   211.15    90192

3      24844         Medium      0.09       78.69          19.99  ... 2015-05-14   803.4705                   16  1164.45    86838

4      24846         Medium      0.08        3.28           2.31  ... 2015-05-13   -24.0300                    7    22.23    86838

The read_excel function is highly flexible and can be adapted to various usage scenarios. Next, we will explore how to use it for reading specific sheets and columns.

Reading Specific Sheets and Columns

Excel files can come with multiple sheets and as many columns as possible. The read_excel function takes the sheet_name argument to tell pandas which sheet to read. By default, read_excel will load all worksheets. Here is how you can use the sheet_name argument:

df = pd.read_excel('SuperStoreUS-2015.xlsx', sheet_name="Returns")

print(df)

This will read the Returns sheet, and here is an example output:

      Order ID    Status

0           65  Returned

1          612  Returned

2          614  Returned

3          678  Returned

4          710  Returned

...        ...       ...

1629    182681  Returned

1630    182683  Returned

1631    182750  Returned

1632    182781  Returned

1633    182906  Returned

[1634 rows x 2 columns]

The sheet_name argument also takes integers that are used in zero-indexed sheet positions. For instance, using pd.read_excel('SuperStoreUS-2015.xlsx', sheet_name=1) will load the Returns sheet as well.

You can also choose to read specific columns from the Excel file. The read_excel function allows for selective column reading using the usecols parameter. It accepts various formats:

A string for Excel column letters or ranges (e.g., "A:C").
A list of integers for column positions.
A list of column names.

Here is an example using column names:

import pandas as pd

df = pd.read_excel('SuperStoreUS-2015.xlsx', usecols=['Row ID', 'Sales'])

print(df)

In this case, the usecols parameter specifies that only columns Row ID and Sales from the Excel file should be imported into the DataFrame. The code below does the same thing, but using Excel column letters:

import pandas as pd

df = pd.read_excel('SuperStoreUS-2015.xlsx', usecols='A,X')

print(df)

Here is the output:

      Row ID    Sales

0      20847    13.01

1      20228  6362.85

2      21776   211.15

3      24844  1164.45

4      24846    22.23

...      ...      ...

1947   19842   207.31

1948   19843   143.12

1949   26208    59.98

1950   24911   135.78

1951   25914   506.50

You can also use range selection to read columns by their position. In the code below, we are reading from Order Priority to Customer ID.

df = pd.read_excel('SuperStoreUS-2015.xlsx', usecols='B:F')

Here is an example output when reading columns B to F:

     Order Priority  Discount  Unit Price  Shipping Cost  Customer ID

0              High      0.01        2.84           0.93            3

1     Not Specified      0.02      500.98          26.00            5

2          Critical      0.06        9.48           7.29           11

3            Medium      0.09       78.69          19.99           14

4            Medium      0.08        3.28           2.31           14

Additionally, you can provide a callable that evaluates column names, reading only those for which the function returns True.

Handling Missing Data in Excel Files

In Excel files, missing data refers to values that are absent, often represented by empty cells. When reading an Excel file into a pandas DataFrame, missing data is automatically identified and handled as NaN (Not a Number), which is pandas placeholder for missing values.

Pandas offers several methods to handle missing data, such as:

dropna(): Removes rows or columns with missing values.
fillna(): Replaces missing values with a specified value (e.g., 0 or the mean of the column).
isna(): Detects missing values and returns a boolean DataFrame.

For example, using fillna on our Excel file will replace all missing values with 0:

df = pd.read_excel('SuperStoreUS-2015.xlsx')

df_cleaned = df.fillna(0)

Handling missing data is essential to ensure accurate analysis and prevent errors or biases in data-driven decisions.

Reading and Analyzing an Excel File in Pandas

Let’s make a pragmatic use of the notion we have learned. In this practical example, we will walk through reading an Excel file, performing some basic analysis, and exporting the manipulated data into various formats.

Specifically, we’ll calculate the sum, maximum, and minimum values for the Profit column for the year 2015, and export the results to CSV, JSON, and a Python dictionary.

Step 1: Loading the Excel File

The first step is to load the Excel file using the read_excel function from pandas:

import pandas as pd

df = pd.read_excel('SuperStoreUS-2015.xlsx', usecols=['Ship Date', 'Profit'])

print(df.head())

This code reads the SuperStoreUS-2015.xlsx file into a pandas DataFrame and displays the first few rows, including the Ship Date and Profit columns.

Step 2: Calculating Profit for June 2015

Next, we will filter the data to include only records from June 2015 and calculate the total, maximum, and minimum profit for that month. Since the date format in the dataset is MM/DD/YYYY, we will convert the Ship Date column to a datetime format and filter by the specific month:

df['Ship Date'] = pd.to_datetime(df['Ship Date'], format='%m/%d/%Y')

df_june_2015 = df[(df['Ship Date'].dt.year == 2015) & (df['Ship Date'].dt.month == 6)]

# Calculate the sum, max, and min for the Profit column

profit_sum = df_june_2015['Profit'].sum()

profit_max = df_june_2015['Profit'].max()

profit_min = df_june_2015['Profit'].min()

print(f"Total Profit in June 2015: {profit_sum}")

print(f"Maximum Profit in June 2015: {profit_max}")

print(f"Minimum Profit in June 2015: {profit_min}")

The output will be something like:

print(f"Total Profit in June 2015: {round(profit_sum, ndigits=2)}")

print(f"Maximum Profit in June 2015: {round(profit_max, ndigits=2)}")

print(f"Minimum Profit in June 2015: {round(profit_min, ndigits=2)}")

Step 3: Exporting the Manipulated Data

Once the profit for June 2015 has been calculated, we can export the filtered data to different formats, including CSV, JSON, and a Python dictionary.

# Export to CSV

df_june_2015.to_csv('SuperStoreUS_June2015_Profit.csv', index=False)

# Export to JSON

df_june_2015.to_json('SuperStoreUS_June2015_Profit.json', orient='records')

# Convert to Dictionary

data_dict = df_june_2015.to_dict(orient='records')

print(data_dict[:5])

In this step, the data is first exported to a CSV file and then to a JSON file. Finally, the DataFrame is converted into a Python dictionary, with each row represented as a dictionary.

Conclusion

In this article, we have learned how to use the read_excel function from pandas to read and manipulate Excel files. This is a powerful function with the ability to simplify data filtering for a better focus on the rows or columns we want.

Check out our app platform to find Python applications, such as Celery, Django, FastAPI and Flask.

Python

03.10.2024

Reading time: 8 min

Similar

Microservices