How to Read Excel Files in Python using Pandas
Excel files are commonly used to organize, sort, and analyze data in a tabular format with rows and columns. They are widely applied in industries like data analysis, finance, and reporting.
Using Python, the pandas library allows for efficient manipulation of Excel files, enabling operations like reading and writing data. This article will cover how to use the read_excel function from pandas to read Excel files.
Installing Pandas
To begin, install pandas by running the following command:
pip install pandas
This will install pandas along with the required dependencies in your work environment. Additionally, the openpyxl module is needed for reading .xlsx files.
Why OpenPyXL?
Excel files come in different formats and extensions. To ensure compatibility when working with these files, pandas allows you to specify the engine you want to use. Below is a list of supported engines for reading Excel files:
OpenPyXL: Used for reading and writing .xlsx files (Excel 2007+).
XlsxWriter: Primarily used for writing .xlsx files.
xlrd: Used for reading older .xls files (Excel 97-2003).
Pyxlsb: Used for reading .xlsb (binary Excel format) files.
OpenPyXL also supports Excel-specific features, such as formatting and formulas. OpenPyXL is already installed as a dependency of pandas, but you can install it using the following command:
pip install openpyxl
While OpenPyXL can be used on its own to read Excel files, it is also integrated as an engine within pandas for reading and writing .xlsx files.
We will work with an Excel file that you can download here. Download the file and move it into your working environment.
Basic Usage of read_excel Function
The Excel file we are working with has the following structure:
It also has three worksheets: Orders, Returns, and Users.
To read this file, the read_excel function from pandas will be used.
The read_excel function in pandas is used to import data from Excel files into a pandas DataFrame, a powerful structure for analyzing and manipulating data. This function is highly versatile, allowing users to read data from specific sheets, columns, or ranges.
Here is how to use this function while specifying the engine:
import pandas as pd
df = pd.read_excel('SuperStoreUS-2015.xlsx')
print(df)
This code imports the pandas library and uses the read_excel function to read the SuperStoreUS-2015.xlsx Excel file into a pandas DataFrame. The print(df) statement outputs the DataFrame contents, displaying the data from the Excel file. Below is the resulting output:
Row ID Order Priority Discount Unit Price Shipping Cost ... Ship Date Profit Quantity ordered new Sales Order ID
0 20847 High 0.01 2.84 0.93 ... 2015-01-08 4.5600 4 13.01 88522
1 20228 Not Specified 0.02 500.98 26.00 ... 2015-06-15 4390.3665 12 6362.85 90193
2 21776 Critical 0.06 9.48 7.29 ... 2015-02-17 -53.8096 22 211.15 90192
3 24844 Medium 0.09 78.69 19.99 ... 2015-05-14 803.4705 16 1164.45 86838
4 24846 Medium 0.08 3.28 2.31 ... 2015-05-13 -24.0300 7 22.23 86838
The read_excel function is highly flexible and can be adapted to various usage scenarios. Next, we will explore how to use it for reading specific sheets and columns.
Reading Specific Sheets and Columns
Excel files can come with multiple sheets and as many columns as possible. The read_excel function takes the sheet_name argument to tell pandas which sheet to read. By default, read_excel will load all worksheets. Here is how you can use the sheet_name argument:
df = pd.read_excel('SuperStoreUS-2015.xlsx', sheet_name="Returns")
print(df)
This will read the Returns sheet, and here is an example output:
Order ID Status
0 65 Returned
1 612 Returned
2 614 Returned
3 678 Returned
4 710 Returned
... ... ...
1629 182681 Returned
1630 182683 Returned
1631 182750 Returned
1632 182781 Returned
1633 182906 Returned
[1634 rows x 2 columns]
The sheet_name argument also takes integers that are used in zero-indexed sheet positions. For instance, using pd.read_excel('SuperStoreUS-2015.xlsx', sheet_name=1) will load the Returns sheet as well.
You can also choose to read specific columns from the Excel file. The read_excel function allows for selective column reading using the usecols parameter. It accepts various formats:
A string for Excel column letters or ranges (e.g., "A:C").
A list of integers for column positions.
A list of column names.
Here is an example using column names:
import pandas as pd
df = pd.read_excel('SuperStoreUS-2015.xlsx', usecols=['Row ID', 'Sales'])
print(df)
In this case, the usecols parameter specifies that only columns Row ID and Sales from the Excel file should be imported into the DataFrame. The code below does the same thing, but using Excel column letters:
import pandas as pd
df = pd.read_excel('SuperStoreUS-2015.xlsx', usecols='A,X')
print(df)
Here is the output:
Row ID Sales
0 20847 13.01
1 20228 6362.85
2 21776 211.15
3 24844 1164.45
4 24846 22.23
... ... ...
1947 19842 207.31
1948 19843 143.12
1949 26208 59.98
1950 24911 135.78
1951 25914 506.50
You can also use range selection to read columns by their position. In the code below, we are reading from Order Priority to Customer ID.
df = pd.read_excel('SuperStoreUS-2015.xlsx', usecols='B:F')
Here is an example output when reading columns B to F:
Order Priority Discount Unit Price Shipping Cost Customer ID
0 High 0.01 2.84 0.93 3
1 Not Specified 0.02 500.98 26.00 5
2 Critical 0.06 9.48 7.29 11
3 Medium 0.09 78.69 19.99 14
4 Medium 0.08 3.28 2.31 14
Additionally, you can provide a callable that evaluates column names, reading only those for which the function returns True.
Handling Missing Data in Excel Files
In Excel files, missing data refers to values that are absent, often represented by empty cells. When reading an Excel file into a pandas DataFrame, missing data is automatically identified and handled as NaN (Not a Number), which is pandas placeholder for missing values.
Pandas offers several methods to handle missing data, such as:
dropna(): Removes rows or columns with missing values.
fillna(): Replaces missing values with a specified value (e.g., 0 or the mean of the column).
isna(): Detects missing values and returns a boolean DataFrame.
For example, using fillna on our Excel file will replace all missing values with 0:
df = pd.read_excel('SuperStoreUS-2015.xlsx')
df_cleaned = df.fillna(0)
Handling missing data is essential to ensure accurate analysis and prevent errors or biases in data-driven decisions.
Reading and Analyzing an Excel File in Pandas
Let’s make a pragmatic use of the notion we have learned. In this practical example, we will walk through reading an Excel file, performing some basic analysis, and exporting the manipulated data into various formats.
Specifically, we’ll calculate the sum, maximum, and minimum values for the Profit column for the year 2015, and export the results to CSV, JSON, and a Python dictionary.
Step 1: Loading the Excel File
The first step is to load the Excel file using the read_excel function from pandas:
import pandas as pd
df = pd.read_excel('SuperStoreUS-2015.xlsx', usecols=['Ship Date', 'Profit'])
print(df.head())
This code reads the SuperStoreUS-2015.xlsx file into a pandas DataFrame and displays the first few rows, including the Ship Date and Profit columns.
Step 2: Calculating Profit for June 2015
Next, we will filter the data to include only records from June 2015 and calculate the total, maximum, and minimum profit for that month. Since the date format in the dataset is MM/DD/YYYY, we will convert the Ship Date column to a datetime format and filter by the specific month:
df['Ship Date'] = pd.to_datetime(df['Ship Date'], format='%m/%d/%Y')
df_june_2015 = df[(df['Ship Date'].dt.year == 2015) & (df['Ship Date'].dt.month == 6)]
# Calculate the sum, max, and min for the Profit column
profit_sum = df_june_2015['Profit'].sum()
profit_max = df_june_2015['Profit'].max()
profit_min = df_june_2015['Profit'].min()
print(f"Total Profit in June 2015: {profit_sum}")
print(f"Maximum Profit in June 2015: {profit_max}")
print(f"Minimum Profit in June 2015: {profit_min}")
The output will be something like:
print(f"Total Profit in June 2015: {round(profit_sum, ndigits=2)}")
print(f"Maximum Profit in June 2015: {round(profit_max, ndigits=2)}")
print(f"Minimum Profit in June 2015: {round(profit_min, ndigits=2)}")
Step 3: Exporting the Manipulated Data
Once the profit for June 2015 has been calculated, we can export the filtered data to different formats, including CSV, JSON, and a Python dictionary.
# Export to CSV
df_june_2015.to_csv('SuperStoreUS_June2015_Profit.csv', index=False)
# Export to JSON
df_june_2015.to_json('SuperStoreUS_June2015_Profit.json', orient='records')
# Convert to Dictionary
data_dict = df_june_2015.to_dict(orient='records')
print(data_dict[:5])
In this step, the data is first exported to a CSV file and then to a JSON file. Finally, the DataFrame is converted into a Python dictionary, with each row represented as a dictionary.
Conclusion
In this article, we have learned how to use the read_excel function from pandas to read and manipulate Excel files. This is a powerful function with the ability to simplify data filtering for a better focus on the rows or columns we want.
03 October 2024 · 8 min to read