Last Updated on July 14, 2022 by Jay
In this tutorial, we’ll learn how to merge multiple Excel files in Python. The “merge” operation refers to combining files based on a common unique key identifier/column. This operation is identical to the “Merge Query” inside Excel PowerQuery.
Sometimes there’s confusion between a “merge” and “append”, so let’s clear that out.
- Append – Combine files by adding data vertically (at the bottom of a file). When you have files with the (more or less) same format/columns and you want to aggregate those files, use Append. Check here for tutorial on append.
- Merge – Combine files by adding data horizontally (to the right of a file). When you have files containing different aspects for the same data records, and you want to aggregate those files, use Merge.
This tutorial assumes that you have a basic understanding of Python and the pandas library. If you feel lost, this series of articles might help.
The Excel files
Below are a few mock-up datasets, feel free to download them and follow along. Each file (except the death report) contains about 100,000 records. The three files contain different information on the same group of insurance policyholders, and our goal is to create a “master database” to store all information in one place. Don’t laugh if you think I’m joking – in the corporate/finance world, we in fact use Excel as a database to store information. I DO NOT recommend it at all, but it’s a common practice.
I purposely created some dummy data for the first two files, with both having 100k rows and 50+ columns. This is to help test how fast we can merge them.
We notice the “PolicyID” columns contain unique keys we can use to link up policies among the three spreadsheets. Being proficient in Excel, our first reaction is probably: “lookup will do the job. I can use lookups to find values for each PolicyID, and bring all data fields into one spreadsheet!”
That turned out to be a bad idea if you are working with a large dataset. Using Excel means we need to build a massive spreadsheet with millions of lookup and other formulas. Each time we need to update something in the file, it probably will take half an hour to re-calculate, a very good way to spend our time on!
This is where Python really shines. It will provide a 1000x boost to our productivity!
The Python???? way
As usual, we start off by importing our favorite library –
pandas. Then we read all three Excel files into Python.
import pandas as pd df_1 = pd.read_excel('sheet1.xslx') df_2 = pd.read_excel('sheet2.xslx') df_3 = pd.read_excel('sheet3.xslx')
As a reminder, a pandas dataframe is a tabular data object which looks exactly like an Excel spreadsheet – rows, columns, and cells!
Pandas has a method
.merge() to merge multiple datasets efficiently.
df_combine = df_1.merge(df_2, left_on='PolicyID', right_on = "ID", how='left') df_combine = df_combine.merge(df_3, on='PolicyID')
The first merge()
df_1 is called the left dataframe, and the
df_2 is called the right dataframe. Merging
df_1 basically means we are bringing all the data from the two dataframes together, matching each record from
df_1, using a common unique key.
Just like the Excel vlookup formula, except that we’ve achieved the same result with 1 line of code instead of millions of formulas!
Notice that in the first Excel file, the column “PolicyID” contains policy numbers. In contrast, in the second Excel file, the column “ID” contains the policy numbers, so we had to specify that, for the left dataframe (
df_1), we want to use “PolicyID” column as the unique key. For the right dataframe (
df_2), we want to use “ID” column as the unique key.
There is the same number of records in both
df_2, so we can get a one-to-one match and bring both dataframes together. Note the number rows didn’t change in the resulting dataframe, but the number of columns is now the sum of all columns from both files!
The second merge()
We take the result from the first merge operation, then merge with another
df3. This time, because both dfs have the same common column “PolicyID”, we just need to specify it with
on = 'PolicyID'. Let’s look at the argument how.
how = ‘inner’ (or left blank)
how = 'inner' This is the default value when the argument is left out. Inner means to use the intersection of keys from both frames. In other words, only keys that exist in both dataframes will be returned. Note the final record count: 19,777 rows which is equal to the count from file 3.
df_combine = df_combine.merge(df3, on = 'PolicyID') ## how = 'inner'
how = ‘left’
how = 'left' This means using only keys from the left dataframe. Keys that don’t exist in the left dataframe won’t be returned. Note the final record count: 99,999 rows which is equal to the count from file 1.
df_combine = df_combine.merge(df3, on = 'PolicyID', how='left')
A few interesting observations about the final combined dataframe:
- Both PolicyID (from
df_1) and ID (from
df_2) got brought into the dataframe, we’ll have to drop one to clean up the data.
- There are two “Account Value” columns. Account Value_x (from
df_2) and Account Value_y (from
df_3). When there are two identical columns, by default,
pandaswill assign suffix “_x”, “_y”, etc. to the end of column names. We can change the suffix by using the optional argument
suffixes=('_x', '_y')inside the
Putting it All Together
import pandas as pd #load datasets from Excel files df_1 = pd.read_excel('sheet1.xlsx') df_2 = pd.read_excel('sheet2.xlsx') df_3 = pd.read_excel('sheet3.xlsx') #merge datasets df_combine = df_1.merge(df_2, left_on='PolicyID', right_on = "ID") df_combine = df_combine.merge(df_3, on='PolicyID') #output back into Excel df_combine.to_excel('combine_df.xslx')
Merge Performance: Excel vs Python
You are probably already familiar with Excel and know how slow it can be if you have thousands of lookup formulas. I want to show you how fast Python can merge two large datasets.
You read that correct – it was less than 0.1 seconds to merge the two datasets with ~100k rows and 50+ columns each! I’m not even going to bother testing this in Excel, which I’m sure will be orders of magnitude slower than Python.
I like to spend my time on better things than waiting for Excel to refresh. Hope you find this tutorial helpful and start finding ways to save your precious time as well!
I have been referring to this example while working on a similar project. I have a question though. I have 3 different xslx files, and all 3 don’t share the same key. Sheet1 shares with sheet 2 and sheet2 with sheet3. I was able to merge the first 2 sheets together, but when I try to get sheet3 it doesn’t work. I am wondering how to use the left_on and right_on in this case and if how=left or how=inner.
Thanks for the question! I’ve already answered this question on my Youtube channel, but for the benefit of others who see this question, I’m pasting my response here:
For 3 files with different keys, you can:
1. merge file #1 and #2 first on one set of key, let’s call this resulting dataframe “a”
2. merge the above dataframe “a” with file #3, using the common key (from files #2 and #3).
I wouldn’t worry about the “left_on/right_on” and “how” arguments, I don’t think you need them based on your description. However, in case this is any help:
The “left_on” and “right_on” are arguments to use when your “left” and “right” dataframes have different column names for the key column. If they key columns in both dataframes have the exact same name, you can just use the argument “on” instead.
The “how” argument specify how to perform the merge, this is similar to the SQL join. Basically:
left – keep all the records in the left dataframe, if a record is in the left dataframe but not in the right one, you’ll get NAs for columns in the right dataframe after the merge
right – keep all the records in the right dataframe, similar behavior to the above
inner – keep all the records that exist in both dataframes (if a record is in one table but not the other, it will be removed)
outter – keep all records from both dataframes