How to extract and clean data from multiple excel sheets using python

Performing analysis on data from an excel sheet is an everyday task for many professionals and students.

Mustafa Adel Amer
3 min readOct 14, 2021

Speeding up the process of extracting and cleaning data from excel faster will boost productivity, enhance analysis capabilities, and save time to try higher-level analytics.

We will go through extracting data from Excel Workbook BP Statistical Review on World Energy 2020 published publicly by BP.

Each sheet has many empty data cells and other challenges related to the naming of entries.

Extracting and cleaning data from a single excel sheet is here.

In this post, we will discuss how we can automate the process of extracting data from multiple worksheets from the BP Statistical Review on World Energy 2020 Excel Workbook.

Input and output of one excel sheet — Image by author

Extracting Multiple Excel Sheets

Workflow — Image by author

The code is splitter into a function that performs cleaning and adjust the naming of sectors and scenarios.

The GitHub repository is published here.

def clean_table_bp(country):"""
It takes in a Excel Sheet Name
Returns clean dataframe of the data inside the sheet
"""
xls = pd.ExcelFile('bp-energy-outlook-2020-summary-tables.xlsx')#read the dataframe from the sheet
df = pd.read_excel(xls, country)
#adjust column names
df.columns =['sector', '1995', '2000','2005', '2010', '2015', '2018', '1995-2018',
'empty1', '2025 Rapid scenario', '2030 Rapid scenario', '2035 Rapid scenario',
'2040 Rapid scenario', '2045 Rapid scenario', '2050 Rapid scenario', '2018-2050', 'empty2',
'2025 Net Zero scenario', '2030 Net Zero scenario', '2035 Net Zero scenario', '2040 Net Zero scenario',
'2045 Net Zero scenario', '2050 Net Zero scenario', '2018-2050', 'empty3',
'2025 Bussiness as susual', '2030 Bussiness as usual', '2035 Bussiness as usual',
'2040 Bussiness as susual', '2045 Bussiness as usual', '2050 Bussiness as usual', '2018-2050']
#create a column name with the country of interest
df['country'] = country
#choose sectors of interest
df = df[df['sector'].isin(sectors_int)]

#drop the empty rows
df.drop(['empty1', 'empty2', 'empty3'], axis = 1, inplace = True)

#adjust sector names to differentiate between consumption and production of fuels
for x in [6,7,8,9,10,11,13,14, 25,26,27,29,30]:
if x<15:
df.loc[x,'sector'] = df.loc[x,'sector']+' consumption'
if x>15:
df.loc[x,'sector'] = df.loc[x,'sector']+' production'

#print("sector amendments")
#reset_index
df = df.reset_index()
return df

The second is a loop to perform cleaning on each selected sheet then merge the resulting dataframes in one.

# Methodology :loop over the country list and create a dataframe for each country then append the dataframes

#create an empty dataframe
df_final = pd.DataFrame()

# loop over the country list
for i in country_list:

#perform the function clean_table_bp on each element in the country list
df = clean_table_bp(i)

# merge the resulting dataframes
df_final = pd.concat([df, df_final], axis = 0)

# return the final dataframe
df_final

This process saves 2 hours of manual work to clean and merge the data.

In the next post, we will discuss how we can benchmark time series data in Python in seconds.

--

--