How to extract and clean data from multiple excel sheets using python

Performing analysis on data from an excel sheet is an everyday task for many professionals and students.

3 min readOct 14, 2021

Speeding up the process of extracting and cleaning data from excel faster will boost productivity, enhance analysis capabilities, and save time to try higher-level analytics.

We will go through extracting data from Excel Workbook BP Statistical Review on World Energy 2020 published publicly by BP.

Each sheet has many empty data cells and other challenges related to the naming of entries.

Extracting and cleaning data from a single excel sheet is here.

In this post, we will discuss how we can automate the process of extracting data from multiple worksheets from the BP Statistical Review on World Energy 2020 Excel Workbook.

Input and output of one excel sheet — Image by author

Extracting Multiple Excel Sheets

The code is splitter into a function that performs cleaning and adjust the naming of sectors and scenarios.

The GitHub repository is published here.

def clean_table_bp(country):"""
It takes in a Excel Sheet Name
Returns clean dataframe of the data inside the sheet
"""    xls = pd.ExcelFile('bp-energy-outlook-2020-summary-tables.xlsx')#read the dataframe from the sheet
    df = pd.read_excel(xls, country)#adjust column names
    df.columns =['sector', '1995', '2000','2005', '2010', '2015', '2018', '1995-2018', 
                         'empty1', '2025 Rapid scenario', '2030 Rapid scenario', '2035 Rapid scenario', 
                         '2040 Rapid scenario', '2045 Rapid scenario', '2050 Rapid scenario', '2018-2050', 'empty2', 
                         '2025 Net Zero scenario', '2030 Net Zero scenario', '2035 Net Zero scenario', '2040 Net Zero scenario', 
                         '2045 Net Zero scenario', '2050 Net Zero scenario', '2018-2050', 'empty3', 
                         '2025 Bussiness as susual', '2030 Bussiness as usual', '2035 Bussiness as usual',
                         '2040 Bussiness as susual', '2045 Bussiness as usual', '2050 Bussiness as usual', '2018-2050']#create a column name with the country of interest
    df['country'] = country
        #choose sectors of interest
    df = df[df['sector'].isin(sectors_int)]
        
    #drop the empty rows
    df.drop(['empty1', 'empty2', 'empty3'], axis = 1, inplace = True)
    
            #adjust sector names to differentiate between consumption and production of fuels 
    for x in [6,7,8,9,10,11,13,14, 25,26,27,29,30]:
        if x<15:
            df.loc[x,'sector'] = df.loc[x,'sector']+' consumption'
        if x>15:
            df.loc[x,'sector'] = df.loc[x,'sector']+' production'
        
        #print("sector amendments")#reset_index
    df = df.reset_index()
    return df

The second is a loop to perform cleaning on each selected sheet then merge the resulting dataframes in one.

# Methodology :loop over the country list and create a dataframe for each country then append the dataframes

#create an empty dataframe
df_final = pd.DataFrame()

# loop over the country list
for i in country_list:
    
    #perform the function clean_table_bp on each element in the country list
    df = clean_table_bp(i)
    
    # merge the resulting dataframes
    df_final = pd.concat([df, df_final], axis = 0)
    
# return the final dataframe    
df_final

This process saves 2 hours of manual work to clean and merge the data.

In the next post, we will discuss how we can benchmark time series data in Python in seconds.

How to extract and clean data from multiple excel sheets using python

Performing analysis on data from an excel sheet is an everyday task for many professionals and students.

Extracting Multiple Excel Sheets

Written by Mustafa Adel Amer