Extract and clean data from Excel using Python

Example on extracting and cleaning BP Statistical Review of World Energy using Python

Mustafa Adel Amer
3 min readOct 13, 2021

BP statistical review of the world energy is published annually to show the views of BP on the future of energy in different scenarios.

You can review the scenario building process here.

The excel workbook that has perfectly shows the data but has empty cells that hinder working directly with the data. Our goal here is to show how we can use python code to extract the data in clean format from the BP Statistical Review. We will present here how to get the data in clean formate from the workbook using Python Pandas. on the example of Brazil data.

How does the data look like in the excel sheet?

The excel sheet has many empty rows and columns that would require manual work to get the data in a clean format.

How to extract the data in a clean format?

1.Read the data in Jupyter Notebook.

#read the excel workbook as a whole
xls = pd.ExcelFile(‘bp-energy-outlook-2020-summary-tables.xlsx’)

2. Read the Brazil excel sheet from the file.

#read specific worksheet from the excel workbook
brazil = pd.read_excel(xls, ‘Brazil’)

The file after reading in python without data wrangling

The third row with index 2 has the column names. How can we use it for column names?

3. Adjust the column names.

brazil.iloc[2].tolist()

brazil.columns =

[‘sector’, ‘1995’, ‘2000’,’2005', ‘2010’, ‘2015’, ‘2018’, ‘1995-2018’, ‘empty1’, ‘2025 Rapid scenario’, ‘2030 Rapid scenario’, ‘2035 Rapid scenario’,
‘2040 Rapid scenario’, ‘2045 Rapid scenario’, ‘2050 Rapid scenario’, ‘2018–2050’, ‘empty2’, ‘2025 Net Zero scenario’, ‘2030 Net Zero scenario’, ‘2035 Net Zero scenario’, ‘2040 Net Zero scenario’, ‘2045 Net Zero scenario’, ‘2050 Net Zero scenario’, ‘2018–2050’, ‘empty3’, ‘2025 Bussiness as usual, ‘2030 Bussiness as usual, ‘2035 Bussiness as usual, ‘2040 Bussiness as usual, ‘2045 Bussiness as usual, ‘2050 Bussiness as usual, ‘2018–2050’]

4. Identify rows of interest.

sectors_int = [‘Oil’, ‘Natural gas+’, ‘Coal’, ‘Nuclear’, ‘Hydro’, ‘Renewables*’, ‘Oil (Mbd)’, ‘Natural gas+ (Bcm)’, ‘Transport’, ‘Industry’, ‘Non-combusted’, ‘Buildings’, ‘Inputs to power’, ‘Oil’, ‘Natural gas+’, ‘Coal’, ‘Oil (Mbd)’, ‘Natural gas+ (Bcm)’, ‘Net CO2 emissions (Mt)’]

Note that the rows of interest have oil, Natural gas, Coal repeated twice for production and consumption. We need to find a way to differentiate between them.

In the sector column, there are repeated entries that we want to differentiate

5. Identify the index values of production and consumption data to erase confusion.

consumption_inx = [6,7,8,9,10,11,13,14]
production_inx = [25,26,27,29,30]
idx_list = [6,7,8,9,10,11,13,14, 25,26,27,29,30]

6. Adjust the names of sector entries based on the index value.

Your data is clean now in the form below. The only missing thing is the way you want to treat missing data but that is not the scope for now.

Final clean dataset.

The full code is available on Github.

Do you think there was a better way to get this task done? please share with us in the comments below.

--

--