Step-by-step data analysis process with code using Python

Udacity Advanced Data Analysis — Nano-Degree notes

3 min readJul 12, 2022

Data analysis nano degree offered by Udacity is among the best out there for learning python and having hands-on experience through well-structured projects.

Data analysis is using data to answer questions and deliver insights that can create value or reshape understanding.

The course defines the data analysis process through the following steps:

Question — What are you trying to answer?
Wrangling — gathering, assessing completeness, and cleaning the data in hand.
Exploration — is about exploring, aurgmenting, and learning from data.
Draw conclusion — the gathered insights and learnings from exploring the data
Communicate findings

The folloiwng will summarize the steps presented in the course for preparing datasets for analysis using python.

Analyze datasets in practice using pandas

1] Gather the data

Read the dataset from its source. it can be a csv file, excel, API, webpage, or database.

df = pd.read_csv('file_name.csv')

2] Assessing the data

Aims at getting a general understanding of the size and shape of the dataset, identifying missing data, knowing the existing data types, and getting summary statistics.

The steps are as follows:

Load packages
Understand how the data look
Get summary statistics
Identify duplicated rows
Identify missing values
Count the number of missing values in each row
Identify data types
Identify unique values
Count unique values in specific rows

#get the first few columns of the dataset
df.head( )#understand the existing datat types
df.dtypes#get the size of the dataframe
df.shape#info on the number of non-Na data points in each column 
df.info()# count the number of missing values in each row
df.isnull().sum()#number of unique entries
df.nunique()# summary statistics on numerical columns 
df.describe()# slicing the dataframe to assess a specific section 
df.iloc[row_index, column_index]# count unique values in specific rows
df['column_name'].value_counts()

3] Data cleaning

Data wrangling (cleaning) prepares the data for analysis. It aims at having complete data that does not have null values or duplicated entries and has the right columns at the right data type. The process generally has 3 general categories: Handle incorrect data types, Handle and manage missing data — remove them or replace them, and Handle duplicate data.

Drop unnecessary columns.

# drop unnecessary column
df.drop(['column_1', 'column_2',.., axis = 1, inplace = True])

Rename columns

# rename column names and index
df.rename ( Columns = { “old_column_name”:“new_name“}, inplace =True) 
df.rename ( index= [ index: new_value])

Remove spaces from column names by ‘ -’ & remove spaces at the end

#remove spaces in column names 
df.rename( columns = Lambda x:x.strip().lower().replace(‘ ‘, ‘_’), inplace = True)
# or
df.columns = [column.replace(‘ ‘, ‘_’) for column in df.columns)

Adjust data types

df[‘column_name’] = df['column_name'].astype('required_data_type’)

Query Data — get specific data to look at

df =df.query(‘column_name’]==“the_filter_value” )

Drop rows with null values “NA”

# drop rows with Na values 
df.dropna(inplace = True) 
# fill missing data with mean 
df['column'] = df['column'].fillna(df['column'].mean())
#I check if any columns his dataframe has Na
df.isnull().Sam().any()

Drop duplicated rows

# drop duplicated data 
df.drop_duplicates(inplace=True)

Save the resulting clean dataset

Full list of cleaning codes

# drop unnecessary column
df.drop(['column_1', 'column_2',.., axis = 1, inplace = True])# rename column names and index
df.rename ( Columns = { “old_column_name”:“new_name“, inplace =True) 
df.rename ( index= [ index: new_value])#remove spaces in column names 
df.rename( columns = Lambda x:x.strip().lower().replace(‘ ‘, ‘_’), inplace = True)
# or
df.columns = [column.replace(‘ ‘, ‘_’) for column in df.columns)# convert a column into datetime
df['time_column'] = pandas.to_datetime(['time_column'])
#adjust the data type to string 
df[‘column_name’] = df['column_name'].astype('str’)# Query dataframe
df =df.query(‘column_name’]==“the_filter_value”)# drop rows with Na values 
df.dropna(inplace = True) 
# fill missing data with mean 
df['column'] = df['column'].fillna(df['column'].mean())
#I check if any columns his dataframe has Na
df.isnull().Sam().any()# get duplicated data
df.duplicated()# drop duplicated data 
df.drop_duplicates(inplace=True)# save the resulting clean datafame
df.to_csv('file_name', index = False)

The data preparation steps are shown in this project offered during the nano degree on airlines data analysis.