Step-by-step data analysis process with code using Python
Data analysis nano degree offered by Udacity is among the best out there for learning python and having hands-on experience through well-structured projects.
Data analysis is using data to answer questions and deliver insights that can create value or reshape understanding.
The course defines the data analysis process through the following steps:
- Question — What are you trying to answer?
- Wrangling — gathering, assessing completeness, and cleaning the data in hand.
- Exploration — is about exploring, aurgmenting, and learning from data.
- Draw conclusion — the gathered insights and learnings from exploring the data
- Communicate findings
The folloiwng will summarize the steps presented in the course for preparing datasets for analysis using python.
Analyze datasets in practice using pandas
1] Gather the data
Read the dataset from its source. it can be a csv file, excel, API, webpage, or database.
df = pd.read_csv('file_name.csv')
2] Assessing the data
Aims at getting a general understanding of the size and shape of the dataset, identifying missing data, knowing the existing data types, and getting summary statistics.
The steps are as follows:
- Load packages
- Understand how the data look
- Get summary statistics
- Identify duplicated rows
- Identify missing values
- Count the number of missing values in each row
- Identify data types
- Identify unique values
- Count unique values in specific rows
#get the first few columns of the dataset
df.head( )#understand the existing datat types
df.dtypes#get the size of the dataframe
df.shape#info on the number of non-Na data points in each column
df.info()# count the number of missing values in each row
df.isnull().sum()#number of unique entries
df.nunique()# summary statistics on numerical columns
df.describe()# slicing the dataframe to assess a specific section
df.iloc[row_index, column_index]# count unique values in specific rows
df['column_name'].value_counts()
3] Data cleaning
Data wrangling (cleaning) prepares the data for analysis. It aims at having complete data that does not have null values or duplicated entries and has the right columns at the right data type. The process generally has 3 general categories: Handle incorrect data types, Handle and manage missing data — remove them or replace them, and Handle duplicate data.
- Drop unnecessary columns.
# drop unnecessary column
df.drop(['column_1', 'column_2',.., axis = 1, inplace = True])
- Rename columns
# rename column names and index
df.rename ( Columns = { “old_column_name”:“new_name“}, inplace =True)
df.rename ( index= [ index: new_value])
- Remove spaces from column names by ‘ -’ & remove spaces at the end
#remove spaces in column names
df.rename( columns = Lambda x:x.strip().lower().replace(‘ ‘, ‘_’), inplace = True)
# or
df.columns = [column.replace(‘ ‘, ‘_’) for column in df.columns)
- Adjust data types
df[‘column_name’] = df['column_name'].astype('required_data_type’)
- Query Data — get specific data to look at
df =df.query(‘column_name’]==“the_filter_value” )
- Drop rows with null values “NA”
# drop rows with Na values
df.dropna(inplace = True)
# fill missing data with mean
df['column'] = df['column'].fillna(df['column'].mean())
#I check if any columns his dataframe has Na
df.isnull().Sam().any()
- Drop duplicated rows
# drop duplicated data
df.drop_duplicates(inplace=True)
- Save the resulting clean dataset
Full list of cleaning codes
# drop unnecessary column
df.drop(['column_1', 'column_2',.., axis = 1, inplace = True])# rename column names and index
df.rename ( Columns = { “old_column_name”:“new_name“, inplace =True)
df.rename ( index= [ index: new_value])#remove spaces in column names
df.rename( columns = Lambda x:x.strip().lower().replace(‘ ‘, ‘_’), inplace = True)
# or
df.columns = [column.replace(‘ ‘, ‘_’) for column in df.columns)# convert a column into datetime
df['time_column'] = pandas.to_datetime(['time_column'])
#adjust the data type to string
df[‘column_name’] = df['column_name'].astype('str’)# Query dataframe
df =df.query(‘column_name’]==“the_filter_value”)# drop rows with Na values
df.dropna(inplace = True)
# fill missing data with mean
df['column'] = df['column'].fillna(df['column'].mean())
#I check if any columns his dataframe has Na
df.isnull().Sam().any()# get duplicated data
df.duplicated()# drop duplicated data
df.drop_duplicates(inplace=True)# save the resulting clean datafame
df.to_csv('file_name', index = False)
The data preparation steps are shown in this project offered during the nano degree on airlines data analysis.