Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. EDA is used to understand what the data can tell us beyond the formal modeling or hypothesis testing task. It is a crucial step in the data analysis process. ^[[AI and Data Scientist Roadmap]]
EDA with Pandas
Data Preparation
During the data preparation, we explore and clean the dataset in order for it to be ready for the next step.
First look at the Dataframe
After loading the dataset with pandas we usually start by looking at the Dataframe general information, such as:
df.head()to see the first elements;df.columns()to list the columns;df.shapeto see the number of columns and rows;df.describe()to see a summary of information about the Dataframe, such as count, mean and std for all the numeric data columns.
Cleaning process
In the cleaning process, we can perform a series of operations:
- Drop columns that are not needed (or select just the columns that we need);
- Convert the
dtypeof certain columns (example: convert a date from a string to adatetimeobject); - Rename columns with
df.rename(columns={"old_name": "new_name"}); - Drop or fill
NaNvalues, which can be located withdf.isna()and counted withdf.isna().count(); - Drop duplicate rows, which can be located with
df.duplicated(subset=["column1", "column2"]).
Feature Understanding
This part is useful to see which are the meaningful features, their distribution, and explore if there are any outliers. Useful operations are:
- Count the values for each column with
df["column"].value_counts(), which can also be plotted with.plot(kind="bar", title="Plot title"). - Analyze the distribution of values inside of a column with an histogram or density plot.
- Analyze the feature relationship with scatterplots, heatmap correlation and pairplots.
Ask questions
Finally, we can make some questions on the dataset (example: “What are the locations with the fastest roller coaster?”) and answer them with plots or statistics. This usually involves into filtering the necessary data (just like SQL queries), and return some plots or statistics about the filtered dataframe.
tags: data-science resources: