Exploratory Data Analysis (EDA) is an approach to analyzing data sets to summarize their main characteristics, often with visual methods. EDA is used to understand what the data can tell us beyond the formal modeling or hypothesis testing task. It is a crucial step in the data analysis process. ^[[AI and Data Scientist Roadmap]]

EDA with Pandas

Data Preparation

During the data preparation, we explore and clean the dataset in order for it to be ready for the next step.

First look at the Dataframe

After loading the dataset with pandas we usually start by looking at the Dataframe general information, such as:

  • df.head() to see the first elements;
  • df.columns() to list the columns;
  • df.shape to see the number of columns and rows;
  • df.describe() to see a summary of information about the Dataframe, such as count, mean and std for all the numeric data columns.

Cleaning process

In the cleaning process, we can perform a series of operations:

  • Drop columns that are not needed (or select just the columns that we need);
  • Convert the dtype of certain columns (example: convert a date from a string to a datetime object);
  • Rename columns with df.rename(columns={"old_name": "new_name"});
  • Drop or fill NaN values, which can be located with df.isna() and counted with df.isna().count();
  • Drop duplicate rows, which can be located with df.duplicated(subset=["column1", "column2"]).

Feature Understanding

This part is useful to see which are the meaningful features, their distribution, and explore if there are any outliers. Useful operations are:

  1. Count the values for each column with df["column"].value_counts(), which can also be plotted with .plot(kind="bar", title="Plot title").
  2. Analyze the distribution of values inside of a column with an histogram or density plot.
  3. Analyze the feature relationship with scatterplots, heatmap correlation and pairplots.

Ask questions

Finally, we can make some questions on the dataset (example: “What are the locations with the fastest roller coaster?”) and answer them with plots or statistics. This usually involves into filtering the necessary data (just like SQL queries), and return some plots or statistics about the filtered dataframe.


tags: data-science resources: