Content

1 Step 1: Loading and Inspecting the Dataset
2 Step 2: Basic Dataset Information
3 Step 3: Identifying Duplicate Entries
4 Step 4: Exploring Unique Values
5 Step 5: Visualizing Counts of Unique Values
6 Step 6: Detecting Missing Values
7 Step 7: Handling Missing Data
8 Step 8: Checking Data Types
9 Step 9: Filtering the Dataset
10 Step 10: Box Plot for Quick Visualization
11 Step 11: Correlation Matrix
12 Conclusion

Vijona

7 Feb at 9:45

Exploratory Data Analysis (EDA) with Python: An In-Depth Guide Using Essential Functions

In data analysis, understanding your dataset’s structure and distribution is crucial before making any interpretations or applying models. Exploratory Data Analysis (EDA) provides this understanding through systematic exploration. Here, we’ll focus on using Python functions to gain insights without relying heavily on graphical methods, though we’ll also touch on some visualization techniques.

Step 1: Loading and Inspecting the Dataset

We’ll start with the Titanic dataset, a popular dataset in data analysis, and set up the environment by importing necessary libraries.

Copy Code


 
import pandas as pd
import numpy as np
import seaborn as sns

# Load the data
df = pd.read_csv('titanic.csv')

# Preview the data
df.head()

This code will load and display the first few rows of the Titanic dataset, giving you a quick overview of its structure.

Step 2: Basic Dataset Information

It’s important to familiarize yourself with the dataset’s structure. The info() and describe() functions provide a high-level summary of the data.

Copy Code


 
# Basic information about the dataset
df.info()

# Descriptive statistics
df.describe()

The info() function reveals data types and missing values, while describe() provides basic statistics for numerical columns.

Step 3: Identifying Duplicate Entries

Duplicate data can bias results, so it’s good to identify any duplicate rows early on.

Copy Code


 
# Count duplicate rows
df.duplicated().sum()

A result of 0 indicates no duplicates, ensuring data integrity.

Step 4: Exploring Unique Values

Understanding the range of values within categorical columns is helpful, especially for feature analysis.

Copy Code


 
# Unique values in specific columns
print(df['Pclass'].unique())
print(df['Survived'].unique())
print(df['Sex'].unique())

This returns the distinct values within each specified column.

Step 5: Visualizing Counts of Unique Values

Visualizations like count plots make it easier to see the frequency of categories within a column.

Copy Code


 
# Count plot for unique values in 'Pclass'
sns.countplot(x='Pclass', data=df)

This plot reveals the distribution of values in the Pclass column.

Step 6: Detecting Missing Values

Missing values can impact analysis quality. The isnull().sum() function helps identify columns with null entries.

Copy Code


 
# Check for null values
df.isnull().sum()

This reveals that ‘Age’ and ‘Cabin’ have missing values, which you’ll need to address for thorough analysis.

Step 7: Handling Missing Data

One way to address missing values is by replacing them with a specific value, such as 0.

Copy Code


 
# Replace missing values with 0
df.replace(np.nan, 0, inplace=True)

# Verify changes
df.isnull().sum()

This fills all null values with 0, though other methods like using the mean may be preferable depending on the context.

Step 8: Checking Data Types

Understanding data types is crucial, as it guides you in selecting appropriate analysis techniques for each attribute.

Copy Code

# Check data types of each column df.dtypes

This function reveals each column’s data type, helping distinguish numerical from categorical data.

Step 9: Filtering the Dataset

Filtering allows you to analyze subsets of data based on specific criteria.

Copy Code


 
# Filter for first-class passengers
df[df['Pclass'] == 1].head()

This code returns rows where passengers are in the first class.

Step 10: Box Plot for Quick Visualization

Box plots are an effective way to examine the spread and detect outliers in numerical data.

Copy Code


 
# Box plot for the 'Fare' column
df[['Fare']].boxplot()

This gives a quick view of fare distribution, including any potential outliers.

Step 11: Correlation Matrix

The correlation matrix quantifies relationships between numerical features. You can visualize it for a more intuitive understanding.

Copy Code


 
# Correlation matrix
df.corr()

# Visualize the correlation matrix
sns.heatmap(df.corr(), annot=True, cmap="coolwarm")

Positive correlations near 1 indicate strong relationships, while negative values close to -1 suggest inverse relationships.

Conclusion

Exploratory Data Analysis is a fundamental part of any data project. With these Python functions, you can achieve a comprehensive understanding of your dataset, helping you make informed decisions before advancing to more complex analyses. Integrating both graphical and non-graphical approaches offers a fuller perspective on your data.

Happy Analyzing!

Source: digitalocean.com

Create a Free Account

Try now

Posts you might be interested in:

Moderne Hosting Services mit Cloud Server, Managed Server und skalierbarem Cloud Hosting für professionelle IT-Infrastrukturen

How to Install and Secure GoCD on CentOS 7 with SSL and Firewall

Linux Basics, Tutorial

2 weeks ago

Installing GoCD on CentOS 7 with Block Storage Configuration GoCD is a freely available automation and continuous delivery platform. It supports designing sophisticated pipelines through both sequential and concurrent task…

Install Leanote on CentOS 7 with SSL, MongoDB & Nginx

Linux Basics, Tutorial

2 weeks ago

Installing Leanote on CentOS 7 with MongoDB and Let’s Encrypt SSL Leanote is a free, lightweight, and open source note-taking platform built with Golang. Designed with a strong focus on…

Set Up a Secure Git Server with Nginx on Debian 8

Linux Basics, Tutorial

2 weeks ago

Setting Up a Secure Git Server with Nginx on Debian 8 Git is a widely used version control solution that allows developers to manage and track changes in their source…

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS

FEATURED PRODUCTS

Kubernetes

ccloud³

Managed Server

Cloud GPU

S3 Object Storage

COMPUTE

MANAGED

STORAGE

NETWORKING

MANAGEMENT TOOLS

BACKUPS & SNAPSHOTS

WEBSITE HOSTING

HOUSING

FEATURED INDUSTRIES

Enterprise

Saas-Hosting

Startup

INDUSTRIES

MORE INDUSTRIES

FEATURED USE CASES

Linux-Hosting

VMware Migration

Docker Hosting

USE CASES

MORE USE CASES

RESSOURCES

Help Center

Trust Center

Glossar

Tutorials

MORE CENTRON

MORE INFOS

Exploratory Data Analysis (EDA) with Python: An In-Depth Guide Using Essential Functions

Step 1: Loading and Inspecting the Dataset

Step 2: Basic Dataset Information

Step 3: Identifying Duplicate Entries

Step 4: Exploring Unique Values

Step 5: Visualizing Counts of Unique Values

Step 6: Detecting Missing Values

Step 7: Handling Missing Data

Step 8: Checking Data Types

Step 9: Filtering the Dataset

Step 10: Box Plot for Quick Visualization

Step 11: Correlation Matrix

Conclusion

Create a Free Account