Pandas DataFrame Analysis

What is a Pandas DataFrame?

A Pandas DataFrame is a two-dimensional, labeled data structure with columns of potentially different types. It is similar to a table in a database or a spreadsheet in Excel. DataFrames are one of the most commonly used structures in data analysis tasks, allowing efficient handling, manipulation, and analysis of data.

Why Analyze DataFrames?

Analyzing Pandas DataFrames is essential in various fields such as data science, machine learning, and statistical analysis. By using DataFrames, you can load, clean, transform, and visualize data efficiently. These steps form the core of the data analysis process, enabling deeper insights and informed decision-making.

1. Getting Started with Pandas DataFrame

Installing Pandas

To begin using Pandas, you need to install the library. This can be done using either pip or conda:

python

1
2
3
pip install pandas
# or
conda install pandas

Importing Pandas

Once Pandas is installed, import it into your script:

python

1
import pandas as pd

Creating a DataFrame

You can create a DataFrame from various data sources. Here are a few examples:

From a CSV file:

python

1
df = pd.read_csv('file.csv')

From a dictionary:

python

1
2
data = {'Name': ['Alice', 'Bob'], 'Age': [25, 30]}
df = pd.DataFrame(data)

From a list:

python

1
2
data = [['Alice', 25], ['Bob', 30]]
df = pd.DataFrame(data, columns=['Name', 'Age'])

2. Exploring DataFrame Structure

Displaying DataFrame

You can view the first few rows of a DataFrame with the .head() method:

python

1
print(df.head())

Similarly, to view the last few rows, use .tail():

python

1
print(df.tail())

Understanding DataFrame Dimensions

The .shape attribute gives you the number of rows and columns in a DataFrame:

python

1
print(df.shape)

Getting Column Names

To retrieve column names, use .columns:

python

1
print(df.columns)

Data Types in DataFrame

Check the data types of each column using .dtypes:

python

1
print(df.dtypes)

3. Data Selection and Indexing

Selecting Columns

You can select a single column or multiple columns using:

python

1
2
3
4
5
# Single column
df['Name']

# Multiple columns
df[['Name', 'Age']]

Selecting Rows

To select rows, use .iloc[] for index-based selection and .loc[] for label-based selection:

python

1
2
3
4
5
# Index-based selection
df.iloc[0:5]

# Label-based selection
df.loc[0:5]

Conditional Selection

Filter rows based on conditions like this:

python

1
df[df['Age'] > 25]

4. Data Cleaning Techniques

Handling Missing Data

Identify missing data using .isnull():

python

1
print(df.isnull())

You can drop rows with missing values:

python

1
df.dropna(inplace=True)

Filling Missing Values

To fill missing values with a specific value, use .fillna():

python

1
df.fillna(0, inplace=True)

Removing Duplicates

To remove duplicate rows:

python

1
df.drop_duplicates(inplace=True)

5. Data Transformation and Manipulation

Renaming Columns

Rename columns using .rename():

python

1
df.rename(columns={'Name': 'Full Name', 'Age': 'Years'}, inplace=True)

Changing Data Types

Convert a column’s data type using .astype():

python

1
df['Age'] = df['Age'].astype(float)

Adding/Removing Columns

To add a new column:

python

1
df['City'] = ['New York', 'Los Angeles']

To remove a column:

python

1
df.drop('City', axis=1, inplace=True)

6. Aggregating and Grouping Data

GroupBy Operation

The GroupBy method allows you to group data and apply aggregate functions like .mean(), .sum(), and .count():

python

1
grouped = df.groupby('City').mean()

Multiple Aggregation Functions

You can apply multiple aggregation functions using .agg():

python

1
df.groupby('City').agg({'Age': ['mean', 'sum'], 'Years': 'count'})

7. Sorting and Ranking Data

Sorting Data

Sort data by column values using .sort_values():

python

1
df.sort_values(by='Age', ascending=False, inplace=True)

Sort by index using .sort_index():

python

1
df.sort_index(inplace=True)

Ranking Data

Rank data based on column values:

python

1
df['Rank'] = df['Age'].rank()

8. Merging and Joining DataFrames

Concatenating DataFrames

Use pd.concat() to merge DataFrames:

python

1
df_combined = pd.concat([df1, df2], axis=0)

Merging DataFrames

Join two DataFrames on common columns using pd.merge():

python

1
df_merged = pd.merge(df1, df2, on='City', how='inner')

9. Visualizing Data with Pandas

Plotting Data

You can easily create plots from DataFrames with .plot():

python

1
df['Age'].plot(kind='hist')

Types of Plots

Pandas supports various plot types such as line, bar, histogram, and more. Here’s an example of a line plot:

python

1
df.plot(kind='line', x='Name', y='Age')

10. Exporting DataFrames

Saving DataFrame to CSV

To save a DataFrame to a CSV file:

python

1
df.to_csv('data.csv', index=False)

Exporting to Excel and JSON

Similarly, export to Excel:

python

1
df.to_excel('data.xlsx', index=False)

Or to JSON:

python

1
df.to_json('data.json')

Previous Lesson Next Lesson

Frequently Asked Questions

For large DataFrames, consider using dask or splitting your data into chunks.

You can use the & (and) or | (or) operators along with conditions.

.loc[] is used for label-based indexing and .iloc[] is used for integer-location based indexing.

Still have questions?Contact our support team