Pandas DataFrame Operations You’ll Use in Every Data Project

I work with pandas every single day. After three years of data work, I’ve found that about 20 operations cover 90% of what I need to do. Here they are, organized by how often I reach for them.

Loading and First Look

import pandas as pd

# Read data
df = pd.read_csv('sales.csv', parse_dates=['order_date'])

# First things I always check
df.shape              # How many rows and columns?
df.head()             # What does it look like?
df.dtypes             # What types are the columns?
df.describe()         # Basic statistics
df.isnull().sum()     # How much data is missing?

These five lines are how I start every single project. They take 10 seconds and save hours of confusion later.

Filtering Data

# Single condition
big_orders = df[df['amount'] > 1000]

# Multiple conditions (use & for AND, | for OR)
premium_recent = df[(df['amount'] > 1000) & (df['order_date'] > '2025-01-01')]

# Filter with isin for multiple values
selected = df[df['status'].isin(['completed', 'shipped'])]

# Query syntax (cleaner for complex filters)
result = df.query('amount > 1000 and region == "North"')

Grouping and Aggregation

# Basic groupby
monthly = df.groupby(df['order_date'].dt.month)['amount'].sum()

# Multiple aggregations
summary = df.groupby('region').agg(
    total_sales=('amount', 'sum'),
    avg_order=('amount', 'mean'),
    num_orders=('order_id', 'count'),
    unique_customers=('customer_id', 'nunique')
).reset_index()

# Pivot table (my go-to for cross-tabulations)
pivot = pd.pivot_table(df, values='amount', 
    index='region', columns='product_category',
    aggfunc='sum', fill_value=0)

Handling Missing Data

# Fill with a specific value
df['phone'] = df['phone'].fillna('Not provided')

# Fill with the column mean
df['amount'] = df['amount'].fillna(df['amount'].median())

# Forward fill (great for time series)
df['price'] = df['price'].ffill()

# Drop rows where specific columns are null
df = df.dropna(subset=['email', 'name'])

Creating New Columns

# Simple calculation
df['profit'] = df['revenue'] - df['cost']

# Conditional column
df['size'] = pd.cut(df['amount'], bins=[0, 100, 500, float('inf')], 
                    labels=['Small', 'Medium', 'Large'])

# Apply a function
df['email_domain'] = df['email'].apply(lambda x: x.split('@')[1])

# Multiple conditions with np.select
import numpy as np
conditions = [df['score'] >= 90, df['score'] >= 70, df['score'] >= 50]
choices = ['A', 'B', 'C']
df['grade'] = np.select(conditions, choices, default='F')

Merging DataFrames

# Inner join (only matching rows)
merged = pd.merge(orders, customers, on='customer_id')

# Left join (keep all orders, even without customer match)
merged = pd.merge(orders, customers, on='customer_id', how='left')

# Merge on different column names
merged = pd.merge(orders, products, 
    left_on='product_code', right_on='sku')

Performance Tips

  • Use category dtype for columns with few unique values – saves 80% memory
  • Read only needed columns: pd.read_csv('file.csv', usecols=['col1', 'col2'])
  • Use itertuples() instead of iterrows() when you must iterate (10x faster)
  • For files over 1GB, consider using polars instead of pandas

Bookmark this page. I guarantee you’ll come back to it. These operations are the bread and butter of data work in Python.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top