I work with pandas every single day. After three years of data work, I’ve found that about 20 operations cover 90% of what I need to do. Here they are, organized by how often I reach for them.
Loading and First Look
import pandas as pd
# Read data
df = pd.read_csv('sales.csv', parse_dates=['order_date'])
# First things I always check
df.shape # How many rows and columns?
df.head() # What does it look like?
df.dtypes # What types are the columns?
df.describe() # Basic statistics
df.isnull().sum() # How much data is missing?
These five lines are how I start every single project. They take 10 seconds and save hours of confusion later.
Filtering Data
# Single condition
big_orders = df[df['amount'] > 1000]
# Multiple conditions (use & for AND, | for OR)
premium_recent = df[(df['amount'] > 1000) & (df['order_date'] > '2025-01-01')]
# Filter with isin for multiple values
selected = df[df['status'].isin(['completed', 'shipped'])]
# Query syntax (cleaner for complex filters)
result = df.query('amount > 1000 and region == "North"')
Grouping and Aggregation
# Basic groupby
monthly = df.groupby(df['order_date'].dt.month)['amount'].sum()
# Multiple aggregations
summary = df.groupby('region').agg(
total_sales=('amount', 'sum'),
avg_order=('amount', 'mean'),
num_orders=('order_id', 'count'),
unique_customers=('customer_id', 'nunique')
).reset_index()
# Pivot table (my go-to for cross-tabulations)
pivot = pd.pivot_table(df, values='amount',
index='region', columns='product_category',
aggfunc='sum', fill_value=0)
Handling Missing Data
# Fill with a specific value
df['phone'] = df['phone'].fillna('Not provided')
# Fill with the column mean
df['amount'] = df['amount'].fillna(df['amount'].median())
# Forward fill (great for time series)
df['price'] = df['price'].ffill()
# Drop rows where specific columns are null
df = df.dropna(subset=['email', 'name'])
Creating New Columns
# Simple calculation
df['profit'] = df['revenue'] - df['cost']
# Conditional column
df['size'] = pd.cut(df['amount'], bins=[0, 100, 500, float('inf')],
labels=['Small', 'Medium', 'Large'])
# Apply a function
df['email_domain'] = df['email'].apply(lambda x: x.split('@')[1])
# Multiple conditions with np.select
import numpy as np
conditions = [df['score'] >= 90, df['score'] >= 70, df['score'] >= 50]
choices = ['A', 'B', 'C']
df['grade'] = np.select(conditions, choices, default='F')
Merging DataFrames
# Inner join (only matching rows)
merged = pd.merge(orders, customers, on='customer_id')
# Left join (keep all orders, even without customer match)
merged = pd.merge(orders, customers, on='customer_id', how='left')
# Merge on different column names
merged = pd.merge(orders, products,
left_on='product_code', right_on='sku')
Performance Tips
- Use
categorydtype for columns with few unique values – saves 80% memory - Read only needed columns:
pd.read_csv('file.csv', usecols=['col1', 'col2']) - Use
itertuples()instead ofiterrows()when you must iterate (10x faster) - For files over 1GB, consider using
polarsinstead of pandas
Bookmark this page. I guarantee you’ll come back to it. These operations are the bread and butter of data work in Python.
