Parse and Clean Messy CSV Files with Python and Pandas
Introduction
CSV files are everywhere — exported from databases, downloaded from APIs, generated by spreadsheets. But real-world CSV files are rarely clean.
They often contain missing values, inconsistent capitalization, extra whitespace, duplicate rows, wrong data types, and broken formatting. Before you can analyze the data, build dashboards, or generate reports, you usually need to clean it first.
This is where pandas becomes invaluable.
In this tutorial, you will learn how to:
- Load CSV files with pandas
- Inspect data quality
- Handle missing values
- Clean column names and string data
- Fix data types
- Remove duplicate rows
- Filter and transform data
- Save cleaned results
All examples use a deliberately messy employee dataset so you can follow along step by step.
All examples are tested on Python 3.12 and pandas 2.x.
Setup
Install pandas:
pip install pandas
pandas is the most widely used Python library for data analysis and cleaning. Its core data structure is the DataFrame — a table of rows and columns, similar to a spreadsheet or SQL table.
Creating a Messy Sample File
To practice data cleaning, start by creating a deliberately messy dataset:
import pandas as pd
data = {
"Name": ["Alice", "bob", "CAROL", "alice", None, "Dave"],
"Age": [28, 34, None, 28, 22, 45],
"Department": [
"Engineering",
"marketing",
"Engineering",
"Engineering",
"Sales",
" HR "
],
"Salary": ["75000", "82000", "90000", "75000", "60000", "105000"],
"Join Date": [
"2020-01-15",
"2019-03-22",
"2021-07-01",
"2020-01-15",
"2022-11-30",
"2018-05-10"
],
}
df = pd.DataFrame(data)
df.to_csv("employees.csv", index=False)
print("Sample file created")
Expected output:
Sample file created
This file intentionally contains:
- Inconsistent capitalization (
bob,CAROL,alice) - A missing name and a missing age (
None) - Extra whitespace in a department name (
" HR ") - Salary values stored as strings instead of numbers
- A fully duplicate row (Alice appears twice with identical data)
This closely resembles the kind of messy data you encounter in real projects.
Loading and Inspecting the File
Load the file and run three basic checks:
import pandas as pd
df = pd.read_csv("employees.csv")
print(df.head())
print(df.shape)
print(df.dtypes)
df.head() shows the first five rows — a quick sanity check that the file loaded correctly.
df.shape returns (rows, columns):
Example output:
(6, 5)
df.dtypes shows each column’s inferred data type. This is where problems often surface first:
Example output:
Name object
Age float64
Department object
Salary object
Join Date object
Notice that Salary is object (string), not a number. Age is float64 because pandas uses floats when a column contains missing values. Both will need fixing.
Inspecting Data Quality
Before making any changes, get a full picture of what needs cleaning.
Missing Values
print(df.isnull().sum())
Example output:
Name 1
Age 1
Department 0
Salary 0
Join Date 0
dtype: int64
To see the percentage of missing data per column:
print(df.isnull().sum() / len(df) * 100)
This helps decide whether to fill or drop — a column with 60% missing data is usually better dropped than filled.
Duplicate Rows
print(df.duplicated().sum())
Example output:
1
To see which rows are duplicates:
print(df[df.duplicated()])
Basic Statistics
print(df.describe())
describe() summarizes numeric columns with count, mean, min, max, and quartiles. It is useful for spotting suspicious outliers or impossible values.
Fixing Column Names
Inconsistent column names cause problems later — especially when column names have spaces or mixed capitalization. Standardize them all at once:
df.columns = (
df.columns
.str.lower()
.str.replace(" ", "_")
)
print(df.columns)
Expected output:
Index(['name', 'age', 'department', 'salary', 'join_date'], dtype='object')
Lowercase, underscore-separated column names are easier to type, work cleanly with pandas attribute access (df.join_date), and match the conventions of most data tools.
Handling Missing Values
There are two main strategies: drop rows with missing data, or fill them with a substitute value.
Dropping Missing Rows
df_clean = df.dropna()
print(df_clean.shape)
Expected output:
(4, 5)
This removes any row containing at least one missing value. It is simple but can discard too much data if missing values are scattered across many rows.
Filling Missing Values
A more targeted approach fills each column individually:
df["age"] = df["age"].fillna(df["age"].median())
df["name"] = df["name"].fillna("Unknown")
Median is preferred over mean for numeric columns because it is less affected by outliers. For categorical columns like name, a placeholder string like "Unknown" is more informative than dropping the row.
Verify the result:
print(df.isnull().sum())
Expected output:
name 0
age 0
department 0
salary 0
join_date 0
dtype: int64
Fixing Data Types
CSV files store everything as text. pandas infers types on load, but it often gets things wrong.
Convert Salary to Numeric
df["salary"] = pd.to_numeric(df["salary"])
Convert Join Date to Datetime
df["join_date"] = pd.to_datetime(df["join_date"])
Verify the corrected types:
print(df.dtypes)
Expected output:
name object
age float64
department object
salary int64
join_date datetime64[ns]
dtype: object
Correct data types unlock proper operations — you cannot calculate the average salary or filter by date range if those columns are stored as strings.
Cleaning String Columns
Text columns often have inconsistent casing and stray whitespace.
Standardize Capitalization
df["name"] = df["name"].str.title()
df["department"] = df["department"].str.title()
This converts bob to Bob, CAROL to Carol, and marketing to Marketing.
Remove Extra Whitespace
df["department"] = df["department"].str.strip()
This removes leading and trailing spaces. " HR " becomes "HR".
Verify the result:
print(df["department"].unique())
Expected output:
['Engineering' 'Marketing' 'Sales' 'Hr']
Removing Duplicates
print(f"Before: {len(df)} rows")
df = df.drop_duplicates()
print(f"After: {len(df)} rows")
Expected output:
Before: 6 rows
After: 5 rows
To remove duplicates based only on specific columns — useful when some columns are allowed to differ:
df = df.drop_duplicates(subset=["name", "department"])
Filtering and Transforming Data
Once the data is clean, analysis is straightforward.
Filtering Rows
high_salary = df[df["salary"] > 80000]
engineering = df[df["department"] == "Engineering"]
Creating New Columns
df["seniority"] = (
df["join_date"]
.dt.year
.apply(lambda y: "Senior" if y < 2020 else "Junior")
)
print(df[["name", "join_date", "seniority"]])
Expected output:
name join_date seniority
0 Alice 2020-01-15 Junior
1 Bob 2019-03-22 Senior
2 Carol 2021-07-01 Junior
4 Dave 2018-05-10 Senior
Derived columns like this are common in analytics workflows — converting raw dates into categories, computing ratios, or flagging rows that meet certain conditions. For simpler list-based transformations outside of pandas, list comprehensions are often faster and more readable than applying pandas operations to plain Python data.
Saving the Clean Data
Save as CSV
df.to_csv("employees_clean.csv", index=False)
print("Cleaned data saved")
Expected output:
Cleaned data saved
index=False prevents pandas from writing the row index as an extra column.
Save as Excel
df.to_excel("employees_clean.xlsx", index=False)
Excel output is useful for sharing with non-technical users or feeding into business reporting tools. For more control over Excel formatting, the openpyxl tutorial covers reading, writing, and styling .xlsx files directly.
Common Real-World Cleaning Problems
Mixed Numeric Formats
Real salary data often looks like "$75,000" instead of "75000". Strip the formatting before converting:
df["salary"] = (
df["salary"]
.str.replace("$", "", regex=False)
.str.replace(",", "", regex=False)
)
df["salary"] = pd.to_numeric(df["salary"])
Inconsistent Date Formats
A single CSV file may contain dates written as 2024-01-01, 01/01/2024, and Jan 1 2024. pd.to_datetime() handles many formats automatically, but when it cannot, specify the format explicitly:
df["join_date"] = pd.to_datetime(df["join_date"], format="%Y-%m-%d")
Non-Standard Null Values
Missing data does not always appear as an empty cell. Enterprise exports often use "N/A", "NULL", "-", or "unknown". Tell pandas to treat them as missing during load:
df = pd.read_csv(
"file.csv",
na_values=["N/A", "NULL", "-", "unknown"]
)
Wrap-Up
Data cleaning is one of the most important and time-consuming steps in any data workflow. With pandas, the most common problems — missing values, bad types, inconsistent strings, duplicates — can each be solved in one or two lines.
The cleaned dataset from this tutorial pairs naturally with other automation workflows. You can feed it into a web scraping pipeline to collect fresh data on a schedule, or use the file renaming script to organize output files automatically. For questions or future tutorial ideas, get in touch via the Contact page.