Parse and Clean Messy CSV Files with Python and Pandas

· 7 min read

Introduction

CSV files are everywhere — exported from databases, downloaded from APIs, generated by spreadsheets. But real-world CSV files are rarely clean.

They often contain missing values, inconsistent capitalization, extra whitespace, duplicate rows, wrong data types, and broken formatting. Before you can analyze the data, build dashboards, or generate reports, you usually need to clean it first.

This is where pandas becomes invaluable.

In this tutorial, you will learn how to:

  • Load CSV files with pandas
  • Inspect data quality
  • Handle missing values
  • Clean column names and string data
  • Fix data types
  • Remove duplicate rows
  • Filter and transform data
  • Save cleaned results

All examples use a deliberately messy employee dataset so you can follow along step by step.

All examples are tested on Python 3.12 and pandas 2.x.


Setup

Install pandas:

pip install pandas

pandas is the most widely used Python library for data analysis and cleaning. Its core data structure is the DataFrame — a table of rows and columns, similar to a spreadsheet or SQL table.


Creating a Messy Sample File

To practice data cleaning, start by creating a deliberately messy dataset:

import pandas as pd

data = {
    "Name": ["Alice", "bob", "CAROL", "alice", None, "Dave"],
    "Age": [28, 34, None, 28, 22, 45],
    "Department": [
        "Engineering",
        "marketing",
        "Engineering",
        "Engineering",
        "Sales",
        "  HR  "
    ],
    "Salary": ["75000", "82000", "90000", "75000", "60000", "105000"],
    "Join Date": [
        "2020-01-15",
        "2019-03-22",
        "2021-07-01",
        "2020-01-15",
        "2022-11-30",
        "2018-05-10"
    ],
}

df = pd.DataFrame(data)
df.to_csv("employees.csv", index=False)

print("Sample file created")

Expected output:

Sample file created

This file intentionally contains:

  • Inconsistent capitalization (bob, CAROL, alice)
  • A missing name and a missing age (None)
  • Extra whitespace in a department name (" HR ")
  • Salary values stored as strings instead of numbers
  • A fully duplicate row (Alice appears twice with identical data)

This closely resembles the kind of messy data you encounter in real projects.


Loading and Inspecting the File

Load the file and run three basic checks:

import pandas as pd

df = pd.read_csv("employees.csv")

print(df.head())
print(df.shape)
print(df.dtypes)

df.head() shows the first five rows — a quick sanity check that the file loaded correctly.

df.shape returns (rows, columns):

Example output:

(6, 5)

df.dtypes shows each column’s inferred data type. This is where problems often surface first:

Example output:

Name          object
Age          float64
Department    object
Salary        object
Join Date     object

Notice that Salary is object (string), not a number. Age is float64 because pandas uses floats when a column contains missing values. Both will need fixing.


Inspecting Data Quality

Before making any changes, get a full picture of what needs cleaning.

Missing Values

print(df.isnull().sum())

Example output:

Name          1
Age           1
Department    0
Salary        0
Join Date     0
dtype: int64

To see the percentage of missing data per column:

print(df.isnull().sum() / len(df) * 100)

This helps decide whether to fill or drop — a column with 60% missing data is usually better dropped than filled.

Duplicate Rows

print(df.duplicated().sum())

Example output:

1

To see which rows are duplicates:

print(df[df.duplicated()])

Basic Statistics

print(df.describe())

describe() summarizes numeric columns with count, mean, min, max, and quartiles. It is useful for spotting suspicious outliers or impossible values.


Fixing Column Names

Inconsistent column names cause problems later — especially when column names have spaces or mixed capitalization. Standardize them all at once:

df.columns = (
    df.columns
    .str.lower()
    .str.replace(" ", "_")
)

print(df.columns)

Expected output:

Index(['name', 'age', 'department', 'salary', 'join_date'], dtype='object')

Lowercase, underscore-separated column names are easier to type, work cleanly with pandas attribute access (df.join_date), and match the conventions of most data tools.


Handling Missing Values

There are two main strategies: drop rows with missing data, or fill them with a substitute value.

Dropping Missing Rows

df_clean = df.dropna()
print(df_clean.shape)

Expected output:

(4, 5)

This removes any row containing at least one missing value. It is simple but can discard too much data if missing values are scattered across many rows.

Filling Missing Values

A more targeted approach fills each column individually:

df["age"] = df["age"].fillna(df["age"].median())
df["name"] = df["name"].fillna("Unknown")

Median is preferred over mean for numeric columns because it is less affected by outliers. For categorical columns like name, a placeholder string like "Unknown" is more informative than dropping the row.

Verify the result:

print(df.isnull().sum())

Expected output:

name          0
age           0
department    0
salary        0
join_date     0
dtype: int64

Fixing Data Types

CSV files store everything as text. pandas infers types on load, but it often gets things wrong.

Convert Salary to Numeric

df["salary"] = pd.to_numeric(df["salary"])

Convert Join Date to Datetime

df["join_date"] = pd.to_datetime(df["join_date"])

Verify the corrected types:

print(df.dtypes)

Expected output:

name                  object
age                  float64
department            object
salary                 int64
join_date     datetime64[ns]
dtype: object

Correct data types unlock proper operations — you cannot calculate the average salary or filter by date range if those columns are stored as strings.


Cleaning String Columns

Text columns often have inconsistent casing and stray whitespace.

Standardize Capitalization

df["name"] = df["name"].str.title()
df["department"] = df["department"].str.title()

This converts bob to Bob, CAROL to Carol, and marketing to Marketing.

Remove Extra Whitespace

df["department"] = df["department"].str.strip()

This removes leading and trailing spaces. " HR " becomes "HR".

Verify the result:

print(df["department"].unique())

Expected output:

['Engineering' 'Marketing' 'Sales' 'Hr']

Removing Duplicates

print(f"Before: {len(df)} rows")

df = df.drop_duplicates()

print(f"After: {len(df)} rows")

Expected output:

Before: 6 rows
After: 5 rows

To remove duplicates based only on specific columns — useful when some columns are allowed to differ:

df = df.drop_duplicates(subset=["name", "department"])

Filtering and Transforming Data

Once the data is clean, analysis is straightforward.

Filtering Rows

high_salary = df[df["salary"] > 80000]
engineering = df[df["department"] == "Engineering"]

Creating New Columns

df["seniority"] = (
    df["join_date"]
    .dt.year
    .apply(lambda y: "Senior" if y < 2020 else "Junior")
)

print(df[["name", "join_date", "seniority"]])

Expected output:

      name  join_date seniority
0    Alice 2020-01-15    Junior
1      Bob 2019-03-22    Senior
2    Carol 2021-07-01    Junior
4     Dave 2018-05-10    Senior

Derived columns like this are common in analytics workflows — converting raw dates into categories, computing ratios, or flagging rows that meet certain conditions. For simpler list-based transformations outside of pandas, list comprehensions are often faster and more readable than applying pandas operations to plain Python data.


Saving the Clean Data

Save as CSV

df.to_csv("employees_clean.csv", index=False)
print("Cleaned data saved")

Expected output:

Cleaned data saved

index=False prevents pandas from writing the row index as an extra column.

Save as Excel

df.to_excel("employees_clean.xlsx", index=False)

Excel output is useful for sharing with non-technical users or feeding into business reporting tools. For more control over Excel formatting, the openpyxl tutorial covers reading, writing, and styling .xlsx files directly.


Common Real-World Cleaning Problems

Mixed Numeric Formats

Real salary data often looks like "$75,000" instead of "75000". Strip the formatting before converting:

df["salary"] = (
    df["salary"]
    .str.replace("$", "", regex=False)
    .str.replace(",", "", regex=False)
)

df["salary"] = pd.to_numeric(df["salary"])

Inconsistent Date Formats

A single CSV file may contain dates written as 2024-01-01, 01/01/2024, and Jan 1 2024. pd.to_datetime() handles many formats automatically, but when it cannot, specify the format explicitly:

df["join_date"] = pd.to_datetime(df["join_date"], format="%Y-%m-%d")

Non-Standard Null Values

Missing data does not always appear as an empty cell. Enterprise exports often use "N/A", "NULL", "-", or "unknown". Tell pandas to treat them as missing during load:

df = pd.read_csv(
    "file.csv",
    na_values=["N/A", "NULL", "-", "unknown"]
)

Wrap-Up

Data cleaning is one of the most important and time-consuming steps in any data workflow. With pandas, the most common problems — missing values, bad types, inconsistent strings, duplicates — can each be solved in one or two lines.

The cleaned dataset from this tutorial pairs naturally with other automation workflows. You can feed it into a web scraping pipeline to collect fresh data on a schedule, or use the file renaming script to organize output files automatically. For questions or future tutorial ideas, get in touch via the Contact page.