Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Part 7: Cleaning Data & Casting

Part 7: Cleaning Data & Casting

This notebook introduces practical data cleaning and type casting in pandas.

Main skills:

  • Detect and remove missing data with dropna.

  • Convert custom placeholders (for example "NA", "Missing") into real NaN.

  • Fill missing values with fillna.

  • Inspect data types with dtypes.

  • Convert text/object columns to numeric with pd.to_numeric.

  • Handle missing values directly when importing CSV files.

Reliable analysis depends on clean missing-value handling and correct numeric dtypes.

We’ll create a small DataFrame that purposely contains missing values of different kinds:

  • Python None

  • np.nan

  • Custom placeholders: "NA" and "Missing"

Then we’ll explore dropna, replace, isna, fillna, and type casting.

import pandas as pd
import numpy as np

df_small = pd.DataFrame({
    "first": ["Alice", "Bob", np.nan, "Carol", None],
    "last": ["Smith", "Jones", None, "Lee", "Jones"],
    "email": [
        "alice@example.com",
        np.nan,
        None,
        "carol@example.com",
        "Missing",
    ],
    "uid": ["AS100293", "BJ240806", np.nan, "NA", "SJ251203"],
    "year": [1993, 2006, None, np.nan, 2003],
    "age": [32, 19, np.nan, None, 22],
    # "empty_col": [np.nan, np.nan, np.nan, np.nan, np.nan],
})
df_small

1. Dropping missing values with dropna

dropna removes rows or columns based on missing-value rules.

Common patterns in this section:

  • dropna() or dropna(how="any"): remove rows with at least one missing value.

  • dropna(how="all"): remove rows (or columns) only if everything is missing.

  • dropna(subset=[...]): require specific columns to be non-missing.

Different analysis tasks need different strictness for missing data.

# Default: drop rows (axis='index') with ANY missing value
df_any = df_small.dropna()
df_any
# Explicitly: rows with ANY missing
df_any2 = df_small.dropna(axis="index", how="any")
df_any2
# Rows with ALL values missing only
df_allrows = df_small.dropna(axis="index", how="all")
df_allrows
# Drop columns where ALL values are missing (our 'empty_col')
df_drop_allcols = df_small.dropna(axis="columns", how="all")
df_drop_allcols

⚠️ If you do dropna(axis="columns", how="any") you’ll likely drop almost every column (because most real columns have at least one missing). That can leave you with an empty (or nearly empty) DataFrame.

Using subset=[...] focuses missing-value checks on the most important columns.

Examples:

  • Keep only rows with an email.

  • Drop rows only if both uid and email are missing.

This lets you avoid over-dropping rows when non-critical columns are missing.

# Drop rows where a specific subset has missing values
df_email_req = df_small.dropna(subset=["email"])
df_email_req
# Multiple required columns, drop only if ALL of the listed are missing
df_uid_or_email_req = df_small.dropna(how="all", subset=["uid", "email"])
df_uid_or_email_req

2. Custom missing values → real NaNs

Values like "NA" or "Missing" are text, not true missing values.

Once placeholders become NaN, pandas missing-data tools (isna, dropna, fillna) work consistently.

# Replace custom placeholders with real NaNs
df_clean = df_small.replace({"NA": np.nan, "Missing": np.nan})
df_clean

isna() shows where missing values are.

# Where are the NaNs now?
df_clean.isna()

3. Filling missing values with fillna

fillna(value) replaces missing values with a chosen value.

# Example: fill missing strings with a flag
df_filled_str = df_clean.fillna("MISSING")
df_filled_str
# For numeric columns, a numeric fill is often more useful (e.g., 0 or -1)
df_filled_num = df_clean.copy()
df_filled_num["year"] = df_filled_num["year"].fillna(-1)
df_filled_num

4. Dtypes and why mean() may fail

Many numeric-looking columns are stored as object (strings), especially after messy imports.

If dtype is not numeric, operations like mean() may fail or behave incorrectly.

Check first with: df.dtypes

# Check dtypes (note: attribute, not a method)
df_clean.dtypes
# This will raise a TypeError because 'age' is still strings/NaN, not numbers:
# df_clean["age"].mean()

age contains strings (e.g., "32") and NaNs. You cannot compute a numeric mean on object dtype.

This is why we need what is called casting: converting a value, row or column from one variable type to another.

5. Casting to numeric

If the values of a column are not numeric, we cannot use operations like the mean.

  • Casting directly to int fails if there are NaNs.

  • Use float (or to_numeric) instead, then compute the mean.

Use: pd.to_numeric(series, errors="coerce")

  • Valid numbers are converted.

  • Invalid values become NaN.

After conversion, numeric stats (mean, median, etc.) work correctly.

# Safe numeric conversion
df_clean["age_num"] = pd.to_numeric(df_clean["age"], errors="coerce")  # strings -> numbers, others -> NaN
df_clean.dtypes
# Fails: cannot cast NaNs to int directly
# df_cast["age_num"] = df_cast["age_num"].astype(int)   # ValueError

# Works: floats can hold NaN
df_clean["age_num"] = df_clean["age_num"].astype(float)
df_clean.dtypes
# Now you can compute numeric stats
df_clean["age_num"].mean()

6. Handling custom missing values when reading CSVs

Suppose we save the dataframe we created earlier with missing values to file:

df_small.to_csv("data/test_df.csv", index=False)

You can normalize placeholders at load time.

pd.read_csv(..., na_values=[...]) tells pandas which strings should be interpreted as missing on load.

na_vals = ["NA", "Missing"]
df_loaded = pd.read_csv("data/test_df.csv", na_values=na_vals)
df_loaded

This will automatically treat those strings as NaN on import.

Exercises for Part 7

Exercise 7.1:

  1. For all columns in the big dataset, compute the number and percentage of missing values.

  2. Show the top 10 columns by missingness.

Exercise 7.2:

  1. Convert the CompTotal (salary) to numeric.

  2. Drop missing values.

  3. Filter all rows with a salary greater than 10 millions.

  4. Compute the mean.

  5. Create a copy with missing CompTotal filled by the global median.

  6. Compute the mean salary on the filled copy and compare.

  7. Count how many salaries were missing before the fill.

Solutions

#### YOUR CODE HERE ####