Part 7: Cleaning Data & Casting¶

This notebook introduces practical data cleaning and type casting in pandas.

Main skills:

Detect and remove missing data with dropna.
Convert custom placeholders (for example "NA", "Missing") into real NaN.
Fill missing values with fillna.
Inspect data types with dtypes.
Convert text/object columns to numeric with pd.to_numeric.
Handle missing values directly when importing CSV files.

Reliable analysis depends on clean missing-value handling and correct numeric dtypes.

We’ll create a small DataFrame that purposely contains missing values of different kinds:

Python None
np.nan
Custom placeholders: "NA" and "Missing"

Then we’ll explore dropna, replace, isna, fillna, and type casting.

import pandas as pd
import numpy as np

df_small = pd.DataFrame({
    "first": ["Alice", "Bob", np.nan, "Carol", None],
    "last": ["Smith", "Jones", None, "Lee", "Jones"],
    "email": [
        "alice@example.com",
        np.nan,
        None,
        "carol@example.com",
        "Missing",
    ],
    "uid": ["AS100293", "BJ240806", np.nan, "NA", "SJ251203"],
    "year": [1993, 2006, None, np.nan, 2003],
    "age": [32, 19, np.nan, None, 22],
    # "empty_col": [np.nan, np.nan, np.nan, np.nan, np.nan],
})
df_small

1. Dropping missing values with `dropna`¶

dropna removes rows or columns based on missing-value rules.

Common patterns in this section:

dropna() or dropna(how="any"): remove rows with at least one missing value.
dropna(how="all"): remove rows (or columns) only if everything is missing.
dropna(subset=[...]): require specific columns to be non-missing.

Different analysis tasks need different strictness for missing data.

# Default: drop rows (axis='index') with ANY missing value
df_any = df_small.dropna()
df_any

# Explicitly: rows with ANY missing
df_any2 = df_small.dropna(axis="index", how="any")
df_any2

# Rows with ALL values missing only
df_allrows = df_small.dropna(axis="index", how="all")
df_allrows

# Drop columns where ALL values are missing (our 'empty_col')
df_drop_allcols = df_small.dropna(axis="columns", how="all")
df_drop_allcols

⚠️ If you do dropna(axis="columns", how="any") you’ll likely drop almost every column (because most real columns have at least one missing). That can leave you with an empty (or nearly empty) DataFrame.

Using subset=[...] focuses missing-value checks on the most important columns.

Examples:

Keep only rows with an email.
Drop rows only if both uid and email are missing.

This lets you avoid over-dropping rows when non-critical columns are missing.

# Drop rows where a specific subset has missing values
df_email_req = df_small.dropna(subset=["email"])
df_email_req

# Multiple required columns, drop only if ALL of the listed are missing
df_uid_or_email_req = df_small.dropna(how="all", subset=["uid", "email"])
df_uid_or_email_req

2. Custom missing values → real NaNs¶

Values like "NA" or "Missing" are text, not true missing values.

Once placeholders become NaN, pandas missing-data tools (isna, dropna, fillna) work consistently.

# Replace custom placeholders with real NaNs
df_clean = df_small.replace({"NA": np.nan, "Missing": np.nan})
df_clean

isna() shows where missing values are.

# Where are the NaNs now?
df_clean.isna()

3. Filling missing values with `fillna`¶

fillna(value) replaces missing values with a chosen value.

# Example: fill missing strings with a flag
df_filled_str = df_clean.fillna("MISSING")
df_filled_str

# For numeric columns, a numeric fill is often more useful (e.g., 0 or -1)
df_filled_num = df_clean.copy()
df_filled_num["year"] = df_filled_num["year"].fillna(-1)
df_filled_num

4. Dtypes and why `mean()` may fail¶

Many numeric-looking columns are stored as object (strings), especially after messy imports.

If dtype is not numeric, operations like mean() may fail or behave incorrectly.

Check first with: df.dtypes

# Check dtypes (note: attribute, not a method)
df_clean.dtypes

# This will raise a TypeError because 'age' is still strings/NaN, not numbers:
# df_clean["age"].mean()

age contains strings (e.g., "32") and NaNs. You cannot compute a numeric mean on object dtype.

This is why we need what is called casting: converting a value, row or column from one variable type to another.

5. Casting to numeric¶

If the values of a column are not numeric, we cannot use operations like the mean.

Casting directly to int fails if there are NaNs.
Use float (or to_numeric) instead, then compute the mean.

Use: pd.to_numeric(series, errors="coerce")

Valid numbers are converted.
Invalid values become NaN.

After conversion, numeric stats (mean, median, etc.) work correctly.

# Safe numeric conversion
df_clean["age_num"] = pd.to_numeric(df_clean["age"], errors="coerce")  # strings -> numbers, others -> NaN
df_clean.dtypes

# Fails: cannot cast NaNs to int directly
# df_cast["age_num"] = df_cast["age_num"].astype(int)   # ValueError

# Works: floats can hold NaN
df_clean["age_num"] = df_clean["age_num"].astype(float)
df_clean.dtypes

# Now you can compute numeric stats
df_clean["age_num"].mean()

6. Handling custom missing values when reading CSVs¶

Suppose we save the dataframe we created earlier with missing values to file:

df_small.to_csv("data/test_df.csv", index=False)

You can normalize placeholders at load time.

pd.read_csv(..., na_values=[...]) tells pandas which strings should be interpreted as missing on load.

na_vals = ["NA", "Missing"]
df_loaded = pd.read_csv("data/test_df.csv", na_values=na_vals)
df_loaded

This will automatically treat those strings as NaN on import.

Exercises for Part 7¶

Exercise 7.1:¶

For all columns in the big dataset, compute the number and percentage of missing values.
Show the top 10 columns by missingness.

Exercise 7.2:¶

Convert the CompTotal (salary) to numeric.
Drop missing values.
Filter all rows with a salary greater than 10 millions.
Compute the mean.
Create a copy with missing CompTotal filled by the global median.
Compute the mean salary on the filled copy and compare.
Count how many salaries were missing before the fill.

Solutions¶

#### YOUR CODE HERE ####

Part 7: Cleaning Data & Casting