Part 7: Cleaning Data & Casting¶
This notebook introduces practical data cleaning and type casting in pandas.
Main skills:
Detect and remove missing data with
dropna.Convert custom placeholders (for example
"NA","Missing") into realNaN.Fill missing values with
fillna.Inspect data types with
dtypes.Convert text/object columns to numeric with
pd.to_numeric.Handle missing values directly when importing CSV files.
Reliable analysis depends on clean missing-value handling and correct numeric dtypes.
We’ll create a small DataFrame that purposely contains missing values of different kinds:
Python
Nonenp.nanCustom placeholders:
"NA"and"Missing"
Then we’ll explore dropna, replace, isna, fillna, and type casting.
import pandas as pd
import numpy as np
df_small = pd.DataFrame({
"first": ["Alice", "Bob", np.nan, "Carol", None],
"last": ["Smith", "Jones", None, "Lee", "Jones"],
"email": [
"alice@example.com",
np.nan,
None,
"carol@example.com",
"Missing",
],
"uid": ["AS100293", "BJ240806", np.nan, "NA", "SJ251203"],
"year": [1993, 2006, None, np.nan, 2003],
"age": [32, 19, np.nan, None, 22],
# "empty_col": [np.nan, np.nan, np.nan, np.nan, np.nan],
})
df_small1. Dropping missing values with dropna¶
dropna removes rows or columns based on missing-value rules.
Common patterns in this section:
dropna()ordropna(how="any"): remove rows with at least one missing value.dropna(how="all"): remove rows (or columns) only if everything is missing.dropna(subset=[...]): require specific columns to be non-missing.
Different analysis tasks need different strictness for missing data.
# Default: drop rows (axis='index') with ANY missing value
df_any = df_small.dropna()
df_any# Explicitly: rows with ANY missing
df_any2 = df_small.dropna(axis="index", how="any")
df_any2# Rows with ALL values missing only
df_allrows = df_small.dropna(axis="index", how="all")
df_allrows# Drop columns where ALL values are missing (our 'empty_col')
df_drop_allcols = df_small.dropna(axis="columns", how="all")
df_drop_allcols⚠️ If you do
dropna(axis="columns", how="any")you’ll likely drop almost every column (because most real columns have at least one missing). That can leave you with an empty (or nearly empty) DataFrame.
Using subset=[...] focuses missing-value checks on the most important columns.
Examples:
Keep only rows with an email.
Drop rows only if both
uidandemailare missing.
This lets you avoid over-dropping rows when non-critical columns are missing.
# Drop rows where a specific subset has missing values
df_email_req = df_small.dropna(subset=["email"])
df_email_req# Multiple required columns, drop only if ALL of the listed are missing
df_uid_or_email_req = df_small.dropna(how="all", subset=["uid", "email"])
df_uid_or_email_req2. Custom missing values → real NaNs¶
Values like "NA" or "Missing" are text, not true missing values.
Once placeholders become NaN, pandas missing-data tools (isna, dropna, fillna) work consistently.
# Replace custom placeholders with real NaNs
df_clean = df_small.replace({"NA": np.nan, "Missing": np.nan})
df_cleanisna() shows where missing values are.
# Where are the NaNs now?
df_clean.isna()3. Filling missing values with fillna¶
fillna(value) replaces missing values with a chosen value.
# Example: fill missing strings with a flag
df_filled_str = df_clean.fillna("MISSING")
df_filled_str# For numeric columns, a numeric fill is often more useful (e.g., 0 or -1)
df_filled_num = df_clean.copy()
df_filled_num["year"] = df_filled_num["year"].fillna(-1)
df_filled_num4. Dtypes and why mean() may fail¶
Many numeric-looking columns are stored as object (strings), especially after messy imports.
If dtype is not numeric, operations like mean() may fail or behave incorrectly.
Check first with:
df.dtypes
# Check dtypes (note: attribute, not a method)
df_clean.dtypes# This will raise a TypeError because 'age' is still strings/NaN, not numbers:
# df_clean["age"].mean()
agecontains strings (e.g.,"32") and NaNs. You cannot compute a numeric mean onobjectdtype.
This is why we need what is called casting: converting a value, row or column from one variable type to another.
5. Casting to numeric¶
If the values of a column are not numeric, we cannot use operations like the mean.
Casting directly to
intfails if there are NaNs.Use
float(orto_numeric) instead, then compute the mean.
Use:
pd.to_numeric(series, errors="coerce")
Valid numbers are converted.
Invalid values become
NaN.
After conversion, numeric stats (mean, median, etc.) work correctly.
# Safe numeric conversion
df_clean["age_num"] = pd.to_numeric(df_clean["age"], errors="coerce") # strings -> numbers, others -> NaN
df_clean.dtypes# Fails: cannot cast NaNs to int directly
# df_cast["age_num"] = df_cast["age_num"].astype(int) # ValueError
# Works: floats can hold NaN
df_clean["age_num"] = df_clean["age_num"].astype(float)
df_clean.dtypes# Now you can compute numeric stats
df_clean["age_num"].mean()6. Handling custom missing values when reading CSVs¶
Suppose we save the dataframe we created earlier with missing values to file:
df_small.to_csv("data/test_df.csv", index=False)You can normalize placeholders at load time.
pd.read_csv(..., na_values=[...]) tells pandas which strings should be interpreted as missing on load.
na_vals = ["NA", "Missing"]
df_loaded = pd.read_csv("data/test_df.csv", na_values=na_vals)
df_loadedThis will automatically treat those strings as NaN on import.
Exercises for Part 7¶
Exercise 7.1:¶
For all columns in the big dataset, compute the number and percentage of missing values.
Show the top 10 columns by missingness.
Exercise 7.2:¶
Convert the
CompTotal(salary) to numeric.Drop missing values.
Filter all rows with a salary greater than 10 millions.
Compute the mean.
Create a copy with missing
CompTotalfilled by the global median.Compute the mean salary on the filled copy and compare.
Count how many salaries were missing before the fill.
Solutions¶
#### YOUR CODE HERE ####