Part 4: Filtering — Using Conditionals to Filter Rows and Columns¶

This notebook introduces filtering: selecting only rows (and columns) that match conditions.

Main ideas:

Build a boolean mask (True/False per row).
Use that mask to filter DataFrames.
Combine conditions with AND, OR, and NOT.
Filter large real-world data (CompTotal, Country, LanguageHaveWorkedWith).
Use string-based filters with .str.contains(...).

Filtering is one of the most common operations in data analysis, reporting, and model preparation.

1. Boolean masking¶

A boolean mask is a Series of True/False values aligned to DataFrame rows.

We start by creating a new small DataFrame with an extra person who shares a last name (Jones) so we can demonstrate filtering:

import pandas as pd

people = {
    "first": ["Alice", "Bob", "Carol", "Sarah"],
    "last": ["Smith", "Jones", "Lee", "Jones"],
    "email": [
        "alice@example.com",
        "bob@example.com",
        "carol@example.com",
        "sarah@example.com",
    ],
    "uid": ["AS100293", "BJ240806", "CL150510", "SJ251203"],
    "year": [1993, 2006, 2010, 2003],
    "age": [32, 19, 15, 22],
}
df_small = pd.DataFrame(people)

Basic filter: single condition¶

# Create a boolean Series where last name is "Jones"
mask_jones = (df_small["last"] == "Jones")
print(mask_jones)  # True/False per row

# Apply the mask to get only people with last name Jones
print("People with last name Jones:")
df_small[mask_jones]

# You can also do it inline:
df_small[df_small["last"] == "Jones"]

# Or with `.loc`:
df_small.loc[df_small["last"] == "Jones"]

df[mask]: quick shorthand for row filtering.
df.loc[mask, columns]: explicit row+column selection.

Combining conditions: AND / OR¶

To filter by conditions, we use pandas operators:

& for AND
| for OR
~ for NOT

Always wrap each comparison in parentheses, for example: (df["last"] == "Jones") & (df["first"] == "Bob")

# AND: last name Jones and first name Bob
both_conditions = df_small[(df_small["last"] == "Jones") & (df_small["first"] == "Bob")]
print("Last name Jones AND first name Bob:")
both_conditions

# OR: last name Jones OR year before 2010
or_condition = df_small[(df_small["last"] == "Jones") | (df_small["year"] == 2010)]
print("Last name Jones OR born before 2005:")
or_condition

Important: Use & for logical AND and | for OR. Don’t use the Python and/or here, they don’t work elementwise on Series.

Negation¶

Negation is performed with the ~ operator.

~condition flips True to False and False to True.

# Opposite of last name Jones
not_jones = df_small[~(df_small["last"] == "Jones")]
print("People whose last name is NOT Jones:")
not_jones

2. Filtering the big Stack Overflow survey DataFrame¶

Let’s try to apply some masks on a real dataframe.

High salary filter¶

df = pd.read_csv("data/survey_results_public.csv")

# Filter respondents with total compensation greater than 70,000
high_salary = df["CompTotal"] > 70000

print("High earners (CompTotal > 70000) with selected columns:")
df.loc[high_salary, ["Country", "LanguageHaveWorkedWith", "CompTotal"]]

Filtering by a list of countries¶

countries_of_interest = ["Switzerland", "United States of America", "Germany", "India", "Canada"]

print(f"Respondents from {countries_of_interest}:")
country_mask = df["Country"].isin(countries_of_interest)
df.loc[country_mask, "Country"]

You can combine filters, for example high salary and specific countries:

print("High earners in selected countries:")
df.loc[high_salary & country_mask, ["Country", "LanguageHaveWorkedWith", "CompTotal"]]

3. String-based filtering¶

Many columns (e.g., languages worked with) are stored as semicolon-separated strings. You cannot use == to check membership; use .str.contains. Handle missing values safely.

# First, create a mask for knowing Python; protect against NaNs by using na=False
knows_python = df["LanguageHaveWorkedWith"].str.contains("Python", na=False)

# Show some respondents who know Python
print("Respondents who use Python:")
df.loc[knows_python, "LanguageHaveWorkedWith"]

Exercises for Part 4¶

Excercise 1:
- Filter all respondents in the 35–54 range using the "Age" column.
- Report how many respondents are in that age range.
- Filter by people in the 35–54 range and from Switzerland.
Excercise 2:
- Filter for people that have worked with the Python language but now with the Java language

Solutions¶

#### YOUR CODE HERE ####