Part 4: Filtering — Using Conditionals to Filter Rows and Columns¶
This notebook introduces filtering: selecting only rows (and columns) that match conditions.
Main ideas:
Build a boolean mask (
True/Falseper row).Use that mask to filter DataFrames.
Combine conditions with AND, OR, and NOT.
Filter large real-world data (
CompTotal,Country,LanguageHaveWorkedWith).Use string-based filters with
.str.contains(...).
Filtering is one of the most common operations in data analysis, reporting, and model preparation.
1. Boolean masking¶
A boolean mask is a Series of True/False values aligned to DataFrame rows.
We start by creating a new small DataFrame with an extra person who shares a last name (Jones) so we can demonstrate filtering:
import pandas as pd
people = {
"first": ["Alice", "Bob", "Carol", "Sarah"],
"last": ["Smith", "Jones", "Lee", "Jones"],
"email": [
"alice@example.com",
"bob@example.com",
"carol@example.com",
"sarah@example.com",
],
"uid": ["AS100293", "BJ240806", "CL150510", "SJ251203"],
"year": [1993, 2006, 2010, 2003],
"age": [32, 19, 15, 22],
}
df_small = pd.DataFrame(people)Basic filter: single condition¶
# Create a boolean Series where last name is "Jones"
mask_jones = (df_small["last"] == "Jones")
print(mask_jones) # True/False per row
# Apply the mask to get only people with last name Jones
print("People with last name Jones:")
df_small[mask_jones]# You can also do it inline:
df_small[df_small["last"] == "Jones"]
# Or with `.loc`:
df_small.loc[df_small["last"] == "Jones"]df[mask]: quick shorthand for row filtering.df.loc[mask, columns]: explicit row+column selection.
Combining conditions: AND / OR¶
To filter by conditions, we use pandas operators:
&for AND|for OR~for NOT
Always wrap each comparison in parentheses, for example:
(df["last"] == "Jones") & (df["first"] == "Bob")
# AND: last name Jones and first name Bob
both_conditions = df_small[(df_small["last"] == "Jones") & (df_small["first"] == "Bob")]
print("Last name Jones AND first name Bob:")
both_conditions# OR: last name Jones OR year before 2010
or_condition = df_small[(df_small["last"] == "Jones") | (df_small["year"] == 2010)]
print("Last name Jones OR born before 2005:")
or_conditionImportant: Use
&for logical AND and|for OR. Don’t use the Pythonand/orhere, they don’t work elementwise on Series.
Negation¶
Negation is performed with the ~ operator.
~condition flips True to False and False to True.
# Opposite of last name Jones
not_jones = df_small[~(df_small["last"] == "Jones")]
print("People whose last name is NOT Jones:")
not_jones2. Filtering the big Stack Overflow survey DataFrame¶
Let’s try to apply some masks on a real dataframe.
High salary filter¶
df = pd.read_csv("data/survey_results_public.csv")# Filter respondents with total compensation greater than 70,000
high_salary = df["CompTotal"] > 70000
print("High earners (CompTotal > 70000) with selected columns:")
df.loc[high_salary, ["Country", "LanguageHaveWorkedWith", "CompTotal"]]Filtering by a list of countries¶
countries_of_interest = ["Switzerland", "United States of America", "Germany", "India", "Canada"]
print(f"Respondents from {countries_of_interest}:")
country_mask = df["Country"].isin(countries_of_interest)
df.loc[country_mask, "Country"]You can combine filters, for example high salary and specific countries:
print("High earners in selected countries:")
df.loc[high_salary & country_mask, ["Country", "LanguageHaveWorkedWith", "CompTotal"]]3. String-based filtering¶
Many columns (e.g., languages worked with) are stored as semicolon-separated strings. You cannot use == to check membership; use .str.contains. Handle missing values safely.
# First, create a mask for knowing Python; protect against NaNs by using na=False
knows_python = df["LanguageHaveWorkedWith"].str.contains("Python", na=False)
# Show some respondents who know Python
print("Respondents who use Python:")
df.loc[knows_python, "LanguageHaveWorkedWith"]Exercises for Part 4¶
Excercise 1:
Filter all respondents in the 35–54 range using the
"Age"column.Report how many respondents are in that age range.
Filter by people in the 35–54 range and from Switzerland.
Excercise 2:
Filter for people that have worked with the Python language but now with the Java language
Solutions¶
#### YOUR CODE HERE ####