Part 6: Transforming Data¶
This notebook introduces common data transformation patterns in pandas:
Sorting rows and Series.
Summarizing data with aggregation functions.
Grouping data and computing per-group statistics.
Combining multiple metrics into result tables.
This is useful because:
Sorting lets you order your data to highlight top values, see trends, or restore a logical sequence.
Most real-world analysis centers on grouping observations and computing aggregate statistics (mean, median, counts, etc.) rather than peeking at individual rows.
1. Sorting (small DataFrame)¶
Main functions in this section:
sort_values(by=...): sort rows by one or more columns.sort_index(): sort rows by index labels.Series.sort_values(): sort values within one column.
import pandas as pd
people = {
"first": ["Alice", "Bob", "Carol", "Sarah", "Tony"],
"last": ["Smith", "Jones", "Lee", "Jones", "Stark"],
"email": [
"alice@example.com",
"bob@example.com",
"carol@example.com",
"sarah@example.com",
"tony@example.com"
],
"uid": ["AS100293", "BJ240806", "CL150510", "SJ251203", "TS010100"],
"year": [1993, 2006, 2010, 2003, 2000],
"age": [32, 19, 15, 22, 25],
}
df_small = pd.DataFrame(people)# 1a. Sort by a single column (year ascending by default)
df_small = df_small.sort_values(by="year")
print("Sorted by year (oldest → youngest):")
df_small# 1b. Sort descending
df_small = df_small.sort_values(by="year", ascending=False)
print("\nSorted by year descending (youngest → oldest):")
df_small# 1c. Sort by last name descending, then year ascending
df_small = df_small.sort_values(
by=["last", "year"],
ascending=[False, True]
)
print("\nSorted by last ↓, then year ↑:")
df_small# 1d. Restore the original index order
df_small = df_small.sort_index()
print("\nRestored original index with sort_index():")
df_small# 1e. Sorting a Series
print("\nYear series sorted:")
df_small["year"].sort_values()2. Sorting (big DataFrame)¶
The same sorting logic scales to large datasets.
Sort alphabetically by
Country.Sort by
Countryand thenCompTotalfor within-country ordering.nlargest(10, "CompTotal")andnsmallest(10, "CompTotal")for top/bottom values.
df = pd.read_csv("data/survey_results_public.csv")
schema_df = pd.read_csv("data/survey_results_schema.csv", index_col="qname")# 2a. Sort by Country alphabetically
df = df.sort_values(by="Country")
print("First 50 respondents sorted by Country:")
df.head(50)# 2b. Sort by Country then by CompTotal (low → high)
df = df.sort_values(by=["Country", "CompTotal"])
print("\nSample sorted by Country then CompTotal:")
df.head(10)# 2c. Top 10 highest salaries
top10_CompTotal = df.nlargest(10, "CompTotal")
print("\nTop 10 salaries:")
top10_CompTotal[["Country", "CompTotal"]]# 2d. Bottom 10 lowest salaries
bottom10_CompTotal = df.nsmallest(10, "CompTotal")
print("\nBottom 10 salaries:")
bottom10_CompTotal[["Country", "CompTotal"]]3. Basic aggregation on one column¶
median()on a Series for one-variable summary.df.median(numeric_only=True)for medians across all numeric columns.
This gives a robust central tendency estimates and fast numeric summaries before deeper modeling.
# Median salary across all respondents
print("Global median salary:", df["CompTotal"].median())
# What if you call df.median()?
print("\nAll numeric-column medians:")
df.median(numeric_only=True)Note:
.median()ignores NaNs. Callingdf.median(numeric_only=True)returns a Series of medians for every numeric column.
4. Quick summary with .describe()¶
describe() returns compact descriptive statistics.
For numeric columns, typical outputs include:
count,mean,std,min, quartiles (25%,50%,75%),max.
For categorical/text columns (if requested), it summarizes frequencies.
It gives a first-pass data quality check and distribution overview.
# Quick summary with `.describe()` on the entire DataFrame
df.describe()# Quick summary of the "Salary" column
df["CompTotal"].describe()count: number of non‐missing values
mean: average (sensitive to outliers)
50% quantile = median
std, min, max, and other percentiles
count() vs value_counts()¶
count()gives number of non-missing entries.value_counts()gives frequency of each distinct value.
They answer different questions:
“How much data is present?” (
count)“How is it distributed across categories?” (
value_counts)
# Compare count vs. value_counts
print("Non-null Salary count:", df["CompTotal"].count())
print("Value counts of a categorical column (e.g. Age):")
print(df["Age"].value_counts().head())
# Example: AISelect usage
print("\nAISelect question (code lookup):")
print(schema_df.loc["AISelect", "question"])
print("\nResponse counts for AISelect:")
print(df["AISelect"].value_counts(dropna=False))
print("\nRelative frequencies of AISelect:")
print(df["AISelect"].value_counts(normalize=True).head())5. GroupBy by country¶
df.groupby("Country") creates a grouped view of the data, where each group contains rows for one country.
# Quick alternative to get country counts:
df["Country"].value_counts().head()# The GroupBy object
country_grp = df.groupby("Country")
# You can pull out a single country’s subframe:
swiss = country_grp.get_group("Switzerland")
print("\nFirst rows for Switzerland:")
swiss.head(3)# This is equivalent to a mask:
mask = df["Country"] == "Switzerland"
print("\nSame via boolean mask:")
df.loc[mask].head(3)6. Aggregating within groups¶
6.1 Value counts of AISelect by country¶
groupby(...)[column].value_counts() computes category frequencies inside each group.
Example: AI usage responses per country.
# AISelect counts per country (multi‐indexed Series)
ai_by_ctry = country_grp["AISelect"].value_counts()
ai_by_ctry.head(10)# To select Switzerland’s AISelect breakdown:
print("\nAISelect counts in Switzerland:")
ai_by_ctry.loc["Switzerland"]6.2 Median salary per country¶
med_salary_by_ctry = country_grp["CompTotal"].median()
print("\nMedian salary by country:")
print(med_salary_by_ctry.head(10))
# Specific country:
print("\nSwitzerland median salary:", med_salary_by_ctry.loc["Switzerland"])6.3 Multiple aggregations with .agg()¶
groupby(...)[col].median()computes one metric per group..agg(["mean", "median"])computes several metrics in one pass.
# Compute both mean and median salary by country
salary_stats = country_grp["CompTotal"].agg(["mean", "median"]).sort_values("median", ascending=False)
print("\nSalary mean & median by country:")
print(salary_stats.head(10))6.4 Grouping Gotcha: Summing Boolean Masks by Group¶
# Mask for Python users in India
filt = df["Country"] == "India"
df.loc[filt, "LanguageHaveWorkedWith"].str.contains("Python", na=False)you get a Series of True/False for India, and can do:
(df.loc[filt, "LanguageHaveWorkedWith"].str.contains("Python", na=False)).sum()to count them. But if you try directly on the GroupBy:
country_grp["LanguageHaveWorkedWith"].str.contains("Python", na=False).sum() # IT RETURNS AN ERRORyou’ll get an AttributeError, because country_grp["..."] is a grouped object, not a Series.
country_grp["LanguageHaveWorkedWith"].apply(lambda s: s.str.contains("Python", na=False).sum())Together Exercises¶
How to compute, for each country, the percentage of respondents who know Python?
#### YOUR CODE HERE ####Exercises for Part 6¶
Exercise 6.1¶
For each country in the list
["Switzerland", "Germany", "United States of America", "China", "France", "Italy"], compute:The percentage of respondents that answered “Yes” to
AISelect.The median
Salary.
Present the results in a DataFrame with columns:
Country | PctAIUsers | AvgSalary.
Exercise 6.2¶
Step 1
Compute the mean and median
Salaryfor each country.Store the result in a DataFrame with columns:
Country | MeanSalary | MedianSalary | NRespondents.
Step 2
Restrict to the top 10 countries by
NRespondents.Reset the index so
Countryis a column.
Exercise 6.3¶
Step 1
Using the existing
"Age"column (e.g."25-34 years old"), group respondents by bracket.For each bracket compute:
AvgSalary(meanSalary),PctKnowsPython(percentage who know Python, based on your boolean column).Create a DataFrame:
AgeBracket | AvgSalary | PctKnowsPython | NRespondents
Step 2
Count how many respondents fall into each
"Age"bracket.
Solutions¶
#### YOUR CODE HERE ####