Part 6: Transforming Data¶

This notebook introduces common data transformation patterns in pandas:

Sorting rows and Series.
Summarizing data with aggregation functions.
Grouping data and computing per-group statistics.
Combining multiple metrics into result tables.

This is useful because:

Sorting lets you order your data to highlight top values, see trends, or restore a logical sequence.
Most real-world analysis centers on grouping observations and computing aggregate statistics (mean, median, counts, etc.) rather than peeking at individual rows.

1. Sorting (small DataFrame)¶

Main functions in this section:

sort_values(by=...): sort rows by one or more columns.
sort_index(): sort rows by index labels.
Series.sort_values(): sort values within one column.

import pandas as pd

people = {
    "first": ["Alice", "Bob", "Carol", "Sarah", "Tony"],
    "last": ["Smith", "Jones", "Lee", "Jones", "Stark"],
    "email": [
        "alice@example.com",
        "bob@example.com",
        "carol@example.com",
        "sarah@example.com",
        "tony@example.com"
    ],
    "uid": ["AS100293", "BJ240806", "CL150510", "SJ251203", "TS010100"],
    "year": [1993, 2006, 2010, 2003, 2000],
    "age": [32, 19, 15, 22, 25],
}
df_small = pd.DataFrame(people)

# 1a. Sort by a single column (year ascending by default)
df_small = df_small.sort_values(by="year")
print("Sorted by year (oldest → youngest):")
df_small

# 1b. Sort descending
df_small = df_small.sort_values(by="year", ascending=False)
print("\nSorted by year descending (youngest → oldest):")
df_small

# 1c. Sort by last name descending, then year ascending
df_small = df_small.sort_values(
    by=["last", "year"],
    ascending=[False, True]
)
print("\nSorted by last ↓, then year ↑:")
df_small

# 1d. Restore the original index order
df_small = df_small.sort_index()
print("\nRestored original index with sort_index():")
df_small

# 1e. Sorting a Series
print("\nYear series sorted:")
df_small["year"].sort_values()

2. Sorting (big DataFrame)¶

The same sorting logic scales to large datasets.

Sort alphabetically by Country.
Sort by Country and then CompTotal for within-country ordering.
nlargest(10, "CompTotal") and nsmallest(10, "CompTotal") for top/bottom values.

df = pd.read_csv("data/survey_results_public.csv")
schema_df = pd.read_csv("data/survey_results_schema.csv", index_col="qname")

# 2a. Sort by Country alphabetically
df = df.sort_values(by="Country")
print("First 50 respondents sorted by Country:")
df.head(50)

# 2b. Sort by Country then by CompTotal (low → high)
df = df.sort_values(by=["Country", "CompTotal"])
print("\nSample sorted by Country then CompTotal:")
df.head(10)

# 2c. Top 10 highest salaries
top10_CompTotal = df.nlargest(10, "CompTotal")
print("\nTop 10 salaries:")
top10_CompTotal[["Country", "CompTotal"]]

# 2d. Bottom 10 lowest salaries
bottom10_CompTotal = df.nsmallest(10, "CompTotal")
print("\nBottom 10 salaries:")
bottom10_CompTotal[["Country", "CompTotal"]]

3. Basic aggregation on one column¶

median() on a Series for one-variable summary.
df.median(numeric_only=True) for medians across all numeric columns.

This gives a robust central tendency estimates and fast numeric summaries before deeper modeling.

# Median salary across all respondents
print("Global median salary:", df["CompTotal"].median())

# What if you call df.median()?
print("\nAll numeric-column medians:")
df.median(numeric_only=True)

Note: .median() ignores NaNs. Calling df.median(numeric_only=True) returns a Series of medians for every numeric column.

4. Quick summary with `.describe()`¶

describe() returns compact descriptive statistics.

For numeric columns, typical outputs include:

count, mean, std, min, quartiles (25%, 50%, 75%), max.

For categorical/text columns (if requested), it summarizes frequencies.

It gives a first-pass data quality check and distribution overview.

# Quick summary with `.describe()` on the entire DataFrame
df.describe()

# Quick summary of the "Salary" column
df["CompTotal"].describe()

count: number of non‐missing values
mean: average (sensitive to outliers)
50% quantile = median
std, min, max, and other percentiles

`count()` vs `value_counts()`¶

count() gives number of non-missing entries.
value_counts() gives frequency of each distinct value.

They answer different questions:

“How much data is present?” (count)
“How is it distributed across categories?” (value_counts)

# Compare count vs. value_counts
print("Non-null Salary count:", df["CompTotal"].count())
print("Value counts of a categorical column (e.g. Age):")
print(df["Age"].value_counts().head())

# Example: AISelect usage
print("\nAISelect question (code lookup):")
print(schema_df.loc["AISelect", "question"])

print("\nResponse counts for AISelect:")
print(df["AISelect"].value_counts(dropna=False))

print("\nRelative frequencies of AISelect:")
print(df["AISelect"].value_counts(normalize=True).head())

5. GroupBy by country¶

df.groupby("Country") creates a grouped view of the data, where each group contains rows for one country.

# Quick alternative to get country counts:
df["Country"].value_counts().head()

# The GroupBy object
country_grp = df.groupby("Country")

# You can pull out a single country’s subframe:
swiss = country_grp.get_group("Switzerland")
print("\nFirst rows for Switzerland:")
swiss.head(3)

# This is equivalent to a mask:
mask = df["Country"] == "Switzerland"
print("\nSame via boolean mask:")
df.loc[mask].head(3)

6. Aggregating within groups¶

6.1 Value counts of AISelect by country¶

groupby(...)[column].value_counts() computes category frequencies inside each group.

Example: AI usage responses per country.

# AISelect counts per country (multi‐indexed Series)
ai_by_ctry = country_grp["AISelect"].value_counts()
ai_by_ctry.head(10)

# To select Switzerland’s AISelect breakdown:
print("\nAISelect counts in Switzerland:")
ai_by_ctry.loc["Switzerland"]

6.2 Median salary per country¶

med_salary_by_ctry = country_grp["CompTotal"].median()
print("\nMedian salary by country:")
print(med_salary_by_ctry.head(10))

# Specific country:
print("\nSwitzerland median salary:", med_salary_by_ctry.loc["Switzerland"])

6.3 Multiple aggregations with `.agg()`¶

groupby(...)[col].median() computes one metric per group.
.agg(["mean", "median"]) computes several metrics in one pass.

# Compute both mean and median salary by country
salary_stats = country_grp["CompTotal"].agg(["mean", "median"]).sort_values("median", ascending=False)
print("\nSalary mean & median by country:")
print(salary_stats.head(10))

6.4 Grouping Gotcha: Summing Boolean Masks by Group¶

# Mask for Python users in India
filt = df["Country"] == "India"
df.loc[filt, "LanguageHaveWorkedWith"].str.contains("Python", na=False)

you get a Series of True/False for India, and can do:

(df.loc[filt, "LanguageHaveWorkedWith"].str.contains("Python", na=False)).sum()

to count them. But if you try directly on the GroupBy:

country_grp["LanguageHaveWorkedWith"].str.contains("Python", na=False).sum() # IT RETURNS AN ERROR

you’ll get an AttributeError, because country_grp["..."] is a grouped object, not a Series.

country_grp["LanguageHaveWorkedWith"].apply(lambda s: s.str.contains("Python", na=False).sum())

Together Exercises¶

How to compute, for each country, the percentage of respondents who know Python?

#### YOUR CODE HERE ####

Exercises for Part 6¶

Exercise 6.1¶

For each country in the list ["Switzerland", "Germany", "United States of America", "China", "France", "Italy"], compute:
- The percentage of respondents that answered “Yes” to AISelect.
- The median Salary.
Present the results in a DataFrame with columns:
Country | PctAIUsers | AvgSalary.

Exercise 6.2¶

Step 1
- Compute the mean and median Salary for each country.
- Store the result in a DataFrame with columns: Country | MeanSalary | MedianSalary | NRespondents.
Step 2
- Restrict to the top 10 countries by NRespondents.
- Reset the index so Country is a column.

Exercise 6.3¶

Step 1
- Using the existing "Age" column (e.g. "25-34 years old"), group respondents by bracket.
- For each bracket compute:
  - AvgSalary (mean Salary),
  - PctKnowsPython (percentage who know Python, based on your boolean column).
  - Create a DataFrame:
```
AgeBracket | AvgSalary | PctKnowsPython | NRespondents
```
Step 2
- Count how many respondents fall into each "Age" bracket.

Solutions¶

#### YOUR CODE HERE ####

Part 6: Transforming Data