Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Part 6: Transforming Data

Part 6: Transforming Data

This notebook introduces common data transformation patterns in pandas:

  • Sorting rows and Series.

  • Summarizing data with aggregation functions.

  • Grouping data and computing per-group statistics.

  • Combining multiple metrics into result tables.

This is useful because:

  • Sorting lets you order your data to highlight top values, see trends, or restore a logical sequence.

  • Most real-world analysis centers on grouping observations and computing aggregate statistics (mean, median, counts, etc.) rather than peeking at individual rows.

1. Sorting (small DataFrame)

Main functions in this section:

  • sort_values(by=...): sort rows by one or more columns.

  • sort_index(): sort rows by index labels.

  • Series.sort_values(): sort values within one column.

import pandas as pd

people = {
    "first": ["Alice", "Bob", "Carol", "Sarah", "Tony"],
    "last": ["Smith", "Jones", "Lee", "Jones", "Stark"],
    "email": [
        "alice@example.com",
        "bob@example.com",
        "carol@example.com",
        "sarah@example.com",
        "tony@example.com"
    ],
    "uid": ["AS100293", "BJ240806", "CL150510", "SJ251203", "TS010100"],
    "year": [1993, 2006, 2010, 2003, 2000],
    "age": [32, 19, 15, 22, 25],
}
df_small = pd.DataFrame(people)
# 1a. Sort by a single column (year ascending by default)
df_small = df_small.sort_values(by="year")
print("Sorted by year (oldest → youngest):")
df_small
# 1b. Sort descending
df_small = df_small.sort_values(by="year", ascending=False)
print("\nSorted by year descending (youngest → oldest):")
df_small
# 1c. Sort by last name descending, then year ascending
df_small = df_small.sort_values(
    by=["last", "year"],
    ascending=[False, True]
)
print("\nSorted by last ↓, then year ↑:")
df_small
# 1d. Restore the original index order
df_small = df_small.sort_index()
print("\nRestored original index with sort_index():")
df_small
# 1e. Sorting a Series
print("\nYear series sorted:")
df_small["year"].sort_values()

2. Sorting (big DataFrame)

The same sorting logic scales to large datasets.

  • Sort alphabetically by Country.

  • Sort by Country and then CompTotal for within-country ordering.

  • nlargest(10, "CompTotal") and nsmallest(10, "CompTotal") for top/bottom values.

df = pd.read_csv("data/survey_results_public.csv")
schema_df = pd.read_csv("data/survey_results_schema.csv", index_col="qname")
# 2a. Sort by Country alphabetically
df = df.sort_values(by="Country")
print("First 50 respondents sorted by Country:")
df.head(50)
# 2b. Sort by Country then by CompTotal (low → high)
df = df.sort_values(by=["Country", "CompTotal"])
print("\nSample sorted by Country then CompTotal:")
df.head(10)
# 2c. Top 10 highest salaries
top10_CompTotal = df.nlargest(10, "CompTotal")
print("\nTop 10 salaries:")
top10_CompTotal[["Country", "CompTotal"]]
# 2d. Bottom 10 lowest salaries
bottom10_CompTotal = df.nsmallest(10, "CompTotal")
print("\nBottom 10 salaries:")
bottom10_CompTotal[["Country", "CompTotal"]]

3. Basic aggregation on one column

  • median() on a Series for one-variable summary.

  • df.median(numeric_only=True) for medians across all numeric columns.

This gives a robust central tendency estimates and fast numeric summaries before deeper modeling.

# Median salary across all respondents
print("Global median salary:", df["CompTotal"].median())

# What if you call df.median()?
print("\nAll numeric-column medians:")
df.median(numeric_only=True)

Note: .median() ignores NaNs. Calling df.median(numeric_only=True) returns a Series of medians for every numeric column.

4. Quick summary with .describe()

describe() returns compact descriptive statistics.

For numeric columns, typical outputs include:

  • count, mean, std, min, quartiles (25%, 50%, 75%), max.

For categorical/text columns (if requested), it summarizes frequencies.

It gives a first-pass data quality check and distribution overview.

# Quick summary with `.describe()` on the entire DataFrame
df.describe()
# Quick summary of the "Salary" column
df["CompTotal"].describe()
  • count: number of non‐missing values

  • mean: average (sensitive to outliers)

  • 50% quantile = median

  • std, min, max, and other percentiles

count() vs value_counts()

  • count() gives number of non-missing entries.

  • value_counts() gives frequency of each distinct value.

They answer different questions:

  • “How much data is present?” (count)

  • “How is it distributed across categories?” (value_counts)

# Compare count vs. value_counts
print("Non-null Salary count:", df["CompTotal"].count())
print("Value counts of a categorical column (e.g. Age):")
print(df["Age"].value_counts().head())

# Example: AISelect usage
print("\nAISelect question (code lookup):")
print(schema_df.loc["AISelect", "question"])

print("\nResponse counts for AISelect:")
print(df["AISelect"].value_counts(dropna=False))

print("\nRelative frequencies of AISelect:")
print(df["AISelect"].value_counts(normalize=True).head())

5. GroupBy by country

df.groupby("Country") creates a grouped view of the data, where each group contains rows for one country.

# Quick alternative to get country counts:
df["Country"].value_counts().head()
# The GroupBy object
country_grp = df.groupby("Country")

# You can pull out a single country’s subframe:
swiss = country_grp.get_group("Switzerland")
print("\nFirst rows for Switzerland:")
swiss.head(3)
# This is equivalent to a mask:
mask = df["Country"] == "Switzerland"
print("\nSame via boolean mask:")
df.loc[mask].head(3)

6. Aggregating within groups

6.1 Value counts of AISelect by country

groupby(...)[column].value_counts() computes category frequencies inside each group.

Example: AI usage responses per country.

# AISelect counts per country (multi‐indexed Series)
ai_by_ctry = country_grp["AISelect"].value_counts()
ai_by_ctry.head(10)
# To select Switzerland’s AISelect breakdown:
print("\nAISelect counts in Switzerland:")
ai_by_ctry.loc["Switzerland"]

6.2 Median salary per country

med_salary_by_ctry = country_grp["CompTotal"].median()
print("\nMedian salary by country:")
print(med_salary_by_ctry.head(10))

# Specific country:
print("\nSwitzerland median salary:", med_salary_by_ctry.loc["Switzerland"])

6.3 Multiple aggregations with .agg()

  • groupby(...)[col].median() computes one metric per group.

  • .agg(["mean", "median"]) computes several metrics in one pass.

# Compute both mean and median salary by country
salary_stats = country_grp["CompTotal"].agg(["mean", "median"]).sort_values("median", ascending=False)
print("\nSalary mean & median by country:")
print(salary_stats.head(10))

6.4 Grouping Gotcha: Summing Boolean Masks by Group

# Mask for Python users in India
filt = df["Country"] == "India"
df.loc[filt, "LanguageHaveWorkedWith"].str.contains("Python", na=False)

you get a Series of True/False for India, and can do:

(df.loc[filt, "LanguageHaveWorkedWith"].str.contains("Python", na=False)).sum()

to count them. But if you try directly on the GroupBy:

country_grp["LanguageHaveWorkedWith"].str.contains("Python", na=False).sum() # IT RETURNS AN ERROR

you’ll get an AttributeError, because country_grp["..."] is a grouped object, not a Series.

country_grp["LanguageHaveWorkedWith"].apply(lambda s: s.str.contains("Python", na=False).sum())

Together Exercises

How to compute, for each country, the percentage of respondents who know Python?

#### YOUR CODE HERE ####

Exercises for Part 6

Exercise 6.1

  • For each country in the list ["Switzerland", "Germany", "United States of America", "China", "France", "Italy"], compute:

    • The percentage of respondents that answered “Yes” to AISelect.

    • The median Salary.

  • Present the results in a DataFrame with columns:
    Country | PctAIUsers | AvgSalary.

Exercise 6.2

  1. Step 1

    • Compute the mean and median Salary for each country.

    • Store the result in a DataFrame with columns: Country | MeanSalary | MedianSalary | NRespondents.

  2. Step 2

    • Restrict to the top 10 countries by NRespondents.

    • Reset the index so Country is a column.

Exercise 6.3

  1. Step 1

    • Using the existing "Age" column (e.g. "25-34 years old"), group respondents by bracket.

    • For each bracket compute:

      • AvgSalary (mean Salary),

      • PctKnowsPython (percentage who know Python, based on your boolean column).

      • Create a DataFrame:

      AgeBracket | AvgSalary | PctKnowsPython | NRespondents
  2. Step 2

    • Count how many respondents fall into each "Age" bracket.

Solutions

#### YOUR CODE HERE ####