Part 3: Indexes, Lookup by Key, and Sorting¶
This notebook introduces indexing, which is the foundation for selecting and filtering rows.
By the end, you should be able to:
Turn a regular column into an index.
Understand the difference between label-based and position-based row access.
Restore the default index when needed.
Load CSV files with a meaningful index directly.
Use a schema table to decode column names.
Sort data by index in ascending and descending order.
1. Adding an integer identifier & using a unique uid as the index¶
We’ll start with the small DataFrame from before (df_small) which already has a unique uid column (e.g., AS150190). We’ll also add a simple integer identifier, then set uid as the index.
import pandas as pd
people = pd.DataFrame({
"first": ["Alice", "Bob", "Carol"],
"last": ["Smith", "Jones", "Lee"],
"email": ["alice@example.com", "bob@example.com", "carol@example.com"],
"uid": ["AS100293", "BJ240806", "CL150510"],
})
# Build DataFrame
df_small = pd.DataFrame(people)# Show current frame
print("Before setting index:")
df_smallset_index() moves one column into the DataFrame index.
It replaces the default
0,1,2,...index with values fromuid.In this way we can set a more meaningful column as index.
inplace=Trueapplies the transformation to the dataset directly without creating a new variable.
# Set uid as the index
df_small.set_index("uid", inplace=True)
# Inspect the index
print("\nAfter setting uid as index:")
print("\nIndex object:", df_small.index)
df_smallKey points:
set_index("uid", inplace=True)makes theuidcolumn become the DataFrame’s index, replacing the default integer index.You can inspect what the current index is via
df.index.
2. Accessing rows via .loc vs .iloc after changing the index¶
.loc[...]: uses index labels (here:uidstrings such as"BJ240806")..iloc[...]: uses integer positions (0,1,2, ...).
Examples in this notebook:
df_small.loc["BJ240806"]means “row with index label BJ240806”.df_small.iloc[0]means “first row by position”.
If you try df_small.loc[0] after setting uid as index, pandas raises a KeyError because label 0 does not exist.
# Example: access by uid with .loc
print("Row by uid via .loc:")
print(df_small.loc['BJ240806'])
# .iloc by position still works (e.g., first row)
print("\nRow by position via .iloc:")
print(df_small.iloc[0])3. Resetting the index if it was modified accidentally¶
If you ever want to revert and bring the index back as a column:
# Reset index back to default integer index
df_small = df_small.reset_index()
print("After reset_index:")
df_smallThe index is back to the default
RangeIndex(0..n-1).uidis again a normal column.
This is useful when you changed the index earlier and now want position-based workflows again.
4. Setting the index on the big survey DataFrame on load¶
When reading the large Stack Overflow survey results, we can directly set a meaningful unique identifier as the index. The column ResponseId uniquely identifies each respondent.
# Read with ResponseId as index right away
df = pd.read_csv('data/survey_results_public.csv', index_col="ResponseId")
# Quick check
print("Index name and sample:")
print(df.index.name, "— first 5 indices:")
print(df.index[:5])5. Real-world schema lookup example¶
Suppose you want to understand what the question code AISelect means in the schema, and you have schema_df loaded from the schema CSV.
First, set its index to the question name field (commonly qname), so lookups are simple:
# Load schema with index set to question names
schema_df = pd.read_csv('data/survey_results_schema.csv', index_col="qname")
# Lookup the full entry for 'AISelect'
print("Schema entry for AISelect:")
schema_df.loc["AISelect"]# Another code we may not know: "Check"
print("\nSchema entry for 'Check':")
schema_df.loc["Check"]# If the question text is truncated in display, access the full field explicitly
print("\nFull question text for 'Check':")
print(schema_df.loc["Check", "question"])Why this helps:
By setting qname as the index, you can directly do schema_df.loc[...] to resolve what each coded column in the main survey means, making interpretation and downstream labeling far easier.
6. Sorting alphabetically (ascending and descending) in place¶
You can sort a DataFrame by an axis (e.g., index or a column). Here are examples:
# Sort schema_df by its index (qname) alphabetically ascending
schema_df.sort_index(ascending=True, inplace=True)
print("First few qnames after ascending sort:")
print(schema_df.index[:5])
# Sort schema_df by index descending
schema_df.sort_index(ascending=False, inplace=True)
print("\nFirst few qnames after descending sort:")
print(schema_df.index[:5])Note: inplace=True modifies the DataFrame directly. Newer pandas practice often prefers assignment (e.g., schema_df = schema_df.sort_index(...)) because some inplace behaviors are being discouraged for clarity.
Exercise for Part 3¶
Sort the people in
df_smallby date of birth so that the oldest person appears first and the youngest last (i.e., ascending by date).After sorting, print:
The first row (should be the oldest person).
The last row (youngest person).
Solution¶
#### YOUR CODE HERE ####