Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Part 3: Indexes, Lookup by Key, and Sorting

Part 3: Indexes, Lookup by Key, and Sorting

This notebook introduces indexing, which is the foundation for selecting and filtering rows.

By the end, you should be able to:

  • Turn a regular column into an index.

  • Understand the difference between label-based and position-based row access.

  • Restore the default index when needed.

  • Load CSV files with a meaningful index directly.

  • Use a schema table to decode column names.

  • Sort data by index in ascending and descending order.

1. Adding an integer identifier & using a unique uid as the index

We’ll start with the small DataFrame from before (df_small) which already has a unique uid column (e.g., AS150190). We’ll also add a simple integer identifier, then set uid as the index.

import pandas as pd

people = pd.DataFrame({
    "first": ["Alice", "Bob", "Carol"],
    "last": ["Smith", "Jones", "Lee"],
    "email": ["alice@example.com", "bob@example.com", "carol@example.com"],
    "uid": ["AS100293", "BJ240806", "CL150510"],
})

# Build DataFrame
df_small = pd.DataFrame(people)
# Show current frame
print("Before setting index:")
df_small

set_index() moves one column into the DataFrame index.

  • It replaces the default 0,1,2,... index with values from uid.

  • In this way we can set a more meaningful column as index.

  • inplace=True applies the transformation to the dataset directly without creating a new variable.

# Set uid as the index
df_small.set_index("uid", inplace=True)

# Inspect the index
print("\nAfter setting uid as index:")
print("\nIndex object:", df_small.index)
df_small

Key points:

  • set_index("uid", inplace=True) makes the uid column become the DataFrame’s index, replacing the default integer index.

  • You can inspect what the current index is via df.index.


2. Accessing rows via .loc vs .iloc after changing the index

  • .loc[...]: uses index labels (here: uid strings such as "BJ240806").

  • .iloc[...]: uses integer positions (0, 1, 2, ...).

Examples in this notebook:

  • df_small.loc["BJ240806"] means “row with index label BJ240806”.

  • df_small.iloc[0] means “first row by position”.

If you try df_small.loc[0] after setting uid as index, pandas raises a KeyError because label 0 does not exist.

# Example: access by uid with .loc
print("Row by uid via .loc:")
print(df_small.loc['BJ240806'])

# .iloc by position still works (e.g., first row)
print("\nRow by position via .iloc:")
print(df_small.iloc[0])

3. Resetting the index if it was modified accidentally

If you ever want to revert and bring the index back as a column:

# Reset index back to default integer index
df_small = df_small.reset_index()
print("After reset_index:")
df_small
  • The index is back to the default RangeIndex (0..n-1).

  • uid is again a normal column.

This is useful when you changed the index earlier and now want position-based workflows again.

4. Setting the index on the big survey DataFrame on load

When reading the large Stack Overflow survey results, we can directly set a meaningful unique identifier as the index. The column ResponseId uniquely identifies each respondent.

# Read with ResponseId as index right away
df = pd.read_csv('data/survey_results_public.csv', index_col="ResponseId")

# Quick check
print("Index name and sample:")
print(df.index.name, "— first 5 indices:")
print(df.index[:5])

5. Real-world schema lookup example

Suppose you want to understand what the question code AISelect means in the schema, and you have schema_df loaded from the schema CSV.

First, set its index to the question name field (commonly qname), so lookups are simple:

# Load schema with index set to question names
schema_df = pd.read_csv('data/survey_results_schema.csv', index_col="qname")

# Lookup the full entry for 'AISelect'
print("Schema entry for AISelect:")
schema_df.loc["AISelect"]
# Another code we may not know: "Check"
print("\nSchema entry for 'Check':")
schema_df.loc["Check"]
# If the question text is truncated in display, access the full field explicitly
print("\nFull question text for 'Check':")
print(schema_df.loc["Check", "question"])

Why this helps: By setting qname as the index, you can directly do schema_df.loc[...] to resolve what each coded column in the main survey means, making interpretation and downstream labeling far easier.


6. Sorting alphabetically (ascending and descending) in place

You can sort a DataFrame by an axis (e.g., index or a column). Here are examples:

# Sort schema_df by its index (qname) alphabetically ascending
schema_df.sort_index(ascending=True, inplace=True)
print("First few qnames after ascending sort:")
print(schema_df.index[:5])

# Sort schema_df by index descending
schema_df.sort_index(ascending=False, inplace=True)
print("\nFirst few qnames after descending sort:")
print(schema_df.index[:5])

Note: inplace=True modifies the DataFrame directly. Newer pandas practice often prefers assignment (e.g., schema_df = schema_df.sort_index(...)) because some inplace behaviors are being discouraged for clarity.

Exercise for Part 3

  1. Sort the people in df_small by date of birth so that the oldest person appears first and the youngest last (i.e., ascending by date).

  2. After sorting, print:

    • The first row (should be the oldest person).

    • The last row (youngest person).

Solution

#### YOUR CODE HERE ####