Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Part 5: Updating Rows and Columns

Part 5: Updating Rows and Columns

This notebook shows how to modify DataFrames: renaming columns, updating values, applying transformations, and adding/removing rows.
Changing labels and values is how we clean and enrich data. This section shows patterns for renaming, mass-updating, safe assignment, and applying transformations.

Core goals:

  • Rename columns globally or partially.

  • Update specific rows/cells without chained-assignment issues.

  • Apply string and custom functions to Series/DataFrames.

  • Use map and replace for value recoding.

  • Combine and split columns.

  • Append and remove rows in controlled ways.


1. Renaming columns in df_small

Clean, consistent column names improve readability and reduce downstream bugs.

import pandas as pd

people = {
    "first": ["Alice", "Bob", "Carol", "Sarah"],
    "last": ["Smith", "Jones", "Lee", "Jones"],
    "email": [
        "alice@example.com",
        "bob@example.com",
        "carol@example.com",
        "sarah@example.com",
    ],
    "uid": ["AS100293", "BJ240806", "CL150510", "SJ251203"],
    "year": [1993, 2006, 2010, 2003],
    "age": [32, 19, 15, 22],
}
df_small_orig = pd.DataFrame(people)
# We create a copy that we can freely modify without modifying the original dataframe
df_small = df_small_orig.copy()

# 1a. Directly replace all column names
df_small.columns = ["first_name", "last_name", "email_address", "id_code", "byear", "age_2025"]
print("After assigning new column list:")
df_small.columns
# 1b. Uppercase all column names via list comprehension
df_small.columns = [col.upper() for col in df_small.columns]
print("\nAfter uppercasing column names:")
df_small.columns
# 1c. Modify names more flexibly: replace underscores with spaces (simulate if needed)
# First set back to some with underscores to demo
df_small.columns = [col.replace("_", " ") for col in df_small.columns]
print("\nAfter replacing underscores with spaces:")
df_small.columns
# 1d. Rename only some columns (partial) using .rename
df_small = df_small.rename(columns={"FIRST NAME": "first", "LAST NAME": "last"})
print("\nAfter selective rename:")
df_small.columns

2. Updating row values safely

These methods perform explicit assignment and avoid ambiguous chained operations.

# Reload the original dataframe
df_small = df_small_orig.copy()

# Update entire row 1 (zero-based index) with new data using .loc
df_small.loc[1] = ["Robert", "Jones", "robert@example.com", "RJ010100", 2000, 25]
print("After replacing all of row 1:")
df_small
# Update only some columns in a row
df_small.loc[2, ["email"]] = ["carol.modified@example.com"]
print("\nAfter updating only email for row 2:")
df_small
# Update a single value using .at
df_small.at[0, "last"] = "Smythe"
print("\nAfter changing Alice's last name via .at:")
df_small

Warning:

Avoid chained assignment like this (can produce SettingWithCopyWarning and unreliable behavior):

mask = df_small["last"] == "Jones"
# This is unsafe / may not persist:
df_small[mask]["last"] = "J."

Correct:

df_small.loc[mask, "last"] = "J."

3. String methods on Series

df["email"].str.lower() applies lowercase conversion elementwise across the entire column.

Vectorized .str operations are concise, readable, and usually faster than manual loops.

# Uppercase/lowercase transformations safely
df_small["email"] = df_small["email"].str.lower()  # vectorized lowercase
print("\nEmails after lowercasing:")
df_small["email"]

4. apply / applymap / map

Where to use each:

  • Series.apply(func): apply a function to each value in one column.

  • DataFrame.apply(func): apply a function to each column (or row) as a Series.

  • DataFrame.applymap(func): apply a function to every cell (elementwise).

  • Series.map(mapping_or_func): map values using a dictionary or function.

These tools are useful because they handle custom transformations when built-in vectorized methods are not enough.

# apply on a Series: length of each email
email_lengths = df_small["email"].apply(len)
print("\nLength of each email (Series.apply):")
email_lengths
# apply with a user-defined function
def shout(email):
    return email.upper()
print("\nEmail shouted via apply:")
df_small["email"].apply(shout)
# with lambda
print("\nEmail first 5 chars using lambda:")
df_small["email"].apply(lambda x: x[:5])
# Difference: df.apply(len) applies to each column (Series) -> gives number of non-null entries per column
print("\nUsing df.apply(len):")
df_small.apply(len)
# applymap applies to every element of the DataFrame
print("\nApplying str.lower to every string cell (applymap):")
df_small.applymap(lambda x: x.lower() if isinstance(x, str) else x)

5. map (elementwise with a mapping dict)

map vs replace for recoding values

  • map({...}) on a Series recodes matching values; unmatched values become NaN.

  • replace({...}) changes only matched values and keeps unmatched values unchanged.

Choose map when you want strict remapping, replace when you want partial substitution without introducing NaN.

# Suppose we want to abbreviate first names
df_small["first_abbrev"] = df_small["first"].map({"Alice": "A.", "Robert": "R.", "Carol": "C."})
print("\nAfter mapping first names to abbreviations:")
df_small[["first", "first_abbrev"]]
# Note: unmapped values become NaN; to avoid that, you can fill or use replace
df_small["first_abbrev_fallback"] = df_small["first"].replace({"Alice": "A.", "Robert": "R.", "Carol": "C."})
print("\nUsing replace (no NaNs for unmatched since exact replacements):")
df_small[["first", "first_abbrev_fallback"]]

6. Big DataFrame: rename column and use map

This section demonstrates two practical cleaning steps on a large dataset:

  • Renaming CompTotal to Salary for clearer semantics.

  • Mapping country names to short codes (US, CH, DE, ...).

These operations make analysis outputs easier to read and standardize.

df = pd.read_csv("data/survey_results_public.csv")

# Rename 'CompTotal' to 'Salary'
df = df.rename(columns={"CompTotal": "Salary"})
print("Columns after renaming compensation:")
print([c for c in df.columns if "Salary" in c])

# Example map usage: create a simplified label for some countries
country_abbrev = {
    "United States of America": "US",
    "Switzerland": "CH",
    "Germany": "DE",
    "India": "IN",
    "Canada": "CA",
}
df["country_code"] = df["Country"].map(country_abbrev)
print("\nSample country codes:")
df.loc[df["country_code"].notna(), ["Country", "country_code"]].head()

7. Combine first & last into a full name, then split back

  • Combine columns: df["full_name"] = df["first"] + " " + df["last"]

  • Remove no-longer-needed columns: drop(columns=[...])

  • Split back with: .str.split(" ", expand=True)

Feature engineering often requires building intermediate text fields and then re-normalizing structure.

# Combine into full_name
df_small["full_name"] = df_small["first"] + " " + df_small["last"]
print("After combining:")
df_small
# Drop the original first & last columns
df_small.drop(columns=["first", "last"], inplace=True)
print("\nAfter dropping first & last:")
df_small
# Split full_name back into first & last
df_small[["first", "last"]] = df_small["full_name"].str.split(" ", expand=True)
print("\nAfter splitting full_name:")
df_small

8. Appending rows & the ignore_index=True quirk

Two common patterns:

  • pd.concat([df, new_df], ignore_index=True) to append one or many rows.

  • df.loc[new_index] = {...} for quick insertion of a single row.

Why ignore_index=True matters: it rebuilds a clean continuous index after concatenation.

df_small = df_small_orig.copy()

# Add a new row using pd.concat
new_row = pd.DataFrame([{"first": "Tony"}])

# Method 1: Concatenate, resetting the index
df_small = pd.concat([df_small, new_row], ignore_index=True)
df_small
# Method 2: Using .loc to append a new row
new_idx = len(df_small)
df_small.loc[new_idx] = {"first": "Tony"}
df_small

9. Appending another DataFrame

df_small = df_small_orig.copy()

# Create a smaller DataFrame with same columns
new_people = pd.DataFrame({
    "first": ["Diana", "Eli"],
    "last":  ["Prince", "Musk"],
    "email": ["diana@themiscira.com", "eli@space.com"],
    "full_name": ["Diana Prince", "Eli Musk"],
})

# Append new_people to df_small (resetting index for a clean result)
df_small = pd.concat([df_small, new_people], ignore_index=True)
print("\nCombined DataFrame:")
df_small

10. Removing rows by index and by condition

  • drop(index=...) removes rows by index label.

  • Combine with a mask to drop conditionally, e.g. rows where last == "Lee".

Selective removal is a key cleanup step before statistics or model training.

# Remove a row by its index label, e.g., drop index 1
df_small = df_small.drop(index=1)
print("\nAfter dropping index=1:")
df_small
# Remove rows where last name == "Lee"
mask_lee = df_small["last"] == "Lee"
df_no_lee = df_small.drop(index=df_small[mask_lee].index)
print("\nAfter dropping rows where last == 'Lee':")
df_no_lee

Exercise for Part 5

Exercise 5.1

  • Transform every column name so that it starts with an uppercase letter (e.g. firstFirst, emailEmail).

  • Replace each name and last name with the first letter followed by a dot using apply().

  • Extract the domain (after the @) from each email.

Exercise 5.2

  • Create an Adult column to be True when the numeric age is > 18, and False otherwise (including NaN ages).

Solutions

#### YOUR CODE HERE ####