Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Part 2: Basic Data Structures, Series, and Selections

Part 2: Basic Data Structures, Series, and Selections

1. Motivation: Without Pandas

In pure Python, a common way to represent tabular data is using dictionaries:

  • Keys are column names.

  • Values are lists (rows) or nested dictionaries.

Example: build a small people collection with first, last, email.

people = {
    "first": ["Alice", "Bob", "Carol"],
    "last": ["Smith", "Jones", "Lee"],
    "email": ["alice@example.com", "bob@example.com", "carol@example.com"],
}

This works, but lacks convenient vectorized operations, alignment, and metadata.

2. Converting Dict to DataFrame

import pandas as pd

df_small = pd.DataFrame(people)
df_small
  • Columns correspond to keys.

  • Rows are inferred from aligned list lengths.

# Similar behaviour for numpy arrays
import numpy as np

data = np.array([[10, 2, 1993], [24, 8, 2006], [15, 5, 1810]])
df_array = pd.DataFrame(data, columns=['day', 'month', 'year'])
df_array

3. Series and Column Access

# Single column access returns a Series
df_small['email']
df_small.email  # shorthand, but can conflict

Caveat: attribute access can break if column name clashes with existing DataFrame methods/attributes or if column name has spaces, punctuation, etc.

df['email'] is unambiguous; df.email is syntactic sugar that fails if the column is named e.g. count or contains characters not valid as Python identifiers. (pandas.pydata.org) (general practice, common in Pandas docs)


4. Selecting Multiple Columns and Inspecting Available Columns

# Suppose we want first + email only
df_small[['first', 'email']]
# List all columns in a DataFrame
df_small.columns

5. Indexing with loc and iloc (on the small df)

# .loc uses labels
df_small.loc[0]                             # first row by label
df_small.loc[0, 'email']                    # scalar
df_small.loc[[0, 1], ['first', 'email']]    # multiple rows and columns
# .iloc uses integer positions
df_small.iloc[0]         # first row
df_small.iloc[0, 2]      # first row, third column (email)

6. Return to the Big Survey Data

# Load the data from a CSV file
df = pd.read_csv('data/survey_results_public.csv')
# Re-check shape
df.shape

Example: explore a column (e.g., “Employment”)
NOTE: column names are case-sensitive and must match exactly what the schema shows.
If the column is “Employment”, we can do:

df.loc[0]                                 # first respondent
df.loc[0, 'Employment']                   # their answer to Employment
df.loc[[0, 1, 2], 'Employment']           # first three respondents' Employment
df.loc[0:2, 'Employment']                 # slicing; inclusive of 2
df.loc[0:2, 'Employment':'EdLevel']       # column range selection, inclusive

.loc[0:2] is label slicing and inclusive of the end; this trips people coming from Python list slicing. (pandas.pydata.org)


Exercise for Part 2

In the small df:

  • Modify the dictionary to add a column uid containing a unique identifier for that person, in the form of: FLDDMMYY

    • F: First letter of the first name

    • L: First letter of the last name

    • DD: day of birth

    • MM: month of birth

    • YY: year of birth

(You can use the dates from the numpy array in section 2)

In the big survey DataFrame:

  • Retrieve the rows 10 through 15 and print their Employment through EdLevel columns (adjust column names if necessary using schema).

  • Print the last 10 answers from the last column.

  • Print 5 answers from the exact middle of the dataframe.

  • (Optional) Count how many respondents answered “Yes” when asked if they currently use AI tools in their development process (AISelect column).

Solution

#### YOUR CODE HERE ####