Part 2: Basic Data Structures, Series, and Selections¶
1. Motivation: Without Pandas¶
In pure Python, a common way to represent tabular data is using dictionaries:
Keys are column names.
Values are lists (rows) or nested dictionaries.
Example: build a small people collection with first, last, email.
people = {
"first": ["Alice", "Bob", "Carol"],
"last": ["Smith", "Jones", "Lee"],
"email": ["alice@example.com", "bob@example.com", "carol@example.com"],
}This works, but lacks convenient vectorized operations, alignment, and metadata.
2. Converting Dict to DataFrame¶
import pandas as pd
df_small = pd.DataFrame(people)
df_smallColumns correspond to keys.
Rows are inferred from aligned list lengths.
# Similar behaviour for numpy arrays
import numpy as np
data = np.array([[10, 2, 1993], [24, 8, 2006], [15, 5, 1810]])
df_array = pd.DataFrame(data, columns=['day', 'month', 'year'])
df_array3. Series and Column Access¶
# Single column access returns a Series
df_small['email']
df_small.email # shorthand, but can conflictCaveat: attribute access can break if column name clashes with existing DataFrame methods/attributes or if column name has spaces, punctuation, etc.
df['email'] is unambiguous; df.email is syntactic sugar that fails if the column is named e.g. count or contains characters not valid as Python identifiers. (pandas.pydata.org) (general practice, common in Pandas docs)
4. Selecting Multiple Columns and Inspecting Available Columns¶
# Suppose we want first + email only
df_small[['first', 'email']]# List all columns in a DataFrame
df_small.columns5. Indexing with loc and iloc (on the small df)¶
# .loc uses labels
df_small.loc[0] # first row by labeldf_small.loc[0, 'email'] # scalardf_small.loc[[0, 1], ['first', 'email']] # multiple rows and columns# .iloc uses integer positions
df_small.iloc[0] # first rowdf_small.iloc[0, 2] # first row, third column (email)6. Return to the Big Survey Data¶
# Load the data from a CSV file
df = pd.read_csv('data/survey_results_public.csv')# Re-check shape
df.shapeExample: explore a column (e.g., “Employment”)
NOTE: column names are case-sensitive and must match exactly what the schema shows.
If the column is “Employment”, we can do:
df.loc[0] # first respondentdf.loc[0, 'Employment'] # their answer to Employmentdf.loc[[0, 1, 2], 'Employment'] # first three respondents' Employmentdf.loc[0:2, 'Employment'] # slicing; inclusive of 2df.loc[0:2, 'Employment':'EdLevel'] # column range selection, inclusive
.loc[0:2]is label slicing and inclusive of the end; this trips people coming from Python list slicing. (pandas.pydata.org)
Exercise for Part 2¶
In the small df:
Modify the dictionary to add a column
uidcontaining a unique identifier for that person, in the form of: FLDDMMYYF: First letter of the first name
L: First letter of the last name
DD: day of birth
MM: month of birth
YY: year of birth
(You can use the dates from the numpy array in section 2)
In the big survey DataFrame:
Retrieve the rows 10 through 15 and print their
EmploymentthroughEdLevelcolumns (adjust column names if necessary using schema).Print the last 10 answers from the last column.
Print 5 answers from the exact middle of the dataframe.
(Optional) Count how many respondents answered “Yes” when asked if they currently use AI tools in their development process (
AISelectcolumn).
Solution¶
#### YOUR CODE HERE ####