Part 1: Getting Started with Pandas and the Stack Overflow Survey Dataset¶

1. Setup and Installation¶

We’ll assume you have Python ≥3.10 (Pandas 2.x requires 3.10+). Install via pip:

!pip install --upgrade pandas

Or, if using conda:

conda install pandas

Verify the version so we know what behavior to expect (current stable series is 2.3.x as of mid-2025; Pandas 2.0 introduced breaking changes from 1.x and removed deprecated APIs, so older tutorials may mention things that no longer exist).

import pandas as pd
print(pd.__version__)  # Expect 2.2.x / 2.3.x; warn if <2.0

Note: Pandas 2.0+ enforces previously deprecated behavior from 1.x, so if the code uses deprecated things (e.g., references to .ix or reliance on certain implicit dtype coercions) so you may need slight adaptation.

2. Download and prepare the Stack Overflow 2024 Survey data¶

Go to the official Stack Overflow Developer Survey page and download the 2024 full data set (CSV).
- The public results file is named survey_results_public.csv.
- There’s a companion schema file (e.g., survey_results_schema.py) which maps column codes to human-readable questions.
Unzip the downloaded archive and rename the extracted folder to data in your working directory.
Confirm that inside data/ you have at least:
- survey_results_public.csv
- survey_results_schema.py
- README (explains the files and structure)

The dataset comes from the official 2024 survey; it has ~65,000 responses and the public CSV is one respondent per row and each column is an answer. (survey.stackoverflow.co, Kaggle)

3. CSV vs Excel (Why we’re using CSV here)¶

CSV is plain text with delimiter-separated values. It’s lightweight, universally readable, and ideal for data interchange between systems.
Excel (.xlsx, .xls) is a richer binary/XML format supporting multiple sheets, formatting, formulas, etc., but requires more complex parsing and proprietary support.
For large-scale programmatic analysis and reproducibility, CSV is preferred because it’s simple, version-control friendly, and doesn’t embed presentation metadata.

References for the distinctions: (DataCamp, GeeksforGeeks, Spreadsheet Planet)

4. First Pandas Usage¶

import pandas as pd  # standard alias

df = pd.read_csv('data/survey_results_public.csv')

# Number of rows and columns
print(df.shape)

# Summary of types, non-null counts
df.info()

Viewing data slices

df.head()        # first 5 rows

df.head(10)      # first 10 rows

df.tail()        # last 5 rows

Exercise for Part 1¶

There is an additional .csv file in the data folder.

Load the schema file into a DataFrame.
Answer the following:
- How many rows and columns does the schema DataFrame have?
- Inspect the first few rows to understand its structure. What are the key fields/columns it contains, and what do they mean in context of the survey?
- Are there any missing values in the schema? Identify which column(s), if any, have missing entries and how many.

Solution¶

#### YOUR CODE HERE ####