Estimation

Lab setup

First, make sure you have completed the initial setup.

Open Terminal. Run the update command to make sure you have the latest code.
```
$ mwc update
```

Move to this lab's directory.

$ cd ~/Desktop/making_with_code/shuyuan/labs/estimation

Move to your MWC directory.
```
$ cd ~/Desktop/making_with_code
```

Get a copy of this lab's materials.

git clone https://git.makingwithcode.org/mwc/estimation.git

How this lab meets the learning objectives

The previous two labs focused on classification: predicting a categorical outcome (spam/ham, digit 0–9). This lab introduces estimation: predicting a continuous value, and association: discovering relationships between behaviors in large datasets. Both are grounded in a real public health dataset that raises genuine ethical questions.

A4.3.1 — Linear regression (Part 2): Students build and interpret a linear regression model predicting a continuous health outcome. They examine the slope, intercept, and r² as measures of model fit, and encounter the limits of linear relationships in noisy real-world data.

A4.3.5 — Association rules (Part 3): Students mine the BRFSS dataset for associations between health behaviors using the Apriori algorithm. The medical diagnosis framing (symptoms that co-occur) grounds the otherwise abstract concept of support, confidence, and lift. The crime analysis example from the IB standard is referenced in the discussion.

A4.4.1 — Ethical considerations (Part 4): The ethical discussion is woven throughout but concentrated in Part 4. The BRFSS context makes bias, consent, and causation vs. correlation concrete and personal. Students examine how a predictive health model could be used, and misused.

Background on BRFSS

The Behavioral Risk Factor Surveillance System (BRFSS) is a US health survey conducted annually since 1984. It collects data on health behaviors, chronic diseases, and use of preventive services. The dataset used here is a cleaned subset of the 2015 survey (~253,000 respondents) with features selected for clarity: BMI, physical activity, fruit/vegetable consumption, smoking status, alcohol consumption, and several health outcomes (diabetes, heart disease, etc.).

The full dataset and codebook are available from the CDC. A cleaned version is available on Kaggle as "Diabetes Health Indicators Dataset."

Pacing

Suggested pacing (8 class periods):

Period 1: Part 1 (notebook setup, data exploration)
Period 2–3: Part 2 (linear regression, r², slope/intercept)
Period 4–5: Part 3 (association rules, Apriori)
Period 6–8: Part 4 (ethical discussion, final analysis)

This lab uses Jupyter notebooks. Class discussion works well between sections; the ethical questions in Part 4 benefit from whole-class conversation before students write their individual responses.

So far in this unit you have built models that classify: spam or ham, digit 0 through 9. Now we turn to two different questions:

Estimation: Given someone's health behaviors, can we predict a continuous health outcome—such as their BMI?
Association: Which health behaviors tend to occur together?

These are different kinds of ML tasks, and they raise different ethical questions.

Part 1: Setup and Exploration

This lab uses Jupyter notebooks—an interactive environment that mixes code, output, and written analysis in a single document. Notebooks are widely used in data science and machine learning research because they make it easy to explore data and document your reasoning as you go.

💻 Start the notebook:

$ uv run jupyter lab

Open brfss.ipynb. You should see a notebook with sections corresponding to each part of this lab.

The BRFSS dataset

The Behavioral Risk Factor Surveillance System (BRFSS) is a US health survey conducted annually since 1984. It collects information about health behaviors and conditions from hundreds of thousands of Americans each year.

💻 Run the first two cells of the notebook to load and explore the data.

The dataset has one row per survey respondent and includes:

Column	Description
`BMI`	Body mass index
`PhysActivity`	Does the respondent do physical activity? (1=Yes, 0=No)
`Fruits`	Eats fruit at least once per day (1=Yes, 0=No)
`Veggies`	Eats vegetables at least once per day (1=Yes, 0=No)
`Smoker`	Has smoked at least 100 cigarettes in lifetime (1=Yes, 0=No)
`HvyAlcoholConsump`	Heavy alcohol consumption (1=Yes, 0=No)
`Diabetes_binary`	Has been told they have diabetes (1=Yes, 0=No)
`HeartDiseaseorAttack`	Has had coronary heart disease or heart attack (1=Yes, 0=No)
`GenHlth`	General health (1=Excellent to 5=Poor)
`Age`	Age category (1=18-24 up to 13=80+)

Part 2: Linear Regression

In classification, the output was a category. In linear regression, the output is a continuous number. We try to predict one variable (the response or dependent variable) from one or more other variables (the predictor or independent variables).

The simplest case is one predictor. The linear regression model is:

y = m * x + b

where:

y is the predicted value (e.g., BMI)
x is the predictor (e.g., physical activity)
m is the slope: how much y changes for each unit increase in x
b is the intercept: the predicted y when x = 0

The model learns m and b by minimizing the mean squared error between predictions and actual values.

How well does the model fit?

r² (R-squared) measures the proportion of variation in y that is explained by the model. An r² of 1.0 means the model perfectly predicts y; r² = 0 means the model does no better than predicting the mean of y.

r² = 1 - (sum of squared residuals) / (total sum of squares)

In practice, a good r² depends on the domain. For physics experiments, r² > 0.99 is typical. For predicting human behavior from a single variable, r² = 0.10–0.30 is often considered reasonable.

💻 Run the linear regression cells in the notebook. The starter code fits a model predicting BMI from physical activity, then from age.

Multiple regression

You can include more than one predictor:

from sklearn.linear_model import LinearRegression

X = df[["PhysActivity", "Age", "Fruits", "Veggies", "Smoker"]]
y = df["BMI"]
model = LinearRegression()
model.fit(X, y)

💻 Fit a multiple regression model using several predictors. Observe how r² changes when you add more variables.

Part 3: Association Rules

Linear regression asks: given x, what is the predicted value of y? Association rule mining asks a different question: which variables tend to occur together?

A classic example is market basket analysis: if a customer buys bread and butter, how likely are they to also buy jam? In health data, we might ask: do people who smoke also tend to drink heavily? Do people who exercise regularly also tend to eat more fruits and vegetables?

Key terms

Support of a rule A → B: the fraction of records where both A and B occur.

Confidence of a rule A → B: given that A occurs, how often does B also occur? (Conditional probability: P(B | A))

Lift of a rule A → B: how much more likely is B given A, compared to B occurring at random?

Lift = Confidence(A → B) / Support(B)

Lift > 1 means A and B co-occur more than chance. Lift = 1 means A and B are independent. Lift < 1 means A makes B less likely.

💻 Run the association rule cells in the notebook. The mlxtend library implements the Apriori algorithm. Start with a low minimum support threshold and observe which rules are discovered.

A4.3.5 — Association rules. The IB standard uses a crime analysis example (areas with high vandalism also have high theft); this lab uses health behaviors for the same concept. Both show the same analytical pattern: discover frequent co-occurrences, compute confidence and lift.

The IB standard says "mining techniques using the association rule and interpretation of the results for a given scenario." The lab directly addresses this: students interpret the rules they discover in the health context. Encourage students to sort by lift (not just confidence or support) to find the most interesting associations.

Note: Apriori works with binary features. The BRFSS dataset has mostly binary columns, making it well suited. BMI (continuous) is excluded from the association rule mining unless students binarize it (e.g., BMI > 30).

Part 4: Ethical Considerations

The BRFSS data was collected to understand population health. But the models we can build from it—linear regression that predicts BMI, association rules that link behaviors to disease—could be used in ways the original researchers did not intend.

Read through the questions below, discuss them with your group, and write your answers in the notebook.

A4.4.1 — Ethical implications. The IB standard lists accountability, algorithmic fairness, bias, consent, environmental impact, privacy, security, societal impact, and transparency as potential ethical issues. This section touches on several:

Bias: Survey respondents self-report behaviors. People under-report smoking and drinking; over-report healthy behaviors. This systematic bias affects model accuracy, but not equally across all groups.
Causation vs. correlation: The models show associations, not causes. A health insurance company that uses BMI prediction to set rates is not necessarily using the model for its intended purpose.
Consent: Respondents consented to participating in a public health survey, not to having their patterns mined and used in predictive models.
Transparency: Who can see the model's predictions? Who can challenge them?

Give students 15–20 minutes for independent or group writing before opening the whole-class discussion. The strongest discussions come when students have had time to form and commit to a position.

Questions for discussion

1. Causation vs. correlation

The regression model shows that, on average, people with more physical activity have lower BMI. Does this mean that exercising causes lower BMI? What else might explain this association?

Your answer:

2. Self-reporting bias

BRFSS data is self-reported: respondents answer questions about their own behaviors. What kinds of behaviors might people systematically under-report or over-report? How might this affect the accuracy of a model trained on this data?

Your answer:

3. Predictive use

Suppose a health insurance company trained a model on BRFSS data to predict an applicant's likely future healthcare costs, and used the predictions to set insurance premiums.

What are the potential benefits of this use?
What are the potential harms?
Who might be unfairly disadvantaged?
Is this use consistent with what respondents consented to when they participated in the survey?

Your answer:

4. Transparency and accountability

If a person were denied affordable insurance based on a model's prediction, should they have the right to know why? Should they be able to challenge the prediction?

Your answer:

5. Your own position

Based on your discussion, write a short paragraph (4–6 sentences) summarizing your view: under what conditions, if any, is it appropriate to use health survey data to build predictive models for commercial purposes?

Your answer:

✅ CHECKPOINT 4

Complete all questions in Part 4 of the notebook. Then push your completed notebook:

$ mwc submit

Be prepared to share your group's position on question 5 with the class.

Discussion prompts

Connecting to other areas of ML:

Linear regression and association rules are both unsupervised relative to a target outcome—they describe the data rather than predict a specific label. How does this change how you interpret the results?
The association rules we mined were between binary variables. Real health data also includes continuous variables (blood pressure, cholesterol). What modifications would be needed to mine associations among continuous variables?

Connecting to societal impact:

BRFSS data is collected by the US government and published openly. Should public health data be freely available for anyone to build predictive models? What conditions, if any, should be placed on its use?
This lab used health behaviors to predict BMI and diabetes. What other personal behaviors might be predicted from survey data? Where would you draw the line?

Linking to other IB standards:

A4.4.2: How does the increasing integration of health data into AI systems affect individual privacy and equity? Who benefits from predictive health models? Who bears the risks?

Lab reflection

Strengths:

The BRFSS dataset is real and large enough that statistical patterns are genuine, not artifacts of a toy dataset.
The ethical discussion in Part 4 is grounded in concrete technical details from the earlier parts—students can connect "the model says X" to "here is how that model could be used or misused."
The notebook format allows students to interleave code, results, and written analysis in a way that supports genuine scientific reasoning.
The progression from classification to regression to association rules gives students a sense of the range of ML tasks.

Weaknesses and open questions:

The BRFSS dataset is US-centric; for classes outside the US, the health context may feel distant. Consider whether a more locally relevant dataset would increase engagement (e.g., WHO Global Health Observatory data).
Multiple regression is introduced briefly. Students who want to understand feature importance in multiple regression should also look at standardized coefficients or permutation importance, which are not currently included.
The Apriori algorithm can be slow for very low support thresholds. Starting with min_support=0.1 is recommended; min_support=0.01 may time out.
Open question: Should students be asked to train a classification model (logistic regression or decision tree) predicting diabetes from health behaviors as part of this lab? This would unify the regression and classification threads, but risks making the lab too long. Currently excluded, but a natural extension.
Open question: Is the BRFSS the right dataset, or should we use the Heart Disease dataset (UCI) or another health dataset with different characteristics? BRFSS is large and has strong ethical angles; Heart Disease is smaller but more focused.