Estimation

Lab setup

First, make sure you have completed the initial setup.

If you are part of a course

  1. Open Terminal. Run the update command to make sure you have the latest code.
    $ mwc update
  2. Move to this lab's directory.
    $ cd ~/Desktop/making_with_code/shuyuan/labs/estimation
    

If you are working on your own

  1. Move to your MWC directory.
    $ cd ~/Desktop/making_with_code
    
  2. Get a copy of this lab's materials.
    git clone https://git.makingwithcode.org/mwc/estimation.git

So far in this unit you have built models that classify: spam or ham, digit 0 through 9. Now we turn to two different questions:

These are different kinds of ML tasks, and they raise different ethical questions.


Part 1: Setup and Exploration

This lab uses Jupyter notebooks—an interactive environment that mixes code, output, and written analysis in a single document. Notebooks are widely used in data science and machine learning research because they make it easy to explore data and document your reasoning as you go.

💻 Start the notebook:

$ uv run jupyter lab

Open brfss.ipynb. You should see a notebook with sections corresponding to each part of this lab.

The BRFSS dataset

The Behavioral Risk Factor Surveillance System (BRFSS) is a US health survey conducted annually since 1984. It collects information about health behaviors and conditions from hundreds of thousands of Americans each year.

💻 Run the first two cells of the notebook to load and explore the data.

The dataset has one row per survey respondent and includes:

ColumnDescription
BMIBody mass index
PhysActivityDoes the respondent do physical activity? (1=Yes, 0=No)
FruitsEats fruit at least once per day (1=Yes, 0=No)
VeggiesEats vegetables at least once per day (1=Yes, 0=No)
SmokerHas smoked at least 100 cigarettes in lifetime (1=Yes, 0=No)
HvyAlcoholConsumpHeavy alcohol consumption (1=Yes, 0=No)
Diabetes_binaryHas been told they have diabetes (1=Yes, 0=No)
HeartDiseaseorAttackHas had coronary heart disease or heart attack (1=Yes, 0=No)
GenHlthGeneral health (1=Excellent to 5=Poor)
AgeAge category (1=18-24 up to 13=80+)

Part 2: Linear Regression

In classification, the output was a category. In linear regression, the output is a continuous number. We try to predict one variable (the response or dependent variable) from one or more other variables (the predictor or independent variables).

The simplest case is one predictor. The linear regression model is:

y = m * x + b

where:

The model learns m and b by minimizing the mean squared error between predictions and actual values.

How well does the model fit?

r² (R-squared) measures the proportion of variation in y that is explained by the model. An r² of 1.0 means the model perfectly predicts y; r² = 0 means the model does no better than predicting the mean of y.

r² = 1 - (sum of squared residuals) / (total sum of squares)

In practice, a good r² depends on the domain. For physics experiments, r² > 0.99 is typical. For predicting human behavior from a single variable, r² = 0.10–0.30 is often considered reasonable.

💻 Run the linear regression cells in the notebook. The starter code fits a model predicting BMI from physical activity, then from age.

Multiple regression

You can include more than one predictor:

from sklearn.linear_model import LinearRegression

X = df[["PhysActivity", "Age", "Fruits", "Veggies", "Smoker"]]
y = df["BMI"]
model = LinearRegression()
model.fit(X, y)

💻 Fit a multiple regression model using several predictors. Observe how r² changes when you add more variables.


Part 3: Association Rules

Linear regression asks: given x, what is the predicted value of y? Association rule mining asks a different question: which variables tend to occur together?

A classic example is market basket analysis: if a customer buys bread and butter, how likely are they to also buy jam? In health data, we might ask: do people who smoke also tend to drink heavily? Do people who exercise regularly also tend to eat more fruits and vegetables?

Key terms

Support of a rule A → B: the fraction of records where both A and B occur.

Confidence of a rule A → B: given that A occurs, how often does B also occur? (Conditional probability: P(B | A))

Lift of a rule A → B: how much more likely is B given A, compared to B occurring at random?

Lift = Confidence(A → B) / Support(B)

Lift > 1 means A and B co-occur more than chance. Lift = 1 means A and B are independent. Lift < 1 means A makes B less likely.

💻 Run the association rule cells in the notebook. The mlxtend library implements the Apriori algorithm. Start with a low minimum support threshold and observe which rules are discovered.


Part 4: Ethical Considerations

The BRFSS data was collected to understand population health. But the models we can build from it—linear regression that predicts BMI, association rules that link behaviors to disease—could be used in ways the original researchers did not intend.

Read through the questions below, discuss them with your group, and write your answers in the notebook.

Questions for discussion

1. Causation vs. correlation

The regression model shows that, on average, people with more physical activity have lower BMI. Does this mean that exercising causes lower BMI? What else might explain this association?

Your answer:


2. Self-reporting bias

BRFSS data is self-reported: respondents answer questions about their own behaviors. What kinds of behaviors might people systematically under-report or over-report? How might this affect the accuracy of a model trained on this data?

Your answer:


3. Predictive use

Suppose a health insurance company trained a model on BRFSS data to predict an applicant's likely future healthcare costs, and used the predictions to set insurance premiums.

Your answer:


4. Transparency and accountability

If a person were denied affordable insurance based on a model's prediction, should they have the right to know why? Should they be able to challenge the prediction?

Your answer:


5. Your own position

Based on your discussion, write a short paragraph (4–6 sentences) summarizing your view: under what conditions, if any, is it appropriate to use health survey data to build predictive models for commercial purposes?

Your answer: