So far in this unit you have built models that classify: spam or ham, digit 0
through 9. Now we turn to two different questions:
Estimation: Given someone's health behaviors, can we predict a continuous
health outcome—such as their BMI?
Association: Which health behaviors tend to occur together?
These are different kinds of ML tasks, and they raise different ethical
questions.
Part 1: Setup and Exploration
This lab uses Jupyter notebooks—an interactive environment that mixes code,
output, and written analysis in a single document. Notebooks are widely used
in data science and machine learning research because they make it easy to
explore data and document your reasoning as you go.
💻
Start the notebook:
$ uv run jupyter lab
Open brfss.ipynb. You should see a notebook with sections corresponding to
each part of this lab.
The BRFSS dataset
The Behavioral Risk Factor Surveillance System (BRFSS) is a US health survey
conducted annually since 1984. It collects information about health behaviors
and conditions from hundreds of thousands of Americans each year.
💻
Run the first two cells of the notebook to load and
explore the data.
The dataset has one row per survey respondent and includes:
Column
Description
BMI
Body mass index
PhysActivity
Does the respondent do physical activity? (1=Yes, 0=No)
Fruits
Eats fruit at least once per day (1=Yes, 0=No)
Veggies
Eats vegetables at least once per day (1=Yes, 0=No)
Smoker
Has smoked at least 100 cigarettes in lifetime (1=Yes, 0=No)
HvyAlcoholConsump
Heavy alcohol consumption (1=Yes, 0=No)
Diabetes_binary
Has been told they have diabetes (1=Yes, 0=No)
HeartDiseaseorAttack
Has had coronary heart disease or heart attack (1=Yes, 0=No)
GenHlth
General health (1=Excellent to 5=Poor)
Age
Age category (1=18-24 up to 13=80+)
Part 2: Linear Regression
In classification, the output was a category. In linear regression, the
output is a continuous number. We try to predict one variable (the response
or dependent variable) from one or more other variables (the predictor
or independent variables).
The simplest case is one predictor. The linear regression model is:
y = m * x + b
where:
y is the predicted value (e.g., BMI)
x is the predictor (e.g., physical activity)
m is the slope: how much y changes for each unit increase in x
b is the intercept: the predicted y when x = 0
The model learns m and b by minimizing the mean squared error between
predictions and actual values.
How well does the model fit?
r² (R-squared) measures the proportion of variation in y that is explained
by the model. An r² of 1.0 means the model perfectly predicts y; r² = 0 means
the model does no better than predicting the mean of y.
r² = 1 - (sum of squared residuals) / (total sum of squares)
In practice, a good r² depends on the domain. For physics experiments, r² > 0.99
is typical. For predicting human behavior from a single variable, r² = 0.10–0.30
is often considered reasonable.
💻
Run the linear regression cells in the notebook. The
starter code fits a model predicting BMI from physical activity, then from age.
💻
Fit a multiple regression model using several predictors.
Observe how r² changes when you add more variables.
Part 3: Association Rules
Linear regression asks: given x, what is the predicted value of y? Association
rule mining asks a different question: which variables tend to occur together?
A classic example is market basket analysis: if a customer buys bread and
butter, how likely are they to also buy jam? In health data, we might ask: do
people who smoke also tend to drink heavily? Do people who exercise regularly
also tend to eat more fruits and vegetables?
Key terms
Support of a rule A → B: the fraction of records where both A and B occur.
Confidence of a rule A → B: given that A occurs, how often does B also
occur? (Conditional probability: P(B | A))
Lift of a rule A → B: how much more likely is B given A, compared to B
occurring at random?
Lift = Confidence(A → B) / Support(B)
Lift > 1 means A and B co-occur more than chance. Lift = 1 means A and B are
independent. Lift < 1 means A makes B less likely.
💻
Run the association rule cells in the notebook. The
mlxtend library implements the Apriori algorithm. Start with a low minimum
support threshold and observe which rules are discovered.
Part 4: Ethical Considerations
The BRFSS data was collected to understand population health. But the models
we can build from it—linear regression that predicts BMI, association rules
that link behaviors to disease—could be used in ways the original researchers
did not intend.
Read through the questions below, discuss them with your group, and write your
answers in the notebook.
Questions for discussion
1. Causation vs. correlation
The regression model shows that, on average, people with more physical activity
have lower BMI. Does this mean that exercising causes lower BMI? What else
might explain this association?
Your answer:
2. Self-reporting bias
BRFSS data is self-reported: respondents answer questions about their own
behaviors. What kinds of behaviors might people systematically under-report or
over-report? How might this affect the accuracy of a model trained on this data?
Your answer:
3. Predictive use
Suppose a health insurance company trained a model on BRFSS data to predict
an applicant's likely future healthcare costs, and used the predictions to set
insurance premiums.
What are the potential benefits of this use?
What are the potential harms?
Who might be unfairly disadvantaged?
Is this use consistent with what respondents consented to when they
participated in the survey?
Your answer:
4. Transparency and accountability
If a person were denied affordable insurance based on a model's prediction,
should they have the right to know why? Should they be able to challenge the
prediction?
Your answer:
5. Your own position
Based on your discussion, write a short paragraph (4–6 sentences) summarizing
your view: under what conditions, if any, is it appropriate to use health
survey data to build predictive models for commercial purposes?