estimation: Questions and suggestions for revision

Open questions

1. The "previous BRFSS lab." The devnote says "Make it a continuation of the BRFSS lab." I don't see a previous BRFSS lab in the dp course directory. I've written the estimation lab as if BRFSS is introduced here for the first time. Is there a prior BRFSS lab I should reference and build on? If so, what did students do in it?

2. Dataset source. The notebook currently loads BRFSS data from a GitHub URL (a cleaned Kaggle version). This is fragile. Should I bundle a local copy in the repo? The file is ~25MB, which is large for a git repo. Alternatives: (a) use the kaggle CLI to download it, (b) use a different smaller health dataset, (c) host it on makingwithcode.org.

3. A4.3.5 (association rules) and the crime analysis example. The IB standard explicitly mentions crime analysis as an example. Should I include a secondary example using crime data (alongside the BRFSS data) to make the direct connection to the standard clearer for teachers reviewing alignment?

4. Should students train a classification model here? The teaching note raises this as an open question. Having students predict diabetes (binary) from health behaviors using logistic regression or a decision tree would unify the regression and classification threads. But it might make the lab feel too long or unfocused. What do you prefer?

5. Jupyter notebooks. This is the first lab in the sequence to use notebooks. The devnote says "Back to jupyter notebooks"—implying students have used them before. Should the lab include a brief refresher on notebook usage, or assume competence from a previous course?

6. US-centric data. BRFSS covers only US respondents. For an IB course with international students, this may feel less relevant. Should I include an alternative non-US dataset (e.g., WHO data) or frame the US context explicitly?

Suggestions for improvement

Confounding variables. The ethical discussion touches on causation vs. correlation, but doesn't deeply engage with confounding. A brief exploration of a confounding variable (e.g., income confounds both physical activity and BMI) would make the causal reasoning more rigorous.
Residual plots. Adding a residual plot after each regression would help students see where the model fails (non-linear relationships, heteroscedasticity) and build intuition for model diagnostics.
Binarize BMI for association rules. Currently BMI (continuous) is excluded from association rule mining. Adding a cell that binarizes BMI (e.g., BMI > 30 = "obese", BMI < 25 = "healthy") and then mines associations would connect Parts 2 and 3 more tightly.
Teacher version of the ethical discussion. The ethical questions in Part 4 are open-ended. A teacher's guide with "seed ideas" for each question (not answers, but things to listen for) would help teachers facilitate the discussion. Currently only in the teaching note, which is already long.