classification_features: Questions and suggestions for revision

Review these after reading the lab draft. I'll revise based on your responses.


Open questions

1. Dataset access. The lab downloads the SMS Spam Collection from the UCI repository at runtime. This is convenient but requires internet access and depends on a third-party URL being stable. Should we bundle the dataset in the repo instead? (It's ~500KB, small enough to include.)

2. Scope of data cleaning. The current lab covers duplicates, class imbalance, and encoding quirks. Should it also include a discussion of missing data (there is none in this dataset) and outliers (some messages are very long)? Or is it better to keep the data-cleaning section focused on what's actually in this dataset?

3. Cross-validation. The lab uses a single train/test split. Cross-validation (k-fold CV) would be more robust and is the industry standard—but it adds complexity. Should I add a brief section on CV, or save it for a later lab?

4. The matrices segue. The devnote sketch says "At this point, we introduce matrices." This is now not in the lab. I assumed the matrix introduction would come naturally in the neural networks lab when we discuss the forward pass. Is that right, or should this lab end with a brief introduction to matrix representation of feature vectors?

5. TF-IDF. The original sketch included TF-IDF as a way to go beyond hand-designed features. Currently TF-IDF is not in the lab—I moved it to classification_neural as an extension. Should it be here instead, as a bridge to learned features?

6. Duration. The lab is marked as 8 periods, matching the other labs. Do the four parts feel like they fit 8 periods? Part 4 (logistic regression + tuning) might run short—should I add more depth there, or use that time for a class competition to build the best classifier?


Suggestions for improvement