Machine Learning

Statistics 561, M-W 10:15-11:45

Follow me on GitHub

Welcome to STA 561 Probabilistic Machine Learning Spring 2024

Quick references:

  • Instructor: Eric Laber, eric.laber@duke.edu, laber-labs.com
  • Office hours: M 11:30AM-12:30PM, Sun 8PM or by appt, location: zoom
  • TAs:
    • Jingan Zhou, jingan.zhou@duke.edu
    • Miles Martinez, miles.martinez@duke.edu
    • Yixin Zhang, yixin.zhang7@duke.edu
    • Sanjev Vishnu Thulasiraman, sanjevvishnu.thulasiraman@duke.edu
    • Jay Bagrecha, jay.bagrecha@duke.edu
    • Rihui Oh, rihui.ou@duke.edu
    • Yen-Chun Liu, yenchun.liu@duke.edu
  • TA office hours will be set in your lab sections
  • Course Sakai Page

Overview

The goal of this course is introduce the statsitical underpinnings needed to solve a many modern statistical problems. Our focus will be on key ideas in prediction and decision making. Often we will try to find the simplest version of a problem/algorithm/idea that illustrates salient features while leaving more complex nuanced versions to homework or to self-study. On a related note, this course is not a catalog of machine learning algorithms and all their variants; such a catalog would immediately be out-of-date as new methods are constantly being introduced (furthemore, learning and using new methods becomes dramatically easier if one has strong intuitive and theoretical understanding of the foundations of statistics/ML.) While much of our lecture time will be spent on proofs and derivations, the homework will involve putting these indeas into practice with simulation experiments or data anlyses.

Pre-requisites

I will assume that students have a basic understanding of mathematical statistics, calculus, basic analyses, linear algebgra, and computing. There are many excellent resources online for shoring gaps in these areas. Coursera, EdX, Udemy, YouTube, etc., are a great place to start. While I will do my best to review key ideas, I will take for granted that students know basic results such as the strong law of large numbers, the central limit theorem, and matrix decompositions. I will also be holding several review sessions throughout the semester to help students prepare for more technical material if it appears that there is interest.

Syllabus (subject to change; roughly one topic per week)

  1. Linear regression review
  2. Linear regression and regularization and noise addition
  3. Cross-validation and inference
  4. Post selection inference
  5. Linear regression and online estimation
  6. Kernel methods
  7. Random forests
  8. Partial linear models
  9. Active learning (i.e., sequential experimental design)
  10. Large margin classifiers
  11. Nearest neighbor methods
  12. Batch decision problems (one-stage)
  13. Batch decision problems (mult-stage)
  14. Contextual bandits
  15. Reinforcement learning

Notes on the notes below

I’ll be updating the notes as we go along. Some I’ll update a lot, some only a little. Read ahead of class at you own peril. But, be warned, you might learn something that’s not on the exam (gasp!) Also, there isn’t an exam (double gasp!)

References

We will primarily use slides and the (virtual) whiteboard for lectures. A list of references for background and/or further study will be provided with each topic. General references that you may find useful include:

  • Elements of Statistical Learning, Hastie, Tibshirani, and Freedman PDF
  • Reinforcement learning, Sutton and Barto PDF
  • Pattern Classification, Duda, Hart, and Stork Amazon, there are pdfs online from the authors but they’ve asked others not to distribute so not linked here

Some references on classic linear models that may be useful for background include

  • Linear Models with Python, Faraway Amazon
  • Transformation and Weighting in Regression, Carroll and Ruppert Amazon

Advice

The objective of this course is to develop your statistical thinking for prediction and decision problems. I strongly encourage you to work with your classmates on all homework and projects and to focus on a deep understanding rather than your grades. Some of the problems will be ambiguous and open-ended. This is (mostly) intentional and intended to provide you with practice making choices and assumptions when you approach an ill-defined problem. (In application, problems are rarely cleanly when they first reach you.) I also encourage you to find other sources of information that explain the same material another way or that explore issues that we didn’t cover in class.

Grading

Grades will be based on homework (80%) and a project (20%). You can work with your classmates on everything. Thus, by appealing to the wisdom of crowds, there is no reason for low HW scores. Late homework will not be accepted but the lowest homework will be dropped.

COVID

Due to COVID-19 the course will be partly online. This is unfortunate because meeting in person can often give me clues about which topics are causing confusion (or boredom!) and allow me to adapt material accordingly. Because I won’t be able to see your eyes gloss over or see the pained expressions on your faces, we will need to take extra steps to make sure that everyone is following along. Please let me or your TAs know if you are struggling with the material and if/what extra content might be helpful. We may also need to slow down to accommodate this new format. I would rather you learn 10 topics well than 15 topics poorly.

Once my student

Once my student, always my student. After this class if over, please don’t hesitate to reach out if there’s something you think I might be able to help you with. (Statistics-wise or lifewise etc. However, I don’t want support the go fund me for your novel, I read a few of the chapters you released online, and let’s be honest, they’re not very good. Your writing is terrible and we are all dumber for having read them. Now, that idea you had for an doggie-door that does automatic grooming…I might be able to get behind that.)

Lectures

  1. Review of linear regresion
  2. Linear regression and regularization
  3. Scaling up
  4. Classification
  5. Decision making