PHY 7097 "Machine Learning"
Class Diary for Fall 2020 (tentative!)

 

Date Notes
Tuesday 09/01 Introductions, course policies. Syllabus
Inspirational reading: UF AI Initiative; UF first US university to acquire world's most advanced AI system.
Motivational reading: Physics Careers: the Myths, the Data and Tips for Success; invited talk at the Career Forum of the Pheno 2020 conference.
Background reading: AI in particle physics
Hot off the press: NSF AI Research Institutes
New job openings: AI fellows
Tuesday 09/01 Introduction to Google Colab. Getting set up for the python tutorials.
Thursday 09/03 Python tutorial (part 1):
00. Introduction
01. How to run python code
02. A quick tour of python language syntax
03. Basic python semantics: variables and objects
04. Basic python semantics: operators
Supplemental reading for Labor Day: Chapter 1.
Tuesday 09/08 Python tutorial (part 2):
05. Built-in scalar types: simple values
06. Built-in data structures
Tuesday 09/08 07. Control flow.
Team exercises: analyze the elements of a list (min, max, mean); swap the content of two variables.
08. Defining and using functions
09. Errors and exceptions.
Hint: avoid naming your variables with built-in function names
Thursday 09/10 Python tutorial (part 3):
10. Iterators
11. List comprehension
12. Generators
Practice quiz
More python exercises and quizzes (with answers) are available here.
Inspiring message from the president.
For individual reading:
14. String manipulation and regular expressions
Tuesday 09/15 Python tutorial (part 4):
13. Modules and packages
15. A preview of data science tools
16. Resources for further reading
Quiz
Tuesday 09/15 What is data science?
NumPy tutorial. Begin reading Chapter 2 "Introduction to NumPy".
2.1. "Understanding Data Types in Python"
A Python List Is More Than Just a List, Fixed Type Arrays in Python, Creating Arrays from Python Lists, Creating Arrays from Scratch.
2.2. "The Basics of NumPy Arrays"
NumPy Array Attributes, Array Indexing: Accessing Single Elements, Array Slicing: Accessing Subarrays, Subarrays as no-copy views, Creating copies of arrays, Reshaping of Arrays, Array Concatenation and Splitting.
2.3. "Computation on NumPy Arrays: Universal Functions"
The Slowness of Loops, Introducing UFuncs, Exploring Numpy's UFuncs: array arithmetic, absolute value, trigonometric functions, exponents and logarithms, specialized ufuncs.
2.4. "Aggregations: Min, Max, and Everything In Between"
Summing the Values in an Array, Minimum and Maximum, Multidimensional aggregates, Example: What Is the Average Height of US Presidents?
Thursday 09/17 NumPy tutorial. Finish reading Chapter 2 "Introduction to NumPy".
2.5. "Computation on Arrays: Broadcasting"
Introducing broadcasting. Fig. 2-4 Visualization of NumPy broadcasting. Rules of broadcasting. (skip the rest of that section)
2.6. "Comparisons, Masks, and Boolean Logic"
Skip "Example: Counting Rainy Days in Seattle". Go over: Comparison operators as ufuncs. Working with Boolean arrays. Skip the rest.
2.7. "Fancy Indexing"
Read: Exploring fancy indexing. Skip: Combined Indexing. Read: Example: selecting random points. Skip: the rest of the section.
2.8. "Sorting Arrays"
Read: Fast Sorting in NumPy: np.sort and np.argsort. Sorting along rows or columns. Partial sorts: Partitioning. Example: k-Nearest neighbors.
2.9. "Structured Data: NumPy's Structured Arrays" (Skip that section)
Programming practice: NumPy exercises.
Tuesday 09/22 Finish discussing the NumPy exercises.
Matplotlib tutorial:
04-01. Simple Line Plots. Adjusting the Plot: Line Colors and Styles. Adjusting the Plot: Axes Limits. Labeling Plots. Aside: Matplotlib Gotchas.
04-02. Simple Scatter Plots. Scatter Plots with plt.plot. Scatter Plots with plt.scatter. plot Versus scatter: A Note on Efficiency.
04-03. Visualizing Errors. Basic Errorbars. [Skip: Continuous Errors]
Tuesday 09/22 Matplotlib tutorial:
04-04. Density and Contour Plots. Visualizing a Three-Dimensional Function.
04-05. Histograms, Binnings, and Density [to fix the error, replace normed=True with density=True] Two-Dimensional Histograms and Binnings. [Skip: Kernel density estimation]
04-06. Customizing Plot Legends. Choosing Elements for the Legend. Legend for Size of Points. Bug fix: use the full URL address of the data file with California cities data: https://raw.githubusercontent.com/jakevdp/PythonDataScienceHandbook/master/notebooks/data/california_cities.csv Skip: Multiple Legends.
04-07. Customizing Colorbars. [Skip: Color limits and extensions.] Discrete Color Bars. Example: Handwritten Digits.
Thursday 09/24 Finish the Matplotlib tutorial:
04-08. Multiple Subplots. plt.axes: Subplots by Hand. plt.subplot: Simple Grids of Subplots. plt.subplots: The Whole Grid in One Go. plt.GridSpec: More Complicated Arrangements.
04-09. Text and Annotation. Example: Effect of Holidays on US Births [Use the full address for the data file https://raw.githubusercontent.com/jakevdp/data-CDCbirths/master/births.csv]. Transforms and Text Position. Arrows and Annotation.
04-10. Customizing Ticks. Hiding Ticks or Labels. Major and Minor Ticks. Reducing or Increasing the Number of Ticks. Fancy Tick Formats.
04-11. Customizing Matplotlib: Configurations and Stylesheets. [Skip: Plot Customization by Hand [if you decide to try it, replace ax = plt.axes(axisbg='#E6E6E6') with ax = plt.axes(facecolor='#E6E6E6') to fix the error]. Changing the Defaults: rcParams.] Stylesheets.
04-12. Three-Dimensional Plotting in Matplotlib. Three-dimensional Points and Lines. Three-dimensional Contour Plots. Wireframes and Surface Plots. Surface Triangulations. Example: Visualizing a Möbius strip.
Skip the remaining sections 04-13 to 04-15.
Tuesday 09/29 Team exercises: Numpy and Matplotlib practice.
Quiz
Tuesday 09/29 05-01. What is Machine Learning? Examples of:
Supervised learning: classification and regression.
Unsupervised learning: clustering and dimensionality reduction.
Thursday 10/01 Statistics 101: Exploratory data analysis.
Key terms for data types: continuous, discrete, categorical, binary, ordinal.
Key terms for rectangular data: data frame, feature, outcome, records. Non-rectangular data structures.
Key terms for estimates of location: mean, weighted mean, median, weighted median, trimmed mean, outliers, robustness.
Key terms for variability metrics: deviations, variance, standard deviation, mean absolute deviation, median absolute deviation from the median, range, order statistics, percentile, interquartile range (IQR).
Key terms for distribution shapes: boxplot, frequency table, histogram, density plot, violin plot, contour plot.
Key terms for categorical data: mode, expected value, bar charts, pie charts.
Key terms for correlation: correlation coefficient, correlation matrix, scatterplot.
Tuesday 10/06 Statistics 102:
2. Data and sampling distributions.
2.1. Random sampling and sample bias. Population. Sample. Random and stratified sampling. Bias. Random selection. Sampling with replacement. Sampling without replacement. Sample mean versus population mean. Sample size versus sample quality.
2.2. Selection bias. Vast search effect. Data snooping. Regression to the mean.
2.3. Sampling distribution of a statistic. Sample statistic. Data distribution. Sampling distribution. Central limit theorem. Standard error.
Tuesday 10/06 Statistics 102:
2. Data and sampling distributions.
2.4. The bootstrap. Bootstrap sample. Resampling. Jackknife.
2.5. Confidence intervals. Confidence levels. Interval endpoints.
2.6. Normal distribution. z-score. Standard normal. QQ-plot.
2.7. Long-tailed distributions.
2.8. Student's t-distribution.
2.9. Binomial distribution. Trial. Success. Probability of success.
2.10. Poisson and related distributions. Exponential distribution.
Thursday 10/08 Discussion of the pros and cons of bootstrapping.
Statistics 103:
3. Statistical experiments and significance testing.
3.1. A/B testing. Treatment, treatment group, control group, randomization, subjects, test statistic, blind study, double blind study.
3.2. Hypothesis tests. Null hypothesis, alternative hypothesis, one-way and two-way hypothesis tests.
3.3. Resampling. The bootstrap. Permutation test.
3.4. Statistical significance and p-values. P-value, Alpha, Type 1 and type 2 errors.
3.6. Multiple testing. Look-elsewhere effect. Adjustment of p-values.
3.9. Chi-square test. Pearson residuals.
3.10. Multi-arm bandit problem. Exploration–exploitation tradeoff dilemma. A few strategies: epsilon-first; epsilon-greedy, epsilon-decreasing.
Tuesday 10/13 Statistics quiz
5.02. Introducing Scikit-Learn. Data Representation in Scikit-Learn: features and samples, features matrix, target array. Scikit-Learn's Estimator API.
Tuesday 10/13 Supervised learning example: Simple linear regression, fit() and predict() methods. Supervised learning example: Iris classification [bug fix: replace cross_validation with model_selection], accuracy score. Unsupervised learning example: Iris dimensionality. Unsupervised learning: Iris clustering [bug fix: replace GMM with GaussianMixture]. Application: Exploring Hand-written Digits: dimensionality reduction [bug fix: replace spectral with Spectral or another valid color map], classification on digits, confusion matrix.
Thursday 10/15 5.03. Hyperparameters and model validation. Thinking about Model Validation: Holdout sets, Model validation via cross-validation [bug fix: change cv=LeaveOneOut(len(X)) to cv=LeaveOneOut() ]. Selecting the Best Model. The Bias-variance trade-off. Validation curve. Validation curves in Scikit-Learn [bug fix: replace "from sklearn.learning_curve import validation_curve" with "from sklearn.model_selection import validation_curve"]. Learning curves. Learning curves in Scikit-Learn. Validation in Practice: Grid Search [bug fix: replace "from sklearn.grid_search import GridSearchCV" with "from sklearn.model_selection import GridSearchCV"] [bug fix: delete the "hold=True" argument].
Team exercise: model validation and hyperparameter optimization
Tuesday 10/20 Announcement: ML Hackathon.
Review: Validation curves and learning curves. Cross-validation (cv), number of folds.
Review of the team exercise.
5.04. Feature Engineering. [bug fix: replace "from sklearn.preprocessing import Imputer" with "from sklearn.impute import SimpleImputer" and from then on use SimpleImputer instead of Imputer.] Categorical features: one-hot encoding, sparse matrices. Text features: word counts, term frequency-inverse document frequency. Derived features: basis function regression, polynomial features. Imputation of missing data. Feature pipelines: PolynomialFeatures+LinearRegression.
Tuesday 10/20 5.06. In Depth: Linear regression. Simple Linear Regression. Basis function regression. Polynomial basis functions. Gaussian basis functions. Regularization. Ridge regression (Tikhonov regularization), Lasso regularization. Example: Voronoi interpolation of Monte Carlo sampled functions.
5.05. In Depth: Naive Bayes Classification. Bayesian Classification: Bayes theorem, generative models. Gaussian Naive Bayes. Predicting the posterior probabilities. Multinomial Naive Bayes. Example: classifying text. When to Use Naive Bayes. Team exercise: classifying text for a different set of newsgroups.
Thursday 10/22 5.07. In Depth: Support vector machines. Motivating Support Vector Machines: generative versus discriminative classification. Support Vector Machines: Maximizing the Margin: fitting a support vector machine, support vectors, kernel SVM, tuning the SVM: softening the margins. Homework example: Face Recognition. [bug fixes: use "from sklearn.decomposition import PCA as RandomizedPCA"; "from sklearn.model_selection import train_test_split"]
Tuesday 10/27 Quiz (can be taken online at any time during nominal class hours, i.e., between 9:30 and 11:30 am)
No class - instead attend the 2020 Hipergator Symposium
[Optional] Continued participation in the ML4SCI 2020 Hackathon
Tuesday 10/27 No class - instead attend the 2020 Hipergator Symposium
[Optional] Continued participation in the ML4SCI 2020 Hackathon
Thursday 10/29 5.08. In Depth: Decision trees and random forests. Ensemble methods. Motivating Random Forests: Decision Trees. Creating a decision tree. Decision trees and overfitting. Ensemble of Estimators: Random Forests. BaggingClassifier. Random Forest Regression. Homework example: Random Forest for Classifying Digits.
A toy decision tree: The Akinator Genie.
5.11. In Depth: k-means clustering. Introducing k-Means. k-means Algorithm: Expectation-Maximization. Caveats: sensitivity to the initial guess, number of clusters a priori unknown, works best with linear boundaries. SpectralClustering. Homework examples: k-means on digits; k-means for color compression.
[Optional] Continued participation in the ML4SCI 2020 Hackathon
Tuesday 11/03 Quiz (adiminstered at the beginning of class)
5.12. In Depth: Gaussian-Mixture Models. Motivating GMM: Weaknesses of k-Means. Generalizing E-M: Gaussian Mixture Models. Choosing the covariance type. GMM as density estimation. How many components? Akaike and Bayesian information criteria. Bug fixes: replace
from sklearn.mixture import GMM
gmm = GMM(n_components=4).fit(X)

with
from sklearn import mixture
gmm = mixture.GaussianMixture(n_components=4).fit(X)

Also replace
for pos, covar, w in zip(gmm.means_, gmm.covars_, gmm.weights_):
with
for pos, covar, w in zip(gmm.means_, gmm.covariances_, gmm.weights_):
Also replace
Xnew = gmm16.sample(400, random_state=42)
with
Xnew, Ynew = gmm16.sample(400)
5.13. In Depth: Kernel Density Estimation. Motivating KDE: Histograms. Kernel Density Estimation in Practice. Selecting the bandwidth via cross-validation. Bug fixes: replace
hist = plt.hist(x, bins=30, normed=True)
with
hist = plt.hist(x, bins=30, density=True)
Optional homework examples: KDE on a sphere (you may need to follow these additional installation instructions), Not-So-Naive Bayes. Bug fix: replace
scores = [val.mean_validation_score for val in grid.grid_scores_]
with
scores = grid.cv_results_['mean_test_score']
Tuesday 11/03 5.09. In Depth: Principal Component Analysis. Introducing Principal Component Analysis. Components and explained variance. PCA as dimensionality reduction. PCA for visualization: Hand-written digits. [Bug fix: 'spectral' -> 'Spectral'] What do the components mean? Choosing the number of components. PCA as noise filtering. Example: Eigenfaces. Bug fix: replace
from sklearn.decomposition import RandomizedPCA
with
from sklearn.decomposition import PCA as RandomizedPCA
5.10. In Depth: Manifold Learning. Manifold Learning: "HELLO". Multidimensional Scaling (MDS). MDS as Manifold Learning. Nonlinear Embeddings: Where MDS Fails. Nonlinear Manifolds: Locally Linear Embedding. Example: Isomap on Faces. Example: Visualizing Structure in Digits. Bug fix: replace
from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')

with
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
mnist.target = mnist.target.astype(np.int8) # fetch_openml() returns targets as strings
Thursday 11/05 Discussion of the choices for final projects. Scikit-learn examples database. Programming practice:

Exercise 1: Comparing different clustering algorithms on toy datasets. Background reading:

Exercise 2: Selecting the number of clusters with silhouette analysis on KMeans clustering. Background reading: Exercise 3: Comparison of supervised classifiers. Background reading:
Tuesday 11/10 Quiz.
Preference poll.
Must-watch video: Neural Networks series at 3blue1brown.
Legacy of neural network research in Physics at UF: R. Field.
Deep learning: overview.
Introduction to neural networks. Artificial neuron. Inputs, weights, connections, bias, activation function, output. The basic structure of a neural network: input layer, output layer, hidden layers. Backpropagation and gradient descent.
Tuesday 11/10 An example of a basic neural network: classifying the handwritten digits from the MNIST dataset.
Variations in the network architecture: different choices for the activation function, the loss function, the optimizer, the metrics. Hyperparameters: learning rate, regularization, momentum. Available datasets in Keras.
Must-watch video: Alphago: The Movie.
Thursday 11/12 Class cancelled due to tropical storm Eta.
Tuesday 11/17 General guidelines for building neural networks. Different choices for the architecture and the hyperparameters.
Example: Toy classification with Tensorflow.
Example: Binary classification of the IMDB dataset.
How to deal with overfitting: lower the network capacity, add weight regularization, add dropout.
Tuesday 11/17 Example: multiclass classification of Reuters newswires.
Example: regression on the Boston housing price market.
Thursday 11/19 EXAM
Tuesday 11/24 Special lecture by Prof. Matt Gitzendanner on HiPerGator: "Introduction to Research Computing and HiPerGator"
Useful links:
Tuesday 11/24 Convolutional neural network (convnet). Example: classifying the MNIST digits.
Example: generative learning, variational autoencoders. Example: MNIST digits latent space.
Thursday 11/26 Thanksgiving holiday - no class.
Tuesday 12/01 Rubric (30 pts total):
1. Introducing the topic [10]: what is the question we are trying to answer? Why is it important? Previous approaches - pros and cons. Why do we expect a (new) ML method would help in this case?
2. Machine learning aspect [10]: what ML technique was applied, what dictated the choice of this particular technique, rough description of the technique, choice of hyperparameters, training/validation (if applicable), results, conclusions.
3. Overall impression [5] and time management [5]: optimal mix of text/graphics/formulas, no spelling and grammar mistakes, appropriate font size, labelling the plots/axes, effective use of color/illustrations; finish within the alloted time of 12 min. (leaving 3 min for questions).

The remaining 10 pts will be distributed for submitting an evaluation and feedback for at least ten presentations.

The deadline for preparing the final projects is December 1. Everybody should be ready to present the final project on December 1. The order of speakers was chosen randomly on Nov. 30 and announced below.

09:30 am Final project presentation: session 1 (4 talks).

01. JA: "OPTICS clustering"
04. AD: "Predicting cryogenic thermalization with neural networks"
07. NG: "Using topological data science to detect cosmic voids"
15. AR: "Transformer modelling in music or something easier"

Tuesday 12/01 10:40 am Final project presentation: session 2 (3 talks).

06. SG: "Affinity propagation clustering"
14. NR: "Binary classification of Higgs from 4 lepton background processes using Neural network"
09. SKu: "Predicting match results in the English Premier League"

Thursday 12/03 10:00 am Final project presentation: session 3 (6 talks). Note the unusual early start time

17. BS: "Dispersed Multiphase Flow Generation using 3D Steerable CNN"
20. LT: "Glitch classification using a convolutional autoencoder"
13. VR: "Graph Neural Networks"
12. TM: "Enhancing detection of Gravitational waves with Machine Learning"
02. SB: "Voting Classifier and Voting Regressor"
08. SKa: "A basic time series forecasting of the stock market using LSTM"

Tuesday 12/08 09:30 am Final project presentation: session 4 (4 talks).

11. IM: "Recovering Binary Black Hole Mergers with Convolutional Neural Networks"
03. DC: "Novelty and Outlier Detection"
16. MS: "Manifold learning: t-SNE"
19. CS: "Decomposing signals in components"

Tuesday 12/08 10:40 am Final project presentation: session 5 (3 talks).

05. PE: "Expanding LISA Optical Pathlength Noise Simulations"
10. LL: "The Winning Formula"
18. SS: "Hierarchical clustering"