CS109 notes, by the class schedule:
Table of contents.
 CS109 notes, by the class schedule:
 Week 1: What is Data Science
 Week 2: Intro Data Analysis and Viz
 Week 3 : Databases, SQL and more Pandas
 Week 4: Probablity, regression and some stats
 Week 5: Scikit learn & regression
 Week 6: SVM, trees and forests
 Week 7: Machine Learning best practices
 Week 8: EC2 and Spark
 Week 9: Bayes!
  Moving on from sklearn...
 Week 10: Text
 Week 11: Clustering!
 Week 12: Deep Learning
 Week 13: Final Project & Wrapup
 Additional Resources
Notes for Harvard's Cs109 data science class
These are my notes for Harvard's 2015 CS109 class, which i went through with Sydney Machine Learning's study group from Ausgust to October 2017 at Amazon Web Service's sydney office.
Why CS109? This class was recommended at Quora and a few other places as being a good resource for practical data science, so here goes. These notes are updated as I work through the course syllabus and the labs and homeworks.
The stuff to watch and work through:  Lecture videos & notes
Note: download the videos using this script, and merge pull request 11 to get the 2015 videos.
Study Suggestions before starting:
 Use a internet blocker app like SelfControl to stop procrastinating and a pomodoro app to break up study into chunks.
 Use a paper notebook for notes as you watch the videos and do the labs.
 Get a second monitor so you can watch videos/have lab notebooks open and work through at the same time. (I got a 1440p 27inch monitor from ebay and it made things so much easier from just using my laptop).
 Don't look at the lab and hw answers  try to do them first on your own. Discuss with others before looking at solutions.
CS109 notes, by the class schedule:
Week 1: What is Data Science
Lecture 1 introduces data science. The basic stuff covered in every blog post.
Week 2: Intro Data Analysis and Viz
The lecture 2 notebook goes through getting data and putting it into a pandas dataframe.
Lab 1 has three very introductory notebooks: pythonpandas, followed by babypython, and finally git. However, since the course dates back to 2015, some of the python is a bit dated and uses 2.x code.
After doing the three intro notebooks, hw0 runs you through installing anaconda, git, and setting up a github and aws account.
Hw0 has one interesting section, where you solve the montyhall problem one step at a time. I didn't really like their steps, so made a simpler monty hall implementation.
Moving on to the Lecture 2 & its quiz notebook, this goes through some more pandas and data scraping web pages and parsing them.
I made a couple of notebooks expand on some of the stuff covered:
 movielens notebook for basic pandas workflow of downloading a zip file, extracting it and putting into pandas dataframes and doing some q&a
 twitter notebook  basic usage of twitter api and doing something with tweets
 ask a q, get data to answer it, explore & check data, then model it and finally communicate and visualize the results.
 keep viz simple and think of the kind of chart needed.
Week 3 : Databases, SQL and more Pandas
Lab 2 introduces web scraping with requests and then parsing html with beautiful soup 4.
Lecture 4 (video, slides) (covers some more Pandas and SQL.
Lecture 5 (slides, video) on stats is a bit sparse. Some supplementary material:
 Stanford Statistics Course  check on this one vs the MIT one.
 Think Stats is a good basic book covering stats using Python.
 Think Bayes follows on from Think Stats and covers Bayesian stats in Python.
Week 4: Probablity, regression and some stats
Lab 3 has three notebooks:  Lab3Probability covers basic probability. Uses a lot of numpy methods, so its a good idea to brush up on numpy.  scipy.stats  very handy, has most stats stuff needed for DS built in.  Lab3Frequentism, Samples and the Bootstrap  use seaborn for plotting, very handy. a good guide to sns factorplot and facetgrids  PDF tells us the probability of where a continuus random variable will be in set of possible values that random variable can be (the sample space).  PMF tells us the probability that a discrete random variable will be ecactly equal to some value  CDF function tells us the probability that a random discrete or continous variable X will take a value less than or equal to X. Video
Lecture 6: Story Telling and Effective Communication (slides, video)
Good insights on how to tell a story with data. Infer, model, use an algorithim and draw conclusions (and check!).
 Start with two fundamental questions:
 Whats the goal? think first of that rather than going first to all the many ways you can slice and dice data.
 Who cares? Know your audience and tell them a story. Have a clear sense of direction and logic.
 Read some howto's on scientific writing
 have some memorable examples or small stories
Tell a story:
 know your audience and why/what they care about this data  what do they want?
 Don't make them think  clearly tell them what you want them to know in a way they can follow. highlight key facts and insights.
 unexpectedness  show something the audience didn't expect. I liked the story which highlighted that bigfoot sightings are dropping sharply
 What tools can we give the audience? For example, a web app for them to further explore the data, or a takeaway presentation with key points.
 be careful of your point of view and don't distort the data, but depending on the audience you can frame your story  for example presenting war deaths in red rather than a regular plot color.
 important to put the message up front  what does my graph show? Show it in stages if a lot of data, highlighting what to look at. Design matters.
More resources:
Lecture 7: Bias and Regression (slides, video)
 think about bias, missing data, etc
 combine independent, unbiased estimators for a parameter into one:
 fisher weighting
 nate silver weighting method
 Bonferroni
 good segment on weighting variables
 regression towards the mean
 think of regression in terms of population or a projection of the column space of x  i.e what combination of the variables of x gets us closest to the value of y?
 linear regression means we're taking linear combination of predictors, the actual regression equation can be nonlinear
 what function of x gives the best predictor of y?
 GaussMarkov Theorem
 the residuals are the diff b/w the actual value of y and the predicted value  plot residuals vs fitted values and vs each predictor variable  good way to eyeball quality of linear regression model
 variance R^2 measures goodness of fit, but doesn't mean model is good.
 Best way to check a model is prediction.
Week 5: Scikit learn & regression
Lab 4  Regression in Python (video, notebook)
 Wikipeda article
 We have data X, which is related to Y in some way.
 Linear regression uses X to predict Y, and also tells us the predictive power of each variable of X
 Linear regression assumes the distribution of each Y is normal (which isn't always so)
 there are many ways to fit a linear regression model, most common is the least squares method
 be careful that the features (variables in X) aren't too similar
 explore your data, plot variables, scatterplots etc. Use seaborn to plot regression.
 use sklearn to split dataset into a train/test
 use crossvalidation
 overfitting happens when the model 'learns' the train data so performs better on that than the test dataset
 there are many types of regressions so think about which one to use
 for high d data, closed form vs gradient decent:
 closed form  use math to solve. this becomes computationally intensive very quickly, is ordered n cubed
 gradient descent is O(n), so for large high d data it's gradient descent all the way
 Logistic regression  used where outcome is binary, for example a chance of success/failure. read:Adit's explanation.
Lecture 8: More Regression (video, slides)
 collinerarity  when some variables are highly correlated with each other  this is bad
 Logistic Regression
 Odds Ratio: ratio of two different people's odds of an outcome.
 Crude Odds Ratio Estimate  quick estimate but flawed as doesn't control for anything.
 Confounding Factors  i.e is one group predisposed to our outcome for some reason?
 Curse of dimensionality  in high d settings, vast majority of data is near boundaries, not center. But, high d can also be a blessing.
 dealing with high dimensionality: ridge regression, shrinkage estimation
 Stein's Paradox wikipedia, good article
 LASSO and Ridge help with high D data by reducing features
 Lasso does L1 regularization, reduces number of features
 Ridge does L2 regularization, doesn't necessarily reduce features but reduces the impace of features on the model by reducing coefficient value
 Elasticnet does both L1 and L2 regularization
Lecture 9: Classification (video, slides)
 we take data and assign labels
 1 nearest neighbour  simple classification method for low d data
 slow, has to check all points to find nearest neighbour
 k nearest neighbours  use k nearest points to find decision boundary
 find ideal k
 what distance function to use?
 my own very simple kNN algo implementation
 cross validation  for 5 fold cross validation, the data is split into 6 folds  4 for training, one for validation and the sixth for testing, which is only used at the end.
 CIFAR10 for 60K images  is split into 50K training and 10K test
 pics are 32x32x3
 L1 distance is the absolute diff b/w two vectors
 L2 is the Euclidean distance i.e "ordinary" straightline distance
 for images, l1 and l2 are pretty bad, so there are a lot more methods
 more features are good for classification, but too many features means the data gets sparse  the curse of dimensionality strikes
 so often we want to reduce dimensionality
 Principal Component Analysis  project a dataset from many variables into fewer less correlated ones, called the principal components.
 Singular Value Decomposition (SVD)  computational method to calculate pricipal components of a dataset. It transforms a large matrix of data into three smallter matrixes:
A (m*n) = U(m*r) x E(r*r) x V(r*n)
. The values in the middle matrixr*r
are the singular values and we can discard bits of them to reduce the amount of data to a more manageable number.  good pca and svd explanation
 Watch Statistics for Hackers
HW2 Q1 (notebook)
 Uses svd and pca to analyze gene data
 a pandas excercise in downloading csv files into a data frame, usijng pd.datetime and visualising samples vs time
Week 6: SVM, trees and forests
Now the course finally gets interesting. Before starting this weeks work, think about project ideas and see Hans Rosling videos to see how to present data. Pitch this project idea (to study group or the internet at large).
There are quite a few companies automating the entire datascience chain, so the key is being able to present your findings well.
HW 2 Questions 2,3 & 4 (notebook)
H2 depends wholly on week 5, so good idea to get it done first. Used seaborn for all the viz questions makes some of them trivial.
 q2 looks at polling data and bias
 q3 is more of the same but with seaborn
 q4 nothing much to see here besides using list comprehensions to make a list of all the .csv files (I'm trying to use python to do all the work instead of my stone age past life of copying and pasting links)
url_str = "http://elections.huffingtonpost.com/pollster/api/charts/?topic=2014senate"
election_urls = [election['url'] + '.csv' for election in requests.get(url_str).json()]
Lab 5: Machine Learning
Learning Models (notebook, video)
 we often have a small sample of a much dataset, and we want to predict the larger data from our sample.
 this isn't just statistical analysis, as we make models which involve domain knowledge and choices.
 need to think about whether our sample is in some way representative of the population
 Stochastic noise, i.e randomness
 systematic error, i.e where the sampling isn't representative, like polling ppl using landlines
 overfitting: models can 'memrize the the data points in the training set, becoming useless or inaccurate at predicting real world data. With many data sets a more and more complex dataset will keep getting better while getting worse on test/validation data. The best model minimizes test set error, and not training set error.
 great illustration of variance at 24:30 and 35min in the video
 use
from sklearn.cross_validation import train_test_split
for splitting datasets into a train test split. See sklearn  sklearn has 3 main features:
 build and fit models
 predict
 transform data.
 sklearn expects days in a 2d array or a matrix of size
[n_samples, n_features]
, so reshape 1d data usingnp.reshape(1,1)
 Validation  keep a chunk of data seperate to check the model after training on the test/train data.
 Cross Validation: randomly split data into K diff train/test splits  so you traion on K1 partitions and test on 1, so there are a total of K combinations, leading to K risks. This leads to better results then just doing one test/train split.
 regularization helps with overfitting
Classification (notebook, video)
 sort data into classes, i.e what kind of fruit
 most datasets can be projected onto a lower dimensial space, for e.g using PCA
 read sklearn's PCA docs
 kNN:
 Logistic Regression  use sklearn, main paramater is C which defines how much to regularize the data. Read this explanation
 Use sklearns GridSearchCV to find hyperparameters
 one way to classify: use PCA to reduce the feature space, then use logistic regression to classify
 many datatypes, like images, have tons of features, so important to reduce dimensionality.
 sklearn PCA returns the principal components in order of how much they explain the data:
from sklearn.decomposition import PCA
pca = PCA(n_components=60) # PCA with no. of components to return
X = pca.fit_transform(data)
print(pca.explained_variance_ratio_) # how much of variance is explained
 sklearn uses the same interface for all its classifiers, so makes it easy to put a wrapper around gridsearchCV and pass in diff classifiers to compare.
 discriminative classifier  finds the decision boundary b/w classes
 maximummargin classifier  for many classifincation problems, multiplie diff lines can seperate classes. Choose that line where the margin b/w classes is the largest, which makes this model senstive to boundaries b/w classes, rather than to point samples deep inside a class.
 SVM is a discrimant classier which finds the widest possible margin b/w classes, including some points touching the boundary which are called the support vectors. (since they support or establish the margins.)
Lecture 10: SVM, Evaluation (video, slides)
 KNN  training is fast, prediction slow since we need to check all the data points to find the nearest neighbours
 but if we know the decision boundary (the seperating hyperplane) we don't need all the data points
 w: weight vector defines the orientation of the hyperplane, and bias b.
 so a new point x is classified by
w(transpose)*x + b
 this is the mathematical model of a neuron, invented 1957 by Rosenblatt
 step function vs sigmoid activation
 Support Vector Machines (SVM) are widely used, some consider it best of the shelf classifier. They add a new dimension to help seperate classes and also use maximum margin classification. SVM is called svm becuase of the support vectors defining the max margin lines for the classification boundary.
 large data is good for training svm as the points on the boundary are rare and svm cares about establishing the boundary
 since outliers can change the svm boundaries, there is a concept of slack variables  it allows the SVM to missclassify outliers to make a neat decision boundary. sklearn uses the parameter C to define the slack. the lower the number the more the slack.
 kernel tricks for svm  go to aribitarily mary dimensions with little computational cost. need to think about what kernel to use. Read What are kernels in machine learning and SVM and why do we need them?.
 read Andrew Ng's cs229 svm notes
 todo: tale sklearns 'faces' dataset and use svm to predict
 svm tips:
 normalize data to 0,1 or 1,1 interval. (check whether the library already normalizes)
 RBF kernel is a good default
 Read ChichWei Hsu practical guide to SVM
 tune paramaters  which kernel, what parameters for it and what C?
 ROC curve: plot true positive rate vs false positive rate
 true +ve is tp/(tp+fn)
 false +ve is fp/(fp+tn)
 one useful summary stat is area under the curve
 Precision Recall Curve sklearn, quora
 Precision:
tp/(tp+fp)
 how much are we getting right, or the probability a a random +ve sample is actually +ve (since some +ve samples are false positives).  Recall
tp/(tp+fn)
 same as in the ROC, how much of the data are we finding. for a random +ve sameple, the probability that it's making a correct prediction. (consider the false negatives.)  ideally we want both to be one.
 Precision:
 good way to compare classifiers
 How do you classify multple classes with a SVM, e.g 10 types of fruit
 One vs all  pick one class, and train it against all the other classes one by one. So you train n classifers for n classes.
 One vs One  Train n(n1)/2 classifiers, take majority vote
 use a confusion matrix of predicted label vs true labels to see classification results
Lecture 11: Decision Trees and Random Forests (video, slides)
 Books: Read Elements of Statiscal Learning and Pattern Recognition and Machine Learning
 Decision Tree  fast training and prediction, easy to understand and interpret. DT basically paritions a feature space into cells using a series of decision rules on one feature at a time
 which feature to query and what thresholds to use?
 node purity: do splits so cells are pure, i.e have only one class in them
 Gini Impurity gives us the expected error of predicted a class from a random sample
 Gini Index
 Node purity gain: compare gini impurity of parent vs child nodes. Lets us see whether a split has improved classification better than a simple missclassification number.
 Optimal trees  diff ways to get there
 Tree pruning  easy to overfit, so first we make the tree, then go backwards and remove 'bad' decisions, or merge cells etc.
 DT disadvatages: sensitive to small changes in data, overfit, only axis aligned splits
 DT vs SVM:
 Netflix prize winners used an ensemble of over 800 models. somewhat disapointing as they didn't come up with a new method
 DT doesn't perform well, but what if we use many of them?
 Bootstrap is one way to do this. It's a resampling method from statistics, useful to get error bags on estimates.
 Bootstrap lets us generate more sample sets from one dataset, each one slightly different.
 Take N data points and draw N times with replacement, then get an estimate from each bootstrapped samples.
 bagging: bootstrap aggregrating, where you learn a classifer from each bootstrap sample and average the . (normally uses just one type of classifier)
 See bootstrap example notebook
 bagged DT's perform better than one a single tree
 not useful for linear models
 BiasVariance trade off: train models with a high variance, then the average might get close
 Random Forest builds on bagging, builds a tree from each bootstrap sample with node splits selected from random feature subsets. See 1, or rather 2
Week 7: Machine Learning best practices
HW 3 Q1 due: a lot of pandas manipulation on baseball data
Start the project
Towards the end of the course you will work on a monthlong data science project. The goal of the project is to go through the complete data science process to answer questions you have about some topic of your own choosing. You will acquire the data, design your visualizations, run statistical analysis, and communicate the results. You will work closely with other classmates in a 34 person project team.
Lab 6: Machine Learning 2
Classification, Probabilities, ROC curves, and Cost (video,notebook)
 Not a very useful lab, essentially better of from scikilearn understand how to use:
Comparing Models (video, notebook)
 learn Bayes classifiers in sklean
 use ROC curves  often accuracy is the the relevant, the true +ve and false ve rate is more important. Since False negatives can be costly, you often want to change the threshold probability from the default 0.5. So write your own prediction function as the sklearn one uses 0.5, or with bayes classifiers adjust the prior probability.
 sklearn has a roc function, tutorial
Lecture 12: Ensemble Methods ((video, slides))
 philisophical point: who do ensenble methods work so well? in real life wisdom of the crowds is overrated, but it does a lot better in computerland. Some averaging methods pick up useful stuff in the data, while others cancel out each others errors.
 Decision trees are easy but have poor predictive accuracy, tend to ovefit
 Ensemble learning combines many learners using methods like weighted avg, boosting, etc. See sklearn's ensemble page
 random forests is an extension of bagging of decision trees
 random forests using sqrt(features) of random predicters to make trees give the best results
 Boosting is another ensemble method like bagging, but better for most applications.
 tune by num of trees, splits in each tree, weights
 often using very simple trees, which you adjust the weights as u make more trees to classify better the things u got wrong. At the end u make a weighted avg.
 most popular & succesful boosting algorithim is AdaBoost
Note: read this series on machine learning
Lecture 13: Best Practices (video, slides)
 story telling is important for data scientists  explain whats going on, even in your own notebooks. good presentation is very important.
 Diff b/w bagging and random forest: rf has bagging idea + random feature subsets
 didn't really find this video that useful, for example:
 knn and trees can be used for regression too, but why would we?
 SVM's can be used for regression
 take care to normalize your data in a sane manner
Week 8: EC2 and Spark
HW3 q2 uses the iris data set & q3 uses sklearn's digits dataset and gridsearchCV to find best parameters for a KNN classifier for two simple datasets.
Lab 7: Decision Trees, Random Forests, Ensemble Methods (video, notebook)
 decision trees, but they use their own function on top of sklearn so its a bit annoying
 random forests:
 take a random subsample of the data
 select a subset of all possible variables (at each node) and build the tree on the best split
 repeat, then finally take a majority vote
 Ensemble learning  put together a lot of classifiers to build a better one
 AdaBoost Classifier uses weights on the training data.
 Gradient Boost Classifier is similar to AdaBoost but uses tree as its base classifier, can do regression or classification.
Lecture 14: Best Practices, Recommendations and MapReduce (video, slides)
 should have started final project by now, if not, start now!
 Nesting: use 5 fold cross validation, take each fold and further apply 5 fold cv to find the right hyperparameters, then use those params in the original fold to train the classifier.
 think about how and why to normalize data, e.g data already in a range from 01 might not need to be normalized, think about normazlization as hyperparameters to your model.
 get the mean estimates from your training data and use that to normalize training, validation and testing data. these values need to be stored to normalize future data.
 know your application domain
 many problems are to do with imbalanced data. Fixes to get balanced data for training:
 oversample: take bigger samples of the sparse clasess
 subsampling: take smaller samples of the more frequently occuring classes
 or weight classes so classifier pays more attention to the sparese classes, see sklearn
 Missing data, where some of the features are missing for certain samples:
 delete, but can reduce data too much
 use mean
 use regression to estimate missing values
 Collobrative filtering is a common technique for recommendations or filling in missing values. Can be user based or item based.
 KNN is widely used for collobrative filtering.
 SVD is widely used, famously in the netflix contest. watch this svd video for a refresher if needed.
Moving on from Machine Learning...
Map reduce is a way to deal with very large data sets by distributing work over many machines. Developed by Google, and Apache Hadoop is a open source implementation.
 data is in key value pairs and is distributed to however many servers
 we write a map function which does something to a key value pair.
 the mapreduce implementation reaggregrates the key value pairs incoming from all the servers (sorted by keys)
 we write a reduce which does something else
 simple mapreduce example:
 map takes in sentences, sends them off to diff machines to process,
 those machines send back key:value pairs, say we are just counting words so we get back from machine 1:
"the":1
,"the":1
,"a":4
and machine 2 sends back "the":4, "a"10,  there can be a combine step here, which takes the output from one mapper and combines it, so the two
the
words from machine 1 become one"the":2
. this reduces network traffic and makes the work upstream easier. the combine output has to be of the same type as the mapper, since combine is just an optional optimizatizion step not a DO WORK thing.  the mapreduce thingamajig aggregrates all the key:value pairs and sends them to the
 reduce function, which counts the words and finally we end up with "the":6 and "a":14
 Udacity has a course on mapreduce with hadoop
Lecture 15: MapReduce Combiners and Spark (video, slides)
 refresher on mapreduce: it takes away the headache of dealing with machines, clusters, etc
 be careful on how you use reducers, they shouldn't effect the algorithims effectiveness
 the reducer should work whether or not the combiner runs
 you could have a inmapper combining, but here we are gettting into complex territory
 Why did we switch to map reduce from good old write a function and run it computing to the horror that is map reduce?
 large large data which looms like a trex, scaring our current teeny tiny computers cowering away in fear. luckily the computers gathered in data centers where mapreduce and other such algos can harness them in parallel.
 many problems are embarrisingly parallel
 Apache Hadoop is an open source implementation of map reduce, built way back in 2006. It provides us a way to store data on clusters of commodity machines and process it using mapreduce, without having to worry about the rocketscience of managing the clusters and splitting data and processes across all the machines.
 Though these days you probably want to use Hadoop with Apache Spark running on Amazon's EMR or Google Cloud.
 python has tools like dask and ipyparallel for parallel computing. for many projects python is enough
 Functional programming: where your functions take an input, process it and return an output, without messing around with other functions or data stored elsewhere. this makes so much sense that it shouldn't be a big deal, it sounds like a toyota corolla of the computing world, but its a big deal because ppl manage to write functions which fuck around with your programming world enough that after a few rounds of their unfunctional running around no one has a good idea of whats the actual state of things.
 Functional programming is clean! Watch Joel Grus Learning Data Science Using Functional Python or this one
 Getting back to mapreduce, when you are programming functionally, then it makes it possible to distribute the data and computing across different machines as the functions aren't trying to mess with each other. When a machine dies, its no big deal as we know what data and function we sent it to process, we just resend it to some other machine.
 aside: this is why pandas always outputs a new dataframe whenever we do something, trying to emulate this ideal of not fucking up existing things. Spark also does similar stuff.
 Apache Spark makes all this easy for us. Runs on top of Hadoop and provides nice interface and methods for all kind of stuff. So nicer, faster, shineir than Hadoop.
 Spark stores data on a Resilient distributed dataset (RDD)  a fault tolerant collection of stuff which can be operated on in parallel.
 the basics of using sparl: write a function, some kind of mapreduce job, spark runs it on the RDD and makes a new processed RDD.
 a lot of the pandas style commands work on spark
 spark is lazy  it makes a graph of the stuff we want to do and then runs it from start to end every time we execute. so caching is important so we don't keep computing same stuff over and over again
 explore spark…. todo
Week 9: Bayes!
Lab 8: Vagrant and VirtualBox, AWS, and Spark (video, notebook)
 Moving on from sklearn...
Lecture 16: Bayes Theorem and Bayesian Methods (video, slides)

book recommendations from the lecture: (the one I liked best in bold)

easy reading: How Bayes’ Rule Cracked the Enigma Code also see her talk at google.
 simple stuff, in python: Think Bayes, though maybe too basic
 Probabilistic Programming & Bayesian Methods for Hackers  text and python code is in jupyter notebooks which u can clone to your own pc and go through.

Proper textbooks: Bayesian Data Analysis and Statistical Rethinking  this comes with python examples.

Bayes rule: P(AB) = P(BA)P(A) / P(B)

bayes rules tells us how to update our beliefs (the prior) as we get new data, which gives us a new posterior (which is just a fancy word for our new, updated belief). A more wordy description:
The theorem itself can be stated simply. Beginning with a provisional hypothesis about the world (there are, of course, no other kinds), we assign to it an initial probability called the prior probability or simply the prior. After actively collecting or happening upon some potentially relevant evidence, we use Bayes’s theorem to recalculate the probability of the hypothesis in light of the new evidence. This revised probability is called the posterior probability or simply the posterior.
 bayes is controversial becuase traditional stats doesn't like giving numbers to unknown thinks, for example bayes essentially makes up the prior. the prior is often our subject belief about something.
 however, even when starting out with different priors, given that they aren't ridiculously dogmatic, with a large sample size the different priors will converge
 discriminative model: focus on predicting y given x, generative model: we simulate the entire model, i.e we can generate x and y
 naive bayes: assumes probablities are conditionally independent of each other, greatly simplifies the calculations. sometimes unrealistic but works well for many scenarios. sklearn has bayes, of course.
 Conjugate prior says that if u start with a family of distributions, like beta, you stay in the same distribution. simplifies computations
 one way to think about bayesian
More Reading:
Lecture 17: Bayesian Methods Continued (video, slides)
 you can estimate your prior from the data, though some bayesians would say you're tainting your priors and the data by doing that, but this is an accepted way to get an acceptable prior.
 the lecture works through examples from a blog, which has collected its bayes posts into this book: Introduction to Empirical Bayes: Examples from Baseball Statistics. the explanations in the book look great.
Note: Bayes is simple to do yet hard to understand. So read a number of guides/blogs/posts/youtubes till it makes sense. Some talks to see:
 Eric J Ma Bayesian Statistical Analysis with Python PyCon 2017  30 min talk, uses PyMC3
 Christopher Fonnesbeck Probabilistic Programming with PyMC3 PyCon 2017  30min, more PyMC3
Week 10: Text
hw5  did this with my group, so need to redo the hw and commit to my own github.
Week 11: Clustering!
Week 12: Deep Learning
Used tensorflow and keras. update repo.
Week 13: Final Project & Wrapup
My final project was a proof of concept bot which ingested your bank transactions and answered questions about your money. It used wit.ai to parse user queries, machine learning to categorize transactions and straight forward scipy to crunch numbers and make graphs using matplotlib and seaborn. It was a fun learning excercise, to make something which lived briefly live on facebook messenger and could answer questions. using wit.ai made the NLP easy, though with more time writing my own NLP parser or using one of the many offline libraries would be a good learning excercise.
Additional Resources
Stuff I found useful to understand the class material better.
 Computational and Inferential Thinking  the textbook for UC Berkely's [Foundations of Data Science class](http://data8.org/.
 Pythonb Data Science Handbook