CS109 notes, by the class schedule:
Table of contents.
- CS109 notes, by the class schedule:
- Week 1: What is Data Science
- Week 2: Intro Data Analysis and Viz
- Week 3 : Databases, SQL and more Pandas
- Week 4: Probablity, regression and some stats
- Week 5: Scikit learn & regression
- Week 6: SVM, trees and forests
- Week 7: Machine Learning best practices
- Week 8: EC2 and Spark
- Week 9: Bayes!
- - Moving on from sklearn...
- Week 10: Text
- Week 11: Clustering!
- Week 12: Deep Learning
- Week 13: Final Project & Wrapup
- Additional Resources
Notes for Harvard's Cs109 data science class
Why CS109? This class was recommended at Quora and a few other places as being a good resource for practical data science, so here goes. These notes are updated as I work through the course syllabus and the labs and homeworks.
The stuff to watch and work through: - Lecture videos & notes
Study Suggestions before starting:
- Use a internet blocker app like SelfControl to stop procrastinating and a pomodoro app to break up study into chunks.
- Use a paper notebook for notes as you watch the videos and do the labs.
- Get a second monitor so you can watch videos/have lab notebooks open and work through at the same time. (I got a 1440p 27inch monitor from ebay and it made things so much easier from just using my laptop).
- Don't look at the lab and hw answers - try to do them first on your own. Discuss with others before looking at solutions.
CS109 notes, by the class schedule:
Week 1: What is Data Science
Lecture 1 introduces data science. The basic stuff covered in every blog post.
Week 2: Intro Data Analysis and Viz
The lecture 2 notebook goes through getting data and putting it into a pandas dataframe.
After doing the three intro notebooks, hw0 runs you through installing anaconda, git, and setting up a github and aws account.
I made a couple of notebooks expand on some of the stuff covered:
- movielens notebook for basic pandas workflow of downloading a zip file, extracting it and putting into pandas dataframes and doing some q&a
- twitter notebook - basic usage of twitter api and doing something with tweets
- ask a q, get data to answer it, explore & check data, then model it and finally communicate and visualize the results.
- keep viz simple and think of the kind of chart needed.
Week 3 : Databases, SQL and more Pandas
- Stanford Statistics Course - check on this one vs the MIT one.
- Think Stats is a good basic book covering stats using Python.
- Think Bayes follows on from Think Stats and covers Bayesian stats in Python.
Week 4: Probablity, regression and some stats
Lab 3 has three notebooks: - Lab3-Probability covers basic probability. Uses a lot of numpy methods, so its a good idea to brush up on numpy. - scipy.stats - very handy, has most stats stuff needed for DS built in. - Lab3-Frequentism, Samples and the Bootstrap - use seaborn for plotting, very handy. a good guide to sns factorplot and facetgrids - PDF tells us the probability of where a continuus random variable will be in set of possible values that random variable can be (the sample space). - PMF tells us the probability that a discrete random variable will be ecactly equal to some value - CDF function tells us the probability that a random discrete or continous variable X will take a value less than or equal to X. Video
Good insights on how to tell a story with data. Infer, model, use an algorithim and draw conclusions (and check!).
- Start with two fundamental questions:
- Whats the goal? think first of that rather than going first to all the many ways you can slice and dice data.
- Who cares? Know your audience and tell them a story. Have a clear sense of direction and logic.
- Read some howto's on scientific writing
- have some memorable examples or small stories
Tell a story:
- know your audience and why/what they care about this data - what do they want?
- Don't make them think - clearly tell them what you want them to know in a way they can follow. highlight key facts and insights.
- unexpectedness - show something the audience didn't expect. I liked the story which highlighted that bigfoot sightings are dropping sharply
- What tools can we give the audience? For example, a web app for them to further explore the data, or a takeaway presentation with key points.
- be careful of your point of view and don't distort the data, but depending on the audience you can frame your story - for example presenting war deaths in red rather than a regular plot color.
- important to put the message up front - what does my graph show? Show it in stages if a lot of data, highlighting what to look at. Design matters.
- think about bias, missing data, etc
- combine independent, unbiased estimators for a parameter into one:
- fisher weighting
- nate silver weighting method
- good segment on weighting variables
- regression towards the mean
- think of regression in terms of population or a projection of the column space of x - i.e what combination of the variables of x gets us closest to the value of y?
- linear regression means we're taking linear combination of predictors, the actual regression equation can be nonlinear
- what function of x gives the best predictor of y?
- Gauss-Markov Theorem
- the residuals are the diff b/w the actual value of y and the predicted value - plot residuals vs fitted values and vs each predictor variable - good way to eyeball quality of linear regression model
- variance R^2 measures goodness of fit, but doesn't mean model is good.
- Best way to check a model is prediction.
Week 5: Scikit learn & regression
- Wikipeda article
- We have data X, which is related to Y in some way.
- Linear regression uses X to predict Y, and also tells us the predictive power of each variable of X
- Linear regression assumes the distribution of each Y is normal (which isn't always so)
- there are many ways to fit a linear regression model, most common is the least squares method
- be careful that the features (variables in X) aren't too similar
- explore your data, plot variables, scatterplots etc. Use seaborn to plot regression.
- use sklearn to split dataset into a train/test
- use cross-validation
- overfitting happens when the model 'learns' the train data so performs better on that than the test dataset
- there are many types of regressions so think about which one to use
- for high d data, closed form vs gradient decent:
- closed form - use math to solve. this becomes computationally intensive very quickly, is ordered n cubed
- gradient descent is O(n), so for large high d data it's gradient descent all the way
- Logistic regression - used where outcome is binary, for example a chance of success/failure. read:Adit's explanation.
- collinerarity - when some variables are highly correlated with each other - this is bad
- Logistic Regression
- Odds Ratio: ratio of two different people's odds of an outcome.
- Crude Odds Ratio Estimate - quick estimate but flawed as doesn't control for anything.
- Confounding Factors - i.e is one group pre-disposed to our outcome for some reason?
- Curse of dimensionality - in high d settings, vast majority of data is near boundaries, not center. But, high d can also be a blessing.
- dealing with high dimensionality: ridge regression, shrinkage estimation
- Stein's Paradox wikipedia, good article
- LASSO and Ridge help with high D data by reducing features
- Lasso does L1 regularization, reduces number of features
- Ridge does L2 regularization, doesn't necessarily reduce features but reduces the impace of features on the model by reducing coefficient value
- Elasticnet does both L1 and L2 regularization
- we take data and assign labels
- 1 nearest neighbour - simple classification method for low d data
- slow, has to check all points to find nearest neighbour
- k nearest neighbours - use k nearest points to find decision boundary
- find ideal k
- what distance function to use?
- my own very simple kNN algo implementation
- cross validation - for 5 fold cross validation, the data is split into 6 folds - 4 for training, one for validation and the sixth for testing, which is only used at the end.
- CIFAR-10 for 60K images - is split into 50K training and 10K test
- pics are 32x32x3
- L1 distance is the absolute diff b/w two vectors
- L2 is the Euclidean distance i.e "ordinary" straight-line distance
- for images, l1 and l2 are pretty bad, so there are a lot more methods
- more features are good for classification, but too many features means the data gets sparse - the curse of dimensionality strikes
- so often we want to reduce dimensionality
- Principal Component Analysis - project a dataset from many variables into fewer less correlated ones, called the principal components.
- Singular Value Decomposition (SVD) - computational method to calculate pricipal components of a dataset. It transforms a large matrix of data into three smallter matrixes:
A (m*n) = U(m*r) x E(r*r) x V(r*n). The values in the middle matrix
r*rare the singular values and we can discard bits of them to reduce the amount of data to a more manageable number.
- good pca and svd explanation
- Watch Statistics for Hackers
HW2 Q1 (notebook)
- Uses svd and pca to analyze gene data
- a pandas excercise in downloading csv files into a data frame, usijng pd.datetime and visualising samples vs time
Week 6: SVM, trees and forests
Now the course finally gets interesting. Before starting this weeks work, think about project ideas and see Hans Rosling videos to see how to present data. Pitch this project idea (to study group or the internet at large).
There are quite a few companies automating the entire datascience chain, so the key is being able to present your findings well.
HW 2 Questions 2,3 & 4 (notebook)
H2 depends wholly on week 5, so good idea to get it done first. Used seaborn for all the viz questions makes some of them trivial.
- q2 looks at polling data and bias
- q3 is more of the same but with seaborn
- q4 nothing much to see here besides using list comprehensions to make a list of all the .csv files (I'm trying to use python to do all the work instead of my stone age past life of copying and pasting links)
url_str = "http://elections.huffingtonpost.com/pollster/api/charts/?topic=2014-senate" election_urls = [election['url'] + '.csv' for election in requests.get(url_str).json()]
Lab 5: Machine Learning
- we often have a small sample of a much dataset, and we want to predict the larger data from our sample.
- this isn't just statistical analysis, as we make models which involve domain knowledge and choices.
- need to think about whether our sample is in some way representative of the population
- Stochastic noise, i.e randomness
- systematic error, i.e where the sampling isn't representative, like polling ppl using landlines
- overfitting: models can 'memrize the the data points in the training set, becoming useless or inaccurate at predicting real world data. With many data sets a more and more complex dataset will keep getting better while getting worse on test/validation data. The best model minimizes test set error, and not training set error.
- great illustration of variance at 24:30 and 35min in the video
from sklearn.cross_validation import train_test_splitfor splitting datasets into a train test split. See sklearn
- sklearn has 3 main features:
- build and fit models
- transform data.
- sklearn expects days in a 2d array or a matrix of size
[n_samples, n_features], so reshape 1d data using
- Validation - keep a chunk of data seperate to check the model after training on the test/train data.
- Cross Validation: randomly split data into K diff train/test splits - so you traion on K-1 partitions and test on 1, so there are a total of K combinations, leading to K risks. This leads to better results then just doing one test/train split.
- regularization helps with overfitting
- sort data into classes, i.e what kind of fruit
- most datasets can be projected onto a lower dimensial space, for e.g using PCA
- read sklearn's PCA docs
- Logistic Regression - use sklearn, main paramater is C which defines how much to regularize the data. Read this explanation
- Use sklearns GridSearchCV to find hyperparameters
- one way to classify: use PCA to reduce the feature space, then use logistic regression to classify
- many datatypes, like images, have tons of features, so important to reduce dimensionality.
- sklearn PCA returns the principal components in order of how much they explain the data:
from sklearn.decomposition import PCA pca = PCA(n_components=60) # PCA with no. of components to return X = pca.fit_transform(data) print(pca.explained_variance_ratio_) # how much of variance is explained
- sklearn uses the same interface for all its classifiers, so makes it easy to put a wrapper around gridsearchCV and pass in diff classifiers to compare.
- discriminative classifier - finds the decision boundary b/w classes
- maximum-margin classifier - for many classifincation problems, multiplie diff lines can seperate classes. Choose that line where the margin b/w classes is the largest, which makes this model senstive to boundaries b/w classes, rather than to point samples deep inside a class.
- SVM is a discrimant classier which finds the widest possible margin b/w classes, including some points touching the boundary which are called the support vectors. (since they support or establish the margins.)
- KNN - training is fast, prediction slow since we need to check all the data points to find the nearest neighbours
- but if we know the decision boundary (the seperating hyperplane) we don't need all the data points
- w: weight vector defines the orientation of the hyperplane, and bias b.
- so a new point x is classified by
w(transpose)*x + b
- this is the mathematical model of a neuron, invented 1957 by Rosenblatt
- step function vs sigmoid activation
- Support Vector Machines (SVM) are widely used, some consider it best of the shelf classifier. They add a new dimension to help seperate classes and also use maximum margin classification. SVM is called svm becuase of the support vectors defining the max margin lines for the classification boundary.
- large data is good for training svm as the points on the boundary are rare and svm cares about establishing the boundary
- since outliers can change the svm boundaries, there is a concept of slack variables - it allows the SVM to missclassify outliers to make a neat decision boundary. sklearn uses the parameter C to define the slack. the lower the number the more the slack.
- kernel tricks for svm - go to aribitarily mary dimensions with little computational cost. need to think about what kernel to use. Read What are kernels in machine learning and SVM and why do we need them?.
- read Andrew Ng's cs229 svm notes
- todo: tale sklearns 'faces' dataset and use svm to predict
- svm tips:
- normalize data to 0,1 or -1,1 interval. (check whether the library already normalizes)
- RBF kernel is a good default
- Read Chich-Wei Hsu practical guide to SVM
- tune paramaters - which kernel, what parameters for it and what C?
- ROC curve: plot true positive rate vs false positive rate
- true +ve is tp/(tp+fn)
- false +ve is fp/(fp+tn)
- one useful summary stat is area under the curve
- Precision Recall Curve sklearn, quora
tp/(tp+fp)- how much are we getting right, or the probability a a random +ve sample is actually +ve (since some +ve samples are false positives).
tp/(tp+fn)- same as in the ROC, how much of the data are we finding. for a random +ve sameple, the probability that it's making a correct prediction. (consider the false negatives.)
- ideally we want both to be one.
- good way to compare classifiers
- How do you classify multple classes with a SVM, e.g 10 types of fruit
- One vs all - pick one class, and train it against all the other classes one by one. So you train n classifers for n classes.
- One vs One - Train n(n-1)/2 classifiers, take majority vote
- use a confusion matrix of predicted label vs true labels to see classification results
- Books: Read Elements of Statiscal Learning and Pattern Recognition and Machine Learning
- Decision Tree - fast training and prediction, easy to understand and interpret. DT basically paritions a feature space into cells using a series of decision rules on one feature at a time
- which feature to query and what thresholds to use?
- node purity: do splits so cells are pure, i.e have only one class in them
- Gini Impurity gives us the expected error of predicted a class from a random sample
- Gini Index
- Node purity gain: compare gini impurity of parent vs child nodes. Lets us see whether a split has improved classification better than a simple missclassification number.
- Optimal trees - diff ways to get there
- Tree pruning - easy to overfit, so first we make the tree, then go backwards and remove 'bad' decisions, or merge cells etc.
- DT disadvatages: sensitive to small changes in data, overfit, only axis aligned splits
- DT vs SVM:
- Netflix prize winners used an ensemble of over 800 models. somewhat disapointing as they didn't come up with a new method
- DT doesn't perform well, but what if we use many of them?
- Bootstrap is one way to do this. It's a resampling method from statistics, useful to get error bags on estimates.
- Bootstrap lets us generate more sample sets from one dataset, each one slightly different.
- Take N data points and draw N times with replacement, then get an estimate from each bootstrapped samples.
- bagging: bootstrap aggregrating, where you learn a classifer from each bootstrap sample and average the . (normally uses just one type of classifier)
- See bootstrap example notebook
- bagged DT's perform better than one a single tree
- not useful for linear models
- Bias-Variance trade off: train models with a high variance, then the average might get close
- Random Forest builds on bagging, builds a tree from each bootstrap sample with node splits selected from random feature subsets. See 1, or rather 2
Week 7: Machine Learning best practices
HW 3 Q1 due: a lot of pandas manipulation on baseball data
Start the project
Towards the end of the course you will work on a month-long data science project. The goal of the project is to go through the complete data science process to answer questions you have about some topic of your own choosing. You will acquire the data, design your visualizations, run statistical analysis, and communicate the results. You will work closely with other classmates in a 3-4 person project team.
Lab 6: Machine Learning 2
- Not a very useful lab, essentially better of from sciki-learn understand how to use:
- learn Bayes classifiers in sklean
- use ROC curves - often accuracy is the the relevant, the true +ve and false -ve rate is more important. Since False negatives can be costly, you often want to change the threshold probability from the default 0.5. So write your own prediction function as the sklearn one uses 0.5, or with bayes classifiers adjust the prior probability.
- sklearn has a roc function, tutorial
- philisophical point: who do ensenble methods work so well? in real life wisdom of the crowds is overrated, but it does a lot better in computerland. Some averaging methods pick up useful stuff in the data, while others cancel out each others errors.
- Decision trees are easy but have poor predictive accuracy, tend to ovefit
- Ensemble learning combines many learners using methods like weighted avg, boosting, etc. See sklearn's ensemble page
- random forests is an extension of bagging of decision trees
- random forests using sqrt(features) of random predicters to make trees give the best results
- Boosting is another ensemble method like bagging, but better for most applications.
- tune by num of trees, splits in each tree, weights
- often using very simple trees, which you adjust the weights as u make more trees to classify better the things u got wrong. At the end u make a weighted avg.
- most popular & succesful boosting algorithim is AdaBoost
Note: read this series on machine learning
- story telling is important for data scientists - explain whats going on, even in your own notebooks. good presentation is very important.
- Diff b/w bagging and random forest: rf has bagging idea + random feature subsets
- didn't really find this video that useful, for example:
- knn and trees can be used for regression too, but why would we?
- SVM's can be used for regression
- take care to normalize your data in a sane manner
Week 8: EC2 and Spark
- decision trees, but they use their own function on top of sklearn so its a bit annoying
- random forests:
- take a random subsample of the data
- select a subset of all possible variables (at each node) and build the tree on the best split
- repeat, then finally take a majority vote
- Ensemble learning - put together a lot of classifiers to build a better one
- AdaBoost Classifier uses weights on the training data.
- Gradient Boost Classifier is similar to AdaBoost but uses tree as its base classifier, can do regression or classification.
- should have started final project by now, if not, start now!
- Nesting: use 5 fold cross validation, take each fold and further apply 5 fold cv to find the right hyperparameters, then use those params in the original fold to train the classifier.
- think about how and why to normalize data, e.g data already in a range from 0-1 might not need to be normalized, think about normazlization as hyperparameters to your model.
- get the mean estimates from your training data and use that to normalize training, validation and testing data. these values need to be stored to normalize future data.
- know your application domain
- many problems are to do with imbalanced data. Fixes to get balanced data for training:
- oversample: take bigger samples of the sparse clasess
- subsampling: take smaller samples of the more frequently occuring classes
- or weight classes so classifier pays more attention to the sparese classes, see sklearn
- Missing data, where some of the features are missing for certain samples:
- delete, but can reduce data too much
- use mean
- use regression to estimate missing values
- Collobrative filtering is a common technique for recommendations or filling in missing values. Can be user based or item based.
- KNN is widely used for collobrative filtering.
- SVD is widely used, famously in the netflix contest. watch this svd video for a refresher if needed.
Moving on from Machine Learning...
Map reduce is a way to deal with very large data sets by distributing work over many machines. Developed by Google, and Apache Hadoop is a open source implementation.
- data is in key value pairs and is distributed to however many servers
- we write a map function which does something to a key value pair.
- the mapreduce implementation reaggregrates the key value pairs incoming from all the servers (sorted by keys)
- we write a reduce which does something else
- simple mapreduce example:
- map takes in sentences, sends them off to diff machines to process,
- those machines send back key:value pairs, say we are just counting words so we get back from machine 1:
"a":4and machine 2 sends back "the":4, "a"10,
- there can be a combine step here, which takes the output from one mapper and combines it, so the two
thewords from machine 1 become one
"the":2. this reduces network traffic and makes the work upstream easier. the combine output has to be of the same type as the mapper, since combine is just an optional optimizatizion step not a DO WORK thing.
- the mapreduce thingamajig aggregrates all the key:value pairs and sends them to the
- reduce function, which counts the words and finally we end up with "the":6 and "a":14
- Udacity has a course on mapreduce with hadoop
- refresher on mapreduce: it takes away the headache of dealing with machines, clusters, etc
- be careful on how you use reducers, they shouldn't effect the algorithims effectiveness
- the reducer should work whether or not the combiner runs
- you could have a in-mapper combining, but here we are gettting into complex territory
- Why did we switch to map reduce from good old write a function and run it computing to the horror that is map reduce?
- large large data which looms like a t-rex, scaring our current teeny tiny computers cowering away in fear. luckily the computers gathered in data centers where mapreduce and other such algos can harness them in parallel.
- many problems are embarrisingly parallel
- Apache Hadoop is an open source implementation of map reduce, built way back in 2006. It provides us a way to store data on clusters of commodity machines and process it using mapreduce, without having to worry about the rocketscience of managing the clusters and splitting data and processes across all the machines.
- Though these days you probably want to use Hadoop with Apache Spark running on Amazon's EMR or Google Cloud.
- python has tools like dask and ipyparallel for parallel computing. for many projects python is enough
- Functional programming: where your functions take an input, process it and return an output, without messing around with other functions or data stored elsewhere. this makes so much sense that it shouldn't be a big deal, it sounds like a toyota corolla of the computing world, but its a big deal because ppl manage to write functions which fuck around with your programming world enough that after a few rounds of their unfunctional running around no one has a good idea of whats the actual state of things.
- Functional programming is clean! Watch Joel Grus Learning Data Science Using Functional Python or this one
- Getting back to mapreduce, when you are programming functionally, then it makes it possible to distribute the data and computing across different machines as the functions aren't trying to mess with each other. When a machine dies, its no big deal as we know what data and function we sent it to process, we just resend it to some other machine.
- aside: this is why pandas always outputs a new dataframe whenever we do something, trying to emulate this ideal of not fucking up existing things. Spark also does similar stuff.
- Apache Spark makes all this easy for us. Runs on top of Hadoop and provides nice interface and methods for all kind of stuff. So nicer, faster, shineir than Hadoop.
- Spark stores data on a Resilient distributed dataset (RDD) - a fault tolerant collection of stuff which can be operated on in parallel.
- the basics of using sparl: write a function, some kind of mapreduce job, spark runs it on the RDD and makes a new processed RDD.
- a lot of the pandas style commands work on spark
- spark is lazy - it makes a graph of the stuff we want to do and then runs it from start to end every time we execute. so caching is important so we don't keep computing same stuff over and over again
- explore spark…. todo
Week 9: Bayes!
- Moving on from sklearn...
book recommendations from the lecture: (the one I liked best in bold)
- simple stuff, in python: Think Bayes, though maybe too basic
- Probabilistic Programming & Bayesian Methods for Hackers - text and python code is in jupyter notebooks which u can clone to your own pc and go through.
Bayes rule: P(A|B) = P(B|A)P(A) / P(B)
bayes rules tells us how to update our beliefs (the prior) as we get new data, which gives us a new posterior (which is just a fancy word for our new, updated belief). A more wordy description:
The theorem itself can be stated simply. Beginning with a provisional hypothesis about the world (there are, of course, no other kinds), we assign to it an initial probability called the prior probability or simply the prior. After actively collecting or happening upon some potentially relevant evidence, we use Bayes’s theorem to recalculate the probability of the hypothesis in light of the new evidence. This revised probability is called the posterior probability or simply the posterior.
- bayes is controversial becuase traditional stats doesn't like giving numbers to unknown thinks, for example bayes essentially makes up the prior. the prior is often our subject belief about something.
- however, even when starting out with different priors, given that they aren't ridiculously dogmatic, with a large sample size the different priors will converge
- discriminative model: focus on predicting y given x, generative model: we simulate the entire model, i.e we can generate x and y
- naive bayes: assumes probablities are conditionally independent of each other, greatly simplifies the calculations. sometimes unrealistic but works well for many scenarios. sklearn has bayes, of course.
- Conjugate prior says that if u start with a family of distributions, like beta, you stay in the same distribution. simplifies computations
- one way to think about bayesian
- you can estimate your prior from the data, though some bayesians would say you're tainting your priors and the data by doing that, but this is an accepted way to get an acceptable prior.
- the lecture works through examples from a blog, which has collected its bayes posts into this book: Introduction to Empirical Bayes: Examples from Baseball Statistics. the explanations in the book look great.
Note: Bayes is simple to do yet hard to understand. So read a number of guides/blogs/posts/youtubes till it makes sense. Some talks to see:
- Eric J Ma Bayesian Statistical Analysis with Python PyCon 2017 - 30 min talk, uses PyMC3
- Christopher Fonnesbeck Probabilistic Programming with PyMC3 PyCon 2017 - 30min, more PyMC3
Week 10: Text
hw5 - did this with my group, so need to redo the hw and commit to my own github.
Week 11: Clustering!
Week 12: Deep Learning
Used tensorflow and keras. update repo.
Week 13: Final Project & Wrapup
My final project was a proof of concept bot which ingested your bank transactions and answered questions about your money. It used wit.ai to parse user queries, machine learning to categorize transactions and straight forward scipy to crunch numbers and make graphs using matplotlib and seaborn. It was a fun learning excercise, to make something which lived briefly live on facebook messenger and could answer questions. using wit.ai made the NLP easy, though with more time writing my own NLP parser or using one of the many offline libraries would be a good learning excercise.
Stuff I found useful to understand the class material better.