khalido.org

every blog needs subheader text

CS109 notes, by the class schedule:

Table of contents.

Notes for Harvard's Cs109 data science class

These are my notes for Harvard's 2015 CS109 class, which i went through with Sydney Machine Learning's study group from Ausgust to October 2017 at Amazon Web Service's sydney office.

Why CS109? This class was recommended at Quora and a few other places as being a good resource for practical data science, so here goes. These notes are updated as I work through the course syllabus and the labs and homeworks.

The stuff to watch and work through: - Lecture videos & notes

Note: download the videos using this script, and merge pull request 11 to get the 2015 videos.

Study Suggestions before starting:

CS109 notes, by the class schedule:

Week 1: What is Data Science

Lecture 1 introduces data science. The basic stuff covered in every blog post.

Week 2: Intro Data Analysis and Viz

The lecture 2 notebook goes through getting data and putting it into a pandas dataframe.

Lab 1 has three very introductory notebooks: pythonpandas, followed by babypython, and finally git. However, since the course dates back to 2015, some of the python is a bit dated and uses 2.x code.

After doing the three intro notebooks, hw0 runs you through installing anaconda, git, and setting up a github and aws account.

Hw0 has one interesting section, where you solve the montyhall problem one step at a time. I didn't really like their steps, so made a simpler monty hall implementation.

Moving on to the Lecture 2 & its quiz notebook, this goes through some more pandas and data scraping web pages and parsing them.

I made a couple of notebooks expand on some of the stuff covered:

Lecture 3 (slides, video):

Week 3 : Databases, SQL and more Pandas

Lab 2 introduces web scraping with requests and then parsing html with beautiful soup 4.

Lecture 4 (video, slides) (covers some more Pandas and SQL.

Lecture 5 (slides, video) on stats is a bit sparse. Some supplementary material:

Week 4: Probablity, regression and some stats

Lab 3 has three notebooks: - Lab3-Probability covers basic probability. Uses a lot of numpy methods, so its a good idea to brush up on numpy. - scipy.stats - very handy, has most stats stuff needed for DS built in. - Lab3-Frequentism, Samples and the Bootstrap - use seaborn for plotting, very handy. a good guide to sns factorplot and facetgrids - PDF tells us the probability of where a continuus random variable will be in set of possible values that random variable can be (the sample space). - PMF tells us the probability that a discrete random variable will be ecactly equal to some value - CDF function tells us the probability that a random discrete or continous variable X will take a value less than or equal to X. Video

Lecture 6: Story Telling and Effective Communication (slides, video)

Good insights on how to tell a story with data. Infer, model, use an algorithim and draw conclusions (and check!).

Tell a story:

More resources:

Lecture 7: Bias and Regression (slides, video)

Week 5: Scikit learn & regression

Lab 4 - Regression in Python (video, notebook)

Lecture 8: More Regression (video, slides)

Lecture 9: Classification (video, slides)

HW2 Q1 (notebook)

Week 6: SVM, trees and forests

Now the course finally gets interesting. Before starting this weeks work, think about project ideas and see Hans Rosling videos to see how to present data. Pitch this project idea (to study group or the internet at large).

There are quite a few companies automating the entire datascience chain, so the key is being able to present your findings well.

HW 2 Questions 2,3 & 4 (notebook)

H2 depends wholly on week 5, so good idea to get it done first. Used seaborn for all the viz questions makes some of them trivial.

url_str = "http://elections.huffingtonpost.com/pollster/api/charts/?topic=2014-senate"
election_urls = [election['url'] + '.csv' for election in requests.get(url_str).json()]

Lab 5: Machine Learning

Learning Models (notebook, video)

Classification (notebook, video)

from sklearn.decomposition import PCA
pca = PCA(n_components=60) # PCA with no. of components to return
X = pca.fit_transform(data)
print(pca.explained_variance_ratio_) # how much of variance is explained

Lecture 10: SVM, Evaluation (video, slides)

Lecture 11: Decision Trees and Random Forests (video, slides)

Week 7: Machine Learning best practices

HW 3 Q1 due: a lot of pandas manipulation on baseball data

Start the project

Towards the end of the course you will work on a month-long data science project. The goal of the project is to go through the complete data science process to answer questions you have about some topic of your own choosing. You will acquire the data, design your visualizations, run statistical analysis, and communicate the results. You will work closely with other classmates in a 3-4 person project team.

Lab 6: Machine Learning 2

Classification, Probabilities, ROC curves, and Cost (video,notebook)

Comparing Models (video, notebook)

Lecture 12: Ensemble Methods ((video, slides))

Note: read this series on machine learning

Lecture 13: Best Practices (video, slides)

Week 8: EC2 and Spark

HW3 q2 uses the iris data set & q3 uses sklearn's digits dataset and gridsearchCV to find best parameters for a KNN classifier for two simple datasets.

Lab 7: Decision Trees, Random Forests, Ensemble Methods (video, notebook)

Lecture 14: Best Practices, Recommendations and MapReduce (video, slides)

Moving on from Machine Learning...

Map reduce is a way to deal with very large data sets by distributing work over many machines. Developed by Google, and Apache Hadoop is a open source implementation.

Lecture 15: MapReduce Combiners and Spark (video, slides)

Week 9: Bayes!

Lab 8: Vagrant and VirtualBox, AWS, and Spark (video, notebook)

- Moving on from sklearn...

Lecture 16: Bayes Theorem and Bayesian Methods (video, slides)

The theorem itself can be stated simply. Beginning with a provisional hypothesis about the world (there are, of course, no other kinds), we assign to it an initial probability called the prior probability or simply the prior. After actively collecting or happening upon some potentially relevant evidence, we use Bayes’s theorem to recalculate the probability of the hypothesis in light of the new evidence. This revised probability is called the posterior probability or simply the posterior.

More Reading:

Lecture 17: Bayesian Methods Continued (video, slides)

Note: Bayes is simple to do yet hard to understand. So read a number of guides/blogs/posts/youtubes till it makes sense. Some talks to see:

Week 10: Text

hw5 - did this with my group, so need to redo the hw and commit to my own github.

Week 11: Clustering!

Week 12: Deep Learning

Used tensorflow and keras. update repo.

Week 13: Final Project & Wrapup

My final project was a proof of concept bot which ingested your bank transactions and answered questions about your money. It used wit.ai to parse user queries, machine learning to categorize transactions and straight forward scipy to crunch numbers and make graphs using matplotlib and seaborn. It was a fun learning excercise, to make something which lived briefly live on facebook messenger and could answer questions. using wit.ai made the NLP easy, though with more time writing my own NLP parser or using one of the many offline libraries would be a good learning excercise.

Additional Resources

Stuff I found useful to understand the class material better.

posted , updated
tagged: python data science View on: github