Dear Data was a year-long project by Stefanie Posavec and Giorgia Lupi.

Connectedreams’ Resources

Beginner’s Guide to Data Science

1. What is DATA SCIENCE

Connectedreams.com

Published in

Connectedreams Blog

15 min readJul 27, 2016

“Life is a journey, not a destination.” ― Ralph Waldo Emerson

The quote is a part of our inbuilt thinking process. It is true in many ways, for the destination is decided by the journey we take.

There is a other side to the interpretation of this ideology in the data world. The Destination decides which path you should prefer in data science.

Data science is interpreted as a process of formulating a hypothesis or a quantitative question that can be answered through data collection, cleaning and analyzing it.

Data science is comprised of two terminologies:

DATA and The SCIENCE

The Data is the second most important thing in data science. The most important thing in data science is the question you are trying to answer i.e the Science you use to find the answer. The data should follow that question we are trying to answer.

The practitioner of data science is the data scientist. “Data Scientist” has become a popular occupation with Harvard Business Review dubbing it “The Sexiest Job of the 21st Century”. McKinsey & Company have projected a global excess demand of 1.5 million new data scientists. The Master’s degree is currently being offered at almost all the top universities and a number of certification programmes are being offered free of charge.

But as we said earlier, the journey is more intuitive than the destination. Let us dig into the process of Data Science.

2. The process of data analysis

Data flows through various processes to become information which has intellectual knowledge.This data is useful for taking decisions in real life. Processes through which a data has to flow so it can enable one to make correct decisions could be summarized as :

The Data requirements. The data which is related to the data question to be answered needs to be recognized.

Data collection. Gather it from varied sources in varied formats.

Data processing. Data initially obtained must be processed or formatted for analysis.

Data cleaning. Once processed and organized, the data may be noisy i.e incomplete, contain duplicates, or contain errors. Data cleaning is the process of removing the noise in data.

Exploratory data analysis. Exploring the data in every possible aspect and performing analysis to draw data patterns.

Modeling and algorithms. In general terms, models may be developed to evaluate a particular variable in the data based on other variable(s) in the data, with some residual error depending on model accuracy (i.e., Data = Model + Error).

Data product. A data product is a computer application which can be deployed with a user interface so as end users of data do not necessarily need to know about inner details.

Communication. The end result needs to be communicated to the potential clients in the format which will used further to take business decisions.

3. Career Options

Data science affects academic and applied research in many domains, including machine translation, speech recognition, robotics,search engines, digital economy, and also the biological sciences, medical informatics, health care, social sciences and the humanities. It heavily influences economics, business and finance. From the business perspective, data science is an integral part of competitive intelligence, a newly emerging field that encompasses a number of activities, such as data mining and data analysis.

4. Schools that offer Data Science Specialization

For a data science aspirant, the following is the summary of schools offering data science specialization both at bachelor’s or a master’s level. The programs offered are either on-campus, executive or online. Details can be found below.

a. Southern Methodist University | Master of Science in Data Science | Dedman College of Humanities and Sciences, Lyle School of Engineering and Meadows School of the Arts | Dallas, Texas | 18–24 months

b. University of California, Berkeley | Online Master of Information and Data Science (MIDS) | School of Information | Berkeley, CA | 12–20 months

c. Arizona State University | Master of Science in Business Analytics | W.P. Carey School of Business | Tempe, AZ | 9 months In- development

d. Bentley University | Master of Science in Marketing Analytics | Graduate School of Business | Waltham, MA| 1–1.5 years

f. Columbia University | Master of Science in Data Science | Data Science Institute | New York, NY| 30 credits

g. Georgia Tech | Concentrations in Analytical Tools, Business Analytics, and Computational Data Analytics| College of Computing, College of Engineering, and Scheller College of Business| Atlanta, GA| 1 year

h. Indiana University, Bloomington | Data Science M.S. School of Informatics & Computing | Bloomington, IN| 30 credits

i. Louisiana State University | Master of Science in Analytics | E.J. Ourso College of Business | Baton Rouge, LA| 1 year

j. Michigan State University | MS in Data Science | Broad College of Business | East Lansing, MI | 1 year

l. North Carolina State University | Master of Science In Analytics | Institute for Advanced Analytics| Raleigh, NC| 10 months

m. Rutgers University | Master of Business and Science with a Concentration in Analytics — Discovery Informatics and Data Sciences; Online Master of Information| Graduate School, Professional Science Masters Programs (Master of Business and Science) and School of Communication and Information (Master of Information)| New Brunswick, NJ| 1.5–2 years

n. Stanford University | Master of Science In Statistics: Data Science| Department of Statistics| Stanford, CA| 45 credits

o. Texas A&M University | Master of Science in Analytics| Department of Statistics| Houston, TX| 2 year

5. Self Learners’ Edge!

If you are a self learner, we have sketched out necessary requirements you’d like to go through.

Step I: Watch the recording of Connectdreams’ Groupinar on Data-Driven Innovation.

Panelists:
Ashish Gupta | Senior Software Engineer Google NYC, Ph.D. Computer Science
Prukalpa Sankar | Founder SocialCops
Soroush Vosoughi | Postdoctoral Associate MIT Media Lab

Moderator:
Manish Agarwal | Quantitative Researcher Point 72 Asset Management, Ph.D. Information Theory

Step II: Math and statistics — Khan Academy: The Math Track , Linear Algebra by MIT-OCW , Discrete Mathematics for computer science by MIT-OCW

Step III: Algorithms — Introduction to Algorithms MIT-OCW

Step IV: Machine Learning — Machine Learning b y Andrew NG , Practical Machine Learning by John Hopkins

Step V: Coding Skills — R John Hopkins , CodecAcademy

Step VI: Databases — Get acquainted with various databases used in Data Projects.

Step VII: Data management — Data cleaning and managing, Getting and cleaning data John Hopkins, Visualization, Reporting

Step VII: Level up with Big Data — Hadoop,Spark & Map Reduce

Step IX: Get experience and network — Join Kaggle and compete. Kaggle is a great platform where data firm hire from, Subscribe to KDnuggets and keep yourself updated on Data Science

6. Books for Data Analysts

A list of free online courses and books for those interested in dealing with Data.

Machine Learning

Stanford Machine Learning — This course gave me a practical start in machine learning using Octave/MATLAB. Caltec Learning From Data — Caltec course is by far the best available lectures to get you started in Machine Learning. Neural Networks for Machine Learning — This courses covers neural nets much more deeply, which is just covered in a week on the other machine learning courses. In particular, they even cover the RNN types.
Probabilistic Graphic Models — PGMs stand their own ground apart from Machine Learning. In this course’s introduction, it is mentioned that they are better over certain scenarios where machine learning is weak.

Data Science

Introduction to Data Science — As the name suggests, this covers a bit of each part of what a data scientist does very broadly .
Harvard Data Science — This was a class about data science given in Harvard and all the material was made available for free online.
Berkley Data Science — Another course in data science that made the slides available (also note that a previous semester’s lecture slides are available). They enforce the use of Rddply for pre-processing.
Columbia Data Science — This was a course whose material was made available as well. The professor of this course is publishing a book entirely about data science which is only available at pre-order at this point.

Statistics

Computing for Data Analysis — This course covers a lot of ground in R.
Data Analysis — Subsequent course from Computing for Data Analysis. Here the focus is stronger on statistics, and weaker on R.
Statistics One — It is focused more on theory than on applying statistics, but it provides R scripts for each week’s theoretical material.
Latent Variable Models — This is a very advanced course that is yet to start and is within Statistics

Text Mining

Stanford Natural Language Processing — Covers the space of performing exploratory clustering of words by frequency and what is also called “context mining”.
Toronto Natural Language Processing — It is harder version of the Stanford version on Coursera.

Social Network Analysis

Michigan Social Networks Analysis — A good introduction to SNA.
Stanford Social Network Analysis — The course is an advanced version of SNA.

If you want to learn about gender roles and character portrayal in film, this is the project to look at. http://stereotropes.bocoup.com/

Visualization

Information Visualization — Helpful in learning about how those infographics are made among other things.
Data Visualization — This course was offered once in a moodle environment that is no longer available. The video lectures are on YouTube.

Databa

Stanford Database — Covers a lot of ground on all kinds of databases.
10gen MongoDB — Covers everything in handling JSON data and structuring it in a MongoDB database.
Graph Database — For the sake of including all databases I’ve used for data analysis, this free available book contains a lot of ground for Neo4j graph databases.

7. Boot Camps & Scholarships!

In the quest for learning, there maybe a financial limitation, as a result it may become difficult for an aspirant to pursue their dreams. So we have compiled resources like bootcamp and scholarships for you.

Bit Bootcamp — New York enroll@bitbootcamp.com

The Data Incubator — New York / Washington D.C. Contact Form

Data Science Dojo — Seattle, WA / Silicon Valley, CA Contact Form

Data Science for Social Good — Chicago datascifellows@gmail.com

Data Society hello@datasociety.co

General Assembly — Boston / New York San Francisco / Washington D.C. classes@generalassemb.ly

Insight Data Engineering & Data Science Fellows — New York / Silicon Valley, CA info@insightdatascience.com

Level — Boston / Seattle / Charlotte / Silicon ValleyLevel Data Analytics Bootcamp jon@leveledu.com

Metis — New York / San Francisco, CA Contact Form

Microsoft Research Data Science Summer School ds3@microsoft.com

NYC Data Science Academy — New York NYC info@nycdatascience.com

SlideRule founders@mysliderule.com

Thinkful hello@thinkful.com

Zipfian Academy — San Francisco hello@zipfianacademy.com

8. Data Science and Machine Learning Podcasts

You can learn the basics and keep up with the latest news in data science, machine learning and artificial intelligence by listening to these great podcasts which were compiled by Matt Fogel.

The Data Skeptic. A great starting point on some of the basics of data science and machine learning. Kyle and Linh Da explore basic data science concepts.
Linear Digressions. Hosted by Katie Malone and Ben Jaffe of online education startup Udacity, this weekly podcast covers diverse topics in data science and machine learning: teaching specific concepts like Hidden Markov Models and how they apply to real-world problems and datasets.
Partially Derivative. Each week, hosts Chris Albon and Jonathon Morgan, both experienced technologists and data scientists, talk about the latest news in data science over drinks.
The O’Reilly Data Show. This podcast features Ben Lorica, O’Reilly Media’s Chief Data Scientist speaking with other experts about timely big data and data science topics.
Data Stories. Data Stories is a little more focused on data visualization. Every other week, Enrico Bertini and Moritz Stefaner cover diverse topics in data with their guests.
Learning Machines 101.Billing itself as “A Gentle Introduction to Artificial Intelligence and Machine Learning”, this podcast can still get quite technical and complex, covering topics like: “How to Reason About Uncertain Events using Fuzzy Set Theory and Fuzzy Measure Theory” and “How to Represent Knowledge using Logical Rules”.
Talking Machines. Every other week, hosts Katherine Gorman and Ryan Adams speak with a guest about their work, and news stories related to machine learning.

A full list of MOOC courses is available on the link below. http://www.mooc-list.com

9. Conversations With 3 Experts

Imagination is more important than knowledge -Albert Einstein

If Einstein was to make a similar quote today, he will probably say something like “Imagination is more important than data”.

Well Einstein isn’t here to say anything..so here are a few excerpts from my conversations with 3 experts, who deal with data every day at work and graduated in the 90s and 00s.

a. Conversation with Amrita A. Mohan | Director, Clinical Bioinformatics at CHDI Management, Inc

“Data science is a very broad topic. I would not call Data Science a subject that people train in these days. Data science in my opinion is a very multidisciplinary cross functional space.

Data science in Bioinformatics is a combination of Biology, Life science and Information Technology. You need to know what sort of questions to ask and how to ask them. You need to answer whether what you are looking at is a true phenomena represented in biology or is it some sort of artifact that someone did a bad experiment on data and they probably would not have known it failed. To address those kind of challenges, application sciences is trivial and one needs to posses a basic idea of statistics and maths which is followed by substantial domain expertise.

You can imagine someone with a life science background to be a data scientist or you can imagine someone who has a financial engineering or economics sort of background also to be a data scientist. It all depends on what you are looking at.

Bioinformatics space is still new in India and not as mature as in US but people are talking about it. It is interesting to see the change over a period starting 10 years earlier where no one knew about Data Science and now everyone is talking about data. Data science is going to become the next need of the hour. I cannot think of an area where people are not looking at data. It is hard to believe that how not looking at data is going to help you! It adds value and you can learn a lot when you look at data with a open mind.

My area of expertise is predominantly in human genetic disorders like the huntington disease. It is a kind of genetic disorder which follows a trend that if an individual in a family is positively tested than there is almost a 50% of chance that someone in the next generation will test positively for the huntington disorder.

My particular goal is largely structured around clinical trials. Clinical trials in huntington’s disease generate tonnes of data. Sometimes there are clinical trials where thousands of people being monitored and you keep analyzing them. On annual basis you will read how their brain is functioning,how their mind is working, how they feel and are they depressed or happy. imagine yourself doing this for 20 thousand people for 10 years which will lead to large data points .The other thing is, when you have a clinical trial with drugs you always have data involved . You always know whether the drug worked or failed. You have to address whether the drug was safe in the combination, Which patients responded well, Why did they respond and why did the others did not respond. All of above is covered under the clinical computation.We haven’t found the cure yet but what can we learn from the failures: this is the question you ask when you go into data science. Finding a cure is one thing but it is also eliminating the lesser promising therapy . You are making yourself more efficient because you are not going to keep repeating the same mistakes.

There are also several varied projects that I look into which cover this domain.”

b. Conversation with Manish Agarwal | A Quantitative Researcher

Quantitative and Qualitative Analysis — “The qualitative analysis terminology does not exist in practice! Back in the days of 90’s-95’s, theoretical and experimental studies were performed on a massive scale. For an experiment various people from varied backgrounds were called upon and then analysis performed to arrive at facts or to prove a hypothesis. There was intensive behavioral analysis of people which then involved measuring the experiment results on a large scale.

This was referred to as qualitative study. Let us analyze the scenario now. Today’s era is data driven! Revolution of industries, the tech bubble and economic change have changed everything. People now generate data. There is Snapchat, Whatsapp and varied other social media which generate massive data along with user generated data like Wikipedia.

Now if the same economist today has a theory he can verify it on massive data. If a hypothesis is proposed, then there is a confidence and a proof that can be attached. The facts can now be measured on a scale and a metric can be assigned to each of it. So let us say if there are a hundred million tweets a day, what fraction of those tweets are angry tweets ? You can put a number on the amount of anger. All of a sudden all the research we used to do which was non quantitative in a way and a lot of things which we were not able to put numbers on before, can now be analyzed quantitatively . Research has to be quantitative on paper as ultimately you will have to define and well round the quantities you are working with but now we are able to do all those things precisely while we have data.

If there were tonnes of data back when Einstein was doing his research, he would probably be too distracted to come out with a very good theoretical model for space-time behavior. Sometimes one needs to put the data away to think about what makes sense mathematically and theoretically.

Information theory is less about data but more about how many things you can infer about a certain system , just by looking at the mathematical constraints. Those are the areas that are never going to change and are the ones in which there is just a lot you can do by sitting in a room and thinking hard. Nothing can replace that, maybe data can help in later stages to verify what you hypothesized but data cannot replace a good mathematical model.”

C. In Conversation with | Aditi Saini | Data Engineer, Barclays Technology Center India

Data Science is more of Analytics on top of the Data with the help of machine learning and other statistical techniques.It is also a means of getting insights on data for gaining intelligent information which may not be determined without visualization and analysis of data patterns by adopting data science as a tool but rather working with such large data by traditional ways.

You can generate a lot of data or collect it from a variety of sources but how will you use that data to obtain a pattern is the main task There is a huge tech stack that is involved into the streaming process of data and is quite an important task.

I am currently analyzing and building models for analysis of ‘internal organization fraud’ by tracking a person’s movements which involves a lot of parameters and feature points including the visual data from the CCTVs and other sources. There is data involved and so various approaches/methodologies are involved which keep advancing. It is quite a challenge and a wide domain full of new possibilities which enables us to solve wider problems and have a wide learning opportunity.