But seriously: how is Data Science Retreat?

After recommending DSR to many people and explaining the same things over and over again, I decided to put multiple conversations together into this article. I’m attending the 7th batch of Data Science Retreat, a full-time bootcamp held in Berlin from April 18 to July 15, 2016. My description is just about the batch I attended, so is probably different from how DSR looked an year ago and will surely be different in 2 years.

Still 2 weeks left to go, but since I am almost completely done with my project and things started to settle down, giving me more time to think about the future and how I will be applying Data Science differently in the future, this writing may come in handy. This course wasn't my first experience in the area but shaped the Data Scientist I am today.

First month

In the first month, we had full-time classes on almost each of the 20 business days. Subjects vary between data cleaning, Jupyter notebooks, Python, R, SQL, Spark, Streaming and Big Data architectures. During 3 days, we had a mini competition between random pairs in the class based on a Kaggle problem. In other 2 days, you were expected to start working on your portfolio project — for most of people, having ideas by looking datasets or different applications of Machine Learning.

Coming up with an idea for portfolio project

DSR, for graduation, requires you to present a portfolio project, something you made end-to-end leveraging different Data Science techniques. Want to Python or R, the languages taught in the course? Excellent. Scala or Julia, studying after hours? Even better. The idea is to show you can have ideas, find datasets, clean up, make hypothesis, validate them, come up with a predictive model using Machine Learning and present the project for technical and non-technical audiences. The whole course is structured around this requirement and the majority of classes seem to be there to prepare you for this job.

Your project is expected to pass these requirements:

  1. Answer a good question.
  2. Possible business case.
  3. Must have data available.
  4. Technology must already exist.
  5. The solution accuracy must be checkable, to know when does and does not work.

When choosing a project from a Kaggle competition, for instance, you are failing 1 and 3, because you are not showing you can make good questions (the question is already there) or good on finding datasets (since the dataset is also there, easily accessible). If you decide to build a recommender system, an accurate validation may require collection of data after the system is already working with users, neglecting 3 and 5. And since you don't have more than 3 months to start and end your project, 4 is important because you shouldn't expend time creating a complete different Deep Learning framework just because the available options don't support your problem for some reason. I could add a 6th requirement, "Use Machine Learning", since this is probably the main subject of Data Science Retreat. Doing a Data Science project without Machine Learning may be judged as not challenging enough for the program.

In the end, you just need to show you can handle a Data Science project end to end when it's necessary. Depending on your previous experience or the size of the company you go after DSR, you will have a lead role and knowing this will help you.

Mentorship

DSR has a list of many mentors, but the main one is Jose Quesada, also director of the company behind the course. Everyone have 30 minute one-on-ones with him each week, for giving feedback about classes, teachers and everything about Data Science that came through your mind since the previous meeting. They say to follow the Meerkat Method, where you are pushed against your current capabilities to try to expand your limits. Can be frustrating if you don't be aware that Data Science is an area expanding in fast pace, with a large amount of knowledge already to be acquired.

The best way to reach out other mentors is through email. In the middle of the course the list of names, emails and specialties was posted in our Slack. Want more help than that? You have to be proactive and ask yourself. Writing short and direct emails that can explain the project for people who are not aware about it is an art for a few, but helps you getting answers. One or other time you could see other mentors coming to the office for lunch, which was in fact when many good conversations happened.

Though I make a list of "official" mentors who helped me during the development of the project and random questions about Data Science in general, I'd say my best mentors were on my side. No one in my batch with 8 people was a beginner before coming to Berlin. We had specialists in AI, Data Warehousing, Visualizations, Economics, Physics, Aerospace Engineering, Business Intelligence and Computer Science. I personally have been learning a lot by osmosis.

Second month

In the second month, time for classes and portfolio project was about equally separated. Going more advanced in R and Python, we also had lessons on Visualizations, Presentations, practice of HR interviews (sigh), Geographical Data and Deep Learning. With R and Python, we practiced optimization of algorithms (Cython included); on Deep Learning, the classes were mostly focused on computer vision.

During this period, I'd highlight practices on presenting. In the first day, we were told on Friday evening to prepare a presentation (subject of our own choice) of 20 minutes for Tuesday. In the second, we had 3 hours to go from "damn, I need an idea" to "recording? OK, my name is Irio Musskopf and my target audience is X and Y…". Challenging task never enough practiced.

Third month

In the third, Model Pipelines, Recommender Systems and Technical Communication. Since everyone is now super focused on their projects, this ending classes had to be killer ones — and baby, they were. Specially Model Pipelines changed the way I used to see Data Science. Coming from Software Engineering, I read a lot of code for learning. In general, Data Science code you find is, let me find words… "not so easily maintainable". Model Pipelines is a way of applying well known Software Engineering patterns on Machine Learning models. You start seeing a light in the end of the tunnel, a world where a Jupyter notebook doesn't need to be run in a crazy order to work and you can actually understand what's going on. Imagine the sensation of going from a code with GOTO's to functional programming.

Out of the Curriculum

As I mentioned before, a lot is learned from others attending DSR, so telling you what happened on my batch may not help you on understanding what you're going to find in a next one. Could list: how to manage version controlling Jupyter notebooks, different techniques with Deep Learning, Scala, Graph Analyses with Neo4j and even how is "Happy Birthday" in Vietnamese. BTW the batch had 8 people from 7 different nationalities.