My transition from research to data science

TSA Kaggle Competition — Chipy blog 1

(This is part 1 of a 3 part series about my experiences in the Chipy mentorship program.)

Hi and thanks for taking the time to check out my first post about my experience in the Chicago Python User’s Group (Chipy) mentorship program for Fall 2017! Chipy is a great community of Python users, and hosts monthly meetings in Chicago featuring talks about cool things you can do in Python, as well as team project nights. The mentorship program pairs mentees with a more experienced Python user to help guide them on a project, and I’m thrilled to be part of the program this Fall as a mentee!

My interests are diverse, but in general, I am interested in increasing efficiency and helping improve people’s lives. I also love to travel, but dread the TSA screening process with its long lines and frustrations. A few months ago, the TSA launched a Kaggle competition to help improve it’s passenger screening algorithm, and I decided to jump right in as my mentorship program project. The TSA wants to improve it’s passenger screening algorithm the millimeter wave detection scanners used to detect potential threats on passengers’ bodies and clothing.

Photograph: Benoit Tessier/Reuters

The current algorithms yield too many false positives, which costs time of TSA agents to manually screen false positive passengers, and increases wait time for all other passenger to move through screening.

Photograph: Scott Olson/Getty Images

The competition includes 19,499 3D image scans from 17 different body zones, about 3 TB of data. It’s an ambitious project, but since I am currently in transition I have the time and attention to focus on it, it’ll help me learn a lot in a short amount of time.

My goals as a Chipy mentee and for this project are to:

  1. Learn to work with large datasets and cloud computing (3 TB per phase — largest data set I’ve ever worked with!)
  2. Gain experience with machine learning algorithms (my data analysis background is largely statistics based, and I want to expand my skillset)
  3. Increase my experience and confidence using Python
  4. Begin making contributions to the open source community by documenting this project
  5. Get hired as a data scientist!


My plan to start tackling this project is to start training a model with images from a single body zone — I am choosing zone 6.

  1. I am starting simple and chose a sample of 30 images at random from zone 6 to start training a model.
  2. Data preparation and preprocessing
  3. Train a model on my 30 images from zone 6
  4. When working well, increase sample size from zone 6
  5. Incrementally repeat 2–4 until I have all data from zone 6
  6. Also get connected to Google Cloud Platform to either stream data or start running code in their shell (at some point there will be more data than I want to download onto my computer)
  7. Repeat 2–5 with data from another body zone

This will be a challenging project , but I know I will learn a lot — I look forward to sharing what I’m doing and appreciate any feedback and ideas for improvement. It’s my hope that my successes and challenges will be helpful to others working on similar projects and transitioning into data science!

My biggest Mistake as a Python Beginner — Chipy blog 2

(This is part 2 of a 3 part series about my experiences in the Chipy mentorship program.)

Don’t just read articles and take a class before starting a data science project. Analyze an interesting dataset as a shadow project of an online class.

When I first started to learn Python for data science, I took the typical approach I had learned for many years in school. I had no idea what any of the syntax meant, so I knew I had to take a beginner’s course. I jumped right in to a developer’s course on Udemy and followed along with the tutorials and basic exercises. It was my first programming language, and I found it super exciting to learn a new skill. The course was tremendously helpful for understanding syntax, program flow, and other basics such as the differences between a list, dictionary, and tuple, etc. — and these are things I definitely needed to know. However, I quickly became bored with the example exercises and started to lose motivation. I love working with data, and programming simple games just wasn’t doing it for me. Even though I was learning important concepts in Python, I didn’t have anything relevant to show for it. I felt like I wasn’t making any progress towards my goal of becoming a data scientist, and this was frustrating.

I thought I needed to go through this basic learning process before I could jump into a data science project. I was wrong.

When I was accepted into the Chipy mentorship program, I chose an ambitious project — a machine learning algorithm image recognition Kaggle challenge with 3D images and 3 TB worth of data. This is the first thing I really tried to do in Python, and there are a lot of basics I am still learning. I had never built a ML model before and didn’t even have a great handle on the “10,000 ft view” of neural networks and how they work. For a few days, I started reading blog posts on ML basics and a few white papers. These were helpful for some basic understanding, but nothing I could put into practice immediately. More frustration.

Then I found a much better approach. It was clear I needed to take a course on deep learning in Python, and I found a good one on Udemy, “Deep Learning A-Z,” that uses the Keras library I was planning to use for my project. The course has great video explanations of the underlying neural network concepts (much easier to understand than reading about it in a blog post or paper), and I can re-watch the videos at any time if I need a refresher. Moreover, the course covers the basic workflow of a ML data science project, complete with step by step templates for data preprocessing and different types of models. After understanding the basics, I jumped right into the module that builds a model for a basic artificial neural network. After completing each step with the example dataset for the course, I did the exact same step using a subset of my 3D image dataset.

Advantages of this strategy are that I feel like I’m getting both conceptual understanding and making progress on something relevant and practical at the same time. I get to learn a concept, and then immediately apply it to a new, more complicated dataset. Often times there are additional steps I need to take to modify my 3D image dataset, but a few google searches, trial and error, and advice from my mentor help with that. I know I am making much more progress this way than if I had completed the entire artificial neural network module before trying to apply the concepts to my 3D dataset. I know I would’ve had to re-watch some of the videos and look up concepts I had forgotten about anyways, so this way is much more efficient.

End result: I learned new things, remember it better because I actually applied it in a different context, and have a data science project to show for it at the end of the course! A much better use of time and learning.

Choosing a project

Finding projects to help you along on your Python learning journey can be overwhelming — there is so much to learn about data science, there are tons of open data sets, and concepts can be applicable across almost every sector of the economy. It’s impossible to become an expert in all of it, and no one will expect that of you. But you have to start somewhere. I’ve found it helpful to write out a list of learning goals, surrounding essential skills every data scientist must have. There are a lot of great resources online about becoming a data scientist that mention important skills/concepts to be familiar with. Consult a variety of these online resources, and talk to data scientists in your city to find out what skills employers in your area and target sector value most.

Once you have a list of learning goals, it’s time to choose a project. You’ve heard this a thousand times, but it’s so true: choose a topic you’re interested in (i.e. healthcare, financial modeling, education). You’re going to be spending a lot of quality time with your project, and it will be extra frustrating/demoralizing if you’re working on a problem you don’t care about. Search for datasets in your topic(s) of interest (good places to start are open government datasets and Kaggle competitions), or devise your own project. Try to pick a project that hits several of your learning goals on your list, so you can be efficient with your time.

Finally, keep in mind the scope of your project. Do you want to do a small project to get your feet wet, one that will take a short amount of time and can quickly push to GitHub? Do you want to do a big project that will probably hit more learning objectives, but take more time? How many hours per week do you have to dedicate to learning Python for data science? By when would you like to have a project portfolio started that you can showcase to potential employers? It’s important to ask yourself questions such as these to ensure you’re making progress (and stay motivated) in the timeframes that you have.

Choosing a good online course

Now that you have a project, figure out what questions you want to ask about the data (this is usually already specified if you’re doing a Kaggle project, for example). A few online searches of your questions should give you an idea of what kinds of analysis techniques to use. Find a course online that covers those analysis topics. There are many classes and tutorials out there to choose from — look through the syllabus and find one that fits your learning style. Courses I’ve found to be most helpful are ones that walk you through an example project, complete with a conceptual overview and steps you should take when doing that type of analysis. Consider your own data — if you want to build a predictive model to answer a business question, follow along a course with a similar example. This structure can be much more helpful in this context than courses that just explain how to do random tasks without immediately applying them. Also keep in mind it may not be necessary to complete the entire course to get the information you need for your project. If so, only do the relevant sections. The course will always be there if you need the other sections later on.

Yvonne K Matos·
8 min
3 cards

Read “My transition from research to data science” on a larger screen, or in the Medium app!