TSA Kaggle Data Challenge — ChiPy blog 1

Yvonne K Matos
3 min readOct 29, 2017

--

(This is part 1 of a 3 part series about my experiences in the Chipy mentorship program.)

Hi and thanks for taking the time to check out my first post about my experience in the Chicago Python User’s Group (Chipy) mentorship program for Fall 2017! Chipy is a great community of Python users, and hosts monthly meetings in Chicago featuring talks about cool things you can do in Python, as well as team project nights. The mentorship program pairs mentees with a more experienced Python user to help guide them on a project, and I’m thrilled to be part of the program this Fall as a mentee!

My interests are diverse, but in general, I am interested in increasing efficiency and helping improve people’s lives. I also love to travel, but dread the TSA screening process with its long lines and frustrations. A few months ago, the TSA launched a Kaggle competition to help improve it’s passenger screening algorithm, and I decided to jump right in as my mentorship program project. The TSA wants to improve it’s passenger screening algorithm the millimeter wave detection scanners used to detect potential threats on passengers’ bodies and clothing.

Photograph: Benoit Tessier/Reuters

The current algorithms yield too many false positives, which costs time of TSA agents to manually screen false positive passengers, and increases wait time for all other passenger to move through screening.

Photograph: Scott Olson/Getty Images

The competition includes 19,499 potential threat lablels from 17 different body zones in 1,148 3D scan images, about 3 TB of data. It’s an ambitious project, but since I am currently in transition I have the time and attention to focus on it, it’ll help me learn a lot in a short amount of time.

My goals as a Chipy mentee and for this project are to:

  1. Learn to work with large datasets and cloud computing (3 TB per phase — largest data set I’ve ever worked with!)
  2. Gain experience with machine learning algorithms (my data analysis background is largely statistics based, and I want to expand my skillset)
  3. Increase my experience and confidence using Python
  4. Begin making contributions to the open source community by documenting this project
  5. Get hired as a data scientist!

Strategy

My plan to start tackling this project is to start training a model with images from a single body zone — I am choosing zone 6.

  1. I am starting simple and chose a sample of 30 images at random from zone 6 to start training a model.
  2. Data preparation and preprocessing
  3. Train a model on my 30 images from zone 6
  4. When working well, increase sample size from zone 6
  5. Incrementally repeat 2–4 until I have all data from zone 6
  6. Also get connected to Google Cloud Platform to either stream data or start running code in their shell (at some point there will be more data than I want to download onto my computer)
  7. Repeat 2–5 with data from another body zone

This will be a challenging project , but I know I will learn a lot — I look forward to sharing what I’m doing and appreciate any feedback and ideas for improvement. It’s my hope that my successes and challenges will be helpful to others working on similar projects and transitioning into data science!

--

--