Generating Labeled Training Data for Your ML/AI Models featuring Angie Hugeback
TWiML Talk 006
Subscribe: iTunes / SoundCloud / Google Play / Stitcher / RSS
My guest this time is Angie Hugeback, who is principal data scientist at Spare5. In this show, Angie and I discuss the real-world practicalities of generating training datasets.
This week’s podcast is sponsored by Spare5 (now Mighty AI). Spare5 helps customers generate the high-quality labeled training datasets that are so crucial to accurate machine learning models.
Angie and I talk through the challenges faced by folks that need to label training data, and how to develop a cohesive system for achieving performing the various labeling tasks you’re likely to encounter. We discuss some of the ways that bias can creep into your training data and how to avoid that. We explore the some of the popular 3rd party options that companies look at for scaling training data production, and how they differ. And, Angie gives us her top 3 tips for folks tasked with generating training data for AI.
Thank You Spare5!
Spare5 has graciously sponsored this episode. If you’re struggling with generating labeled training data for your machine learning or AI based products you should definitely take a look at what they’ve got to offer.
Above all, I’m just very grateful to Spare5 for helping to make this podcast possible for all of you, and I really encourage you to show them some love back: Reach out to them on Twitter at @spare5 and thank them, visit their web site, or request for a demo. All of those things let them know how much you appreciate this podcast and their support for it.
Finally, they’ve got a special offer for 25 lucky TWiML Talk listeners. Learn more on the podcast and sign up here: spare5.com/podcast.
UPDATE 1/10/17: Spare5 is now Mighty AI. The company announced its new name in conjunction with the close of a $14M financing round led by Intel Capital, with GV, Accenture Ventures and others.
About Angie Hugeback
Mentioned in the Interview
- Spare5 | Training Data as a Service
- Metropolis–Hastings Algorithm
- Importance Sampling
- Rserve — Binary R server
- Machine Learning: The High Interest Credit Card of Technical Debt Note: In the interview, Angie referred to a Microsoft paper that she recommended. After the interview she realized it was this one by Google Research.
- Seven Rules of Thumb for Web Site Experimenters
- ModelTracker: Redesigning Performance Analysis Tools for Machine Learning [PDF] [Youtube]
- A cautionary tale about humans creating biased AI models | TechCrunch
IMAGE CREDIT: UMN