Talk on designing machine learning pipelines for mining transactional SMSes

Meet Paul Meinshausen, one of the speakers at The Fifth Elephant 2017

Abhishek Balaji
The Fifth Elephant Blog
3 min readJul 21, 2017

--

Paul Meinshausen is a globally experienced data scientist. He is a Co-Founder of PaySense, a mobile fintech startup based in Mumbai, and was Chief Data Officer at the company until February 2017. Before co-founding PaySense, Paul was Vice President of Data Science at Housing.com, where he led the Data Science Lab and the Product Analytics and Business Intelligence teams. Earlier he was Principal Data Scientist at Teradata, where he worked on machine learning projects in the Banking, Telecom, Automotive, and E-Commerce industries across the South Asia and APAC regions. Paul was a Data Science for Social Good Fellow at the Computation Institute at the University of Chicago in 2013.

Between 2009 and 2011 he served as an analyst for the U.S. Department of the Army and deployed to Kabul, Afghanistan to the headquarters of the International Security Assistance Force in Afghanistan. Paul has an academic research background in behavioural science and was a researcher in the Department of Psychology at Harvard University and a Fulbright Scholar in Turkey at the Middle East Technical University.

The problem statement of Paul’s talk is the design of a machine learning system to extract structured, precise information from raw SMS data with minimal expert guidance.

SMSs carry latent structure — the same message template can be populated with personalised information and sent to a theoretically unlimited number of customers. In that sense the problem is more conceptually tractable than looking at social media messaging, tweets for example, which usually don’t carry structure. However the templates themselves vary immensely, from e-commerce and banking use-cases to cab and delivery notifications, etc. Each use-case requires initial domain expertise to help define efficient data models. If it is done well, for example defining an ontology for the domain of personal finance, we can significantly reduce the amount of expert assistance required.

While the concrete and specific topic of the talk is the case of SMSs, Paul has also tried his best to generalize the talk to the broader problem statement of extracting structured information from raw unstructured (or latent structured) data.

“If you have a tough question that you can’t answer, first tackle a simpler question that you can’t answer.”

In other words, break your problems into pieces and solve them one at a time.

The second takeaway has to do with architecture and the design of machine learning systems. When you’re applying machine learning to a problem you’ll almost never fully understand the problem at the beginning, which means you’ll need to do things 25% and 50% of the way into the project that you didn’t know you’d have to do at the beginning. Unfortunately you also face path-dependancy, previous choices affect and constrain subsequent choices. Being able to deal with this is the quality of extensibility. I think engineers and startups both regularly make the mistake of making a big issue of caring about scalability while forgetting about extensibility. That’s a shame because they’re more likely to break down because of extensibility than because of scalability.

SMSs in India present a rich source of personalized data for many business use-cases, especially in fintech and personal finance management. If you’re a data scientist or engineer interested in using SMSs or a similar dataset, you should find this talk interesting. More broadly if you’re interested in the problem of designing machine learning systems that layer models onto each other in pipelines, then this talk is for you!

Read more on Paul’s talk or buy tickets for The Fifth Elephant 2017!

Have you got your tickets for The Fifth Elephant 2017 yet? Check out https://fifthelephant.in/2017/ for more updates!

Join the #fifthelephant community on Slack. Propose topics for meetups. Speak at The Fifth Elephant meetups round the year. If you work on Deep learning and Artificial Intelligence, book tickets for Anthill Inside today!https://anthillinside.in/2017/

Slack: https://friends.hasgeek.com

Mailing list: http://eepurl.com/cygsfr

--

--