Our Story

Published in

Datalogue

4 min readJun 20, 2017

My first job as a data scientist after graduating college was working in the Data Science and Insights team at Merck. Our raison d’être was with generating insights for the company, about pharmaceuticals, about the industry, or about our customers.

Get: Hard drives, envelopes and one-way connectors

So, we’d work for all hours, scouring the world (from our desks, sadly) for interesting data sets to solve our real business problems. In pharma, where I was working, that meant finding potentially new life-saving drugs. But the variety of the sources we’d have to wrangle was as diverse as you could imagine, from FTP transfers and accessing S3 buckets to, no joke, taking hard drives out of their envelopes and plugging them in.

Once we got our hands on the data, our work had just started. Inevitably, there were two problems with the data: its format, and its format (you read that correctly).

Each database that we would look at would be in a different technical format, we’d be lucky if between 5 sources we’d have 3 formats, and these ranged from tabular to document to relational databases — any data format that you could imagine would be there.

So, we would use finicky one way connectors to try to get all of our sources into a common denominator format. Hacky and slow, but it got us there.

Know: Regular expressions

Once we had each database in the same technical format. But the data itself wasn’t in the same semantic format. One data scientist would write AU, another would write AUSTRALIA, one would write Sen, Sonia, another would write Sonia Sen, one would write 625 Avenue of the Americas New York NY 10011 and one would write {street:625 Avenue of the Americas} {city:New York} {state:NY} {zip:10011}… They all meant the same things, and we had to be able to identify them as having the same ontology.

This was a lot harder to deal with. We would write regular expressions to try to identify all of the expressions of the same data. Of course, we’d then test the regexes to try identify all of the edge cases, and build them in. But, inevitably, they don’t catch all of dates, names and pharmaceutical information that we were looking for. So, finding the information was hard.

Transform: Ad-hoc scripting

Transforming it? Even harder, and painfully manual. Writing bespoke scripts that were dependent on the regular expressions we wrote and the system that fed us data. Building dependencies. Testing, deploying, and hoping for the best.

I’d work around the clock, sometime sleeping under my desk trying to deliver insights, but 80–90% of my time was being spent on data janitorial work.

There had to be a better way. Some way to automate this.

The lightbulb moment

So I began working on my master’s thesis at Cornell Tech, on automated data preparation. I was working with Serge Belongie. He is a computer vision professor, so, to him, every problem is a computer vision problem. I may have been influenced by him a little, and started to look for applications of computer vision everywhere.

During my research, I was coming home late at night, and I was under-slept and over-caffeinated. I was scrawling my training data all over the whiteboards in my living room. Trying to find patterns. Trying to find ways of solving this problem.

Being newly engaged, my fiancé at the time (now wife!) would want everything to be neat and tidy, so naturally, she’d erase everything as soon as soon as I fell asleep.

I stumbled across what was a work in progress. A whiteboard that I had filled with phone numbers, partially erased. I couldn’t quite make out any of the phone numbers any more. But what was left, ignited a spark that was the foundation of this company.

What was left was none of the data itself. It was just the data’s structure. You couldn’t tell the phone number. But you could tell that it was a phone number. There was a structure to the phone number that we, as humans, can understand. But, more importantly, I recognized that that structure could be made understandable by a machine.

Datalogue: getting data into the hands of the people who need it

Data is driving every business today, and at Datalogue we strongly believe in the power that data can bring. Financial data insights keep markets thriving. Hospitals use data to improve their processes and save lives. Governments use data to get services to those who need them most. Data creates real value in countless ways. And getting to the real value of data shouldn’t be hindered by the cost, time and friction of preparing data manually.

Automating the data janitorial work — the work that kept me up at night in my first job — means that governments, businesses and people can actually start impacting the world in a data driven way.

This can’t happen if data scientists are spending 80% of their time sanitizing their data. And what started as solving the problems I had as a junior data scientist became Datalogue — building the automated pipelines that deliver data to data hungry companies, instantly and on demand.

Our first demo day…