The Deep Tech part of building a Deep Tech company (part 1)

Published in

nPlan

5 min readNov 25, 2019

Background

For many people at nPlan, the story starts here. A 3 minute pitch by my co-founder Dev Amratia, in front of over five hundred of europe’s best venture capitalists. We told the world that construction programmes (or schedules if you’re not in the UK) had a low accuracy, and that we knew how to fix it. After all it’s not news that most large construction projects are delayed and over budget.

Thanks to our efforts, a new solution was possible that allowed a planner (or scheduler) to understand the sort of risk that the timeline of their project represented. And that very risk could be summarised to the project manger, to the person that is responsible for submitting a bid, to the CEO, and even to the clients of a construction company.

A not-so-small problem kept going round and round in my head. “How the hell are we supposed to be doing all of this?”. That, in essence, was the meaning of my job as technical co-founder.

We had a working prototype that demonstrated that the problem was tractable. We had also collected a substantial dataset (over one hundred thousand schedule files) that would allow us to train a good enough machine learning model.

Because we had about 1TB of data already, and a very complex problem at hand, I was constantly thinking about how to build the real model. You know, the one that you actually put into the product. The one that you sell to customers.

This story is about what happened before that day. It’s about what we did that allowed us to claim that we knew how to solve the $2T/year problem in construction. It’s a story that continues today and that we’re constantly writing.

Part one: Our first dataset

It was a very exciting day. I had just downloaded our first batch of construction schedules from a client. About ten thousand of them, to be precise. Each file has an average of two thousand activities, for a total twenty million activities, give or take. Anyone that is familiar with machine learning, on receiving such a dataset, would sigh in relief at the fact that there is, in fact, enough data available to at least prove the learnability of the problem.

Fortunately enough the ecosystem of software used to create a very large GANTT chart of a construction project is quite small. The consequence of this is that there are only a couple of file formats that we needed to understand. This invited its own second sigh of relief.

Building a preprocessor that put everything into a database was therefore conceptually easy, if not slightly time consuming. I wasn’t sure how stable our database schema was going to be, so for the time being I used MongoDB.

Meanwhile, the dataset was already growing. By the time we had some rudimentary tools that could read this format, there were over one hundred thousand XML files that took up 1TB of disk space, and a resulting 800GB database.

Surprisingly (or not, if you think about it), the actual amount of disk space taken only shrunk by 20% even though we were discarding a lot of variables and pruning aggressively.

The first pipeline run took about five days just to load all the data. It was at this moment that it became clear to me that we wouldn’t be able to fix up a quick solution (an MVP?) that we could roll out quickly to our prospective clients that were giving us this data.

The first sign that we were processing close to full speed!

This was (and still is) scale problem no.1 for nPlan: As we get data from more clients, the time requirement to reprocess everything from scratch grows linearly with the number of schedules.

I couldn’t tell Dev to slow down traction — that’s not how you build a successful startup. Parallel processing only helped a little (about 50% speedup) because most of these operations were disk bound.

Upgrading the disks on our cloud provider’s instances was painful, but at least we could now process our entire dataset in about 2 days. By then João, Josh, and Vahan had joined (we finally had an engineering team!), and I was very glad that they did.

We first realised that the preprocessing was embarrassingly parallel and we could shard the source data across multiple disks to speed up the loading. We could even split the work across several machines, or use tools like Apache Storm (TFX and Beam weren’t around at the time).

This shifted the bottleneck to MongoDB write operations, so we had to become smarter about writing into the database. We only needed to update a record if something changed, so we introduced schedule fingerprinting and checksum-based caching. This got us down to about half a day, which was an acceptable penalty to pay every time we had to reprocess the entire dataset.

We still had a lot of processing to do before we got to the actual learning. We had a toy model (a random forest), trained on a small subset of the data, that we used to learn the outcomes of an activity, but we did not have a complete dataset.

Transforming the data from MongoDB into numeric form is truly an embarrassing parallel problem as well. We had no idea if any line of code we were writing would survive for longer than a week, so we wrote our own little parallel implementation of map/reduce jobs in Python to get data transformations done quickly, and managed to extract a 100GB file that looked like a supervised ML dataset. This is usually the starting point for an ML project, and we were already a few months in and wrote an epic amount of code to get us there.

Up next

In part two, I’ll talk about the part of the journey where we had to build ML models based on this dataset, and all the subsequent evolutions of the pipeline and the dataset.

In part three, I’ll cover product. ML startups don’t talk about product enough, and it’s usually a very difficult part of the job to figure how to deliver value to clients.

Get in touch

If you think that doing machine learning at scale is exciting, we are hiring! Look at nplan.io/careers for current opportunities.

You can also get in touch at twitter.com/nplanHQ and twitter.com/nitbix

The Deep Tech part of building a Deep Tech company (part 1)

Background

Part one: Our first dataset

Up next

Get in touch

Written by Alan Mosca