Data Updates: Automatic or Manual?

Published in

Coord

6 min readAug 9, 2018

Just as when you’re buying a car, one question comes up repeatedly when we’re trying to keep our data fresh: automatic or manual? Should we update the data by hand, or should we write code that does it for us?

Coord takes in data from lots and lots of sources to power our mobility APIs. And you might be asking, why would we ever do data entry by hand? As software engineers, we certainly prefer writing code to doing data entry, but that doesn’t mean it’s always the right thing for the business. In fact, we end up doing more manual data updates at Coord than you’d expect.

So why does manual win out so often? It’s not the gas mileage. Read on for some of our reasons.

The Ski Rental Problem

There’s a problem in computer science called “the ski rental problem”. It works like this. Say you’re learning how to ski. You could either rent skis and have to pay every time you hit the slopes, or buy a pair and pay just once. The problem is, right now, you have no idea how much you’re going to ski. You could end up hating it — or even breaking your leg your second time out. So what do you do?

One winning strategy is to rent skis until you’ve spent as much on rentals as it would cost to buy a pair. If you don’t end up buying skis, then you spent less on rentals than the cost of a pair, and in the worst case, if you buy skis and never use them once, then you’ve spent twice as much as you had to. It turns out that this is the best you can do in the worst case.

When you start buying gear, it’s hard to know when to stop.

Data updates are kind of like skis in this problem. You can either take a little bit of time and effort to manually update data when it changes, or spend more time up front to build an automatic system. So this gives us a good rule of thumb: don’t code anything until you’ve spent as much time on manual updates as the code would take to write.

This also means that we can look at factors that will make manual or automatic updates more appealing:

If the automatic update code is really easy to write (say, because you’re reading from a well-documented and well-structured API), then you should switch to automatic sooner. Reading GBFS data for bike share systems into our API is straightforward enough that we built an importer for it very quickly. On the other hand, some toll data is only on the Internet in the form of images or PDFs, which would be very difficult to parse, and likely not worth our effort, so we read it manually.
Similarly, the harder each manual update is, the more appealing an automatic system should be. For example, inputting all of New York City’s parking signs into our Curbs API would require a Herculean effort, so we built the importer for this data as an automatic system from day one.
On the other hand, the less frequently you have to update the data, the longer you should stick with manual. Because bike share availability changes minute-by-minute, all of our bike share updates our automatic, but on the other hand, most toll prices change once a year or less.

These may be obvious insights in retrospect, but it’s good to have a framework for considering the question.

Automatic Isn’t So Automatic

The ski rental problem misses a key factor: just like skis, data import scripts take maintenance too. Just because you’ve written code doesn’t mean your job is done. Say we have an API that we read to get real-time toll data. What happens if the API endpoint changes, or if the response format gets updated? When you’re writing code, it’s important to take maintenance time into account.

This is another reason why it’s good to do manual updates for a while before scripting data imports: you can get a sense of how the data looks, how regular it is, and how much maintenance it would take if you were to import it automatically. There’s no substitute for really understanding your input data, and doing the first few imports manually is one of the best ways to get there.

Manual Isn’t All Manual

By the same token, just because you have manual work as part of your data update process doesn’t mean you don’t have any code to write. There are often ways to build computer-assisted systems that get most of the benefit of automatic imports with a fraction of the cost.

One of the best examples for us is figuring out when toll rates change. Even if we don’t want to automatically figure out the current toll rates are from an agency’s website, we can still check if they are different than they were before. So we built an automatic system that alerts us whenever a toll agency updates their website in a way that might indicate a new rate. This way, we only have to do manual data entry when something changes.

We also build scripts that let us enter the data in a much more convenient format that we automatically convert to our backend data structures. This keeps the burden of manual updates lower.

Curbs: a Hybrid of Manual and Automatic

Much of our curb data comes from surveys, where humans walk the curb and take pictures of parking signs and other relevant features. This is in many ways the ultimate in manual data updates. So, we get a lot of questions about how we keep our curb data fresh.

It’s important to note that, even in cities where we survey the curbs, we still get a lot of data from the cities. This can include things like temporary parking permits or parking meter rates. So much of our curb data is, in fact, updated continuously using automatic processes.

In fact, since cities have to actually paint new signs in order to change parking regulations, it is sometimes possible to get a digital feed of curb feature updates from city governments even in cases where the city doesn’t know all the regulations that exist today. This is another good example of the blurry line between automatic and manual data collection.

Conclusion

The more data sources we explore, the more we realize that the right way to import data in our system depends on many different factors. By being flexible about our systems and methods, we can focus on bringing the best data to users, no matter where it comes from or how we get it.

If you face this same question, here’s our best advice:

Remember the ski rental problem: try manual first if possible.
Ask yourself how often you need to update the data and how much work is involved.
Figure out how hard it would be to build and maintain an automated system.
If an automated system would be cost prohibitive, consider if there’s work you could do to make manual updates easier.