The Future of Data Engineering

What would happen after SQL is no more.

Dima

Published in

Live Long and Prosper

3 min readApr 11, 2014

I believe SQL would:

Exist in 5 years,
Exist as legacy in 10 years, and
Go extinct in 15 years.

What would replace it?

The language that is designed to be pre-optimized for real-time querying.

Say, we want a rolling counter of the visitors of certain page in the past one hour.

On a high level, it’s one SQL query:

SELECT COUNT WHERE TIMESTAMP > CUTOFF

One may well live with such an implementation for the first prototype.

Going deeper, this design, obviously, doesn’t scale. An expiring, bucketing- or sliding-window-based algorithm is the way to go.

It would soon become clear that the latter and the former can and should be implemented in the same language.

In other words, the language would be implemented and popularized, where a syntax as simple as the above SELECT would internally get “compiled” into a sliding-window-based algorithm.

What language would it be?

I’d bet for a clean, lazily evaluatable one, like Haskell.

LINQ and its clones would do.

Syntax-wise, http://rethinkdb.com/ is making it look neat.

As a conceptual change, the set of queries to run efficiently would become part of DB “schema”.

They would have to be either known beforehand, or would take some time to “propagate” — i.e., to replay existing data against the algorithms that maintain efficient internal data structures.

Of course, while prototyping, one could run any query in “interactive mode”. This makes it no different from common MapReduce frameworks — except it is implemented in a language that enables the prototype code to be production-ready once proved feasible.

The jobs of DBA-s and DB SRE-s would change.

Their daily work would be to keep the cluster running both production and sandbox databases in parallel.

Their jobs would be to keep the production database sane and holding the load with under-single-digit millisecond latency, while making sure that the developers can run their “slow” queries against the data that is either reflecting the live site, or, at least, not too far behind.

One could work on snapshots as well — for faster results, which are repeatable.

SRE-s and DBA-s would be the people in charge when it comes to approving certain new real-time “query” to enter the “schema”. Since each new query might potentially require non-trivial data structures added internally, reviewing them to ensure that disk-, memory- and network profiles don’t get bloated would be the major part of the job of those new generation SRE-s and DBA-s.

DBA-s would be taking care of the underlying algorithms — and, perhaps, manually optimizing them. The programming language would allow providing mathematical proofs for the fact that those “rewritten” queries mean the same as the original ones.

“Rewritten” queries may be rather complex: for the sake of efficiency, they might span different “tables” of real-time pre-processed results, or even create those tables as necessary. DBA-s would, certainly, work closely with the engineering team, since often times slightly modifying the query would allow it to be computed significantly cheaper. For example, a machine learning algorithm is unlikely to need specifically the 95th percentile — a cheaper to compute statistic might well do.

SRE-s would keep an eye on how are those tables sharded. Designing real-time-sensitive dataflows between replicas, especially when fault-tolerance is an integral part of requirements, is a non-trivial task.

And the permissions to run certain types of queries would be controlled by the SRE-s.

In a well-designed engineering organization, there would be no need to run MapReduces or other type of jobs. One would just define a new type of “table”, get it [pre-]approved by an SRE on the sandbox cluster, and be good to go to start using data from it.

Even real-time data!

The job of data-driven software architects and developers would change.

After this happens, we would have the world where machine learning engineers and data scientists would have the “time to market” of their features shrinked from weeks and months to hours and days.

And the would not have to worry about bugs and logical errors when converting a slow-but-correct statistics computation algorithms into fast-and-caching ones that can be used in production.

I am very much looking forward to this world.