Thank you HN: 6 insights from the TimescaleDB launch

Andrew Staller
Timescale
Published in
9 min readApr 13, 2017

Last week we launched TimescaleDB in beta with our first blog post, and then posted it to Hacker News

Since then, our HN post received over 300 points with 130+ comments, our blog post was viewed over 20,000 times, and our project surpassed 1,000 stars on GitHub. (What a week!)

300+ points!
1000+ stars!

Our HN post also generated a lot of great comments (some positive, some negative, but all equally valuable). In particular, there are 6 themes we noticed that we’d like to discuss in detail.

But first, a big thank you to the HN community! We could not have reached these milestones without your support.

And this is just the beginning. In the coming months, we plan to: describe our architecture, design decisions, and motivation in detail; release new optimizations and features; publish (and open source) more benchmarks; integrate with important 3rd party tools (e.g., Grafana); and more. Your feedback and support will be invaluable as we continue to improve and grow.

Here are 6 insights from our HN post comments:

1. TimescaleDB challenges some database assumptions

Looking back on these last several days, the biggest thing that stood out was the healthy nature of the discussion. People are very passionate about databases and have deeply held beliefs, some of which are challenged by TimescaleDB.

Here is one question that was on a lot of people’s minds: Really, a row-based store for time-series data?

While I appreciate PostgreSQL every day, am I the only one who thinks this is a rather bad idea? The row based engine of PG is the antithesis of efficient storage and retrieval of timeseries of similar patterns, yielding almost no compression. Columnar storage should naturally be much better at that (BigQuery, Redshift, Citus), culminating in purpose built stores like Influx, Prometheus or KDB. Prometheus for example manages to compress an average 64-bit FP sample including metadata to just 1.3–3.3 bytes depending on engine. As most DB stuff is I/O bound, that usually translates into at least an order of magnitude faster lookups. (via endymi0n)

While column-oriented data stores are good for some types of queries (e.g., key-value lookups, rollups on a single field), they are quite limiting or inefficient when it comes to more complex queries (e.g., complex predicates involving multiple columns). It’s for this query power that we are betting on PostgreSQL.

We also believe that having access to a full SQL interface is also powerful. This response to endymi0n’s comment captures that sentiment:

Not all time series data is metrics (which is what Influx & Prometheus are adapted for). Any kind of audit log — to support a customer service console, for example — is going to be a time series of fairly complex, multi-field records. Orders (and order modifications), support requests, payouts and transfers… Most of business activity falls under the aegis of time series, nominally immutable, even when it’s not numbers. Working with this kind of data is definitely helped by a comprehensive SQL dialect.

To take a contrary position: whose infrastructure is so large that a specialized SQL and storage engine (like Influx) is necessary? Actually not that many companies…so why does infrastructure always end up with these special snowflake versions of things? (via solidsnack9000)

(As an aside: We do think TimescaleDB works well for metrics, but are working on making it easier. More in #4 below.)

That said, one reason why NoSQL databases are popular is that not everyone likes SQL:

Good time series performance is more than just using column-based storage. You also need a query language to take advantage of this and the ordering guarantees it gives you. SQL while it has tried to reinvent itself, is a very poor language for querying TS databases. (via jnordwick)

Although we believe SQL is a very rich language for time-series data, we do recognize that SQL is not for everyone (e.g., the popularity of PromQL). That said, SQL is quite powerful, and a lot of developers are familiar with it:

Wow, lots of very critical comments. I for one think this is a very good idea. I have a use case right now that pretty much fits the manifesto of timescale. Having the power of SQL is very atteactive [sic]. I’m looking to move my current setup to timescale. Will let you guys know how it goes. (via artellectual)

In fact, being able to apply SQL on time-series data may also force you to rethink how to most efficiently store your data, e.g., to take advantage of features like JOINs:

Joins against time series data sound really nice. (via mrkurt)

(Yes, we think so too.)

2. Lies, damned lies, and benchmarks

Benchmarks are the double-edged sword of database evaluation. On the one hand, with so many database options out there, benchmarks provide a degree of objective comparison. Yet, as we have all seen, benchmarks can be cherry-picked and tuned in a way that doesn’t tell the whole story (similar to the complaint about statistics in general).

What do you plan on doing for benchmarking? I don’t expect you to get something like STAC done, but will you try to find general benchmarks that others use?

I’ve been working with TS databases for a long time now, and it never fails that every database vendor always has benchmarks showing they are the best (no put that more bluntly, when you come out with your own benchmark suite and you are the fastest/smallest/bestest I won’t be surprised or believe you).

I don’t expect you to be the fastest when having a row-oriented architecture, and it would be an unfair comparison against the non-free databases, but I would like realistic numbers.

Actually, if you came out 2nd to 3rd against competitors’ workloads, I would be far more impresses [sic]. (via jnordwick)

We agree that cherry-picked benchmarks are a bit of a pox on our industry (and have ourselves experienced the annoyance of not being able to reproduce previously-touted performance). We are doing our best to be transparent with benchmarks, and will soon publish broader results that hopefully cast our database in an honest light (e.g., won’t always present TimescaleDB in a positive way). We also expect those results to clarify how the tradeoffs between row- and column-oriented stores are more complex and nuanced than many think (i.e., your data model and query workloads matter a lot).

Evaluating and comparing database systems is also a lot more than just comparing benchmark numbers. For example, PostgreSQL has many features (e.g., complex data types, geospatial support via PostGIS) that many NoSQL databases don’t even support. Operational management (including backups, disaster recovery, etc.) is also something that benchmarks don’t capture. On the other hand, PostgreSQL is worse on data compression and any fair evaluation must consider that as well. Our goal going forward will be to provide more holistic system evaluations to the broader community (which also lets us scratch our scientific itch).

Stay tuned for more benchmarks and evaluations soon.

3. The PostgreSQL community is vibrant

One of the biggest things we took away from attending PGConf a couple weeks ago was the vibrancy of the community. While PostgreSQL has been around for 20+ years, the project is still quite active and feels to be “coming of age” (with version 10 slated for release later this year). As Amazon announced at the same conference, the top 3 fastest growing AWS products are all PostgreSQL related: Redshift, RDS, and Aurora.

Members of the PostgreSQL community are also quite helpful to each other. When some asked about other PostgreSQL extensions/features, other members of the community were kind enough to jump in (e.g., this very detailed response from ozgune, CTO of Citus Data, or this suggestion from X86BSD on ZFS as a compression option).

In fact, quite a few people are already trying to store time-series data in PostgreSQL:

This looks nice. I’ve had to roll my own PostgreSQL schema and custom functions a few times for timeseries data and if this prevents the need for that I’m impressed. (via Dangeranger)

I was investigating the same topic (PG based timeseries database) for a stock tick data project, would definitely give timescaledb a try. (via wsxiaoys)

(If this is you, please give TimescaleDB a try and let us do the heavy lifting for you.)

4. Schema management can be annoying

For many use cases, it is convenient to just throw some data at your time-series database without jumping through hoops to define and manage schemas. This sentiment was captured well in the following comment:

I quite like Postgresql (and deploy it all the time), and I’m no fan of nosql stuff, which just means you don’t have to properly analyze your database structure before-hand, but with time-series it’s different matter. The data you tend to send to generic time-series databases tends to be very unpredictable. I currently don’t care what data is sent to Prometheus or Influx. This includes, but is not limited to ZFS stats of our storage, system load, network traffic, VMWare, nginx/haproxy, application usage and errors, … I know that when I’ll need it, I’ll have it available and can try to correlate the data at any point in the future. In TimescaleDB it looks like I would have to pre-create tables fitting all those stats, which would make it an absolute pain in the ass. (via koffiezet)

We recognize these concerns. That said, most databases (including many that claim to be “schema-less”) actually have some type of schema internally. Columns often cannot mix datatypes, e.g., integers cannot suddenly store string values. Basic datatypes are necessary for good reasons, such as being able to perform aggregations on integers or define predicates on labels. For similar reasons, a database needs some structure to ingest data for computations, e.g., the average CPU usage across a cluster of servers.

What the comment is really saying (as we see it) is that the schema should be implicit from the structure of the ingested data, and that the schema should be able to expand to accommodate additional columns in the future. In other words, sometimes you just want to throw time-series data into a store without worrying about managing the schema.

Even though we already support semi-structured data through PostgreSQL’s native JSON and JSONB datatypes, we plan to explore HTTP interfaces that will give the option to auto-define and auto-manage a structured schema.

(To be completely honest, this wasn’t high on our roadmap before, but now, thanks to your feedback, we’re already working on it.)

5. Did you really need to build “yet another database”?

This one we’ve heard many times before, and saw again on our HN post:

At my last company we did tens of billions of sensor events per day into Cassandra and I thank god our engineering team was smart enough to spend manpower on product instead of writing yet another database (via jjirsa)

More options dilute talent and stifle innovation — people spend time writing new tools instead of advancing state of the art

But by all means, it’s not my money being burned, so I’m not losing sleep over it. I’ll just sit here and giggle as people talk about scaling Postgres and in my mind all I hear is single point of failure masters and vacuum to prevent wraparound in high write workflows (via jjirsa)

In the abstract, we agree with the author: most of the time, there’s no need to reinvent the wheel. (And this is part of the reason why we choose to engineer up from PostgreSQL, as opposed to building an entirely new DBMS from scratch).

We also believe that database teams like ours should focus on improving the database-state-of-the-art so that product teams like his can focus on the product.

Questioning and rethinking the status quo is how innovation happens, especially within the database industry. Cassandra, PostgreSQL (and every other successful database) was developed by engineers questioning their status quo.

6. You can’t make everyone happy

You can’t make everyone happy, even when it comes to writing style. Try these two opposing comments:

Loved this article, as a competing database company, they did a fantastic job relating to developers and being authentic! Great job, please keep this up, it will definitely make you a winner. (via marknadal)

I really hate this stile [sic] of writing. Why does it have to sound like every other hipster it text? (via sigi45)

In fact, both of our founders have very different writing styles themselves (and they first started writing papers together nearly 20 years ago).

For those who liked the writing, don’t worry, there will be more posts in that style.

And for those who wanted something “less hipster”: don’t worry, we’ll have articles like that as well. Including a technical post that our part-CTO / part-Professor Mike Freedman is already working on (coming soon!).

Like this post? Please recommend and/or share.

Want to learn more? Join our Slack community, follow us here on Medium, check out our GitHub, and sign up for the community mailing below.

--

--

Andrew Staller
Timescale

helping grow @TimescaleDB, @Penn alum, recovering D1 lacrosse player