Time zones and Flux — Part II

Part II: Time zones in practice: Flux

MarcoDB
8 min readJun 6, 2022
Photo by Jon Tyson on Unsplash

This is the second part of a two-parts series on time zones. You can find the first article in the series here.

If you’ve come across the Flux timezone package, you might be wondering what it does exactly. The documentation is not very generous with use cases and explanations, though Influxdata offers a short blog on the subject. There you learn how option location in conjunction with the timezone package can help define day boundaries for functions like aggregateWindow(). This is huge, but the full story is a bit more complicated than that…

First off, the timezone package offers two functions, one — timezone.fixed() — that deals with fixed offsets, and the other — timezone.location()— that deals with IANA time zone identifiers (location-based). If you’ve read the previous article in this series, then you already know I’m not familiar with use cases for fixed offsets, and therefore our focus will be on the location-based time zones.

One thing that at first might sound surprising is that when you set option location, the Flux output continues to be displayed in UTC. This actually makes a lot of sense, since you should expect any server to return standardized machine-readable data, and the client to take responsibility to format the data for ease of user interpretation. A Flux query runs on a server, so it’s not its job to deal with end user formatting.

Is “option location” really the end of the story?

The main role of a query on a time series database is to process time series data, and as we’ve already anticipated, that includes a correct understanding of day boundaries. Say for example that you want to process only data within a “business hours” window. hourSelection() can be helpful for that, but only if we can instruct it to understand the business hours in local time.

And this is where option location falls short, as it forces us to assume that all data can be processed with a single definition of “local time” (UTC or otherwise). In other words, it constrains a query to operate only on data from a specific location. What should you do if you’re ingesting data from multiple locations? Take for example a downsampling task to generate per-day summaries. Do you really need to manually build a separate downsampling task per location?

Luckily the answer to this last question seems to be “no”. If you look at the documentation for aggregateWindow(), you’ll find a location argument, which gives us hope. Never mind that as of 6/5/22 the location argument of aggregateWindow() is documented as being of string type (but try it out, it’s not, it takes a record and it complains if you pass a string). Experimentally, you’ll also learn that most functions that deal with time seem to accept a location argument, even when the documentation makes no mention of it. For some examples, look up date.truncate(), date.hour() or hourSelection(). Experimentally we know they all accept a location argument, but as of 6/5/22, none makes any mention of that in the documents.

And this is where I should probably make a short interruption to warn you that everything I’m saying below should be taken as baseless speculations (BS?) until the documentation gets updated to guarantee that location is part of the “official contract” for these functions. I don’t work for InfluxData and I have no visibility into their roadmap. I’m just a user, practically desperate to make time zones work with InfluxDB. So I can only write about what seems to work based on my own experiments.

Per-function “location” override

What can we do with a location argument? A lot, sometimes, but in many cases, not so much. There are two classes of functions that operate on time:

  • Functions that operate on individual instances of time, like date.truncate() and date.hour()
  • Functions that operate on streams of tables, like hourSelection() and aggregateWindow()

The “location” argument and functions that operate on individual instances of time

Passing a location to functions that operate on individual instances of time is exactly what you want. Your table has a _time column and a timezone column, and you happily map() away each record:

data
|> map(fn: (r) =>
({ r with "hour": date.hour(
t: r._time,
location: timezone.location(name: r.timezone)
)
|) )

Each record gets precisely the interpretation that it needs.

Let’s try to exploit what we’ve learned about Mountain Time, Arizona and DST from the previous article in this series to validate that Flux interprets correctly the appropriate IANA time zone identifiers:

import "date"
import "csv"
import "timezone"
// See https://www.timeanddate.com/time/us/arizona-no-dst.html#:~:text=cooler%20evening%20temperatures.-,War%20Time%20DST,-Arizona%20used%20DST
// for this choice of dates. Arizona doesn't use DST, but it did
// for one year in 1918
datesCsv = "
timeStr,tz
2022-05-28T01:45:00.000Z,America/Los_Angeles
2022-12-28T01:45:00.000Z,America/Los_Angeles
1918-05-28T00:45:00.000Z,America/Phoenix
2022-05-28T00:45:00.000Z,America/Phoenix
2022-12-28T00:45:00.000Z,America/Phoenix
1918-05-28T00:45:00.000Z,America/Edmonton
2022-05-28T00:45:00.000Z,America/Edmonton
2022-12-28T00:45:00.000Z,America/Edmonton
"
setTime = (tables=<-) => tables
|> map(fn: (r) => ({ r with _time: time(v: r.timeStr) }))
|> drop(columns: [ "timeStr" ])
allData = csv.from(csv: datesCsv, mode: "raw")
|> setTime()
|> map(fn: (r) => ({ r with
"date.hour()": date.hour(
t: r._time,
location: timezone.location(name: r.tz)
),
"date.truncate()": date.truncate(
t: r._time,
unit: 1d,
location: timezone.location(name: r.tz)
)
}) )
|> yield(name: "allData")

The output of this query is:

_time             date.hour()  date.truncate()   tz
-------------------- --- -------------------- -------------------
2022-05-28T01:45:00Z 18 2022-05-27T07:00:00Z America/Los_Angeles
2022-12-28T01:45:00Z 17 2022-12-27T08:00:00Z America/Los_Angeles
1918-05-28T00:45:00Z 18 1918-05-27T06:00:00Z America/Phoenix
2022-05-28T00:45:00Z 17 2022-05-27T07:00:00Z America/Phoenix
2022-12-28T00:45:00Z 17 2022-12-27T07:00:00Z America/Phoenix
1918-05-28T00:45:00Z 18 1918-05-27T06:00:00Z America/Edmonton
2022-05-28T00:45:00Z 18 2022-05-27T06:00:00Z America/Edmonton
2022-12-28T00:45:00Z 17 2022-12-27T07:00:00Z America/Edmonton

Great news, InfluxDB recognizes that DST is applicable to Phoenix’s 1918-05-28T00:45:00Z, but not to the same date in 2022 (unlike America/Edmonton). hour() tells us that the hour in local time is 17 or 18 accurately for all locations, taking into account DST when appropriate. date.truncate() with argument unit: 1d shows us the UTC timestamp of “local time midnight”, accurately again.

The “location” argument and functions that operate on streams of tables

Functions that operate on streams of tables are a different story. If my table has data from multiple time zones, a single location argument forces me to instead treat all the data as sharing the same time zone. Unless, that is, I take the extra effort to partition my data into separate tables, one per required location. I would have preferred to see a timezoneColumn argument, to explicitly state the time zone to apply for each record being evaluated. Luckily enough, this can be more or less obtained with a wrapper function, if we manage to get Flux to stop being in the way.

Let’s start with the same data we have used in the previous example. We need to partition the streams of tables into 3 different streams of tables, one per time zone identifier. Luckily for us, Flux has recently introduced an experimental way to iterate over arrays, array.map(). So we just need to have an array of time zone identifiers to get started. Traditionally, this should be easily done with a simple use of findColumn():

tzList = allData
|> distinct(column: "tz")
|> findColumn(fn: (key) => true, column: "_value")

Ok, so… as I write this, that simple code above returns an empty array. At least when the processing that defines allData includes a yield() (it appears yield() somehow poisons allData in such a way that makes findCoumn() fail). Let’s not get discouraged by the fact that findColumn() is broken right now, I’m sure it’ll get fixed eventually. If not, there are ways to use reduce() to simulate it. I might write an article on that one of these days, but for now, we’ll just need to assume that we have a:

tzList = [
"America/Los_Angeles",
"America/Phoenix",
"America/Edmonton"
]

Once that’s available, we can use array.map() to map tzList to an array of streams of tables, one per time zone in tzList. Each one of these streams of tables can safely use the location argument, since each one of these streams of tables is about a single time zone. It’s actually much easier to implement than to describe in words:

selector = (tz, tables=<-) => tables
|> filter(fn: (r) => r.tz == tz)
|> hourSelection(
start: start,
stop: stop,
location: timezone.location(name: tz)
)
selectedDataList = tzList
|> array.map(fn: (x) => allData |> selector(tz: x))

With this, now we have a selectedDataList array that contains the three streams of tables, one per time zone.

As you might have already guessed, the tables argument of the union() function expects an array of streams of tables, so a simple application of

union(tables: selectedDataList)

should recombine the data as a single instance of streams of tables.

Hurray for robustness!

I could have closed with the last sentence, but I feel extremely uncomfortable leaving without even mentioning the hidden “exercises for the reader” that are implied in the previous code sample.

First, be aware that your data must be flawless for this code to work. If you have one million records and only one record includes an invalid IANA time zone identifier, the corresponding invocation of a time function with the location argument will cause the invocation to fail, and the lack of exception handling in Flux will do the rest: your entire task will fail, and you’ll have nothing but an error to show for. “Flawless” doesn’t just mean that you must have performed full validation on all the time zone identifiers that get stored in InfluxDB. It also means that the “eventual consistency” of InfluxDB writes must have settled: if you get a single partial record with a missing time zone tag, you’re done.

One has to wonder about the choice of being so “black and white” about query results. Would it be so bad to just drop the processing of the record with the bad data, but let the rest of the query survive?

Second, “use union()” is easier said than done. union() is another very “black and white” function with a lot of arbitrariness built-in. The tables argument must be an array of at least two elements. What’s wrong with an array of one element? It might be stretching it, but what’s wrong even with an empty array? So, yes, use union() when there are at least two time zones to process, but don’t use union() when there’s a single time zone or when the data set is empty. There are ways to build code branching in Flux to support this logic, but it’s verbose and not very intuitive.

Incidentally, before array.map() there was probably no imperative reason for union() to have to support an array of one or zero elements, with the usual excuse that “there are no use cases for it”. But with the introduction of array.map() that is definitely no longer the case.

Third, “use union()” implies that your schema must be clean. union() tends to get upset when there’s a schema mismatch between tables. And this is maddening because when you do a from() you can get data with inconsistent schema: without going very far, even the _field / _value pattern requires the _value column to take as many types as necessary. So, yes, you can have a schema mismatch in your original data, but once you partition it by time zone as selectedDataList, then you can’t merge it back with union(). This would be the main argument in begging InfluxData to give us a timezoneColumn argument instead of a location argument for functions that operate on streams of tables. Because that are things that a “native” function can do which a Flux function can’t.

To close, yes, time zones in Flux kind of work, but there’s still a ways to go before we can really be comfortable using them. Things seem to be moving in the right direction though, and we can be hopeful.

--

--