Very Cool Flight Feature, Query Batching
In case you are not aware, InfluxDB 3.0 is out. You can access it in InfluxDB Cloud Serverless, or InfluxDB Cloud Dedicated today.
What you need to understand about InfluxDB 3.0 is that it is build on the Apache Arrow project. Meaning that it:
- Uses Apache Arrow for in-memory columnar data.
- Uses Apache Parquet for on disk columnar data.
- Uses Flight (and FlightSQL) as the primary query interface.
Flight is basically a client-server protocol that ingests queries and returns Arrow. Of the many advantageous of being based on Apache Arrow is that InfluxDB users get the benefits of all of that upstream work. For example, there is a set of Arrow libraries for all of the major programming languages. I am, of course, familiar with the Python library.
The other day Jay was helping me with some code, and he pointed out that the Arrow reader has built in batching, which turns out to be super useful for the downsampler container I am writing.
Check out this function:
def write_downsampled_data(reader):
row_count = 0
try:
while True:
batch, buff = reader.read_chunk()
df = batch.to_pandas()
row_count += df.shape[0]
if 'iox::measurement' in df.columns:
df = df.drop('iox::measurement', axis=1)
target_client.write(record=df,
data_frame_measurement_name=target_measurement,
data_frame_timestamp_column="time",
data_frame_tag_columns=tags.to_pylist())
except StopIteration as e:
return True, row_count
except Exception as e:
print("write exception caught")
print(e)
return False, str(e)
After making a query and getting back a reader object, you pass that reader into this function. The function then reads “chunks” of data from the server, converts it to Pandas, which makes it easy for me to drop the extra column (that column is a clue that I am using InfluxQL for the queries, and then use the new influxdb3 python client to write the downsampled data. When the last chunk is read, the reader throws a StopIteration error, and I return from the function.
This code much more robust and efficient than if I tried to bring back all of the data at once. But, for me, the really cool thing is that this is all standard Arrow programming. As an InfluxDB 3.0 user, I get all of this high performance computing and ease of use from a super high quality upstream project! As a developer, this is the kind of thing I live for :)