Hao Gao
Hao Gao
Jul 21, 2017 · 1 min read

I am using kafka plugin on production kafka cluster. But I usually do not query kafka. We have ETL jobs to persist kafka data to our data warehouse (S3) in parquet format, so most of cases we will just query warehouse. It is much much faster to query the warehouse

So I suggest just do not query kafka directly. If you have to query kafka (e.g. debugging), you can build a offset-to-timestamp mapping, and use it as partition keys. After you have this mapping, you can narrow down your kafka scan range by time.

select * from kafka_topic where _timestamp < start_time and _timestamp < end_time

the above query will only scan from start_time to end_time instead of scanning the whole kafka topic

So to answer your questions:

#1. Yes, by default, it always read from the beginning. If you have offset-to-timestamp mapping, then you can read from whenever you want.

#2. No. But I found something interesting https://www.qubole.com/blog/caching-presto/

#3. Just run the query again, then you can get the latest message. Well presto is not streaming processing framework, if you want to do analytics on a stream, maybe take look at flink?

)
    Hao Gao

    Written by

    Hao Gao