Mysql table to aws S3 as parquet file and querying using aws athena
For analysing big data sets and making reports mysql is not a natural choice because of its architecture. As data is stored in row-based format in mysql common functions like aggregates and group by’s might not perform very well on big datasets.
Here comes the columnar storage systems like the apache parquet (https://parquet.apache.org/) to the rescue. But then the question would be how can we use a file based system for querying and reporting like sql. Fortunately aws S3 and athena can be of great use for such scenarios.
The architecture
Here i’m using python, sqlalchemy, pandas and pyarrow to do this task
I already have the query to be run with me because i have build a metabase kind of application already in django rest-framework now after the querying i want to store the data as a parquet file for further processing, I’m not showing the query builder code.
now i have the query to be run in sql_quey
in the code below , engine is the sqlalchemy connection.
here organisation_id and report.id are two unique identifiers for the S3 location .
table = pa.Table.from_pandas(df,preserve_index=False)
The line above will create datatypes for creating the athena tables but we still need to do some small regular expression processing on it to fix some issues.
The functions create_athena_database
, drop_athena_table
, create_athena_table
are as follows
so with the datatypes we have derived from the query we will create a table in athena using create_athena_table
function so that athena can read the parquet file in location in the variables3_url
.
connecting with S3 is done using sqlalchemy. Rather than altering the table if the query changes i’m using drop table and create a new table definition .