Using Server Side Encryption on S3 with Dask

Eric Ness
When I Work Data
Published in
2 min readNov 14, 2018

We use S3 to store our data lake at When I Work. This allows us to query the data with Presto or Athena and to access the files directly as well. Recently we have started experimenting with using Dask to distribute the processing of our largest data sets. Dask extends our favorite open data science tools like numpy and pandas to distribute processing across multiple computers.

Dask is a great tool for reading and processing data, but we ran into a snag when writing the data to our lake. We use AWS Server Side Encryption to add another layer of security on to our data storage. While Dask supports Server Side Encryption, it isn’t clear from the documentation how to implement it. This blog post is meant to combine information that is spread across multiple sources into one place.

Code

This script will write 4 JSON files to the S3 bucket that you specify. One file is written for each partition in the dataframe. Although we use JSON format to store our data, this pattern will work with CSV files written with dask.dataframe.to_csv just as well.

This script will pass the storage_options parameter on to the storage interface which in this case is s3fs. Other S3 options such as bucket version awareness can be set this way as well.

There are several changes that need to be made for this script to work. The BUCKET variable needs to be updated to a S3 bucket that you have permission to write to. If you have a KMS key that you’d like to use, it can be specified in the variable KMS_KEY_ID. If not, the SSEKMSKeyId parameter should be removed from storage_options and S3 will simply use the default encryption key. You will also need to make sure that your AWS credentials are set properly for the write operation.

Conclusion

Dask has greatly increased our ability to process large amounts of data and opened up the possibility to scale as needed in the future by adding clusters. Now that we have figured out how to securely write data to our lake, we’re excited about all the opportunities that Dask has opened up.

--

--

Eric Ness
When I Work Data

Principal Machine Learning Engineer at C.H.Robinson, a Fortune 250 supply chain company.