I think it’s reasonable to assume, that not all our customers will view our Data Lake and it’s technologies through the same lens that we use in our team of Data Engineers.
Whilst we have chosen to lay down our data in S3 using Delta Lake and Apache Spark, we recognise that other teams, such as Data Scientists, have their own priorities to deliver value to their customers and to choose their tooling according to best satisfying these priorities and their team’s skillset.
The ecosystem of tools for Data Scientists is diverse, with this community of highly skilled data professionals using a plethora of libraries and software in addition to the ones used by ourselves. …
I am a Senior Data Engineer in the Enterprise DataOps Team at SEEK in Melbourne, Australia. My colleagues and I, develop for and maintain a Redshift Data Warehouse and S3 Data Lake using Apache Spark.
Back in December of 2019, Databricks added manifest file generation to their open source (OSS) variant of Delta Lake. This made it possible to use OSS Delta Lake files in S3 with Amazon Redshift Spectrum or Amazon Athena.
Delta Lake is an open source columnar storage layer based on the Parquet file format. It provides ACID transactions and simplifies and facilitates the development of incremental data pipelines over cloud object stores like Amazon S3, beyond what is offered by Parquet whilst also providing schema evolution of tables. …