Data Engineering Digest #12 (May 2020)Maycon Viana Bordin·FollowPublished indata.plumbers·17 min read·Jun 19, 2020--ListenSharePhoto by Simon Migaj from PexelsNew ToolsUse Delta Lake 6.0 to Automatically Evolve Table Schema and Improve Operational MetricsLearn more about Delta Lake release 0.6.0 and how it will allow you to automatically evolve table schema in merge…databricks.comA multi-node, elastic, petabyte scale, time-series database on Postgres for free (and more ways we…Today we have a big announcement: we're officially making multi-node TimescaleDB, a petabyte-scale distributed…blog.timescale.comTerminusDBTerminusDB is an open source model driven graph database for knowledge graph representation designed specifically for…terminusdb.comMetricsDB: TimeSeries Database for storing metrics at TwitterWe covered Observability Engineering's high level overview in blog posts earlier here and its follow up here. Our time…blog.twitter.com25 Hot New Data Tools and What They DON’T Do“Wait, do tool X and tool Y work together? I thought they were competitive.”medium.comSpark 3.0How to Speed up SQL Queries with Adaptive Query ExecutionThis is a joint engineering effort between the Databricks Apache Spark engineering team - Wenchen Fan, Herman van…databricks.comPreview Apache Spark 3.0 Using the Databricks Runtime 7.0 BetaWe're excited to announce that the Apache Spark 3.0.0-preview2 release is available on Databricks as part of our new…databricks.comHow Python type hints simplify Pandas UDFs in Apache Spark 3.0Pandas user-defined functions (UDFs) are one of the most significant enhancements in Apache Spark for data science…databricks.comData Engineering RoleHow To Think About DataThe real difference between a data engineer and a data scientist — how they thinktowardsdatascience.comData Engineer, Data Science and Data Analyst — What the Difference?Get to know the professions in the data fieldtowardsdatascience.comComplete Data Engineer’s VocabularyConcepts that data engineers must know in 10 words or lesstowardsdatascience.comVoicing for Data Engineering, the unsung heroHow I have switched gear to help businesses kick-start their data infrastructure and reporting pipelinetowardsdatascience.comData Engineering: What is it?towardsdatascience.comCourses & TrainingData Engineering on GCP Specialisation: A Comprehensive Guide for Data ProfessionalsIf you are a data professional considering to upskill, there is no shortage of learning options, but if you are looking…towardsdatascience.com5 Free Courses to learn Apache Spark in 2020Hello guys, if you are thinking to learn Apache Spark to start your Big Data journey and looking for some awesome free…medium.comHow to Prepare for the Confluent Certified Operator for Apache Kafka (CCOAK) examGetting the Apache Kafka certification from Confluent is a great way of making sure to have your skills recognized by…medium.comDatastax Certified Apache Cassandra Developer | Exam tips 2020Preparation guidelines and resources for Apache Cassandra 3.x Developer Associate Certificationmedium.comPodcasts & PresentationsMapping The Customer Journey For B2B Companies At DreamdataAn interview about the challenges of tracking the customer journey for B2B companies and how Dreamdata is addressing…www.dataengineeringpodcast.comPower Up Your PostgreSQL Analytics With Swarm64An interview with Swarm64 CEO Thomas Richter about optimizing PostgreSQL on high performance hardware and FPGAs for…www.dataengineeringpodcast.comStreamNative Brings Streaming Data To The Cloud Native Landscape With PulsarAn interview with StreamNative co-founder Sijie Guo about his experience contributing to the Pulsar framework for…www.dataengineeringpodcast.comEnterprise Data Operations And Orchestration At InfoworksAn interview with Amar Arsikere about the complexities of data operations at enterprise scale and the approach that…www.dataengineeringpodcast.comMachine Learning through Streaming at LyftSherin Thomas talks about the challenges of building and scaling a fully managed, self-service platform for stream…www.infoq.comFrom Batch to Streaming to BothHerman Schaaf talks about how the streaming data platform at Skyscanner evolved over time. This platform now processes…www.infoq.comKafka: A Modern Distributed SystemTim Berglund covers Kafka's distributed system fundamentals: the role of the Controller, the mechanics of leader…www.infoq.comReal Data ArchitecturesEnabling Data-Driven DecisionsA story of building central Data Platform at the Financial Times using the latest technology trends.medium.comData CultureBye Bye Big Data!Everyone used to say that big data was the future. Was it wrong? What about now?towardsdatascience.comChoose Smart Data Over Big Data to Save Your BusinessFrom a data engineer’s perspectivetowardsdatascience.comData LakeA Complete Guide On Serverless Data Lake using AWS Glue, Athena and QuickSightStep-By-Step Walkthrough on ETL Data Processing, Querying and Visualization in a Serverless Data Laketowardsdatascience.comData Lake — Design For Better Architecture, Storage, Security & Data GovernanceData-driven outcomes, forecasting, and predicting business trends is essential to any business. Today we see at least…medium.comDatalake File Ingestion: From FTP to AWS S3Transferring files from FTP server to AWS s3 using Paramiko in Pythontowardsdatascience.comData Lake AnalyticsData Lake an ever-evolving set of technologies which is used to store structured, semi-structured and unstructured data…medium.comTale of Data Warehouse, Data Lake and Data PondAs a data engineer, you always have a confusion on Data Warehouse, Lake and Pond. What are they and the most important…medium.comData GovernanceHow Privacy Killed RBACThis is a short story about how pressures from the real world can have a grave impact on our world of technologymedium.comPrinciples of lazy data documentation — and how to get your team onboardDocumenting data is a pain. Is there hope for the lazy? Here are a handful of tools and techniques for low-friction…blog.quiltdata.comCollaborative Data Catalog for DataOpsModernizing data platforms can create challenges. Although traditional data catalogs can deliver some visibility across…medium.comImplementation of Decentralized Data QualityA Viewpoint shift from Data Quality to Collaborative Data Qualitytowardsdatascience.comDataOpsMake your data, and your organization ready for AIEver since AI was thrust into the spotlight with Watson in 2007, organizations have wanted to leverage AI in their…medium.comA DataOps perspective on App and Data DemocratizationHow DataOps facilitates access to data and apps, and helps to scale a data-driven companymedium.comWhy DataOps Is Here to StayWith DataOps, data engineers and data scientists can work together, bringing a level of collaboration and…towardsdatascience.com4 Easy Ways to Start DataOps TodayThe primary source of information about DataOps is from vendors (like DataKitchen) who sell enterprise software into…medium.comA DataOps perspective on App and Data DemocratizationHow DataOps facilitates access to data and apps, and helps to scale a data-driven companymedium.comWhat the Heck is *Ops?A Guide to Ops Terms and Whether We Need Themmedium.comThe Seven Pillars of DataOpsWhat is DataOps?medium.comData FormatsGAVRO — Managed Big-Data Schema EvolutionWouldn’t it be great to build a data ingestion architecture that was resilient to change? More specifically, resilient…towardsdatascience.comDelta LakeDelta Lake in production: a critical evaluationI have seen several posts and tutorials on Delta Lake using “Hello World” kind of examples, where everything works…medium.comDelta LakeHola!medium.comDelta Lake: Schema Enforcement & Evolutionmedium.comUse Delta Lake 0.6.0 to Automatically Evolve Table Schema and Improve Operational MetricsLearn more about Delta Lake release 0.6.0 and how it will allow you to automatically evolve table schema in merge…databricks.comApache AvroHow to deserialize AVRO messages in Python Faust?Faust is a stream processing library, porting the ideas from Kafka Streams to Python.medium.comApache ParquetCool Sh*t I Just Learned: Parquet’s Predicate PushdownDoes it really exist? A pursuit of finding Parquet’s Predicate Pushdown empirical evidence 🕵🏻♂️medium.comData PipelinesSelf-serve data pipelining platformBy — Karuna Saini ( Engineer, Data Platform)medium.comHow to build a scalable big data analytics pipelineSet up an end-to-end system at scaletowardsdatascience.comReal Time Data Pipeline — More Than We ExpectedWhen we were considering migrating our data delivery pipeline from batches of hourly files into a real time streaming…medium.comBuild Your First Data Pipeline in just Ten MinutesStep-by-step process to build your first data pipeline with a real-world use case using PDI.medium.comData Quality ToolsIntroducing a new pySpark’s library: owl-data-sanitizerA library to democratize data quality within companies with pySpark data pipelines.towardsdatascience.comData ProcessingGetting started with Spark and batch processing frameworksWhat you need to know before you dive into big data processing with Apache Spark and other frameworks.blog.insightdatascience.comHadoop vs. HDFS vs. HBase vs. HiveWhat’s the difference?medium.comBeyond Pandas: Spark, Dask, Vaex and other big data technologies battling head to headAPI and performance comparison on a billion-rows dataset. What should you use?towardsdatascience.comApache SparkDIY: Apache Spark & DockerSet up a Spark cluster in Docker from scratchtowardsdatascience.comApache Spark with Kubernetes and Fast S3 AccessUse Spark in a simple and portable way on-promise and in the cloudtowardsdatascience.comWorking with JSON in Apache SparkDenormalising human-readable JSON for sweet data processingmedium.comRevealing Apache Spark Shuffling MagicFive Important Aspects of Apache Spark Shuffling to know for building predictable, reliable and efficient Spark…medium.comApache Spark BigQuery Connector — Optimization tips & example Jupyter NotebooksLearn how to use the BigQuery Storage API with Apache Spark on Cloud Dataprocmedium.comApache Spark With DynamoDB Use CasesCode examples of JAVA Spark applications that write and read data from DynamoDB tables running in an AWS EMR cluster.medium.comThe Most Complete Guide to pySpark DataFramesA bookmarkable cheatsheet containing all the Dataframe Functionality you might needtowardsdatascience.comBig Data: Spark, AWS & SQLsimple cloud computing with S3 & EMRmedium.comSpark Serialization ErrorsA deep dive into the causes of serialization errors in Sparkmedium.comQuickstart: Apache Spark on KubernetesGive your big loads a smooth sailing using the native Apache Spark Operator for Kubernetestowardsdatascience.comDynamic Partition Pruning in Spark 3.0 - DZone Big DataWith the release of Spark 3.0, big improvements were implemented to enable Spark to execute faster and there came many…dzone.comApache Spark: CachingApache Spark provides an important feature to cache intermediate data and provide significant performance improvement…towardsdatascience.comThe Pros and Cons of Running Apache Spark on KubernetesKubernetes support was only recently added for Spark. How does it compare to other deployment modes and is it worth it?towardsdatascience.comFive Ways to Perform Aggregation in Apache SparkAggregation being the widely used operator among data analytics assignments, Spark provides a solid framework for the…medium.comUDAF and Aggregators: Custom Aggregation Approaches for Datasets in Apache SparkAggregations on data records is a necessary part of a data analytics exercise and therefore Spark is designed to put…towardsdatascience.comHow to modernize and scale VaR calculations and risk management with Apache Spark, Delta Lake and…Managing risk within the financial services, especially within the banking sector, has increased in complexity over the…databricks.comApache HiveUnderstanding Hadoop HiveHive is a data warehouse system which is used for querying and analysing large datasets stored in HDFS. It process…medium.comHive UDAFs, or why Java’s type system sucksApache Hive is one of the most ubiquitous big data technologies out there; its job is to enable all kinds of data…medium.comApache HadoopApache YARN & ZookeeperAll about Resource Allocation and High Availabilitytowardsdatascience.comPartition Management in HadoopOur solution to the Hadoop small files problemmedium.comPrestoPresto sails on the Ship of TheseusThe Open Source Software, Presto, presents a real-life case study of the philosophical problem: The Ship of Theseus.medium.comQuerying Multiple Data Sources with a Single Query using Presto’s Query FederationIn this first post in a new series, we introduce Presto and show how to use it to combine data from several sources…medium.comApache DrillQuery data in Google Cloud Storage with SQL using Apache DrillGoogle Cloud users are no strangers to BigQuery. Its a petabyte scale serverless warehouse with SQL interface, blazing…medium.comApache SqoopApache SqoopRDBMS to HDFS and backtowardsdatascience.comA Step by Step Guide for Loading Oracle Datasets into Hadoop using Docker ContainersTutorial : Oracle To Hadoop with Docker containersmedium.comApache Sqoop — hide your password!Apache Sqoop is a versatile and very useful tool when it comes to gathering data for your Big Data project.medium.comStream ProcessingRealtime Stream Processing Architectural SolutionIn my previous post I have described the feasibility study on technology selection for a realtime stream processing…medium.comTo stream or to not stream. That is a question.Data streaming through Kafka is becoming an essential part of any data application. We are using Kafka mostly thanks to…medium.comApache FlinkStream Processing Best Practices with Apache FlinkApache Flink is used for building a pipeline for streaming data analysis. This section discusses best practises I have…medium.comApache Flink Series 9 — How Flink & Standalone Cluster Setup Work?In this post, I am going to explain, how Flink starts itself, and what happens when you submit your job to the…medium.comFlink Map, CoMap, RichMap and RichCoMap FunctionsFlink has a powerful functional streaming API which let application developer specify high-level functions for data…medium.comFlink CheckpointingState management comes out of the box for Flink and it is considered as the first-class citizen. While Flink abstracts…medium.comSQL Editor for Apache Flink SQLThis is the very first version of the SQL Editor for Apache Flink.medium.comApache SparkSpark Streaming with HTTP REST endpoint serving JSON dataSpeed up development and testing of spark structured streaming pipelines using HTTP REST endpoint as streaming source.medium.comStreaming from Kafka to PostgreSQL through Spark Structured Streamingmedium.comIntegration Testing in Spark Structured StreamingA guide for writing integration test for a Spark Structured Streaming Applicationmedium.comApache FlumeApache FlumeTrickle-feed unstructured data into HDFS using Apache Flumetowardsdatascience.comClustering & ResourcesApache YARN & ZookeeperAll about Resource Allocation and High Availabilitytowardsdatascience.comChange Data CaptureStream your data changes in MySQL into ElasticSearch using Debizium, Kafka, and Confluent JDBC…How to stream data changes from MySQL into Elasticsearch Indextowardsdatascience.comFaster Change Data Capture for your Data LakeThe intent of this article is to discuss and present a new, faster approach to performing Change Data Capture (CDC) for…medium.comDebeziumStream your data changes in MySQL into ElasticSearch using Debizium, Kafka, and Confluent JDBC…How to stream data changes from MySQL into Elasticsearch Indextowardsdatascience.comData liberation pattern using Debezium engineIntegrating legacy applications into your Event-Driven Architecture.medium.comCDC made Easy with KTable, Debezium and Kafka Connectby Karthikeyan Siva Baskaran and Somanath Sankaranmedium.comDebezium- Production Deployment PreparationIf you are working on Debezium and plans to move it to production, I will suggest you go through this self-explanatory…medium.comDebezium Custom ConvertersCreating custom converters using Debezium’s new SPI to override value conversionsmedium.comMySQL to PostgreSQL using Kafka ConnectOur objective would be to quickly set up a data pipeline and move data from MySQL to PostgreSQL.medium.comStorageApache HDFSHadoop Distributed File SystemA comprehensive guide to understanding HDFS and it’s inner workingstowardsdatascience.comMessagingApache KafkaApache Kafka vs. Enterprise Service Bus (ESB)Typically, an enterprise service bus (ESB) or other integration solutions like extract-transform-load (ETL) tools have…medium.comIs Apache Kafka a Database?Can and should Apache Kafka replace a database? How long can and should I store data in Kafka? How can I query and…medium.comKafka with AVRO vs., Kafka with Protobuf vs., Kafka with JSON SchemaExperiments with Kafka serialisation schemes — playing with AVRO, Protobuf, JSON Schema in Confluent Streaming…medium.comSSL Authentication with Apache KafkaApache Kafka is the next big thing in Event-driven architectures and Microservices ecosystem and with its fast…medium.comSchema and Topic Design in Event-Driven Systems (featuring Kafka!)In the microservices world, communication between services provides a host of problems — one of them being “how do we…medium.comBallerina Kafka Serialization with AvroThis article will demonstrate how to use Apache Avro serialization / deserialization in Ballerina Kafka producers and…medium.comApache-Kafka — Stream Avro Serialized Objects In 6 Steps.Set up the environment for Kafka (Kafka server, Zookeeper, Schema Registry) and Docker.medium.comLearn how to use Kafkacat — the most versatile Kafka CLI clientKafkacat is an awesome tool and today I want to show you how easy it is to use it and what are some of the cool things…medium.comKafka for EngineersHere are things about Kafka that you need to understand as a software engineerlevelup.gitconnected.comThe streaming bridges — A Kafka, RabbitMQ, MQTT and CoAP exampleThe streaming bridges — A Kafka, RabbitMQ, MQTT and CoAP exampleThe streaming bridges — A Kafka, RabbitMQ, MQTT and CoAP examplemedium.com3 Libraries You Should Know to Master Apache Kafka in PythonFirst thing first. Why Kafka? Kafka is intended for boosting an event-driven architecture. It empowers the architecture…towardsdatascience.comIntro to streaming data and Apache KafkaOverview of streaming data architectures and why Apache Kafka has become so populartowardsdatascience.comKafka Internals For TroubleshootingIt is not necessary to know Kafka internals in order to run it but understanding these internals helps to provide…medium.comApache PulsarOne new contestant to bring down the King: Apache PulsarNowadays we’re in a new age of Event-Driven Architecture rise. This is not the first time we’ve lived that. Before…medium.comApache Pulsar 2.5.2The Apache Pulsar 2.5.2 version is a huge effort from the community, with over 56 commits, general improvements and bug…medium.comWorkflow ManagementApache AirflowGetting started with Airflow locally and remotelyAirflow has been around for a while, but it has gained a lot of traction lately. So what is Airflow? How can you use…towardsdatascience.comHow We Monitor Apache Airflow in ProductionA quick guide on what (and how) to monitor to keep your workflows running smoothly.blog.gojekengineering.com3 Steps to Advanced Alerting on Airflow with DatabandAs a data engineer, you need to create trust in your data. You need to be aware of problems in pipeline timeliness and…medium.comBuild your first data warehouse with Airflow on GCPWhat are the steps in building a data warehouse? What cloud technology should you use? How to use Airflow to…towardsdatascience.comA Complete Guide to Setting up a Local Development Environment for Airflow (with Docker and…Collaborate on Airflow workflows with ease using this setup that includes a docker-compose, PyCharm, and DAG validation…medium.comA Gentle Introduction To Understand Airflow ExecutorLet’s discuss details about what’s Airflow executor, compare different types of executors to help you make a decisiontowardsdatascience.comAutomated Reporting System Using AirflowConfigure scheduled reports in under 15 minutesmedium.comApache Airflow and Kubernetes — Pain Points and Plugins to the RescueI explore some of the Airflow pain points we struggled with and how plugins were used to address them.medium.comFrom Zero to Apache Airflow Contribution — Part 1How to make your first contribution to Apache Airflow projectmedium.comFrom Zero to Apache Airflow Contribution — Part 2You are in part 2 of how to make your first Apache Airflow contribution. If you haven’t started in part 1, I advise you…medium.comHow to create an ETL pipeline in Python with AirflowSimple ETL with Airflowmedium.comAirflow: how and when to use it (Advanced)Beyond basic concepts of Airflow, there is a lot to consider. Choosing an operator and DAG structure is important for…towardsdatascience.comMachine Learning WorkflowKubeflow 1.0 — Quick OverviewKubeflow is an open-source and free machine learning Kubernetes-native platform for developing, orchestrating…medium.comDeveloping Machine Learning PipelinesIn Data Science, you are only as good as the way you structure your worktowardsdatascience.comOperationalization of ML Pipelines on Apache Mesos and Hadoop using AirflowAn architecture for bringing ML models into production at NEW YORKERtowardsdatascience.comCloud ProvidersAWSAdd Newly Created Partitions Programmatically into AWS Athena schemaPython script to load new partitions using Glue Job. Simple. Fast. Clean.medium.comLessons From Processing 300 Million Messages a DayCommon problems when scaling AWS Gluemedium.comChallenges during migration from On-Premise to AWS CloudIn case of on-premise set up, infrastructure management is one of the challenge for a company. Now a days, companies…medium.comGlue ETL — Redshift SourceRedshift is a common destination for ETL pipelines. Using Redshift as an ETL source is not very common. It also creates…medium.comExtract, Transform, Load (ETL) — AWS GlueLearn how to use AWS Glue for ETL operations in Spark on Novel Corona Virus Datasettowardsdatascience.com7 Things I Found Annoying About AWS Glue1. Extremely slow start timesmedium.comBuild first ETL solution using AWS Glue..In this post, I am going to discuss how we can create ETL pipelines using AWS Glue. We will learn what is aws glue, how…medium.comCross-account AWS Glue Data Catalog access with Glue ETLTo process data in AWS Glue ETL, DataFrame or DynamicFrame is required. A DataFrame is similar to a table and supports…medium.comImplementing Glue ETL job with Job BookmarksAWS Glue is a fully managed ETL service to load large amounts of datasets from various sources for analytics and data…medium.comProcess Events with Kinesis and LambdaProcessing Kinesis Events with Lambdamedium.comAWS Data Analytics — Kinesis Part-1How to move data on AWS ?medium.comAWS Kinesis Data Streaming with Lambda and ServerlessToday we are going to explore AWS Kinesis Data Streaming with Lambda functions. So, Amazon Kinesis is a managed…medium.comGoogle CloudLoading Data from multiple CSV files in GCS into BigQuery using Cloud Dataflow (Python)A Beginner’s Guide to Data Engineering on Google Cloud Platformmedium.comImplementing a Data Vault in BigQueryIn my last post I went over how we went about implementing a fitness program at Pandera, the need for a custom solution…medium.comHow to use Dynamic SQL in BigQueryFormat a string, and use EXECUTE IMMEDIATEtowardsdatascience.comGoogle Cloud Data Catalog — Integrate Your On-Prem RDBMS MetadataCode samples with a practical approach on how to ingest metadata from on-premise Relational Databases into Google Cloud…medium.comWhere is my data? The answer is Google Data CatalogAll you want to know about GCP Data Catalogmedium.comOptimal performance with Bigtable: changing the key of your table with Apache BeamBigtable is a high performance distributed NoSQL database. But what if you discover that performance is not so great in…towardsdatascience.comTransform JSON to CSV from Google bucket using a Dataflow Python pipelineIn this article, we will try to transform a JSON file into a CSV file using dataflow and pythonmedium.comCloud Firestore on Beam with JavaRecently, I’ve been in charge to create a streaming data pipeline which handle data coming through Cloud Pub/Sub…medium.comUsing the Bigtable emulator with Apache Beam and BigtableIOCloud Bigtable is a great high performance distributed NoSQL database that can store petabytes of data, but sometimes…medium.comData Pipeline in GCP: Cloud Function BasicsMost Data Scientists, prefer to own the end to end data pipeline of their models, but owning a pipeline requires a lot…towardsdatascience.comAzureContinuous integration and delivery in Azure Data FactoryDefinitive guide to building CI/CD pipelines for Azure Data Factory using Azure DevOpsmedium.comHow to build a data platform with Azure and SnowflakeThis blog explains how Azure and Snowflake provide a very powerful toolset to quickly build data platforms.medium.comAzure Data Lake Store Gen2 SnapshotData Lake Storage Gen2 is the result of converging the capabilities of our two existing storage services, Azure Blob…medium.comMonitoring and Reporting the real-time data in Power BIAbout Kafka: Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to…medium.comHow to Increase Azure Databricks Cluster vCPU Cores LimitsThe solution of the warning: “This account may not have enough CPU cores to satisfy this request” for Azure Databrickstowardsdatascience.comDatabasesNoSQLMongoDB 4.2.6MongoDB is a popular distributed document database. It offers replication via a homegrown consensus protocol which…jepsen.ioTileDB 2.0 and the Future of Data ScienceThe Helicopter Viewmedium.comEverything you should know about NoSQL database — System DesignIt is hard to choose between relation (RDBS) or non-relational database (NoSQL) while designing a system. A fair…medium.comFirestore/Datastore: unlocked the query filter capabilities in GoFirestore and Datastore are powerful but the query capabilities are limited. Discover a Go library that I wrote for…medium.comAzure Cosmos Database Concepts, Data Modelling — Part 1NoSQL is becoming the latest trend in all applications. You must have seen a lot of changes in the last decade where…medium.comA Bumpy Journey To Rewrite A Bulk Upload API For Cassandra — P1: The DatabaseA deep dive into how Cassandra handles data writes and consistenciesmedium.comUnderstanding Distributed database/system using CassandraWhen I first came across the term distributed systems or database, the very reaction that came to my mind was that…medium.comComparing CQL and the DynamoDB APISix years ago, a few of us were busy hacking on a new unikernel, OSv, which we hoped would speed up any Linux…medium.comAn introduction to Apache HBaseOverviewmedium.comHBase DB in Distributed SystemsHBase Database overviewmedium.comApache Druid Migration: AWS to GCP — Part 2In this article, we will discuss how to migrate Druid from AWS to GCP cloud platform. For details on how to set up…medium.comIn-Memory & Data GridDistributed Cache Design : 🖥A Cache is like short-term memory. It is typically faster then the origin data source. You know Accessing data from RAM…medium.comRelationalHow does Spanner avoid single point of failures in writes?Google’s Spanner is a relational database with 99.999% availability which is roughly 5 mins a year. Spanner is a…medium.comMigrating from Postgres to CockroachDB: bulk loading performanceI recently migrated a large database (~400GB) from PostgreSQL over to CockroachDB. This blog is a recap of my process…medium.comSQLZoo: The Best Way to Practice SQLA Wild Playground to Test Your Skills with Solutionstowardsdatascience.comModern Data WarehousesBuilding a Data Warehouse Pipeline: Basic Concepts & RoadmapFive processes to improve your data pipeline operability and performancetowardsdatascience.comThe Rise and Fall of the OLAP CubeOne of the biggest shifts in data analytics over the past decade is the move away from building ‘data cubes’, or ‘OLAP…towardsdatascience.comMaking queries 100x faster with SnowflakeWhy and how we migrated our product usage data from PostgreSQL to Snowflakemedium.com