Data Engineering Digest #8 (January 2020)Maycon Viana Bordin·FollowPublished indata.plumbers·8 min read·Feb 7, 2020--ListenSharePhoto by eberhard grossgasteiger from PexelsNew ToolsIntroducing Flyte: Cloud Native Machine Learning and Data Processing PlatformToday Lyft is excited to announce the open sourcing of Flyte, a structured programming and distributed processing…eng.lyft.comData Engineering RoleMost In Demand Tech Skills for Data EngineersData Engineer is the fastest growing job title according to a 2019 analysis. Which tech skills are most in demand for…towardsdatascience.comCourses & Training7 Resources to Becoming a Data Engineer - KDnuggetsDate Engineering is one of the fastest growing and in-demand occupations among Data Science practitioners. The ability…www.kdnuggets.comTop 13 data engineer and data architect certificationsData and big data analytics are the lifeblood of any successful business. Getting the technology right can be…www.cio.comNotes for Databricks CRT020 Exam PrepAs I walk through the Databricks exam prep for Apache Spark 2.4 with Python 3, I’m collating notes based on the…medium.comPodcastsChange Data Capture For All Of Your Databases With DebeziumAn interview about how the Debezium framework simplifies implementing change data capture for all of your database…www.dataengineeringpodcast.comPlanet Scale SQL For The New Generation Of ApplicationsAn interview about YugabyteDB and how it was architected to power the new generation of planet scale applications The…www.dataengineeringpodcast.comReplatforming Production DataflowsAn interview about how Mayvenn replatformed their production dataflows using Ascend and improved their ability to…www.dataengineeringpodcast.comPay Down Technical Debt In Your Data Pipeline With Great ExpectationsAn interview about how the Great Expectations framework helps you add meaningful tests and validation to your data…www.dataengineeringpodcast.comReal Data ArchitecturesDesigning Production-Ready Kappa Architecture for Timely Data Stream ProcessingAt Uber, we use robust data processing systems such as Apache Flink and Apache Spark to power the streaming…eng.uber.comA Deep Dive into Unified’s Data LakeWhat is a data lake? How does it work? In this post we answer these questions in the context of Unified’s data lake.medium.comSome Common Data Science Stacks7 stacks from interviewing Analysts, Scientists, and Engineers.towardsdatascience.comData CultureEmpower Data Owners to become a Data-Driven EnterpriseA detailed look at the missing Data Owner role that keeps organizations from becoming data driven.medium.comThe data product lifecycleYour organisation wants to dive head-first into data and AI but you don’t really know where to start? Data&AI is on the…medium.comData LakeWhat Is a Data Lakehouse? — The Databricks BlogOver the past few years at Databricks, we’ve seen a new data management paradigm that emerged independently across many…databricks.comHow Amazon is solving big-data challenges with data lakesBack when Jeff Bezos filled orders in his garage and drove packages to the post office himself, crunching the numbers…www.allthingsdistributed.comThe Distributed Data Mesh as a Solution to Centralized Data MonolithsInstead of building large, centralized data platforms, enterprise data architects should create distributed data…www.infoq.comA Guide To Modern Batch Data Warehousing — ExtractionRedefining the data extraction patterns to follow “Functional Data Engineering” best practicestowardsdatascience.comStarting out with data puddles, then we’ll think about data lakesComic Relief is re-thinking its data ingestion, storage and query stack with Lambda, S3 & Athena. Here is a quick intro…medium.comMulti-tenancy for Big Data: Part 2Modern businesses understand that data is not just important to your business, it is your business.blog.ellation.comData GovernanceObservability for Data EngineeringObservability is a fast-growing concept in the Ops community that caught fire in recent years, led by major…medium.comHow Data Quality Can Kill your Data Science Project… If You’re Not CarefulIf “Data Scientist is the sexiest job in the 21st Century”, then data quality is the least sexy aspect, but it’s still…medium.comTowards a Data Quality Score in open data (part 1)Why Open Data Toronto created a score to assess data quality and what it measuresmedium.comReducing Organizational Complexity with DataOpsOrganizational complexity creates significant problems, but executives in a McKinsey Survey showed little understanding…medium.comData FormatsComparison of Big Data storage layers: Delta vs Apache Hudi vs Apache Iceberg. Part#1All you will read here is personal opinion or lack of knowledge :) Please feel free to contact me for fixing incorrect…medium.comDelta LakeIs “Delta Lake” Replacing “Data Lakes”? (Ep. 6)Delta Lake is another striking open-source project that Databricks supports. What’s the values of Delta Lake?medium.comMigrating from Hive to Delta Lake + Hive in Hybrid Cloud EnvironmentEverything about migration from Hive to Delta Lake + Hivemedium.comPartitioned Delta Lake : Part 3A tutorial about how to use partition in Delta Lakemedium.comUpsert In Delta Lake : Part 4Welcome to fourth part of series on how to upsert/merge data from an Apache Spark DataFrame into a Delta table.medium.comDelta Lake: Extract the real value from Data LakeDelta Lake provides great features and solves some of the biggest issues that come with a data lake. On top of all, it…medium.comApache ParquetCompaction / Merge of small parquet filesOptimising size of parquet files for processing by Hadoop or Sparkmedium.comData Pipelines4 Easy steps to setting up an ETL Data pipeline from scratchSetting up an ETL pipeline within few commandstowardsdatascience.comData ProcessingBig Data Analytics: Apache Spark vs. Apache HadoopLearn why Apache Spark was created, and how it addresses Apache Hadoop’s shortcomings.towardsdatascience.comApache SparkHow we reduced our Apache Spark cluster cost using best practicesIt’s been about 3 months now since I switched over to Lisbon from Italy. I’ve been offered a chance to work with one of…medium.comSpark in Docker in Kubernetes: A practical approach for scalable NLPNatural Language Processing using the Google Cloud Platform’s Kubernetes Enginetowardsdatascience.comSpark deserves a better IDEAuthors: Raj Bains, Maciej Szpakowskimedium.comSpark UDAF could be an option!Calculate average on sparse arraysmedium.comInfrastructure as Code: Introduction to Continuous Spark Cluster Deployment with Cloud Build and…Imagine you want to start building some data pipelines in Spark or implement a model with Spark ML, the first step…medium.comThe What, Why, and When of Apache SparkBefore-you-code Spark basicstowardsdatascience.comApache HiveHerding the Elephants: moving data from PostgreSQL to HiveIn this article we are sharing learnings and practical advice for making PostgreSQL data available to Spark in an…medium.comPrestoPresto-Powered S3 Data Warehouse on KubernetesPresto is a distributed query engine capable of bringing SQL to a wide variety of data stores, including S3 object…medium.comStream ProcessingApache FlinkFlink as a Service at JW PlayerJW Player is the world’s largest network-independent platform for video delivery and intelligence. Our global footprint…medium.comApache Flink State TypesApache Flink is 4th generation open source data processing framework. Flink does support stateful and stateless…medium.comTimers management in Apache FlinkIntroductionmedium.comChange Data CapturePractical Change Data Streaming Use Cases with Apache Kafka & DebeziumGunnar Morling discusses practical matters, best practices for running Debezium in production on and off Kubernetes…www.infoq.comMessagingApache KafkaStreaming Machine Learning with Tiered StorageKai Waehner Print The combination of streaming machine learning (ML) and Confluent Tiered Storage enables you to build…www.confluent.ioPipeline to the Cloud - On-Premises Data Streaming for Cloud AnalyticsRobin Moffatt Print This article show how you can offload data from on-premises transactional (OLTP) databases to…www.confluent.ioStreams and Tables in Apache Kafka: Event Processing FundamentalsMichael Noll Print Part 2 of this series discussed in detail the storage layer of Apache Kafka: topics, partitions, and…www.confluent.ioStreams and Tables in Apache Kafka: Elasticity, Fault Tolerance & Advanced ConceptsMichael Noll Print Now that we've learned about the processing layer of Apache Kafka ® by looking at streams and…www.confluent.io7 mistakes when using Apache KafkaApache Kafka is used as a message broker but can be extended by additional tools to become a whole message processing…blog.softwaremill.comWho and why uses Apache Kafka?Some claim that Kafka is one of the most popular tools in the world.blog.softwaremill.comKafka the afterthoughts: message encoding and schema managementIn this article I share notes and thoughts, from my journey with Kafka, about data encoding and schema management.medium.comEvent-driven Autoscaling for Kubernetes with Kafka & KedaAutoscale Kubernetes workloads based on message count in a Kafka topicmedium.comApache PulsarWhy Apache Pulsar — A Gentle Comparison with KafkaWhat is Apache Pulsar?medium.comWhat are Pulsar Functions?From Pulsar in Action by David Kjerrumgaardmedium.compulsar-express, a web interface for Apache PulsarPulsar-express aims to be a simple web application that allow the users to see informations about their Apache Pulsar…medium.comWorkflow ManagementApache AirflowConfessions of an Airflow userAirflow, Airflow, Airflow… how I love and hate thee. The siren calls of scale and flexibility tempt me, even as I have…medium.comGeneralizing data load processes with AirflowData load processes should not be written twice, they should be generalizedtowardsdatascience.comAutomatic Airflow DAG creation for Data Scientists and AnalystsTL;DR: DAG creator is a python script when runs, it will pick the latest json definition files and substitutes the…towardsdatascience.comWhy Apache Airflow Is a Great Choice for Managing Data PipelinesA look at capabilties which makes Airflow better than its predecessorstowardsdatascience.comReliably Upgrading Apache Airflow at Slack’s ScaleFor two years we’ve been running Airflow 1.8, and it was time for us to catch up. Here’s how we did it without…slack.engineeringScaling DAG Creation With Apache AirflowOne of the more difficult tasks within the Data Science community is not designing a model to a well-constructed…towardsdatascience.comApache Airflow is Fun For Data Engineer!Implementation Apache Airflow in Tunaikumedium.comIntegrating Airflow + Datadog on docker-composeIntegrating Airflow running on Docker + Datadog took way longer than I expected, so I decided to share my guide to the…medium.comCloud ProvidersAWS & Snowflake vs GCP: how do they stack up when building a data platform?When we talk about data, the number of technologies available on the market is overwhelming and staying up to date is a…medium.comAWSGetting Started with Data Analysis on AWSLearn how to use AWS Glue, Amazon Athena, and Amazon QuickSight to transform, enrich, analyze, and visualize…towardsdatascience.comAmazon S3 Data Lake-Storing & Analyzing the Streaming Data on the go — Serverless ApproachMaking A S3 Data Lake by Storing the Streaming Data && Analyzing it on the go…towardsdatascience.comPublish Streaming data into Aws S3 Datalake and Query itConsume Streaming data from Aws Kinesis, build Datalake in S3 and run Sql Quries from Athena.medium.comHow to merge NoSQL and SQL using AWS GlueHow to report on data from both NoSQL and SQL at the same time without going crazylevelup.gitconnected.comGoogle CloudBuilding Real-time data pipelines with Google Cloud Pub/SubMotivationmedium.comAzureBuilding a Dynamic data pipeline with Databricks and Azure Data FactoryTL;DR A few simple useful techniques that can be applied in Data Factory and Databricks to make your data pipelines a…towardsdatascience.comSecuring access to Azure Data Lake gen2 from Azure DatabricksThere are a number of ways to configure access to Azure Data Lake Storage gen2 (ADLS) from Azure Databricks (ADB). This…medium.comDatabasesNoSQLRelational vs NoSQL and RDBMS to NoSQL Migration - DZone DatabaseGiven the choice of a Relational Database (RDBMS) vs a NoSQL database, it has become more important to select the right…dzone.comThe Multi-Model Knowledge Graph - DZone DatabaseEnterprise Knowledge Graphs (EKGs) have been on the rise and are incredibly valuable tools for harmonizing internal and…dzone.comDynamoDB is Not a DatabaseAmazon describes DynamoDB as a database, but it’s best seen as a highly-durable data structure in the cloud.medium.comKeyDB is a Fork of Redis that is 5X FasterWhat if I told you there is a fork of Redis that can run 5x faster with nearly 5x lower latency. What if you no longer…medium.comMongoDB in production: How connection pool size can bottleneck application scaleUnderstanding how MongoDB connection pools and pool sizing works is a fundamental part of running an effective MongoDB…medium.comMaximizing Disk Utilization with Incremental CompactionBy Raphael S. Carvalho and Benny Halevy, January 16, 2020medium.comIn-Memory & Data GridScalable Data Grid Using Apache Ignite - DZone Big DataIn this article, I introduce the concept of a Data Grid, it's properties, services it offers, and finally how to design…dzone.comRelational4 Data Sharding Strategies for Distributed SQL Analyzed - DZone DatabaseA distributed SQL database needs to automatically partition the data in a table and distribute it across nodes. This is…dzone.com