Data Engineering Digest #11 (April 2020)Maycon Viana Bordin·FollowPublished indata.plumbers·13 min read·May 9, 2020--ListenSharePhoto by Ylanite Koppens from PexelsNew & Updated ToolsProphecyHub: Metadata re-invented with Git & GraphQL for Data EngineeringAuthors: Raj Bains, Arpan Agrawal, Mayank Kotwalmedium.comIntroducing Apache Pinot 0.3.0Built at LinkedIn, Pinot is an open source, distributed, and scalable OLAP data store that we use as our de-facto…engineering.linkedin.comAWS Glue now supports serverless streaming ETLPosted On: AWS Glue now supports streaming ETL. This feature makes it easy to set up continuous ingestion pipelines…aws.amazon.comData Engineering RoleAssessing and interviewing data engineers from a distanceWhen in-person technical interviews are no longer an option, hiring managers still have a wealth of online resources at…blog.insightdatascience.comCourses & TrainingHow to Prepare for and Clear the GCP : Professional Data Engineer ExamHey guys, so I passed the GCP : Professional Data Engineer exam on 17th January 2020 and I went through a really tough…medium.comPATH TO BECOME A DATA ENGINEERData Engineering is definitely one of the most demanded jobs in today’s world. As the data grows the need of Data…medium.com(Review) Udacity Data Engineer NanodegreeOr A.K.A, a journey to become a modern data engineer.medium.comPodcasts & PresentationsBuilding A Knowledge Graph Of Commercial Real Estate At CherreAn interview about how Cherre builds and maintains a knowledge graph of commercial real estate data and how it enables…www.dataengineeringpodcast.comMaking Data Collection In Your Code Easy With RookoutAn interview with Rookout's CTO about the importance of including non-technical roles in the data collection process…www.dataengineeringpodcast.comBuilding Real Time Applications On Streaming Data With EventadorAn interview with Eventador CEO Kenny Gorman about the challenges of building a managed service for streaming data to…www.dataengineeringpodcast.comTaming Complexity In Your Data Driven Organization With DataOpsAn interview about using a DataOps approach to reduce the technical and organizational complexity that occurs in data…www.dataengineeringpodcast.comReal Data ArchitecturesCreate AWS baby datalakes to handle ongoing daily data batchIn the era of micro-service architecture, AI (or BI)-powered applications are structured as a collection of services…medium.comHow we built a modern data platform in a digital bank from the ground zeroTo Start any organization’s data journey, an initial step is to build a data platform. however, it’s not rocket science…medium.comData CultureA Data Engineer’s Perspective On Data DemocratizationHow democratizing data shapes the data engineering efforttowardsdatascience.comHow Data Science is Boosting NetflixWhen used effectively, data can transform your business in magical ways and take it to new heights.towardsdatascience.comWhen Data Science turns into HomoeopathyA few thoughts on the importance of data literacy in data-driven companies.medium.comDon’t Buy Data — Invest In ItHuge sums of money are wasted on data because companies are spending it the wrong way. There’s a difference between…towardsdatascience.comData LakeDesign Patterns for Data LakesData Lake is the heart of big data architecture, as a result there needs to be careful planning in designing and…medium.comA Data Scientist’s Guide to Data ArchitectureWhat you need to know to build a robust data processtowardsdatascience.comReinventing the Data Platform in the CloudIn our first story we pointed out which architectural aspects and paradigms are crucial for a sucessful data platform…medium.comData GovernanceData Discovery in 2020A brief survey of data catalogs from Big Tech data teamsmedium.comData Catalogs — Unlocking Value in your Data LakesIt’s increasingly clear that successful data lake transformation and adoption of self-service rests on findability and…medium.comFive mistakes to avoid while building a Data PlatformHaving spent couple of years now in the world of data, building end to end data management platforms right from the…medium.comSustainable Privacy Compliance Requires Disciplined Data ManagementData management, including meta-data management, data governance, master data management, has been advocated since the…towardsdatascience.comData FormatsDelta LakeHow to optimize and increase SQL query speed on Delta LakeThere are two time-honored optimization techniques for making queries run faster in data systems: process data at a…databricks.comHow to optimize and increase SQL query speed on Delta LakeThere are two time-honored optimization techniques for making queries run faster in data systems: process data at a…databricks.comHow to Build a Modern Clinical Health Data Lake with Delta Lake - The Databricks BlogThe healthcare industry is one of the biggest producers of data. In fact, the average healthcare organization is…databricks.comImproving Resiliency with Databricks Delta Lake & AzureDatabricks Delta Lake with few Azure features can protect our data lake & help to restore easily in case of any issues.medium.comDatabricks Delta ArchitectureAs organizations nowadays have a lot of data, which could be customer data or S3 or could be unstructured data from a…medium.comData PipelinesBuilding a Simple ETL Pipeline with Python and Google Cloud PlatformExtracting data from an FTP server using Google Cloud Functionstowardsdatascience.comImprove your Data Lifecycle with Metadata-Driven PipelinesNo digital transformation program is complete without a data-based initiative. With some speculating that artificial…medium.comLessons learned building serverless data pipelinesBefore many of the cooler features of AI products can be productionalized you need high quality and correct data…medium.comData Pipelines with OpenFaas on K3sA short story of how I used OpenFaas, Nats, and K3s to create a data pipeline for inserting data into a data lakemedium.comBuild a Scalable Data Pipeline with AWS Kinesis, AWS Lambda, and Google BigQueryThis blog details how to handle large amounts of event-triggered data for live time backend analysis with AWS Kinesis…medium.comML PipelinesProductionising ML Projects with Google BigQuery and PySpark: Predicting Hotel CancellationsAll too often, data scientists get caught up in the exploratory phase of data science — i.e. running multiple models on…towardsdatascience.comHow to Build Machine Learning Pipelines with Airflow & PapermillLearn to scale your machine learning workflows at will.medium.comData Quality & ToolsDirty Data — Quality Assessment & Cleaning MeasuresPractical guide to understand, build, and execute data quality & cleaning processtowardsdatascience.comData IngestionApache SqoopSqoop scenarios and optionsAs part of the modern day big data architecture, it has become imperative to move data from RDBMS to Hadoop Distributed…medium.comData ProcessingApache SparkSix Spark Exercises to Rule Them AllSome challenging Spark SQL questions, easy to lift-and-shift on many real-world problems (with solutions)towardsdatascience.comApache Spark Dataset Encoders DemystifiedRDD, Dataframe and Dataset in Spark are different representations of a collection of data records with each one having…towardsdatascience.comSpark Optimizations for Advanced Users - Spark Cheat SheetI started Apache Spark learning almost 3 years back, when I was working with Android middle-ware project as part of my…medium.comAn Apache Spark Application In Microservices EcosystemsArticulating a problem reflects our knowledge about its domain. In the software and data world, we almost always…medium.comAn Apache Spark Application In Microservices EcosystemsArticulating a problem reflects our knowledge about its domain. In the software and data world, we almost always…medium.com4 simple tips to improve your Apache Spark job performance!Making your Apache Spark application run faster with minimal changes to your code!medium.comUsing the Spark Aggregator class in ScalaType-safe Aggregations: what are they?towardsdatascience.comSuccessful spark-submits for Python projects.Smoothly run your project on an actual cluster, instead of that pretend one you’ve been using.towardsdatascience.comApache HadoopInstalling Hadoop 3.2.1 Single node cluster on Windows 10This article is a step-by-step guide to install a Hadoop single node cluster on Windows 10 operating system.towardsdatascience.comMap-Reduce with Python & Hadoop on AWS EMRLet’s do some basic Map-Reduce on AWS EMR, with the typical word count example, but using python and Hadoop streaming.levelup.gitconnected.comApache Flume and Hbase in HadoopWelcome to lesson ‘Apache Flume and HBase’ of Big Data Hadoop tutorial which is a part of ‘big data training’ offered…medium.comApache HiveA POC for YouTube Data Analysis using Pig & HiveIn Today’s world as the 4 V’s of Big data (Volume,Variety,Velocity & Veracity) are very rapidly increasing it has…medium.comAccelerate Spark and Hive Jobs on AWS S3 by 10x with Alluxio as a Tiered Storage SolutionIn this article, Thai Bui describes how Bazaarvoice leverages Alluxio as a caching tier on top of AWS S3 to maximize…medium.comPrestoPresto with Kubernetes and S3 — BenchmarkIn the first part of this blog, I described how to deploy a Presto cluster with Kubernetes and configure it to access…medium.comIntroducing our High-Performance Elasticsearch Connector for PrestoOur Presto Elasticsearch Connector is built with performance in mind. Here are some of the use-cases it is being used…medium.comStream ProcessingBig Data Battle: Streaming data approach using Apache Flume vs PySparkBig Data เป็นนิยาม ซึ่งอธิบายถึงปริมาณของข้อมูลมหาศาล…medium.comKafka Stream (KStream) vs Apache Flink - DZone Big DataTwo of the most popular and fast-growing frameworks for stream processing are Flink (since 2015) and Kafka's Stream API…dzone.comApache FlinkEvent-Driven Supply Chain for Crisis with FlinkSQLHow Open Source Streaming technologies can help improve supply chain during Covid-19towardsdatascience.comTwitter Streaming using FlinkFlink is an open source stream-processing framework. It does provide stateful computation over data streams, recovery…medium.comRunning Flink Application on Kinesis Data Analytics(KDA)- Part 1Learn how to run flink stream processing application in was kinesis data analytics environment. Covers some best…medium.comVisualize fraudulent transactions via CEP with Kafka, Flink, SQL, D3.js and Mapbox.I am not too proud to admit it — I like writing code in javascript. I’m a novice, but along with Python it’s my go-to…medium.comThe Foundations for Building an Apache Flink ApplicationUnderstanding stream processing using Flink from bottom-up; a practical guide for coding a Flink processing Java…medium.comApache SparkOptimize Spark Structured Streaming for Scale.Spark structured streaming production-ready version was released in spark 2.2.0. Our team was excited to test it on a…medium.comThe connection between Spark Streaming and Apache Kafka using JavaHow to connect Kafka and Sparkmedium.comApache Kafka StreamsBottom Up Approach To Kafka Stream InternalsKafka Stream library needs to complete couple of steps before getting a stream application up and running. These steps…medium.comKafka Stream Processing — Composing Views By ExampleThe sample code and instruction to setup and run, is available in GitHub. Event driven system designed with CQRS…medium.comCalculating speed, bearing and distance using Kafka Streams Processor APISometimes the classic DSL Kafka is not enough for us. The Processor API allows you to freely define the processor, and…medium.comApache StormIntroduction to Apache StormApache Storm is a free and open source distributed realtime computation system. Apache Storm makes it easy to reliably…medium.comChange Data CaptureChange data Capture : Old wayIntroductionmedium.comStorageHadoop TeraGen RevisitedSomeone asked me to help benchmark and compare throughput of on premise and cloud big data storage. Instead of just…medium.comApache HDFSHDFS Erasure CodingReduce storage overhead significantly in your HDFS cluster by leveraging Erasure Codingtowardsdatascience.comUnderstanding Hadoop HDFSHDFS (Hadoop Distributed File System) is a distributed file system for storing and retrieving large files with…medium.comMessagingKafka vs. RabbitMQ: Why Use Kafka?Are all data streaming services made equal?medium.comData Pipelines: Scaling a Message Broker System for Half the CostTealium is a data hub platform that processes events at a large scale. We’ve seen tremendous growth in the amount of…medium.comApache KafkaOrdering of events in KafkaCertain use cases require strict ordering of events (messages/records with data payload and/or state) to be maintained…medium.comOptimizing Kafka Cluster Deployments in KubernetesWe, at Axual, pride ourselves in running high volume, mission critical Apache Kafka clusters for businesses in various…itnext.ioKafka 101 — An introduction to KafkaI have been thinking of doing something this quarantine and here you go! As the title mentions, this blog is an…medium.comApache Kafka in a NutshellArchitecture, Use Cases, and a Getting Started guide — rolled into onemedium.comAn investigation into Kafka Log CompactionKafka log cleaner design and usagemedium.comKafka Connect on Kubernetes, the easy way!This is a tutorial that shows how to set up and use Kafka Connect on Kubernetes using Strimzi, with the help of an…itnext.ioApache PulsarWhy StreamSQL moved from Apache Kafka to Apache PulsarApache Kafka and event streaming are practically synonymous today. Event streaming is a core part of our platform, and…medium.comData-Friendly Messenger Apache Pulsar Gains MarketOpen source messaging system, Apache Pulsar has been promoted from incubator status to a top-level project as its…medium.comApache Pulsar 2.5.1Apache Pulsar community has successfully released 2.5.1 version. Learn improvements and bug fixes in Apache Pulsar…medium.comWorkflow ManagementPython Data Engineering Tools: The Next GenerationA look below the surface of new data engineering / science frameworksmedium.comApache AirflowApache Airflow in a Digital bank ProductionProduction Background: We have 100s of data pipelines(mostly Apache spark) in production to ingest the raw data…medium.comAirflow’s dashboard was testing our patience, so we made our owntl;dr: Creating/editing a pipeline visually is impossible in existing tools and it makes data engineering a chore. This…medium.comApache Airflow — Programmatic platform for Data Pipelines and ETL/ELT workflowsIn current data driven world, Data Pipelines and ETL(Extract, Transform and Load) workflows plays a major role in…medium.comHow did I resolved pip package dependency issue in Apache Airflow?Talks about the cyclic dependency issue with pip packages, how to resolve it using PythonVirtualenvOperator.medium.comApache Airflow — Plugins, SubDAGs and SLAsAirflow Pluginsmedium.comElastic(autoscaling) Airflow Cluster in KubernetesIn this article, I will demonstrate how we can build an Elastic Airflow Cluster which scales-out on high load and…itnext.ioAirflow Schedule Interval 101The airflow schedule interval could be a challenging concept to comprehend, even for developers work on Airflow for a…towardsdatascience.comAlpaca: Airflow at JW PlayerHow our custom platform built on top of Airflow allows users to quickly create new Airflow DAGsmedium.comAirflow with YAML Dags and kubernetes operatorSimplifying the creation of DAGsmedium.comAirflow : Zero to OneIn current world, we process a lot of data and the churn rate of it increases exponentially with passing time, where…medium.comApache OozieBatch Processing of data from MYSQL to HDFS using Oozie workflowEncountered with a challenge of automating the process of data collection from MySQL server to HDFS using Oozie…medium.comCloud ProvidersAWSServerless Data Lake: Storing and Analysing Streaming Data using AWSMaking an Amazon S3 Data Lake on Streaming Data using Kinesis, Glue, Athena and Quicksightmedium.comBuilding a Data Lake with AWS Lake FormationWith growing numbers of people accessing data, it is important that data platforms are flexible and scalable. Hence…medium.comPEX — The secret sauce for the perfect PySpark deployment of AWS EMR workloadsHow to use PEX to speed up deployment of PySpark applications on ephemeral AWS EMR clusterstowardsdatascience.comAWS Glue: An ETL Solution with Huge PotentialAWS Glue is a fully managed serverless ETL service with enormous potential for teams across enterprise organizations.medium.comHow I connect an S3 bucket to a Databricks notebook to do analytics.A basic use case to connect Amazon S3 and a databricks notebook.towardsdatascience.comUpdate and Insert(upsert) Data from AWS GlueIntroductiontowardsdatascience.comMap-Reduce with Python & Hadoop on AWS EMRLet’s do some basic Map-Reduce on AWS EMR, with the typical word count example, but using python and Hadoop streaming.levelup.gitconnected.comSimplify data pipelines with AWS Glue automatic code generation and Workflows | Amazon Web ServicesIn this post, we discuss how to leverage the automatic code generation process in AWS Glue ETL to simplify common data…aws.amazon.comExploring the public AWS COVID-19 data lake | Amazon Web ServicesThis post walks you through accessing the AWS COVID-19 data lake through the AWS Glue Data Catalog via Amazon SageMaker…aws.amazon.comA public data lake for analysis of COVID-19 data | Amazon Web ServicesAs the COVID-19 pandemic continues to threaten and take lives around the world, we must work together across…aws.amazon.comIngest streaming data into Amazon Elasticsearch Service within the privacy of your VPC with Amazon…Today we are adding a new Amazon Kinesis Data Firehose feature to set up VPC delivery to your Amazon Elasticsearch…aws.amazon.comIntegrating AWS Lake Formation with Amazon RDS for SQL Server | Amazon Web ServicesTo grow and develop your business, you must collect data from a myriad of sources (such as relational and NoSQL…aws.amazon.comSimplify your Spark dependency management with Docker in EMR 6.0.0 | Amazon Web ServicesApache Spark is a powerful data processing engine that gives data analyst and engineering teams easy to use APIs and…aws.amazon.comApache Hive is 2x faster with Hive LLAP on EMR 6.0.0 | Amazon Web ServicesCustomers use Apache Hive with Amazon EMR to provide SQL-based access to petabytes of data stored on Amazon S3. Amazon…aws.amazon.comRead your S3 access logs in AWS AthenaThe last two posts, I am exploring (again) the static Web hosting through S3 + Route53 and as additional layer the…medium.comMigrating Big Data Workloads to AWS EMROverviewmedium.comHow to configure Kerberos in AWS EMR ?What is Kerberos?medium.comGoogle CloudCreation of an ETL in Google Cloud Platform for automated reportingLearn how to create your own serverless and fully scalable ETL for automated reporting using PyTrends as an exampletowardsdatascience.comMigrating Data Processing Hadoop Workloads to GCPWritten by Anant Damle and Varun Dhussamedium.comMigrating Hive ACID tables to BigQueryMigrating data from Hadoop to Google BigQuery is a fairly straightforward process. DistCP is usually leveraged to push…medium.comTesting Airflow jobs on Google Cloud Composer using pytestA reliable CI/CD without reinventing the wheeltowardsdatascience.comAzureHow to Secure Your Azure Machine Learning ExperimentsA step-by-step guide to adopt best practices and a strong security posture when deploying the Azure Machine Learning…medium.comAzure SQL Database network settings (Private Link, VNET Service Endpoint) and Azure Data FactoryAzure SQL Database has a few extra settings on the Firewalls and Virtual Networks tab in addition to Private Link and…medium.comWeb Activity (Sending Email) in Azure Data FactoryLately, I have been using Azure Data Factory (ADF) gently. I kinda like the concept of code-less data engineering ETL…medium.comDatabasesNoSQLMassive Scale DatabasesIntroductionitnext.ioCAP Theorem & its relevance to No SQL DBCAP theorem, also named Brewer’s theorem after computer scientist Eric Brewer, states that it is impossible for a…medium.comHadoop NoSQL: Hbase, Cassandra, MongoDB7- DEMYSTIFYING THE HADOOP TECHNOLOGYmedium.comBuilding a Distributed Hadoop Cluster with HBase on Amazon EC2’s from ScratchIf you want to build a Distributed Hadoop Cluster on AWS EC2 with HBase, then the best option is to use AWS EMR. But if…medium.comCassandra in KubernetesHeadless Services, KubeDNS, Init Containers, Lifecycle hooks and other K8s concepts on the waymedium.comA Glance at Apache CassandraApache Cassandra is a kind of distributed column-oriented NoSQL database and performs very high performance in vast…medium.comSQL and NoSQL: key differencesA guide to the two most common types of database systemsmedium.comSQL vs NoSQL : What’s the best option for your database?One of the essential choices a developer must make is about what database technology to use for structuring and…medium.comExploring MongoDBData is the new age fuel —MongoDB, the database for modern application can be the best fit when handling large data…medium.comQuery Optimization in MongoDBMongoDB is a NoSQL database most commonly referred as a document database. It was basically designed for ease of…medium.comPersistent Databases Using Docker’s Volumes and MongoDBWith Docker Compose version 3medium.comUsing Apache Pinot and Kafka to Analyze GitHub EventsPinot is the latest Apache incubated project to follow in the footsteps of other tremendously popular open source…medium.comIn-Memory & Data GridApache Ignite and CAP TheoremIs Apache Ignite consistent, available or both?medium.comApache Ignite with persistence!From last couple of years, I have been working on this amazing product from Apache called Ignite ,which is, an in…medium.comPersistence Options for Apache IgniteHow to choose the right persistence store for Apache Ignite.medium.comModern Data WarehousesGuide to Data WarehousingShort and comprehensive information about different data modeling techniquestowardsdatascience.comData Warehouse with a Lake viewHaving had various discussions around data warehousing, data lakes and big data technologies I felt the urge to share…medium.com