Apache Spark has quickly become one of the most heavily used processing engines in the Big Data space since it became a Top-Level Apache Project in February of 2014. Not only can it run in a variety of environments (locally, Standalone Spark Cluster, Apache Mesos, YARN, etc) but it can also provide a number of libraries that can help you solve just about any problem on Hadoop. This includes running SQL Queries, Streaming, and Machine Learning to name a few. All running on an optimized execution engine.
We at Clairvoyant have built many Data Pipelines with Apache Spark, including Batch…
In the very first blog post I wrote a few years ago (Lessons from an Injured Runner), I detailed my experience with my first significant running injury: A Stress Fracture on my Tibia. Up until that point, I had never dealt with any significant injury. It was a big shock to have to hang up my running shoes for several months and also to realize that I could even be seriously injured. When nothing like that has ever happened before, you never expect it to happen at all.
Fortunately, after several months of resting and waiting, I was back to…
The Coronavirus (COVID-19) Pandemic has impacted a great many things in our lives since its first discovery in late 2019 and especially since March of 2020 when many countries around the world implemented heavy safeguards within their respected countries. These safeguards were put in place to prevent the spread of the virus and mitigate its impact on portions of the population that would be heavily impacted by the effects of COVID: those with weakened immune systems like the elderly or those battling other diseases. …
Amazon Web Services (AWS) provides a variety of valuable services that many organizations can utilize to run their workloads. From Elastic Beanstalk (to run scalable Web Applications) to Elastic MapReduce (to run scalable Big Data workloads) to spinning up simple Virtual Machines, there are a host of benefits to being in the Cloud. It’s to the point where you don’t have to worry about having your own Data Center anymore. Not to mention that spinning up instances is as easy as a few clicks. …
Here at Clairvoyant, we’ve been heavily using Apache Airflow for the past 5 years in many of our projects. This includes a diverse number of use cases such as Ingestion into Big Data platforms, Code Deployments, Building Machine Learning Models and much more. We’ve also built and now maintain a dozen or so Airflow clusters.
While we’ve been working with Apache Airflow, we’ve noticed a few operations that could be easily automated through a workflow to improve the longevity of an Apache Airflow instance. This, in turn, could help reduce the number of hands-on activities needed to maintain an Airflow…
The Confluent Schema Registry provides a RESTful interface for storing and retrieving Apache Avro® schemas. It stores the versioned history of all schemas based on a specified subject name strategy, provides multiple compatibility settings, allows the evolution of schemas according to the configured compatibility settings and expanded Avro support. It provides serializers that plug into Apache Kafka® clients that handle schema storage and retrieval for Kafka messages that are sent in the Avro format.
The Schema Registry runs as a separate process from the Kafka Brokers. Your producers and consumers still talk to Kafka to publish and read data (messages)…
The purpose of this blog is to describe how to set Java8 as the version of Java to use in the Cloudera Quickstart VM and as the version of Java to use in Hadoop. You can also read more use cases about how to upgrade to Java 8 here. The reason you might want to do this is so that you can run Spark jobs using the Java8 libraries and features (like lambda operations, etc).
Without Cloudera Manager
sudo…
Cloudera, one of the leading distributions of Hadoop, provides an easy to install Virtual Machine for the purposes of getting started quickly on their platform. With this, someone can easily get a single node CDH cluster running within a Virtual Environment. Users could use this VM for their own personal learning, rapidly building applications on a dedicated cluster, or for many other purposes.
Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data…
Cloudera, one of the leading distributions of Hadoop, provides an easy to install Virtual Machine for the purposes of getting started quickly on their platform. With this, someone can easily get a single node CDH cluster running within a Virtual Environment. Users could use this VM for their own personal learning, rapidly building applications on a dedicated cluster, or for many other purposes.
Apache Kudu is a columnar storage manager developed for the Apache Hadoop platform. Kudu shares the common technical properties of Hadoop ecosystem applications: it runs on commodity hardware, is horizontally scalable, and supports highly available operation.
The…
Cloudera, one of the leading distributions of Hadoop, provides an easy to install Virtual Machine for the purposes of getting started quickly on their platform. With this, someone can easily get a single node CDH cluster running within a Virtual Environment. Users could use this VM for their own personal learning, rapidly building applications on a dedicated cluster, or for many other purposes.
Out of the box, Cloudera is running Spark 1.6.x. It’s a very stable and reliable version of Spark. However, Spark 2.0 released in 2016 bringing with it exceptional improvements in features and performance. …
Director of Big Data and Cloud Engineering for Clairvoyant LLC | Marathon Runner | Triathlete | Endurance Athlete