Simple Tips and Tricks to Improve the Performance of your Spark Applications

Image for post
Image for post
Pixabay — Abstract Abstraction Acceleration — link

Apache Spark has quickly become one of the most heavily used processing engines in the Big Data space since it became a Top-Level Apache Project in February of 2014. Not only can it run in a variety of environments (locally, Standalone Spark Cluster, Apache Mesos, YARN, etc) but it can also provide a number of libraries that can help you solve just about any problem on Hadoop. This includes running SQL Queries, Streaming, and Machine Learning to name a few. All running on an optimized execution engine.

We at Clairvoyant have built many Data Pipelines with Apache Spark, including Batch…


How it happened, how I’m going to recover, and what changes I'm going to make to prevent this from EVER happening again!

Image for post
Image for post
Pexels.com (source)

In the very first blog post I wrote a few years ago (Lessons from an Injured Runner), I detailed my experience with my first significant running injury: A Stress Fracture on my Tibia. Up until that point, I had never dealt with any significant injury. It was a big shock to have to hang up my running shoes for several months and also to realize that I could even be seriously injured. When nothing like that has ever happened before, you never expect it to happen at all.

Fortunately, after several months of resting and waiting, I was back to…


Tips you can follow during the COVID-19 Pandemic to stay safe and prevent being infected while still continuing to run and improve physically

Image for post
Image for post
Pixabay (source)

The Coronavirus (COVID-19) Pandemic has impacted a great many things in our lives since its first discovery in late 2019 and especially since March of 2020 when many countries around the world implemented heavy safeguards within their respected countries. These safeguards were put in place to prevent the spread of the virus and mitigate its impact on portions of the population that would be heavily impacted by the effects of COVID: those with weakened immune systems like the elderly or those battling other diseases. …


A few steps that you can take to reduce the Cost of Operating on Amazon Web Services

Image for post
Image for post
Savings Budget Investment Money — pixabay

Amazon Web Services (AWS) provides a variety of valuable services that many organizations can utilize to run their workloads. From Elastic Beanstalk (to run scalable Web Applications) to Elastic MapReduce (to run scalable Big Data workloads) to spinning up simple Virtual Machines, there are a host of benefits to being in the Cloud. It’s to the point where you don’t have to worry about having your own Data Center anymore. Not to mention that spinning up instances is as easy as a few clicks. …


Maintaining Apache Airflow through Regularly Scheduled DAGs

Image for post
Image for post
Pixabay — Robot Disassembled Blue Lightbulb — link

Here at Clairvoyant, we’ve been heavily using Apache Airflow for the past 5 years in many of our projects. This includes a diverse number of use cases such as Ingestion into Big Data platforms, Code Deployments, Building Machine Learning Models and much more. We’ve also built and now maintain a dozen or so Airflow clusters.

While we’ve been working with Apache Airflow, we’ve noticed a few operations that could be easily automated through a workflow to improve the longevity of an Apache Airflow instance. This, in turn, could help reduce the number of hands-on activities needed to maintain an Airflow…


A guide to the installation and upgrade process for Confluent Schema Registry for Apache Kafka

Image for post
Image for post
Pexels

Architecture

The Confluent Schema Registry provides a RESTful interface for storing and retrieving Apache Avro® schemas. It stores the versioned history of all schemas based on a specified subject name strategy, provides multiple compatibility settings, allows the evolution of schemas according to the configured compatibility settings and expanded Avro support. It provides serializers that plug into Apache Kafka® clients that handle schema storage and retrieval for Kafka messages that are sent in the Avro format.

The Schema Registry runs as a separate process from the Kafka Brokers. Your producers and consumers still talk to Kafka to publish and read data (messages)…


Steps to Upgrade Java on the Cloudera Quickstart VM

Image for post
Image for post
Pixabay

The purpose of this blog is to describe how to set Java8 as the version of Java to use in the Cloudera Quickstart VM and as the version of Java to use in Hadoop. You can also read more use cases about how to upgrade to Java 8 here. The reason you might want to do this is so that you can run Spark jobs using the Java8 libraries and features (like lambda operations, etc).

High-Level Steps

  1. Stop the Hadoop Services
  2. Upgrade to Java8
  3. Update configurations to use Java8
  4. Restart services

Upgrade Steps

1. Stop the Hadoop Services

Without Cloudera Manager

  • SSH into the machine
  • Login as root
sudo…

Steps to Install Apache Kafka on the Cloudera Quickstart VM

Image for post

Cloudera, one of the leading distributions of Hadoop, provides an easy to install Virtual Machine for the purposes of getting started quickly on their platform. With this, someone can easily get a single node CDH cluster running within a Virtual Environment. Users could use this VM for their own personal learning, rapidly building applications on a dedicated cluster, or for many other purposes.

Apache Kafka is an open-source stream-processing software platform developed by LinkedIn and donated to the Apache Software Foundation, written in Scala and Java. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data…


Steps to Install Apache Kudu on the Cloudera Quickstart VM

Image for post

Cloudera, one of the leading distributions of Hadoop, provides an easy to install Virtual Machine for the purposes of getting started quickly on their platform. With this, someone can easily get a single node CDH cluster running within a Virtual Environment. Users could use this VM for their own personal learning, rapidly building applications on a dedicated cluster, or for many other purposes.

Apache Kudu is a columnar storage manager developed for the Apache Hadoop platform. Kudu shares the common technical properties of Hadoop ecosystem applications: it runs on commodity hardware, is horizontally scalable, and supports highly available operation.

The…


Steps to Install Spark2 on the Cloudera Quickstart VM

Image for post

Cloudera, one of the leading distributions of Hadoop, provides an easy to install Virtual Machine for the purposes of getting started quickly on their platform. With this, someone can easily get a single node CDH cluster running within a Virtual Environment. Users could use this VM for their own personal learning, rapidly building applications on a dedicated cluster, or for many other purposes.

Out of the box, Cloudera is running Spark 1.6.x. It’s a very stable and reliable version of Spark. However, Spark 2.0 released in 2016 bringing with it exceptional improvements in features and performance. …

Robert Sanders

Director of Big Data and Cloud Engineering for Clairvoyant LLC | Marathon Runner | Triathlete | Endurance Athlete

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store