How are Big Companies using Apache Spark

The big data marketplace is growing big every other day. The competitive struggle has reached an all new level. This is why open source technologies like Hadoop, Spark, and Flink must find valuable use cases to top the big data marketplace. A new approach to tackle problems is always needed and this is what all the open source technologies are trying to attain. Catering to something unique as compared to the rivals is a must and this is what Apache Spark is all about!

Concentrating on technical features and capabilities is a must when it comes to early adopters. Application development progress happens when companies develop confidence about the reliability and scalability in terms of larger data volumes. Spark with over 90 contributors from 25 companies fulfill all such necessities. So let us have a look how big companies are using Apache Spark and how successful has it been till date?

Spark’s Common User Case

Companies heavily rely on a wide variety of data sources. This is used for their analytical products. Processing like cleaning, transforming, and fusing unstructured external data with internal data sources are all included in these data processing workflows. Especially when it comes to successful Startups, Spark is proving to be of great use. For non-programmers, certain companies have also created simple user interfaces which open up batch data processing tasks.


For BDAS, the most famous components are Spark and Shark. But Spark Streaming real-time processing and PySpark Python API is also in the competition! The key feature of Spark Streaming is that the code used for batch processing can also be used for real-time computations (with minor tweaks). This refers to programmer productivity. Due to this amazing feature, many companies have started using Spark Streaming. Applications like stream mining, real-time scoring2 of analytic models, network optimization, etc. are pretty much included. Also, CloudPhysics is using Spark Streaming for detecting patterns and anomalies. It is noted that 52% of the companies prefer Apache Spark when it comes to real-time streaming.

Image for post
Image for post

Spark has its own wonderful advantages which always helped in attracting users. The speed and suitability for handling iterative computations as compared to Hadoop are far better. Iterative computations are especially used for advanced analytics. Working with Spark is suitable for companies and from early on itself, companies started writing their own Spark libraries for regression, classification, and clustering. Modern world problems like online advertising and marketing, fraud detection, and problems related to scientific research are being solved using Spark tools and libraries. The good thing is that it is becoming easier to develop such libraries for graph and machine-learning analytics. Approximately, 64% of the companies use Apache Spark to leverage advanced analytics.


Now, this is one of the most important aspects of any company. While MPP databases, open source SQL-on-Hadoop solutions Shark and Impala are gaining traction3, companies have now started using Shark and BlinkDB for interactive SQL analysis as well! While many companies are following the general approach, some of them have developed custom interactive dashboards. These are powered by Spark and Shark. Companies now use visual analysis tools like Tableau in harmony with Shark which sounds better as compared to static reports and query analysis only. More than 91% companies use Apache Spark because of its performance gains.

Why are big companies switching over to Apache Spark?


Yahoo is already using Apache Spark and is successfully running projects with Spark. Yahoo itself is a web search engine and has one such project which offers the perfect content for the perfect visitor which is known as personalization. This is possible because of Spark. The most important part of this project is machine learning algorithms which identify individual visitors’ and their interests. This further helps in catering to the news which they love to read/watch. So when a user visits Yahoo, the search engine makes sure that he/she is catered what he/she loves. To achieve such a precise level of personalization, real-time processing power and high-speed is needed. This is certainly attained with the help of Apache Spark!

Image for post
Image for post

A startup which is known as ClearStory recently built a platform which allows users to fuse data multiple sources in no time! It also produces interactive visualizations. The below-given image explains it further:

Image for post
Image for post

In the finance industry, banks are using Spark as the alternative to Hadoop. Spark is especially used to access and analyze social media profiles, call recordings, emails, etc. This helps them for making correct business decisions for target advertising, customer segmentation, and credit risk assessment.


A financial institution which is into retail banking and brokerage operations has been using Apache Spark and it has led to a reduction in its customer churn by a whopping 25%. The platform is divided into retail, banking, trading, and investment. For a 360-degree view of the customer details, the bank uses Apache Spark which acts as a unifying layer. The bank now automates analytics with machine learning. The data of each customer repository can be accessed and is then correlated to a single customer file. This file is then forwarded to the marketing department.

Image for post
Image for post

A financial institution uses Apache Spark for analyzing the text inside the regulatory filing. It also analyses its competitor reports. also helps in discovering the patterns regarding what’s happening and the market competition.


Another multinational financial institution has implemented a real-time monitoring application which runs on Apache Spark and MongoDB NoSQL. These applications actually help the bank monitor client’s activity and identify issues. With the risk-based assessment, Apache Spark works well for financial institutions.

As we all know, E-Commerce industries are growing fast and the importance of real-time information is immense for them. This information can be passed further for streaming clustering algorithms, for example, K-means clustering algorithm. The results obtained are then combined with sources like social media profiles, comments, product reviews, recent search, etc.


As most of us know, Alibaba is the largest e-commerce platform globally. Surprisingly, it also runs some of the largest Apache Spark jobs in the world! While some of these jobs analyze thousands of petabytes data, others are busy performing extraction on image data. Each & every user interaction at Alibaba is displayed on a large graph & Apache Spark is used for deriving precise results and getting fast processing.

Image for post
Image for post

Another well-known e-commerce giant eBay uses Spark. It helps eBay in marketing for targetting specific offers and enhancing customer experiences. Hadoop YARN leverages Apache Spark at eBay. YARN manages all cluster resources which helps in running generic tasks. Hadoop clusters are leveraged by eBay Spark users ranging from 2000 nodes to 20,000 cores and 100TB of RAM via YARN.


With such progressive companies using Apache Spark to assist in business development and offering optimum client services, it is sure that Apache Spark definitely has a bright future!

Written by

Teach technology the way it is used in the industrial world.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store