evaStudio v0.1

evaStudio
4 min readJul 21, 2022

--

Five developers, four weeks, very little sleep. Our team is proud to present evaStudio as part of the OS Labs Tech Accelerator.

What is evaStudio ?

evaStudio (beta) is a web GUI and testing playground for designing and scaling a real-time streaming data pipeline with Kafka brokers and Zookeeper.

Our application aims to simplify the process of developing, experimenting, orchestrating, and monitoring machine learning workflows at scale, so data scientists starting on a new project can focus on analytical challenges instead of infrastructure.

What is Kafka ?

Apache Kafka is an open source stream-processing platform and distributed event store, used by over 80% of Fortune 100 companies in order to manage real-time data from various microservices on enterprise architecture. Kafka provides a unified, high-throughput, low-latency platform for handling real-time data feeds. It achieves this by utilizing an efficient binary TCP-based that groups messages together to reduce the overhead of the network.

By using sequential disk I/O to boost performance, Kafka offers much higher performance than other message brokers like RabbitMQ. With high throughput at millions of messages per second, even with limited resources, Kafka has become a standard part of modern “big data” infrastructure.

EDA Playground

Increasingly, professional data science work involves building and generating insights from real-time pipelines. Data scientists need tools to write streaming analytics in a test environment, on a small scale. While there are many platforms available for Exploratory Data Analysis for batch data, there are far fewer platforms for experimenting with real-time streaming data.

Some of our favorite data scientists have created simple ways to create “mock” streaming data with Apache Kafka for analysis in a Jupyter notebook or by using Spark and Faker. We wanted to provide a simple GUI tool for diagramming a streaming data pipeline, and spinning up several connected Kafka clusters, along with a SQL database, JupyterHub, and other processing engines like Spark, Splunk, or Storm.

We aim to provide a simple tool for architecting robust machine learning infrastructure at a smaller scale. Our browser-based application provides a friendly GUI for monitoring cluster health, checking performance metrics for load, latency, throughput, disk usage, and messages/bytes in and out per second.

Machine learning life cycle

During the initial phases of machine learning lifecycle, we aim to provide a tool that helps with training, testing, verifying, and monitoring your model before deploying to production, in order to get the metrics needed to be able to scale with real-time data.

Features:

  • Design your pipeline using a drag-and-drop tool for extracting, transforming, and loading data in and out of Kafka messaging brokers.
  • Source streaming data from a CSV, SQL database, or HTTP/API, to Apache Spark or Jupyter Notebook for analysis.
  • Manage topics and messages directly inside our web GUI.
  • Monitor cluster health, and check performance metrics for load, latency, throughput, disk usage, and messages and bytes in and out per second, in order to test and scale your data architecture before loading into production.

You can also connect your existing Kafka clusters on AWS, provided that you follow our setup instructions for Prometheus, JMX, and Grafana installation.

Challenges We Faced

This was an extremely complex project to deploy in a short time frame. Given the technical complexity of integrating with Kafka, Kafka Connect, and Kafka Streams, we spent much of our initial time during this project trying to better understand the technology, and design a novel solution to existing problems.

We faced many hurdles with incompatible libraries, and kafkaesque bugs, as we learned how to stream data from our Node.js server to the various nodes of our Java Spring Boot Kafka microservice. We also had to learn Java and Spring in a short time in order to extract data with Kafka Connect.

Miriam Webster Dictionary: “Kafkaesque”

Next Steps

Our current iteration on Github requires cloning and running with Node and Maven for Java Spring Boot. We will also be making our application available on DockerHub. We are also considering other ways of making our tool more accessible as a web platform or desktop application.

Currently, we allow for data to be imported as a CSV and “mock-streamed” for real-time analytics. We also connect to the PostgreSQL database, and allow for data to be streamed from an HTTP or API endpoint. We provide the option of Jupyter Hub or Spark for analytics.

In future iterations, we hope to allow for data transformations, mapping/reducing/filtering, and connections to additional microservices through the GUI tool.

Team Photo: streaming over Zoom!

We just hit our stride, and are only getting started!

If you’re interested in developing this idea further with us, please join us in contributing to this open source project.

--

--