Data on EKS Platform

Tom Xiao
8 min readSep 25, 2023

--

Overview

Since its launch in 2006, the Hadoop ecosystem has gained widespread acclaim. However, as big data components proliferate and usage increases, traditional big data platforms face growing challenges, including inefficient deployment of new components, limited resource sharing between applications, and ineffective operations and maintenance. As architectures remain static, this not only brings more complexity to operations, but also predictably high costs.

Let me share our company’s experience as an example. Previously, our big data platform relied heavily on AWS cloud native services like EMR and EC2. Due to high costs, complex maintenance, and other issues, over the past six months we have gradually migrated all platform applications to EKS. This has reduced costs by 90%. During this time, data volumes have dropped 50% due to operational changes. By moving to EKS, we can now focus solely on building data applications, while basic ops are handled by our ops team for integrated management, reducing our investment in infrastructure.

We believe our experience could benefit any organization undertaking a containerization transition. Therefore, we plan to productize our big data platform and sell it on AWS Marketplace, making adoption easier.

Our platform leverages Kubernetes and other cloud native technologies, providing a simplified architecture compared to traditional Hadoop. By offering this platform on AWS Marketplace, we aim to help organizations realize the benefits we have seen — 90% cost reduction and simplified ops — when shifting big data to containers on Kubernetes.

With AWS handling infrastructure provisioning, customers can focus on data applications while our platform provides cluster deployment, autoscaling, monitoring, and more. We believe this turnkey approach will accelerate container adoption for big data use cases.

Please reach out if you would like to learn more about our journey or get early access when we launch on AWS Marketplace. We welcome any feedback as we work to bring a proven big data container platform to the broader cloud native community.

Architecture

Data on EKS Platform (DEP) is a Kubernetes-based cloud-native big data platform developed by our company. Built using open source components, DEP integrates Hive, Spark, Flink, Trino, HDFS, Kafka and more. It provides data development and BI platforms including Superset, Metabase, Jupyter, Zeppelin and custom IDE platform. DEP’s metadata management and data quality monitoring system improve data quality, while data ops and user profiling enable data-driven capabilities. Ops management and permissions ensure security and stability.

DEP is an all-in-one platform enabling admins, developers, analysts and business users to complete daily tasks. Leveraging Kubernetes auto-scaling, self-healing and simplified ops, DEP provides a managed big data environment. Companies can focus on data while DEP handles infrastructure, monitoring, access controls and more. Compared to legacy Hadoop, DEP reduces costs while empowering employees through faster, easier analytics and insights access.

Product Features

DEP’s main functions include:

  • DEP Core, which contains scheduling center, IDE, metadata management, and data quality monitoring capabilities.
  • Nebula, the data requirement module, which helps users easily find, understand, request access to datasets and raise new data requests to data engineers or data analysts.
  • Data-driven capabilities including data operations and user profiling to generate insights.
  • New features currently in development:
  • Model development and deployment to support machine learning workflows. Using models built by data scientists in production applications and at scale on DEP.
  • Large language model applications to allow conversational interaction with data and insights using chatbots and virtual assistants.

DEP aims to provide an integrated environment covering the full data lifecycle. Users can discover datasets, develop models and applications, deploy them at scale, monitor data quality, and enable self-service analytics. The platform leverages Kubernetes to provide flexibility, scalability, and automated management. We welcome customer feedback to prioritize additional capabilities that would maximize the value of DEP.

Scheduling Center

On DEP, we developed a scheduling center to handle thousands of different types of scheduling tasks across the platform. The scheduling center currently supports the following job types:

  • SYNC: Extract data from business databases (MySQL, PostgreSQL, etc) into the data warehouse ODS layer. Supports full or incremental extracting, customizable data cleansing rules, and partitioning.
  • HIVE,SPARK,TRINO: Supports SQL jobs using different compute engines.
  • JAVA: Supports execution of custom Jar packages.
  • SHELL: Supports execution of other custom jobs.
  • EXPORT: Exports ADS layer data back to business databases for online business queries.
Overview Page
Status Page
Task Status and Logs

Additionally, the scheduling center supports automatic lineage determination, DAG diagrams, job logging, and job alerts.

The scheduling center provides a unified interface to develop, monitor, and manage ETL pipelines, data integration, and processing workloads. By leveraging Kubernetes concepts like controllers, the system is scalable, resilient, and automates repetitive tasks. We plan to continue enhancing the scheduling center with machine learning capabilities for intelligent optimization and automation.

IDE

On DEP, we developed a custom IDE to provide ad-hoc querying, data development, and job approval workflows.

In our IDE, data developers can use Hive or Trino SQL to develop, test, and request deployment of jobs. Business analysts can query data themselves and download data once approved. Every query goes through Ranger and Solr for data authorization and auditing.

The IDE provides a collaborative environment for different user personas with built-in governance. Data engineers have the tools to develop pipelines and ETL jobs. Analysts have self-service access to data for exploratory analysis. Queries are logged and access controlled to meet security policies.

IDE

Key benefits include:

  • Unified interface for SQL, development, testing before deploying to production
  • Collaboration between engineers and analysts
  • Data discovery for business users
  • Approval workflows for access requests
  • Centralized data authorization, auditing, and lineage tracking

By leveraging the IDE, DEP customers can accelerate analytics development, empower users with self-service, and maintain oversight — all through a single pane of glass. We welcome additional feedback on IDE capabilities that would further enhance the user experience.

Metadata Management & Data Quality

Metadata management is a crucial process in building a data platform, enabling enterprises to maximize the value of their data assets. Metadata management helps business analysts, architects, data engineers, developers and other stakeholders clearly understand what data an organization has, where it is stored, how to extract, cleanse, maintain it, and guide users.

Metadata Management

A key benefit of metadata management is ensuring data quality. In DEP, we provide additional data quality monitoring capabilities, calculating metrics and alerts at the table, column and custom rule levels. This allows data teams to promptly discover and address anomalies.

Data Quality

With DEP’s metadata management, companies can:

  • Document data sources, schemas, ETL pipelines, dependencies
  • Discover data to understand availability across silos
  • Track lineage to trust data provenance
  • Manage governance policies like security, access
  • Monitor data quality with automated alerts
  • Guide and train end users on datasets

Strong metadata management and data quality monitoring provides the foundation for reliable analytics, trusted insights, and data democratization. We welcome additional perspectives on how DEP can further enhance metadata and governance capabilities to maximize value.

Request Center

We developed Request Center, a data requirements platform, to manage business demands. Business users can submit requests for data applications, dashboards, and data warehouse development. Data developers can promptly follow up and process these requests on the platform.

The data requirements platform strengthens the connection between business and data teams, reducing communication barriers.

Key benefits include:

  • Central platform for data requests across the organization
  • Standardized intake forms and workflows
  • Transparency into request status and handling
  • Prioritization based on business needs
  • Collaboration between business and technical teams
  • Audit trail of requirements gathering and delivery

Submit Data Requests

Request Center allows business users to easily log and track requests, ensuring their needs are captured and worked on. For data teams, it provides a single workflow to manage incoming tasks rather than ad-hoc emails and messages.

By streamlining data requirements gathering through the Request Center, organizations can align business stakeholders and data teams, accelerate analytics delivery, and use data more strategically. We welcome additional input on how to further enhance and leverage the platform.

Other Features

Data Operations & User Profiling

Empowering business users to apply data in reasonable and rapid ways to feed online business operations.

Permission Center

On DEP, we have defined hundreds of granular permissions that can be combined to assign appropriate roles to members.

For example, data engineers may have access to:

  • Create/edit tables, views in Hive
  • Develop ETL jobs with Sqoop, Spark, etc
  • Schedule and run jobs in production

Analysts may have permissions like:

  • Read access to certain tables and views
  • Ability to run ad-hoc queries in the IDE
  • Export limited results sets

Fine-grained authorization allows implementing least privilege access tailored to each user’s responsibilities. Complex permissions can be abstracted into reusable roles mapped to job functions.

Key benefits include:

  • Align access with organizational data security policies
  • Prevent unauthorized or unintentional data access
  • Simplify user management by bundling permissions into roles
  • Gain visibility into who can access what data
  • Streamline auditing for compliance

DEP’s flexible and granular permission model balances security with enabling self-service analytics.

Configuration Center

On DEP, we provide platform configuration capabilities through a Parameter Center where data administrators can manage table naming conventions, job labels, table tags, and other custom rules. Configurations take effect in the system in real-time.

Product Timeline

Pricing

Pricing Models

We offer annual and monthly subscription plans with one-click deployment through AWS CloudFormation for a ready-to-use solution.

Subscription Plans

We provide a unified edition for all customers. Monthly fee is $800 and annual fee is $8,000.

Value-Added Services

We provide additional services like training, technical support, and can customize exclusive solutions based on user needs.

Key pricing details:

  • Predictable subscription model avoids variable cloud costs
  • Annual contract offers discounted rate with one yearly payment
  • Monthly subscriptions allow flexibility to start small and expand
  • Simple per-node pricing, no hidden fees or extra charges
  • Add-on services available for custom integrations, training, support

By leveraging our AWS Marketplace product, customers can deploy DEP quickly with minimal upfront costs. We aim to provide a high-value and scalable platform enabling organizations to realize fast time-to-value. Please reach out for customized pilots, proofs of concept, or pricing discussions.

For more information about DEP, please feel free to contact tom.xiao@hotmail.com if you are interested.

We will provide updates on the product development progress and usage instructions through additional blogs posts and announcements.

We welcome you to try out a demo, sign up for early access once launched on AWS Marketplace, or engage in a pilot project to see how DEP can empower your data teams and organization.

--

--