DIY Data Lake using SBCs and Ceph

Arko Basu
5 min readJan 6, 2024

--

Welcome to my inaugural journey into the world of medium-blogging. Today, we embark on an adventure that hopes to unveil the hidden potential of OrangePi, a low-cost alternative solution to Raspberry Pi like single-board computers (SBCs).

For those unfamiliar with the OrangePi 3B, it’s a single-board computer that packs a punch in terms of performance and versatility. However, harnessing its full capabilities can be bit of a daunting task, especially for those venturing into the realm of embedded systems from an entry level perspective (like myself), and also due to the lack of community as compared to it’s competitors. This blog series aims to demystify the process, hoping to make this amazing product accessible to beginners (like myself) in quickly getting started and offering some valuable insights for seasoned tech aficionados looking to build data-lake for their personal projects on bare-metal.

Prompt guided image for DIY data lake using Dall-e-3

Motivation

OrangePi 3B’s versatility is in the fact that it comes with a quad core 64 bit Cortex-A55 processor with main frequency that can reach up to 1.8GHz clock speed, has built-in Mali-G52 GPU and an AI accelerator NPU with 0.8Tops computing power, all with very low power consumption and high performance capability. It additionally supports a SD, a M.2 NVMe and an eMMC slot, all for less than 50 USD (for its highest configuration 8GB RAM version — at the time of writing). Which makes it a strong candidate for data intensive applications, and/or AI/ML applications on Edge. You could want to build a NFS server for file sharing in your home network, or have a self-hosted Kubernetes bare-metal with Wordpress backed by persistent storage, or even build a home automation server with open-source projects. The possibilities are endless.

Even though OrangePi 3Bs have NPU and VPU chips that support AI and 4K Video encoding/decoding based applications, it’s not supported on Linux based distributions, hence we are not going to cover those aspects in this blog series. If interested in such applications you can try Android OS which has both NPU and VPU support with OpenGL/OpenCL/Vulkan

I, for example, wished to do some experimental testing of machine learning applications on my local machine (Apple M2 13 inch 8GB), and run some applications on Kubernetes at the Edge, without having to worry about storage or compute limitations.

In this blog series I will not focus on Compute based considerations outside of how I plan to use OPI-3B for Edge applications, but focus more so on Storage based considerations.

I personally didn’t want to use a cloud based storage service, primarily because I wanted to have full intellectual ownership of my data with capabilities to tightly monitor and audit access control policies, even if it meant I had to be responsible for it’s infrastructure, resiliency and availability. Also because cloud based storage can often be very expensive at a large scale when you are working with multi-modal machine learning models at an experimental stage, not so much for the storage itself, but the cost of moving that data across a Virtual Network on Cloud. And since I didn’t want to acquire expensive hardware during my experimentation phase, I was looking for low cost alternatives.

My overall objective was to have an inexpensive but robust, and highly scalable self hosted data-lake that can seamlessly work with all sorts of storage needs for different applications/workloads (Object/Block/File-System) that run in within a home-lab network.

This is not a production grade system (or a recommendations for one). This is only an experimental setup to have a home-lab that has a low-cost entry, and a highly scalable and available distributed storage infrastructure.

Problem Statement

The absence of readily available, budget-friendly resources and comprehensive guides tailored to DIY home-automation or machine learning enthusiasts hinders the adoption/acceptance for seamless implementation/integration of distributed storage solutions such as Ceph or cheap DIY programmable SBCs such as Orange Pis. People (like myself) aspiring to build a reliable, low-cost data server or applications that run on Edge with small compute but high storage needs, encounter obstacles in hardware selection, software configuration, and optimizing performance within budget constraints.

This multi-part blog series hopes to address these challenges and provide a generic guide to users in harnessing the benefits of inexpensive SBCs and open-source distributed storage solutions like Ceph, allowing them to build self-hosted data centers without breaking the bank on low-cost and high performance computer systems.

Outline

At the high-level this blog will be split into 9 topics each with it’s own objective so it can together fit into a broader narrative, but at the same time be used individually as reference as per need:

  1. Part — 1: Inexpensive DIY NAS & VS Code Server on ARM64 SBCs
  2. Part — 2: Docker Run llama-2 models on an OrangePi 5B using llama.cpp
  3. [Part 3] What/Why/Where Ceph — An introduction to the world of high-availability distributed storage infrastructure.
  4. [Part 4] Building a low cost Private Cloud on bare-metal with dedicated networking, compute and distributed storage for your home-lab
  5. [Part 5] Self-host your Wordpress Site in less than 30 min — Using Kubernetes bare-metal and Rook/Ceph for Persistent Storage.
  6. [Part 6] Deploy Kubernetes on OrangePi 3B at Edge for Time-Series anomaly detection.
  7. [Part 7] Self-Host your own MLDC Platform — Using Canonical Products to deploy Kubeflow on a bare-metal multi-node Kubernetes with Ceph as Persistent Storage and Rook to enable Dynamic Provisioning.
  8. [Part 8] Expose bare-metal Kubernetes applications using HAProxy and a VLAN from your home-lab.
  9. [Part 9] Intro to planning and designing a High Availability system for your Home-lab using Proxmox.

Not all topics are related directly to Ceph or OrangePi 3B specifically. They however, provide an experimental platform (with other hardware covered in each topic) to support stateless and fault tolerant infrastructure for a DIY home-lab that should be otherwise be very inexpensive to setup.

This blog series is currently in progress of being written out as I do my testing. Please don’t hesitate to reach out in case you want me to cover specific topics and subject areas. Also since I am new to this please don’t hesitate to reach out with suggestions/recommendations/corrections. I will really appreciate it.

--

--