The Yandex BI tool DataLens is now open source

Gadzhi Gadzhiev
Yandex
Published in
7 min readSep 26, 2023

Today, we published the source code of Yandex DataLens — a service for data analysis and visualization — on GitHub under the public Apache 2.0 license. From now on, anyone can use the open-source version of DataLens on any infrastructure.

My name is Gadzhi Gadzhiev. Together with Pavel Dubinin, we are responsible for the development of DataLens here at Yandex Cloud. Today, we’re going to tell you about what DataLens helps users do, the new opportunities that going open source opens up, and the functionalities you can deploy right away.

What is DataLens?

DataLens is a BI tool that allows you to connect to a data source, define a dataset, create visualizations, build a dashboard, and share the results with your team.

From day one, we’ve been developing the product as both an internal Yandex tool and as a cloud-native solution for Yandex Cloud customers.

Today, over 35,000 people across all our services, including search, raid hailing app, e-commerce platform, maps, audio/video streaming services, and many others, work with data on an ongoing basis, and DataLens helps them all with their analytics.

Each service has its products, data, and hypotheses to check, so we aimed to build a versatile tool. To achieve data democratization in such a huge tech company, we focused on simplicity and speed instead of manual customizations. The advantage for internal users arose from the integration with YTsaurus and ClickHousе. These are a Big Data system and an analytical DBMS serving as our primary sources for analytics. Other BI tools cannot deal with them as effectively.

Within Yandex Cloud, DataLens as a service hosts more than 100,000 instances — separate environments of the service isolated for each client. They are run by companies of all sizes and from a variety of industries: from small tech startups to big banks and national retail chains.

We developed DataLens as a sort of “smart” query generator that connects to a multitude of data sources and provides interactive visualizations. Importantly, DataLens does not store any information, accessing databases directly instead. It can use external databases as data sources, hosted in another cloud or on premise.

Thanks to its architecture and rich visualization toolkit, DataLens can help with a variety of use cases, ranging from creating ad-hoc charts based on metrics to building large-scale dashboards with geo layers where data can be placed on a map and compared. Of course, much depends on the quality of data. That is why we offer integration with ETL and data preparation tools in our cloud-native version.

Why use DataLens?

DataLens is a versatile solution for a wide range of data analysis and visualization tasks. It enables the creation of dashboards for monitoring key business metrics and provides collaborative access to analytics. In 2023, the number of DataLens users on the cloud platform tripled, with tens of thousands of individuals relying on the tool for tasks spanning the retail, fintech, and IT sectors.

The experience of our internal customers could bring benefit to other companies, so we are thrilled to share their use cases as well. Here is an interesting story from our ride hailing and e-commerce businesses. As DataLens emerged, the teams first migrated to it their reports for mass roles (support, outsourcers, warehouse workers). This proved cost-effective and convenient. A little later, both teams recognized the need to migrate, and within three months, they moved over 700 reports for 4,000 users to DataLens.

These and other public case studies help new DataLens users to understand how they can use the service for their own needs. So, we are not just showcasing opportunities but building a community around the product where you can discuss your challenges and find a solution by talking to peers.

Why build a community?

From the project’s inception, we have focused on nurturing BI expertise: the more users understand how to solve their tasks with DataLens, the more interesting scenarios emerge. This propels the project forward and gives momentum to the industry overall. Keeping this in mind, we launched several educational initiatives: we ran the Data Yoga BI marathon, several hackathons, created data analysis courses. Since 2020, our DataLens community has been steadily growing and now boasts over 6,500 members.

Our center of expertise has already shifted towards advanced users: custom solutions and smart hacks more effectively found by community members than the product team. Sometimes, we are amazed at the variety of challenges and creative ideas that community members discuss. Active participants not only share case studies, but also practical solutions like gathering statistics from Telegram chats or choosing between detailed aggregations and window functions for different tasks. One community member even wrote a book about DataLens — a real, printed book!

Why we went open source

Going open source is the next step in the evolution of DataLens. This way, we can engage not only users and analysts, but also the developers in our community. More people will now contribute directly to the product, and its functional expansion will no longer be limited to our own resources. Customers, in turn, can deploy the product on any infrastructure without the fear of vendor lock-in. They can also build data ecosystems based on multiple open-source products, for example: YDB + YTsaurus + Clickhouse + DataLens. Ultimately, this will foster openness and development in the BI market as a whole.

Importantly, the main DataLens developer is still the same professional team: UX, design, analysts, and market experts. We are building a commercial-grade open-source product, and we will continue to invest in it.

Everyone benefits when we go open source:

Clients can adapt DataLens to their requirements and gain flexibility in their infrastructure choice.

Partners can enjoy additional opportunities to develop their expertise and implement custom deployment projects.

IT vendors can utilize DataLens in their products.

● The BI developer community can contribute to the product.

Technical aspects of going open source

DataLens evolved along the path shared by many Yandex projects: we tried to use industry-standard technologies as much as possible, but still depended on internal libraries and infrastructure.

On the backend, we use industry-standard Python 3, aiohttp, and sqlalchemy, but the development and build processes were deeply linked to the Yandex monorepo.

When we planned to publish the source code, we decided that the “source of truth” would reside in open source, not in the internal repository. Our developers work using publicly available Pull Requests, just as other contributors do. Thus, we are not just publishing the source code, we are making the product development more transparent.

This approach challenged us to rethink how our team operates. We had to transition to industry-standard package managers for dependencies and essentially relearn how to deal with Python services and packages as external python projects do. Transitioning from our internal infrastructure was challenging and time-intensive, but we believe that it ultimately will benefit the project.

Prior to making the project public, we had to clear the code of all internal specifics: library calls and logic for interacting with other internal services, certain parts of the interface, and configurations for our environments, such as DataLens installations in Yandex Team and Yandex Cloud.

Essentially, each of our services now comprises two components: the open-source core now accessible to everyone, and closed-source extensions that encapsulate the core, adding missing features and specifics.

The most apparent example is the integration with authentication systems. Initially, we are launching the open-source version without the multi-user and ACL capabilities. However, the proper extension points are already in the code: the closed-source part leverages them to integrate with Yandex ID and Yandex Cloud IAM. The extensions allow for modifying both the server-level request handling logic and the user interface.

Over time, we plan to build an API for plugins and document it, thereby paving the way for the creation of an open-source extension ecosystem for DataLens.

For interface development within Yandex Cloud, we have always used standard technologies for building and managing dependencies, but remained reliant on a number of internal libraries.

What you can deploy and how

To run DataLens locally, all you need to do is to launch several containers using docker compose:

git clone https://github.com/datalens-tech/datalens
cd datalens && HC=1 docker compose up

After that, you will be able to:

  • Open the interface
  • Review demo examples
  • Attach data sources
  • Build custom dashboards.

The first release of the open-source version includes everything you need to try DataLens features in your infrastructure. The repository currently hosts the core service, a set of key connectors (PostgreSQL, ClickHouse, and YTsaurus), and the main interface components.

This release is just the first step, and there is a lot more work ahead. But it is a pivotal moment for us: publishing the source code fundamentally changes our approach to service maintenance and development. We will soon share the DataLens open-source roadmap on GitHub, which would be shaped with the feedback from our community.

We are no longer just developing the service; we are creating an open-source BI product, together with you!

--

--