Single Product Storage?

Nicolas Dupont
Akeneo Labs
Published in
4 min readMay 23, 2017

In our previous post of this series, we discussed the evolution of our product storage, its current state and limitations. We talk here about our R&D approach to define what would be our ideal solution today.

Existing implementations

Using the version 1.7, a customer’s project may be configured to use:

  • MySQL only: with EAV approach for product data and flat tables for other entities
  • MySQL + MongoDB: with product documents in MongoDB and flat tables for other entities in MySQL
  • MySQL + MongoDB + ElasticSearch: with product documents in MongoDB, product index for querying in ElasticSearch and flat tables for other entities in MySQL

What would we do today?

In late 2016, we took a break, a deep breath and we wondered if we could rely on a single product storage to answer all of our customer needs.

A unique storage that may be used by both Community and Enterprise Editions through a native packaged implementation.

This storage should obviously support our existing constraints and customers’ use cases. More importantly, using our acquired experience and better understanding of our market and its evolution, we tried to figure out what would be the expectations in 5 years. The idea is to prepare for the future.

We started a first study listing the eligible technical stacks (ie: set of tools) allowing us to address our expectations. This list also includes our current storage implementations, like MySQL + MongoDB + ElasticSearch stack.

We defined a list of grouped criterion to evaluate these possibilities:

  • Adoption: opensource, adoptable by the community, by the customers, by the partners, by the team, ease of installation, ease of configuration, availability on main OS distribution, required dependencies
  • Future proof: maturity, documentation, implementation for PHP, flexible storage capabilities
  • Stack: full stack simplicity, implementation simplicity
  • Cost: build cost, total cost of ownership
  • Storage and querying: ability to store a high number of products, reliability on products storage (no data loss), ability to query a high number of products

Few criterion directly eliminated a solution, for instance, if a tool being part of the storage stack is not open source.

Other criterion were rated from 1 to 5 by the team. Each group of criterion had a defined weight from 1 to 5. For instance, adoption group being for us more important than cost group.

This first study outcome was a complete matrix of eligible solutions.

Our current MySQL EAV implementation getting a global rating of 3.65 / 5. Being very good regarding adoption or tech stack simplicity. On other hand, being limited regarding querying performance and amount of supported products.

Our current MySQL / MongoDB / Elastic Search implementation getting a global rating of 3.94 / 5 for almost opposite reasons.

Interesting fact, with their now native JSON field support and related features, traditional RDBMS like MySQL or PostgreSQL appear as good competitive solutions.

When we did our first EAV implementation in 2013, store the product data as a JSON field was not eligible due to the missing ability to query these data, which is doable since MySQL 5.7+ and PostgreSQL 9.3+.

With this JSON field approach, MySQL and PostgreSQL are respectively getting a rating of 4.47 / 5 and 4.37 / 5. The difference mainly coming from our adoption criteria, PostgreSQL still being (sadly) less popular and used in PHP ecosystem.

An alternative stack appears very attractive, MySQL with JSON field + ElasticSearch to leverage the querying possibilities. This stack gets a rating of 4.32 / 5, almost the same than MySQL or PostgreSQL only. However, this option is more well-balanced between our criterion groups. Pretty easily adoptable, quite simple and very performant, scalable and future proof.

We studied around 10 other options that we’ll not detail here, all being not as good as these or very unbalanced.

Proof of Concept

We started to work on a POC on the most promising solution, MySQL with JSON field + ElasticSearch.

Our mission was to benchmark the storage stack in itself, without taking the application layer into account, to validate the following assumptions:

  • we are able to implement read / write queries we currently need in our application to address business needs
  • we can enable future known features, there is no technological lock, we can unleash innovation with this stack
  • we can handle future growing expectations regarding read / write performances and regarding the product data scalability

The first part of this POC was to design a relevant schema for both MySQL tables and ElasticSearch indexes. This schema being challenged through various aspects to guarantee support for our current and future expectations regarding read and write accesses.

Another significant topic was to improve our internal tooling and create new tools. Indeed, these tools allowed us to insert very large volume of data and to benchmark the storage. Its performance has been measured using a matrix of data queries, progressively increasing the data volume and playing with several data axes. These axes aim to represent different catalog typologies, for instance, to push the limits on the amount of structured data, or to design a set of very heterogeneous products.

This realization finally confirmed the relevance of the solution and its technical feasibility for the storage layer.

What’s next?

Even if this storage is eligible when used directly in a standalone mode, what about our current application using it?

What would be the limitations of our application? Where would be the bottlenecks? What would be the build cost of this implementation? What would be the impacts of such a change for our product and ecosystem?

We’ll answer these open questions in an upcoming post. Stay tuned! 📻

--

--