Re-building the storage from the ground up

Published in

Akeneo Labs

6 min readJun 8, 2017

This is the third post of a series dedicated to the way we store products at Akeneo. Nicolas previously told the story of products’ storage and explained the soul-searching we did at the end of 2016 regarding this critical part of the application. In this post, we’ll go one step further. What would it be to implement this ideal storage? What would it be to implement a single storage in Akeneo PIM?

A Recap of our Ideal Single Storage

Our ideal single storage could be explained as a simple equation:

single storage = MySQL + product values into a JSON field + Elasticsearch

The storage should work the same for both Community and Enterprise editions and should be able to handle the same amount of data. And as you can see, the number of products or values is not part of the equation anymore.

As explained in the previous post, our first POC confirmed that a standalone MySQL allows us to store our product values in a JSON field.

We had no doubt about the fact that Elasticsearch would fit our needs. We already had some experience with this tool as we provide a special bundle that handles product searches inside very large catalogs for our most challenging customers.

Re-Modeling Product Values

This approach has a direct impact on the product value model. Even if this model is the key concept of our application, it suffers from several design flaws.

It is tied to the EAV implementation, via the methods setMedia and setNumber for instance. Also, you don’t know directly which kind of value you are manipulating. Is it a metric? A number? A reference data? To get this information we have to look at the attribute linked to the value.

It has no dedicated business oriented method per value type. It would be really useful to have a getUnit method on metric values for instance.

It is handled by Doctrine’s unit of work, which, in itself, is not a problem. But when a product has thousands of values, it has direct consequences on memory usage, and by extension, on the application speed, especially during imports and exports.

It has been designed as an entity instead of being what it really is: a value object. As a matter of fact, a product value measures, describes or quantifies a characteristic of a product. And a product value which contains “foo” is exactly the same that another value containing “foo”. It has no identity.

Considering these problems we decided to completely re-model this key object from the scratch.

We wanted to have one product value class per attribute type. They all share together a minimum contract. Things are simple now. You know what you work on and it’s even possible to have dedicated business oriented methods depending of the value’s type.

We wanted to make them immutable, as real value objects are. We provided dedicated factories, to facilitate their creation.

We wanted to keep them out of Doctrine’s unit of work. Doctrine is still responsible for loading and saving the JSON field that contains all the values. But we transform this JSON representation into real Product Value objects by ourselves.

We wanted to offer a lot of design sugar. So we enriched the product layer API with meaningful objects, such as a dedicated collection of values or a factory that is the single entry point for creating any product value object from data provided as standard format.

The interesting thing about this refactoring, is that it has no serious impact on the product object API, nor does it impact the public APIs of features dealing with products. Actually, all those important changes are really deep in our internal API.

Re-work most of the features handling products

The re-modeling of product values had direct impacts on all features regarding products. Among these, can be mentioned variant groups, completeness, drafts, proposals or published products.

Some features that we coded in 2013/2014 haven’t been changed that much since their creation. Like published products or product reverting for instance. And they didn’t even use the “standard” layer we are used to now: product updater, validator and saver. They were living with their own, duplicated, and often complex, code. We took the opportunity to simplify those features by using the standard layer and all the brand new classes we introduced. We end up with less code, hopefully less bugs, and more shared knowledge and confidence within the team about these antique parts of the application.

But this refactoring doesn’t provide only code industrialization. It also brings its share of simplification. For instance, there is no more need to struggle in order to guess the changes made on the values of a product. Everything’s in one single place: in the JSON field.

The main aspect to keep in mind is that most of the userland API regarding these features has not changed at all. Again, it’s mainly a big internal refactoring.

Not a silver bullet…

Even if we are really confident about the usefulness of this project, we know it’s not a silver bullet. And as every choice made in the development world, it comes with limitations, trade-offs and considerations to keep in mind.

Storing the flexible product values in a JSON field and systematically querying our product values through an Elasticsearch index implies a strong and acknowledged technical limitation: we can’t query product values directly in MySQL anymore. To handle this, we kept the core ideas of our current ElasticsearchBundle: when saving a product in MySQL, its data is indexed in Elasticsearch. However, this change has been introduced by implementing new filters for our product query builder. The querying contract remains exactly the same.

Even if it is very common in web applications, Elasticsearch introduces an extra complexity regarding the product installation and configuration. This core change also implies extra work regarding migrations of existing customers to the next versions. We’ll explain how we handle these two aspects in an upcoming and dedicated post.

But it’s worth it!

We are aware these changes will affect our ecosystem. However we really do think this is for the best in the mid-term. Whether it be for our development team, our product owners, our contributors, our integrators or our customers.

First, providing a single product storage to handle all cases drastically simplifies our technical stack. It facilitates the on-boarding of new developers that can now focus on a single way of storing and manipulating data. It also helps more seasoned developers on Akeneo PIM to improve their expertise by being more focused on this unique stack. The fine tuning and optimization of the application or server infrastructure becomes easier. Finally, the cost of building and maintaining new features, projects or extensions also benefits from this standardization.

Talking about extensions, the new design of product values also has a positive impact. In the current version, adding a custom attribute type requires an override to the native product value class. As you can override only once, you can’t install two extensions adding their own attribute type. With the new implementation, no more need of overriding, you just have to implement an interface and voilà!

This implementation also unleashes future innovations and features. For instance, having a native search engine in the stack opens a lot of possibilities, the first one being a complete full text search. This new design will also allow to install new attribute types as plug-and-play extensions, just as easy as installing an add-on in Firefox.

Last but not least, this storage has been built from day one with scalability in mind. It allows us to target larger product catalogs and to support the growth of both our existing enterprise customers and community users.

Fair enough. But when?

The implementation has just been merged on the master branch of our repository. So it will be available in Akeneo’s next version. It has been a 5 months work, requiring 5 core team members who opened more than 300 pull requests.

In the next posts of this series, we’ll detail the set of tools we’re building to facilitate the installation and the automation of the migration. We’ll also share and explain our benchmarks regarding performances and scalability.

Stay tuned!