It’s the first article in a series regarding the storage of products data in Akeneo PIM.
In early 2013, we started working on our open source PIM software and we founded a company, Akeneo. Our goal was to build a user friendly software, not a technical tool, not a framework. A software designed for our end users, the marketers, to give them the freedom to efficiently structure and enrich their product data.
As open source is part of our DNA, since day one we wanted to build a strong community and a global ecosystem around this product. To maximize its adoption, we chose to implement it using PHP, a super open source friendly language. The application relies on the Symfony framework, which has a vibrant community, pushing forward the industrialisation of the PHP world. We also decided to stick to a LAMP stack (Linux Apache MySQL PHP), a very common choice in the PHP ecosystem and a good way to make sure PHP developers felt at home.
Our objective was to provide a flexible structure for the products, enabling an end user to define by themselves the structure of their product catalog. The definition and modifications of product structure should be possible directly using a simple UI, without any technical intervention, without any service interruption.
Taking into account these constraints, we designed and implemented a first version of the product storage, based on a EAV (Entity-Attribute-Value) approach, above Doctrine ORM.
When expectations meet reality
In early 2014, we released our community edition v1.0.0. We got good traction and more and more feedback on the use of our brand new on-premise PIM.
Our assumptions were that most of our early adopters would use it with 1k to 20k SKUs and 100 to 300 attributes (a product property, such as a name, a description, a weight, a color).
However, we quickly worked with a large retailer, managing 50k SKUs in production. To provide them a better user experience, we released a set of improvements regarding performances.
One month later, a large auto part reseller planned to manage 1.3 million skus in their PIM. We directly knew that these expectations would not work with our current implementation. Our EAV approach gives schema flexibility but the drawback is the poor performance regarding querying and a limited vertical scalability. Behind the scene, this limit is more directly related to the total amount of fulfilled product values than to the amount of products.
Considering this limitation, we designed, implemented and released an optional MongoDB storage for the products data. This storage, document and schema-less oriented, matched exactly our use case, i.e. being able to store and manipulate products with different properties. For instance, a TV and a T-shirt have totally different properties, however both are sellable products.
Since then, when you install Akeneo PIM, you can choose to use MySQL or MongoDB for your product storage (other domain models remain stored in MySQL). This implementation also required us to revamp a part of our stack. The introduction of an abstraction layer allowed us to develop new features without having to think too much about real product persistence. To achieve this, we designed an abstract product query builder to read the data. We also implemented saver and remover services to handle the writing data concern.
This new storage has been shipped in our v1.1 and has managed to answer the need of the most ambitious projects regarding the amount of products and product values!
We released our Enterprise Edition v1.0 few months later, bringing extra features such as advanced permissions and workflow. These new features contained few minor improvements of the product storage.
Growing with our customers
From 2015, we’ve supported the setup of bigger projects. Hundreds and thousands of skus or millions of skus became common. We’ve also seen existing live projects, originally setup with a small data set, getting bigger, completing existing data, adding product families and attributes, managing more and more products, expanding to new channels and onboarding always more users.
From standard projects defining hundreds of attributes, families and categories, we moved to projects having to deal with thousands of these properties.
After the question of the scalability of the amount of skus, we faced the question of the scalability of the amount of structure models. Loading synchronously 500 attributes in a web page is not a problem, with 40k attributes the application is just unusable. To handle these new use cases, we improved the stack from the UI to the querying system for example, by always paginating the data loading.
Growing the amount of attributes, we also met a strong limitation with our MongoDB storage. In a MongoDB collection, you can use a maximum of 64 indexes.
When filtering on a non indexed field, MongoDB performs quite efficiently up to a certain amount of data, then performance decreases drastically.
We started to work on a solution for the projects containing very large amount of skus and product values with the requirement to filter these products with more than 64 indexes, as each of this index is used by a different product characteristic.
Among several options, we decided to implement an ElasticSearch index to handle the query part, keeping MongoDB only to store the product data.
Thanks to our product storage querying abstraction, the ElasticSearch querying capability has been quite easy to design and build. It has been shipped as an optional extension with a very light footprint.
Take a look back at this journey
We learnt a lot about our core domain these last years. Being in contact with hundreds of enterprise customers and thousands of community projects has been a rich source of knowledge and challenges. Each industry has its own needs, and each project helped us to better understand how to efficiently design and enrich a product catalog.
Our product storage has evolved in an iterative way. It well reflects the successive challenges we faced on. From a blazing fast adoption, passing by the on-boarding of numerous new customers, to their growing expectations. The subtle exercise has been to properly balance innovation and stability. Innovation to on-board new customers having higher expectations or to handle growing needs of our existing customers. Stability for our existing customers, expecting to keep their current technical stack and the exact same behavior from a version to the next one.
Even if we succeed, the current state of our product storage has several drawbacks, especially for developers directly using the storage internal API.
As a project developer, customizing the PIM for a dedicated customer, the current state of this storage is quite complex to understand. What storage suits my project the best? MySQL only? With MongoDB? With ElasticSearch?
As an extension developer, you have more cases to take into account on development and maintenance, especially if your extension needs to process low level operations on products.
As a core developer, even if we managed to decouple our business code from the persistence layer and even with the proper abstractions, the total cost of ownership is bigger than with a single storage.
This storage approach implies in some cases to write and maintain an implementation per storage. This extra cost per feature directly impact our delivery, a feature more costly = less features shipped per release. On the support side, the issue qualification is also more complex and requires a lot of environments to reproduce the problems. Last but not least, on the maintenance side, even for the simpler fix, we also need to launch a bunch of builds in our continuous integration, to make sure that everything works with both storage systems.
We hope this first post regarding the product storage helps you to better understand its current state. In the upcoming post, we’ll detail the studies we’re running to improve this crucial part of Akeneo PIM. Stay tuned! 📻