Why and when to build a Machine Learning Platform (part 1)

Published in

Mercado Libre Tech

7 min readDec 11, 2020

If your company is a start-up, you’ll probably find this article very -but only- engaging, since we’ll be exploring growth at scale in a technology-based company as is MercadoLibre (MELI).

Maturity and cross-unit development push data science to a level of complexity that not many companies have yet had to attend to. In the process of extending data-driven decisions, sharing and building knowledge seem limitless. While giving birth to specialized data science cells, every new scientist brings their favorite tools and skills, to usually team up and merge with other experts that have acquired their own knowledge from the business world.

This, of course, will at first accelerate delivery and broaden data culture off its bounds, but will also, maybe inadvertently, give place to a new entropy that if let loose, might be challenging to govern shortly.

Organizational silos are the expected result to when information and best practices are not shared across business units.

At mid-sized companies, system development historically grew siloed but later found its way into an integrated off-the-shelf enterprise platform, usually designed to support business operations in a standardized, secure and friendly fashion. A handful of visionaries that have anticipated this scenario soon began aiming to acquire an integrated data science platform, a place where big data access, development flexibility, production delivery and security compliance live happily together. But only a few courageous, such as Netflix, Uber, Amazon and Google have ventured into the deep challenge of building one themselves, … and so did we!

In MercadoLibre our Machine Learning (ML) platform, fully developed in house, has been baptized “FDA”, acronym for Fury Data Apps (more details about it will be available in Part 2).

This said, let’s dive into what has been our own challenge and what our vision brings to us ahead.

Shipping predictions at scale

In an originally C2C retail company that has evolved to offer B2C, developed fulfillment centers, shipping, cross border trade and financial services (MercadoPago, being currently the company’s star), there are diverse problems to solve. Some of the productive ML use cases we needed to support were:

Item Recommendation
Fraud Detection
Fake Items Moderation
Item Stock Forecast
Shipping Cost/Time Promise
Predicting Package Dimensions

*Ranking special offers every hour is one of the problems ML solves at MELI*

With more being created every month, whether starting a new experiment or ML project or even hiring new data scientists, the complexity was such that we needed to find a solution that could:

lower entry barriers in a super complex data ecosystem,
gain agility and provide common tools,
support the model development lifecycle,
help deploy trained models to production,
provide computing infrastructure,
settle in the cloud.

So far, we’ve stated what we intended to do and for what purpose but haven’t provided any further context: why were we encouraged to leap towards developing this platform instead of acquiring one?

“Ok, let’s develop our own ML platform”

Goals & decisions

First and predictably, we analyzed some options available on the market, prioritizing open source platforms. Of course, these serve excellently for companies with a traditional core business that can rely on 3rd parties to provide, maintain and innovate at their own pace. But our case was — and still is — different in the sense that innovation and delivery to support data-driven solutions are indeed part of our DNA. So our curiosity, proactivity and valued risk-taking strategy would not let us wait for things to happen.

Picture what our Request For Proposals (RFP) would’ve looked like: we needed to guarantee flexibility, versatility and readiness as fast as our business flows. Whatever solution we brought had to be able to scale limitlessly, grow organically in a multi-cloud environment, support impossible SLAs and integrate seamlessly with all of our ecosystem… and all of this, even as we speak.

Right, with 12 sales per second, scaling at MELI means a lot.

Machine Learning was merging into business at a rapid pace — it was a call for action. To everyone else’s relief, we then began the journey of creating our own platform, with our own constraints and our own cravings.

But, as you might know, a Machine Learning project requires a wide variety of skills, including maths, statistics, coding experience as well as technological know-how.

IT staff represents more than 30% of the company’s total amount of employees.

In our company, with 8,000 employees at that time (and +12,000 as we speak) — including IT and different business units — all with a wide array of tasks to perform, understanding our users was a key part during the initial phases so that we could make the right calls.

At the initial stage, many meetings took place with every cell that had been carrying out some kind of data-science problem solving; our goal was to scale Machine Learning solutions in MercadoLibre taking advantage of each cell’s initial skills and providing solutions for its major pain points.

As an example, we would support data scientists managing infrastructure for them, while tempting developers by bringing them closer to what data scientists were doing, i.e. adding extra value and speed to both their knowledge areas.

Bringing both areas of expertise together: Data Science and DevOps.

How did we develop our own platform?

We’d like to mention some short but key guidelines we reminded ourselves during the conception of the platform:

The development of Machine Learning solutions should not happen at all costs: the productive code must comply with certain quality standards.
The opportunity to enforce some best practices we had acquired while doing Machine Learning wasn’t to be ignored. It was valuable and costly knowledge that had to be leveraged when the final product was released.

This is why a cross ML platform made sense and an interdisciplinary council of developers, cloud architects and data scientists began sketching the solution to finally come up with a common vision and the assets that would make it possible.

Fury Data Apps (FDA) shared vision

A user who wants to develop a data-science project shall choose FDA to do so, and for it we will:

provide an essential feature set,
support adaptable pipelines, ready to assemble with other projects,
ensure support and assistance, either for current or new projects.

A team who aims to deploy a data-science solution shall choose FDA to do so, given this we will:

maintain a resilient and robust service as Fury is,
ensure support and assistance,
provide a cost-effective and optimized -yet agile- infrastructure.

All the company must know about FDA and its capabilities, namely:

understand what it is meant for,
expect a 360º-experience through training & diffusion and appearing at onboarding/learning sessions.

These are our drivers and the guidelines for our development.

Where are we standing now?

Our team has grown, our platform has grown, our users have grown. This entails a larger complexity to manage, maintain and innovate.

Our diverse perspectives add higher value and deliver better results.

So this is how we do so:

We contribute from three different teams, with developers fully allocated to the project.

Infrastructure (Cloud & Platform)
Machine Learning Technology
Data Team (formerly Business Intelligence)

2. We keep a unified backlog.

Through grooming and prioritization, the council takes product ownership and agrees to every decision taken.
No tasks are drawn from outside.

3. We use many agile rituals and tools: sprint planning, prioritized backlog, daily meetings, retrospectives, pre and post-mortems, kanban board.

We provide support to our products.
We prompt our users to request new features, invite them to discuss and collaborate and come up with unbiased results.

4. We accept external innovation:

If agreed at the council, external requirements will make their way to the backlog and will be subject to prioritization as the rest.

5. We host mandatory kick-off and mid-term results meetings for/with our sponsors.

Thanks to this evolution, the idea that we could improve agility, contribute to making best practices pervasive across the company and govern the development of Machine Learning software across MercadoLibre is successfully wrapped up today in our own in-house solution: FDA.

Yeah! delivering machine learning to production is possible!

In Part 2 we drill into each module FDA offers today and where our vision is heading off to.

Acknowledgments

All of the former and current FDA dev squad members, from the Machine Learning Technology, Cloud & Platform and Data teams at MercadoLibre, working together to make it happen!
Cecilia Sassone (editor).