Be aware of the industry

# 7 — Are your Problems Similar to those of the Rest of the Industry?

The importance of stop trying to reinvent the wheel.

Nicolas Rodriguez Presta
Mercado Libre Tech
Published in
3 min readOct 14, 2021

--

This is story #7 of the series Flight checks for any (big) machine learning project.

As the initiative moves forward, there are many opportunities to improve it and to make it evolve. Many problems will come up.

Even though these problems may seem unique to machine-learning projects, the truth is that, most probably, many others have already faced the same difficulties.

If the most difficult problems to solve are very different from those that the rest of the industry is tackling, it may be a bad smell. The jungle of solutions might have taken us into a very specific fork in the road that leads us away from the river.

At some point in the life cycle of our product/ML model, it is advisable to ask ourselves if the problems we have are similar to those of the rest of the industry and devote some time to finding an answer.

While this seems obvious and some googling should do the trick, in the day-to-day bustle, it is easy to overlook the gaps needed to gain this perspective.

Some triggers that may help:

  • Ask yourself how the company you admire the most (Google, Facebook, Amazon, Tesla, Mercado Libre) would solve this problem. What keeps me from solving it in the same way?
  • How much of the project time is spent on maintaining or integrating legacy systems? Is it scalable in the future?
  • Are there any papers that back up the way I am modeling the problem? Are there any other more updated papers on the subject?
  • Is my data pipeline solution “off-the-shelf”? Wouldn’t it be worth it to do some research on data pipeline solutions to see how different companies solved the problem?

For example, if the big issue in the pipeline is the time the productive model takes to respond online, this solution will surely come in handy. Before jumping into low-level optimization of the lib used or serving it on a ton of GPUs, you will probably want to spend some time trying to figure out if the same model can be trained with another lib that includes serving optimization.

Say the main problem is that the dataset does not fit in the memory. Then, before looking for a distributed way to process the data or an instance with more memory, perhaps it would be useful to validate whether there are applicable batching solutions for training.

Further, if the problem is pipeline component orchestration, you can develop your own orchestration layer (and it may be necessary) but first, it is useful to find out how other ML projects orchestrate the pipe steps. Is any of that applicable?

While these are just trivial matters, in my experience, sometimes the “novelty” of ML projects seems to be dazzling and it is easier to fall into the fallacy that “my issues are hitting the state-of-the-art, so I have to solve them from scratch”. After a little research, the answer is usually ‘no’.

The risk of moving forward with customized solutions is that their maintenance cost will increase dramatically over time, since they will not be competing with other standard solutions.

The question is: Can the components of the current solution be commoditized?

There are still 3 more checks left! Keep up and enjoy your flight!

--

--