Machine Learning Engineering book

Reza Mahmoudi
8 min readNov 24, 2023

--

Chapter 2 — Your data science could use some engineering

  • ML engineering involves a trinity of core concepts
  • Tech (tools, frameworks, algos)
  • People (collaborative work, communication)
  • Process (development standards, experimentation rigor, Agile)
  • Three components of data science
  • Design — how data is going to be collected and in what structure it will need to be to solve the problem
  • Collection — Act of acquiring data
  • Analysis — Gaining insights from data through stats
  • Modern data science is largely based on the third piece — analysis — while a data engineering team handles the first two.
  • Data science (e.g. the ‘analysis’ above) is just one component of ML engineering. Other components include:
  • Project planning
  • Communication
  • Monitoring and evaluation
  • Software development
  • Model and artifact management
  • Scalability
  • Platform and deployment
  • Simplicity is the foundation of ML engineering — start with the most simple solution and add complexity from there as needed. Simplicity is “by far the single most important aspect of ML engineering.”
  • “Unfortunately for newcomers to the field, many data scientists believe that they are providing value to a company only when they are using the latest and “greatest” tech that comes along. Instead of focusing on the latest buzz surrounding a new approach catalogued in a seminal whitepaper or advertised heavily in a blog post, a seasoned DS realizes that the only thing that really matters is the act of solving problems, regardless of methodology. As exciting as new technology and approaches are, the effectiveness of a DS team is measured in the quality, stability, and cost of a solution it provides.”
  • Guide for building the simplest ML solution possible (this image seems ‘pin up next to your desk’ worthy — I’d imagine tons of time, energy, and money could have been saved if everyone walked through these steps first when approaching a project)
  • Hypothetical complaint: “but it’s not data science work if the solution doesn’t use AI”
  • Ben never had the demoralizing experience regarding infrequent application of cutting-edge techniques, because he started in data analytics before moving to ML.
  • One of the biggest drivers of his focus on simplicity was the fact that it was his responsibility to maintain the solution. Using a more complex solution than was needed would unnecessarily complicate his job. Takes longer to diagnose failures, harder to troubleshoot, frustrating to change internal logic for new feature incorporation, etc.
  • Less time spent maintaining solutions = more time spent building new solutions = more value provided to the business
  • “Sometimes working on the basic things that bring incredible value to a company can help you keep your job (which isn’t to say that forecasting, churn, and fraud modeling are simple, even if they don’t seem particularly interesting).”

This discussion of simplicity caused some pretty deep personal reflection for me. I am glad I encountered these ideas at this very early stage in my programming journey, because I think I am very prone to the trap that Ben sheds light on. I’d like to discuss a few examples, lessons learned, etc.

I work with a public-equity focused investment firm. Firms above a certain size need to file something called a 13-F, which is basically a list of the stocks that they own. Combing through 13-Fs of firms you respect can be a good way to find investment ideas. I had a couple of ideas of tools to build to make better use of these filings.

The firm I work for has a list of what we consider to be the best companies in the world, which we refer to as the SCM 100 list and serves as the shopping list for our fund. One project would essentially be something along the lines of “if > x% of the 13-F filer’s portfolio is in names that are on our SCM 100 list, then show us other names from that portfolio that aren’t on the SCM100 or we haven’t chosen for exclusion from it.” This essentially creates a list of companies that we haven’t look at, but that are owned by investment firms that take a similar approach to us. Sourcing this way could effectively give us a higher “hit rate” in terms of what percentage of companies our research team looks at actually end up on our SCM 100 list. This increases the efficiency of the research team, which is one of our firm’s most important/expensive assets.

But when I considered whether to try building this side project, one of the things that came to mind was, “this is just a simple rules-based program, what is so exciting about that?” This deterred me from wanting to pursue it, and I am now starting to see that this was probably a mistake. There are other issues too — such as whether I have time to do this, whether I have/would be allocated the necessary amount of compute required to comb the whole database in this way, etc. But the fact that this didn’t use cutting edge ML should not have been a deterrent to me. I think that was my ego getting in the way. Another idea, that I surely have the compute to execute but is just a question of personal time allocation, is to make a list of all the holdings of e.g. 10 funds I really respect and show which they added to or trimmed that quarter, what the low price of that quarter was vs. the current price so I can evaluate signal value, etc.

There is a balance to strike between whether you are just providing value to the business vs. doing something you actually enjoy. For me, in this particular situation, I think there is enough of interest to me (refining my basic coding skills, having the actual tool once it is done to help me find new ideas, etc.) to make it worth pursuing. But if your more simplistic, business problem focused tasks sound deathly boring to you and you never want to do anything other than work on cutting edge AI models… well then maybe you should try to get a spot at Anthropic or the Stanford AI Lab. And that’s okay. There’s nothing wrong with that. But know that if you are part of a small data science team serving a business that focuses on something other than cutting-edge AI, there may be times where you do things that aren’t as interesting to you.

Another lesson I learned is that I should be more willing to embrace small projects, if nothing else than as a way to establish credibility. We recently decided to build an automated email alert that would send the research team an email whenever one of the stocks on our SCM 100 list hit the low end of our estimated valuation range, indicating it was potentially actionable. Our lone data scientist ended up spending some time to construct this. While it may have taken me a bit longer, I think that I should have taken this responsibility on my own. Not only would it help establish my credibility on the data side, but our PhD data scientist certainly had higher uses of his time that were a better match for his skillset if I could have taken this off of his plate.

Along these lines, I recently interviewed a potential candidate for a full-time data engineer position at our firm. He said something that really caught my eye. It was pretty off-the-cuff for him and I am not sure how much he thought about it. But I thought it spoke volumes about his character and clarity. He was explaining to me what his role as a data engineer consisted of and how he interfaced with the data scientists/ML engineers. He said something along the lines of, “My job is basically just to spare them from having to work with SQL so they can focus on the more complicated modeling work… basically, let the big dogs eat.” This reflected not only a team-oriented mindset, but a humble confidence in his mastery of his own role and skillset, as ell as a conviction about the value it provided to the firm.

I think that right now, I need to keep in mind the idea of a simple solution still providing value to the team. That is probably my highest possible route of value-add on the data science side of our firm. Because it will never be our PhD’s highest use of time to do something that doesn’t involve ML. To whatever degree I can step into those situations and free him up to work on more important stuff, I am providing value. Or even by just doing stuff he would’ve never gotten around to but that still provides value on the research side.

Back to the Ch 2 notes…

  • Communication and cooperation are key — both internally amongst the ML team as well as with other business stakeholders for whom the solution is being built.
  • “Approaching project work with a lone-wolf mentality (as has been the focus for most people throughout their academic careers) is counterproductive to solving a difficult problem.”
  • Cooperation leads to you building the correct, useful solution faster than you otherwise would have
  • Embracing and expecting change is important. If you plan for change and expect it, you can focus on what is most important: solving problems. It helps alleviate any unnecessary attachment to certain approaches or methods. It also forces the project into a “modular format of loosely coupled pieces of functionality”.
  • “ML engineering brings the core functional capabilities of a data scientist, a data engineer, and a software engineer into a hybrid role that supports the creation of ML solutions focused on solving a problem through the rigors of professional software development.”

This discussion of simplicity caused some pretty deep personal reflection for me. I am glad I encountered these ideas at this very early stage in my programming journey, because I think I am very prone to the trap that Ben sheds light on. I’d like to discuss a few examples, lessons learned, etc.

I work with a public-equity focused investment firm. Firms above a certain size need to file something called a 13-F, which is basically a list of the stocks that they own. Combing through 13-Fs of firms you respect can be a good way to find investment ideas. I had a couple of ideas of tools to build to make better use of these filings.

The firm I work for has a list of what we consider to be the best companies in the world, which we refer to as the SCM 100 list and serves as the shopping list for our fund. One project would essentially be something along the lines of “if > x% of the 13-F filer’s portfolio is in names that are on our SCM 100 list, then show us other names from that portfolio that aren’t on the SCM100 or we haven’t chosen for exclusion from it.” This essentially creates a list of companies that we haven’t look at, but that are owned by investment firms that take a similar approach to us. Sourcing this way could effectively give us a higher “hit rate” in terms of what percentage of companies our research team looks at actually end up on our SCM 100 list. This increases the efficiency of the research team, which is one of our firm’s most important/expensive assets.

--

--