DATA STORIES | MACHINE LEARNING | KNIME ANALYTICS PLATFORM

About Machine Learning — How it Fails and Succeeds

A collection of insights from practicing data science

Markus Lauber
Low Code for Data Science

--

A few remarks about insights collected over my years of Machine Learning especially with the low-code software KNIME and Python — and before that IBM SPSS, SAS, R/RStudio … the software changes the challenges remain the same and will be true with Cloud Platforms.

Before the more recent LLM hype there was quite some buzz about Machine Learning and Kaggle competitions were fashionable. I have used some ML over the years in a professional context in various industries (consulting, insurance, telco) as well as out of personal curiosity.

I would like to share some hints and observations along with links to working examples you can use yourself. Such a blog cannot replace some solid studies or tons of books so I do not claim to cover it all in one go — more some kind of philosophy.

“A happy yellow robot learning in the style of El Greco” (DALL·E)
“A happy yellow robot learning in the style of El Greco” (DALL·E).

If you want to dive right into the ML examples I have collected over the years you can check this article — some I have built a lot of others are from the great KNIME team and community. If you want to learn machine learning there are also some books, videos and links in the text:

CRISP-DM — still relevant today

In my view the basics of the “Cross-industry standard process for data mining” (CRISP-DM) still hold true after all these years. Best to think about your ML projects as an evolving process. And when I would have to name the most relevant mistake especially management makes when using AI and ML is to expect instant 99.9% gratification/precision without letting the process run thru several times and think about the balance of costs and what is actually possible to ‘predict’.

“Cross-industry standard process for data mining” (CRISP-DM) — By Kenneth Jensen
Cross-industry standard process for data mining” (CRISP-DM) — By Kenneth Jensen — Own work based on:ftp://public.dhe.ibm.com/software/analytics/spss/documentation/modeler/18.0/en/ModelerCRISPDM.pdf (Figure 1), CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=24930610.

A successful Machine Learning project will require a close collaboration between the business and the technical side. Often a Data Scientist would acquire all the roles ascribed to him in Stephan Kolassa’s famous Venn Gramm. If you can you should opt for also including Data and Software Engineering Know-How.

Yes you can have a KNIME/R version of this Venn Gramm on the KNIME hub or on the forum :-)

Think about your business problem

Before diving head into some fancy deep learning you might want to spend a moment thinking about your business question. What is it you want to score and measure? What you will get in the end is basically not one prediction (some sort of magic truth) but you get a numeric score (between 0.0 and 1.0) that gives you some sort of ‘probability’ or more precise ranking of your so scored items/customers. What would you now actually do if customer Pete has scored a 0.87? Would your management decide based on your score to contact (or hire) Pete? How much does it cost?

Note. If I would have to speculate why especially in Germany a lot of institutions struggle with ML projects, it might have something to do with the famed German engineering culture (think “Spaltmaß). There has to be a direct connection between things, a physical logic. True or False. Probabilities and Uncertainties are not a sign of a project evolving to some -maybe useful- truth but an indicator of a fundamental weakness for not being able to pin down the exact mechanism. So when a German sees a 89% probability the gut feeling often seems to be: this is a failure. This seems to be the fate of a lot of ML projects.

“A happy yellow robot learning in the style of Miro” (DALL·E)
“A happy yellow robot learning in the style of Miro” (DALL·E).

Your data is the most valuable asset

Sometimes it seems a lot of people like to talk about fancy algorithms when what they should talk about is their data, if they do understand it properly and if the data has any chance at all to tell them what they would like to know. So check some basic correlations first maybe.

And think about bringing the data into a shape that can be handled by an ML algorithm in the first place. All those nice Iris, Titanic and classic Kaggle datasets are already prepared and most often cleaned to perfection. Real world data is messy and often un- or semi-structured and resides in several silos where you are happy if you have one common key to match them. Until AI can sort this all out you will spend a lot of time with your data (hopefully).

If you models are stuck and would not improve — see if you can add more data rather than try to find a more fancy algorithm.

Ask the Business Side

A Data Scientist should not assume one can just throw the data at an algorithm and be done. Ask the business side what they think about their data. If they have any recommendations what to use. If they have a gut feeling about some measurements or experiences in the past.

One of my most successful feature in a model was when I asked a very senior colleague to just classify products in 5 categories based on a few variables. What would an expert consider a ‘luxury product’ or a ‘clunker’ (or something in between)? The derived feature proved to be very significant. And if you can classify a phenomenon, please do so — even if models in general should be able to find the insight themselves, doing it in an explicit way is better. So ask the experts — they already know a lot. Also share some findings with them and discuss them.

But: if you find something seemingly no one has ever found before I think about what my former boss said: “If it is interesting it is probably wrong”.

The Target …

Think about your Target (or label). In some scenarios you have a good label. You know when Pete did make his last purchase. But do you know when he actually made the decision to buy? Or when the root cause for your latest technical ticket really emerged that Alice sent you just this morning? More often than not we will have to use what we have but we should keep in mind that the ‘truth’ in the target/label might not be as precise as we might wish it to be. So the same will be true for the prediction.

You might have to re-arrange your data so as to reflect these things. Yes this is a very general remark but try to keep this in mind and see and test and discuss if what you came up with best reflects the reality you want to measure and predict.

Data Preparation is important

You will have to prepare the data so as if they would appear the next time you want to score your (new) customers/items. You can only know in your training what you will know once you are at the point of applying you model. Any you will have to reproduce the data in the same quality at the point of making the prediction. Once you know about Machine-Learning this might be obvious but I have time and time again see people not taking this into account. More details on how to actually do this in the technical collection under “Data Preparation”.

Bring the data into the right shape

Once you have your data ready there are all sorts of further technical data preparations that might also depend on your question and the algorithms used. A lot of modern environments like XGBoost or H2O.ai will take care of (some) of this but you still want to read up about normalization or dimension reduction.

If you want to take a shortcut of some sorts you can revert to the great vtreat package that will ‘black-box’ this process for you and will produce stable and reproducible results:

Build and Evaluate the Model

There are tons of articles and examples of how to build and measure you models (AUC, AUCPR, RMSE, LogLoss …). I will just point you to the other article.

It is always good to test you model in reality and you definitely should do this. But again: think about your business case. Is there a time component? Maybe you check for a certain time-slot instead of a single point in time. If you measure a break-down of a machine or the onset of an illness the model might capture the general tendency and the breakdown will happen at some point in the near future. So you might not yet send out the fire brigade but maybe more closely monitor the item and be ready when it happens. So the model might not technically be ‘right’ but still be useful. This (again) depends on your use case. This (again) might depend on the willingness of the business side to explore and experiment and constantly improve (no initial perfect ‘Spaltmaß’).

Bring your Model into production

This is the second hardest part. Actually using the model and getting feedback on its performance. If you can automate the tracking of the results (sales numbers, churn figures) this might be easier than when you rely on qualitative feedback and the hunch of your business side (which happens often enough).

Also you should think about the platform you model will run on and how to store the results and run them again. this is why you will most often find my examples with a way to store and reproduce the models and results and also store performance metrics. A lot of examples I sometimes see stop when a model is presented inside a Jupyter notebook and then you ask: OK how would I now store this and bring it to my companies system.

This is one reason why I like the KNIME platform — and the H2O.ai / MOJO environment (cf. H2O Classification and Regression examples). You can store and reproduce your results and the MOJO model files you can exchange between KNIME, R/RStudio, Python and Spark(ling Water). Best way is to use a KNIME Business Hub where you can not only store, manage and automate you models but also care to the needs of all your Data Science and Analytics tasks.

“A happy yellow robot learning in the style of Rembrandt” (DALL·E)
“A happy yellow robot learning in the style of Rembrandt” (DALL·E).

Monitor and Update your Model

OK and now to the hardest part. Maintaining and updating you model in production. From my perspective this will involve a technical side. Having the numbers ready maybe even trigger some automatic re-training and deployment of the model but also a business side. Keep in contact with the decision makers and see if they actually use your model and care about its results — or if they after the initial hype just moved to the traditional way of doing things. This can especially be true if the initial benefits are not that great. Again the incremental and gradual thing: a little improvement every week or month will lead to larger benefits in the future but might initially be hard to see.

Machine Learning: It is about the business question, not the algorithm!

In case you enjoyed this story you can follow me on Medium (https://medium.com/@mlxl) or on the KNIME Hub (https://hub.knime.com/mlauber71) or KNIME Forum (https://forum.knime.com/u/mlauber71/summary).

--

--

Markus Lauber
Low Code for Data Science

Senior Data Scientist working with KNIME, Python, R and Big Data Systems in the telco industry