An Introduction to Machine Learning

by Till Bergmann, Lead Einstein Data Scientist, Salesforce

Machine learning has become a pervasive element of our daily lives, often without us being explicitly aware of it. Searching for something on the internet, the classification of e-mails as spam, Netflix recommending a movie for you, what kind of ads you are seeing on a webpage, those are all examples of daily uses of machine learning. “Smart” homes are now becoming a reality, and many interact with AI driven assistants such as Siri on a daily basis. In more advanced use cases, algorithms are being developed to beat the best human players at games such as chess or go, as well as push forward with self driving cars. While all of these use cases rely on machine learning, the actual approaches differ widely in many ways, e.g. algorithms and data used. Furthermore, some of these machine learning models can easily be computed on your personal laptop, while others require specialized machines (e.g. GPUs) that come at a high price point.

In this blog post we will focus on use cases applicable to enterprise business, most of which use “classic” algorithms and traditional approaches. Business use cases that can benefit from predictions are widespread: predicting lead conversion, whether a customer is going to churn (or when!), classifying customer service cases, predicting annual revenue or the lifetime value of a customer and many more.

We will walk through some of these examples, and hope to uncover some of the black box of machine learning. At the end, you should have a better understanding of how to leverage machine learning for your use case. In particular, we will cover the following:

  • The basic terminology of machine learning without getting too technical
  • What kind of data is necessary to build accurate models and predictions
  • How to get the most out of a model, and how to interpret the predictions

We won’t go into too many technical details in this post, rather, we will have follow-up posts on specific topics which will go into more details. This also means that we will make some simplifications, which might not be “100% technically correct”, but serve to illustrate the concepts behind machine learning — the focus of this blog post.

The Basics of Machine Learning

While there’s many subtypes of machine learning problems, we’ll focus on two basic types: binary classification and regression. Binary classification comprises problems where the outcome is either true or false, for example predicting whether a lead converts or not. There is no middle ground between these two states, as a lead either will become a customer or not. On the other hand, regression problems deal with predicting a range of numbers, for example, the annual revenue of a business. The revenue is not a binary true or false, but a dollar value that can (theoretically) range from minus to positive infinity, including decimal values. The field which contains the values you are trying to predict is called the label.
If you have a use case in mind while you are reading this post, think about how you can frame your problem as either one of these two types.

While this difference in labels might not seem important at first glance, it’s a crucial distinction as the underlying mathematics used to predict the values differ a lot between the two use cases. Regardless, there are some commonalities: No matter what algorithm is used under the hood, it works by generalizing from the input data. For this to work, you need to have historical data (often called training data) with labels present — for example, you need to have a track record of previous leads and whether they converted or not. On top of the label, you also need predictors (also called features) — fields which help explaining why the label is either true or false. To continue with the lead prediction example, let’s say you have information on both the company size and the title of the contact at the company (see table below).

+--------------+----------------+-------+
| Company Size | Contact Title | Label |
+--------------+----------------+-------+
| 500 | President | TRUE |
| 200 | Manager | FALSE |
| 300 | Senior Manager | FALSE |
| 10000 | Vice President | TRUE |
| 5000 | Manager | FALSE |
+--------------+----------------+-------+

Based on this historical data, a sales agent might now be able to come up with a simple rule based system:

  • If the company size is 500 or more, and the contact title is VP or above, the lead is likely to convert
  • If the company size is below 500, the lead is not likely to convert.

The salesperson can now take these rules into account when evaluating new leads and rank them accordingly, and contact leads first that rank high under this rule based system. To come up with these rules, you need to have a domain expert, someone who understands all the in-and-outs of your data, as well as your business process. If you asked a new teammate to come up with these rules on their second day, they would most likely struggle to come up with any good rules.

Machine learning algorithms do not have this kind of insight and domain knowledge an expert has: A number is a number, and text is text. Without someone specifying that a president outranks a vice president, algorithms would not be able to come up with a similar rule. This might sound like a big disadvantage for machine learning, but the flip side is that machine learning is much better at generalizing than any one person. Most likely, you will have a lot more fields than just two, and it will become impossible to come up with rules for hundreds of fields! Additionally, while business context can be very helpful, it can also hurt: Biases will make it more likely for certain fields to be included in the rule sets, while others get neglected. Algorithms don’t suffer from this problem: Thanks to their “ignorance” of the wider context, they are able to extract pattern from the historical data automatically, even when dealing with hundreds of fields, and linking them to the label. This also allows machine learning to deal with previously unseen data, for example, predicting whether a lead converts for a company of size 100, with a CEO as a contact person. A rule based system might have no idea how to classify this, and will continuously need to be updated as things evolve and change.

Once an algorithm has learned from the training data, the output is essentially a pairing of features and weights, i.e. a numeric value that indicates how a certain value of a feature relates to the label. The combination of these features and weights are called a model. To illustrate weights in more details, let’s leave the space of leads and use a more every-day example: predicting weight of an animal using its size. As weight typically increases with size, size here would have a positive weight, let’s say 1.5 — but what does that mean? You can visualize it in the following way: For every 1 unit increase in size, there is a 1.5 increase in weight. So on average, you can expect an animal that is one inch bigger than another animal to be 1.5 lbs heavier than it. The figure below illustrates this behavior: Every point is an animal, and the blue line is the trend that the model learnt. You can see that in general, the line increases steadily as the animal size increases, but hardly any single animal is directly on the line, in fact, some are quite far away. This is the aforementioned generalization: It’s not about making a perfect prediction for one specific case, but good predictions for everyone. The complexity here is having not just one feature or weight, but combining tens, if not hundreds of them.

So now that we have our machine learning model with one feature, we can apply it on new data, the so called scoring data. This would be rows where we do not know the label, for example, we only know the size of an animal, but not the weight. But, as we have our model with known feature weights, we can extrapolate the animal weight by applying this model. Scoring is the process that will give you a prediction of the animal weight.

Leaving the animal kingdom and returning to our lead scoring example, we now know how we can produce a machine learning model from the training set (rows 1–5) below and apply it to the scoring set (rows 6–7). Compared to the animal example, there is one small difference: We are not trying to predict a number (regression) but a true/false value (binary classification). Binary classification algorithms will not immediately predict true or false for a given row, but instead a probability, a value between 0 and 1 (or, sometimes multiplied by 100, between 0 and 100). Typically, anything above 0.5 (i.e. chance) is a true prediction, while values below are false. Depending on the use case, sometimes the raw probabilities are preferred to the true/false values, for example, there is a difference between a probability score of 99 vs 51, although both can be seen as predicting a conversion!

+--------------+----------------+-------+-----------------+
| Company Size | Contact Title | Label | Predicted value |
+--------------+----------------+-------+-----------------+
| 500 | President | TRUE | |
| 200 | Manager | FALSE | |
| 300 | Senior Manager | FALSE | |
| 10000 | Vice President | TRUE | |
| 5000 | Manager | FALSE | |
| 300 | Manager | ? | 0.83 (TRUE) |
| 4000 | Vice President | ? | 0.45 (FALSE) |
+--------------+----------------+-------+-----------------+

In this section, we learned about the difference between training and scoring data, what labels and features are, and how models are learned or trained by generalizing from the training data. Finally, this model can then be applied to the scoring data to get predictions for the unknown outcomes. In the next section, we will delve further into the importance of data, and how without it, machine learning simply cannot work.

The necessity of good data

In the previous section, we already mentioned the importance of historical data. But simply having some historical data lying around is not quite sufficient, it also needs to be of relatively high quality. But what does high quality mean? This is a complex topic that will vary depending on the use case, complexity of the data, and other variables, however, there are some common guidelines we can explore.

First, the training data should be reflective of the data that you want predictions for. For example, let’s say your sales team has introduced some codes for the size of a company: Companies with less than 500 employees are small, between 501 and 5000 are medium, and above 5000 are large. After this system has been in use for six months, the team realizes that it’s too broad, as too many companies are classified as small. An extra level is introduced between 401 and 5000: from now on, companies with more than 400 and less than 3000 are medium, between 3001 and 5000 are large, and above 5000 are extra-large. This minor change is problematic: The meaning of medium and large has now changed, and the algorithm has no idea! Predictions will suffer as a historical medium company is different than a medium company you are trying to predict (see table below for illustration)

+--------------+----------+-------------+
| Company Size | Old Code | New Code |
+--------------+----------+-------------+
| 500 | Small | Medium |
| 200 | Small | Small |
| 300 | Small | Small |
| 10000 | Large | Extra Large |
| 5000 | Medium | Large |
| 300 | Small | Small |
| 3000 | Medium | Medium |
+--------------+----------+-------------+

Second, you need fields that are logically tied to your label. If all you have is the label, and some unrelated fields, no algorithm will be able to deliver good predictions. For example, if all you have for your lead records is the label and the first and last name of your contact, it’s highly unlikely there’s going to be much relevant information encoded in the name that will help predict future leads. Thinking back of the animal weight prediction, if all you have is nicknames for the animals, there’s simply no way any machine learning algorithm can use this to predict the weight. Machine learning, unfortunately, is not magic and cannot extract anything from nothing.

Third, missing data and empty fields can be problematic, as the reasons for it missing are usually not clear. Often, leads are sourced in many different ways: personal contacts, referrals, web forms, and so on. The information you collect for a personal contact is different than on a web form, and for example, you might not collect an e-mail address, while leads collected through the web form will always include an e-mail address. We can assume that personal contacts will have a higher conversion rate than a web form lead, and the machine learning algorithm will pick up this pattern, and assign a lower probability of conversion to leads with an e-mail address. However, clearly this does not mean that you should stop collecting e-mail addresses when you can!

Fourth, there must be enough data. This is probably the trickiest guideline, as it depends on quite a number of variables that are hard to nail down to specifics. When it comes to labels, you ideally want to have a spread of values. For binary classification, this means at minimum you need enough examples of both true and false values (e.g. converted and not converted), and in the best case, something close to a 50/50 distribution, although this is not as crucial as there are mathematical ways to create an even distribution. In most business use case, it’s quite unlikely that you have an even distribution (5% conversion rates are much more likely than 50%), which usually does not pose a problem, with one exception: If you have only true or only false examples, the model will not able to generalize anything for the missing labels and thus fail training the model. For regression cases, you similarly want to have a spread of values. If you are trying to predict the lifetime value of a customer, and all your previous examples are grouped around the same value, it is hard for any machine learning algorithm to generalize beyond that. Diversity is good in this case!
Earlier, the importance of having relevant features was already mentioned, and usually, the more (useful) features, the better. Predicting whether a lead converts from just the company size and the title of the contact is possible, but predictions are most likely more accurate if there was more information, e.g. time since last contact and method of last contact (e.g. e-mail vs phone).

A further dimension to take into account here is the number of rows. In general, the same holds true as for features: The more, the better. However, if there is a straightforward relationship between the features and the label, it is possible to have few rows and get accurate prediction (think the animal weight prediction!). The more feature you have, the more complex your predictions get, and the more rows the algorithm needs to be able to generalize from the training data. As a rough guideline, a minimum of a few hundred rows is expected to give a relatively accurate prediction. Einstein Prediction Builder comes with a data checker you can leverage to make sure you have the minimum amount of data required.

Hopefully these examples illustrated the importance of data in machine learning, and why good data entry and maintenance is important to deliver accurate predictions. The key take-away point here is that machine learning is (unfortunately) not magic, and predictions will only be as good as the input training data. The old adage of garbage in, garbage out holds true in the domain of machine learning, and to some degree, predictions will always be limited by the input data we feed in.

Interpreting models and using the predictions

In the previous sections, we have covered the importance of data and how a machine learning model gets trained and is used in scoring predictions, but how do we leverage these models and predictions best? And how do we know if our predictions are any good?

The latter question is crucial: If you are using the predictions to make any kind of business decision, you want to be sure that they are accurate. Luckily, there are ways to measure the quality of the model: It is as easy as comparing the prediction from the model to the true outcome! As we do not know the outcome of any of the data in the scoring set, we cannot use these particular predictions, instead, during model training, a percentage (usually 10–20%) of the training data is withheld and not used for model training. Rather, data in this set are given predictions just like a scoring set, but as we know the outcome, we can compare the prediction against the ground truth and see if they match. In the table below, instead of using all five rows with known labels in the training set, two rows are withheld to measure accuracy. In this particular case, both predictions are correct, which means the model has a 100% holdout accuracy. Such a high number is unlikely — remember, machine learning is all about generalizing, so usually you never reach 100% accuracy. It is also important to keep in mind that often there are subtle differences between the holdout set and the scoring set, so typically, the accuracy decreases a little from holdout to scoring set (the actual predictions). Accuracy is the easiest to understand measurement, but other, more complex metrics exist to cover intricate details of models. You might come across terms such as precision and recall, AuROC, R², and others.

+--------------+----------------+-------+--------------+----------+
| Company Size | Contact Title | Label | Prediction | Set |
+--------------+----------------+-------+--------------+----------+
| 500 | President | TRUE | | Training |
| 200 | Manager | FALSE | | Training |
| 300 | Senior Manager | FALSE | | Training |
| 10000 | Vice President | TRUE | 0.85 (TRUE) | Holdout |
| 5000 | Manager | FALSE | 0.34 (FALSE) | Holdout |
| 300 | Manager | ? | 0.83 (TRUE) | Scoring |
| 4000 | Vice President | ? | 0.45 (FALSE) | Scoring |
+--------------+----------------+-------+--------------+----------+

Now that we know if our predictions are accurate, let’s put them to good use. As a reminder, in binary classification, the predictions are usually a probability of the label being true, e.g. a prediction score of 89 means there is a 89% probability of the lead converting. In regression, the actual value of the label can be predicted, e.g. a prediction can be $340,035 for an annual revenue use case. But the raw predictions do not really tell you why: Why is the score 89%? Why is it not 75%? Depending on the prediction and the specific case, the score might be surprising to you and not match your intuition.

One way to understand the predictions a little more is by looking at the role each feature plays by utilizing the weights the model assigned to the feature. As mentioned earlier, each field that is used to predict the label will have a weight associated with it denoting how important it was in the prediction — the higher the score, the more of a role it played. For example, a first name field would have a much lower importance score than the title field, while a completely unrelated field (e.g. the temperature at the time the lead was entered) would have an importance of zero. However, there are some caveats: You cannot automatically infer causation from these importance scores. As mentioned earlier, the case of the missing e-mail field does not mean that you should stop collecting e-mail addresses! However, if you notice that specific titles result in a high score, you might want to target them as point of contacts over lower ranking titles.

To make matters even more complicated, interpretability of models sometimes comes at the cost of accuracy. Algorithms that result in more accurate predictions also often use more complicated mathematical transformations, which are hard to interpret for humans. For example, the raw fields that you input into the model are often transformed and converted into more meaningful mathematical interpretations, which are hard or impossible to revert back into a human readable value. This is often the case for text features. As the algorithms ultimately can only deal with numbers, any kind of text needs to be transformed into a numerical value first. There are many different ways to do this, for example, by assigning numbers to picklist values — a picklist with values small, medium, large can be internally converted to 1, 2, 3 by the algorithm, and easily converted back. Unfortunately, some of these transformations are a one way street: You can convert text to numbers, but not the other way round (for example, hashing algorithms or word embeddings). These methods often result in more accurate models, but the feature weights are much harder to interpret as we cannot associate the numerical values back to the original text values.

+--------------+--------------+
| Company Size | Number Value |
+--------------+--------------+
| Small | 1 |
| Medium | 2 |
| Large | 3 |
| Extra Large | 4 |
+--------------+--------------+

The combination of predictions and features importances means that each model can be used in many different ways in Salesforce, often depending on your use case. A common way of using the predictions is to rank them from high to low, e.g. to prioritize high probability leads in our lead scoring example. Instead of having to sift through hundred of leads and identifying some to work on, the click of a button suffices to identify the most likely leads. This also works for other use case, for example users at risk of churning.

You can also use the feature importance to target new leads. As illustrated earlier, each field used as a predictor will have a numerical weight associated with it, expressing how important it is in making the prediction. Instead of ranking the predictions, you can rank the weights to list the most important fields, and use this information to identify new targets. For example, if the most important feature in a prediction is the number of employees in the company, you can use this information to target companies with a lot of employees. As elaborated earlier, this is a bit of a slippery slope as there could be hidden relationships between other fields, and there is not necessarily a causal relationship — it could just be pure luck! Instead of blindly targeting companies with a lot of employees, it makes more sense to take this information as a starting point and investigate further whether there is a causal relationship.

All examples above require a “human-in-the-loop”, but that is not always necessary. Einstein Prediction Builder serves the prediction as regular Salesforce fields, which means all kinds of automated process can be leveraged. For example, process builder can be set up to automatically send back low ranking leads to the marketing team, so they can further nurture them, or going further, automatically send out marketing e-mails to those low ranking leads. [Link to Josh’s blog here] Similarly, for customer at high risk of churning out, e-mails with specials offers can be send out. A range of automation is possible, from zero to complete automation, depending on your business practices and the use case.

+--------------+----------------+-------+-----------------+
| Company Size | Contact Title | Label | Predicted value |
+--------------+----------------+-------+-----------------+
| 500 | President | TRUE | |
| 200 | Manager | FALSE | |
| 300 | Senior Manager | FALSE | |
| 10000 | Vice President | TRUE | |
| 5000 | Manager | FALSE | |
| 300 | Manager | ? | 0.83 (TRUE) |
| 4000 | Vice President | ? | 0.45 (FALSE) |
+--------------+----------------+-------+-----------------+

In this section, we have learned how to trust predictions, how to leverage them, and how to utilize further information such as the weights of the features.

In this blog post, we have uncovered the black box of machine learning. We covered the journey of machine learning end-to-end from model training to leveraging the predictions of the model, as well as the importance of data in the process. Machine learning terms such as label and features were introduced (see glossary), and the difference to more traditional rule based systems explained. There are a few key take-aways that will help you to get more out of your predictions, and that we will delve into further in future blog posts:

  • Machine learning is not magic, and definitely not black magic: ML algorithms work by generalizing from historical input data and by learning which fields can help predict your label.
  • Machine learning is a broad field, with different applications, for example binary classification and regression. When you want to apply machine learning in your business, think about if and how you can frame your problem as either of those.
  • Without data, there is no machine learning: Data is the key ingredient, and without making sure that some requirements are met, predictions will not be very accurate (again, it is not magic!)
  • Interpreting the output of a model, such as predictions and feature weights, is not always straightforward and needs some careful evaluation without jumping to conclusions first.

As mentioned above, subsequent blog posts will both go into more details of some of these topics, as well as cover other topics (e.g. ethics). We hope that this served as a good introduction to machine learning and that at least some mysteries were uncovered!

Glossary

  • Binary classification: Machine learning problem where data is classified as either of two groups, usually true or false (for example, whether a lead converts or not).
  • Holdout data/set: The set of data held out during model training, a percentage of the training data. Predictions are compared to the known label in the holdout set to calculate the accuracy of the model.
  • Label: The field which contains the value you want to predict. For historical data (see training data), this information is known and the field is filled out. In scoring data, this field is usually empty and not known.
  • Machine learning algorithm: The mathematical process used during the training process, which learns from the training data and produces a model. Different algorithms exist depending on the type of problem (e.g. regression vs binary classification), but also within a problem. For example, there are different algorithms that can be applied to binary classification, and depending on the use case, some will work better than others.
  • Model: The mathematical representation of the generalizations learned during the training process. Usually this is an algorithm that gets applied to scoring data during the scoring process to obtain predictions.
  • Predictions: The output of applying a model to scoring data. Depending on the problem type, predictions can be probabilities (binary classification) or numerical values (regression).
  • Predictors/features: The fields in the data that hold information relating to the label. Features are the basic building blocks of a machine learning model, and model accuracy will improve the more informative features are about the label.
  • Regression: Machine learning problem, in which predictions/labels are numerical values (for example, predicting annual revenue).
  • Scoring data/set: The set of data used in scoring the predictions. For this set, the label is unknown.
  • Scoring: The process during which* predictions* are produced from a model and scoring data.
  • Training data/set: The set of data used in training the machine learning model. For this set, the label is known and usually, this is historical data.
  • Training: The process during which a model is produced from training data.
  • Transformations: The process used to modify raw features into more useful representations. These transformations can vary in complexity, but ultimately serve to improve model quality.
  • Weights: The mathematical representation of the role a specific feature is playing in the overall model. There are differences in these representations depending on the machine learning algorithm used.

--

--