I. The Fundamental of Machine Learning

Patel Pooja
5 min readJan 22, 2023

--

The Machine Learning Landscape

Photo by Arseny Togulev on Unsplash
  1. What is Machine Learning?

Machine learning is the science (and art) of programming computers so they can learn from data. It is a subfield of Artificial Intelligence founded on the notion that machines are capable of learning from data, spotting patterns, and making judgements with little assistance from humans.

Here is a slightly more general definition:

[Machine learning is the] field of study that gives computers the ability to learn without being explicitly programmed.

Arthur Samuel, 1959

And a more engineering-oriented one:

A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E.

Tom Mitchell, 1997

One of the best example of Machine Learning program is “ The spam filter” in our Gmail application. The spam filter take examples of spam emails (flagged by users) and examples of regular emails (non-spam, also called “ham”), and can learn to flag spam. These examples that system or algorithm take called “Training set”, and each training example is known as “Training instances”. The part that learn and make predictions is called a “Model”. Neural networks and random forests are examples of models. This is the most cruel part to decide which model we should used in our program. We can measure the accuracy of model to check that weather our model is doing well or not.

Examples of Applications

Let’s look at some concrete examples of Machine Learning tasks, along with the techniques that can tackle them:

  • Automatically classifying products on a production line using image analysis: This is image classification, typically performed using convolutional neural networks(CNN).
  • Detecting tumors in brain: This is semantic segmentation, where each pixel in the image is classified. Again convolutional neural networks(CNN).
  • Automatically classifying news articles: This is text classification and it can be tackle using recurrent neural networks (RNNs), CNNs, or Transformers.
  • Creating a chatbot or a personal assistant: This involves many NLP components, including natural language understanding (NLU) and question-answering modules.
  • Forecasting your company’s revenue next year, based on many performance metrics: This is Regression Task (i.e Predicting Values ). This may be tackled using any regression model, such as a Linear Regression or Polynomial Regression model.
  • Detecting credit card fraud: This is anomaly detection
  • Recommending a product that a client may be interested in, based on past purchases: This is a recommender system. One approach is to feed past purchases (and other information about the client) to an artificial neural network, and get it to output the most likely next purchase.

The list could go on forever, but hopefully it gives you an idea of the incredible complexity and breadth of the tasks that machine learning can handle, as well as the kinds of techniques you would use for each task.

Types of Machine Learning System

There are so many different types of machine learning systems so let’s just classify them in broad categories, based on the following criteria:

  • Supervised, unsupervised, semi-supervised, self-supervised, an others
  • Online versus batch learning
  • Instance-based versus model-based learning

Moreover, this criteria are not limited. You can merge them as per your requirement.

Main challenges of Machine Learning

Since, we are making Machine Learning model which can make prediction based on some data. Having clean data as an input is so much important for our algorithm to work perfectly. Let’s discuss some of the major challenges of Machine Learning.

  1. Poor quality of Data: Any machine learning system development majorly relies on data. The absence of high-quality data is one of the major problems that machine learning experts encounter. Unclean and noisy data can make the whole process extremely exhausting. Moreover, this type of data can make our machine learning algorithm to predict faulty and inaccurate results . Hence the quality of data is essential to enhance the output/prediction.
  2. Underfitting the training data: As the title is saying “ Underfitting”, means when your model is too simple to learn the underlying structure of the data. It simply means trying to fit in undersized clothes. To overcome this situation,
  • Feature Engineering( Feed better features to the model )
  • Increasing the training time of model
  • Select a more powerful model, with more parameters
  • Reduce the regularization hyperparameter

3. Overfitting the training data: As you might guess, overfitting is the opposite of underfitting: it means that the model performs well on the training data, but it does not generalize well. Overfitting happens when the model is too complex relative to the amount and noisiness of the training data. Here are possible solutions:

  • Gather more training data would help
  • Analyzing the data with the utmost level of perfection
  • Use data augmentation techniques
  • Remove outliers to remove noise from data

4. Irrelevant Features: Garbage in, garbage out, as they say. Only if the training data has an appropriate balance of relevant and irrelevant features will your system be able to learn. Finding a solid set of features to train on is essential to the success of a machine learning project. The steps involved in this feature engineering process are as follows:

  • Feature Selection( Select the most useful and important features to train)
  • Feature Extraction( combining existing features to produce a more useful one⁠ — as we saw earlier, dimensionality reduction algorithms can help)

5. Insufficient Quantity of Training Data: The majority of machine learning algorithms require a large amount of data to operate correctly, so machine learning is still not quite there. Even for extremely simple problems, you usually need a few thousand examples, and for more complicated ones like speech or image recognition, you might need large numbers of examples (unless you can reuse parts of an existing model).

Testing and Validating

Now that you built your model and it’s time to check if it is working well on new cases or real time data or not. One way to do that is to put your model in production and monitor how well it performs. This works well, but if your model is horribly bad, your users will complain — not the best idea.

A better option is to split your data into two sets: the training set and the test set. As these names suggest that the training set is used to develop your model, and the test set is used to evaluate it. By assessing your model on the test set, you can estimate the generalization error, also known as the out-of-sample error, which is the error rate on new cases. This value informs you of how well your model will function in hypothetical situations.

If the training error is low (i.e., your model makes few mistakes on the training set) but the generalization error is high, it means that your model is overfitting the training data.

TIP

It is common to use 80% of the data for training and hold out 20% for testing. However, this depends on the size of the dataset: if it contains 10 million instances, then holding out 1% means your test set will contain 100,000 instances, probably more than enough to get a good estimate of the generalization error.

Happy Learning! Cheers!

If you enjoyed this content, please give it a like and do follow ! Your support helps to create more valuable content for you. Thank you for your support!

--

--