How to do Artificial Intelligence: The Overarching Concepts

Alaa El Hariri
Beirut AI
Published in
13 min readOct 17, 2019
Photo by Franck V. on Unsplash

Artificial Intelligence (AI), Machine Learning (ML), Data Science (DS)… If you haven’t heard of any of those terms yet, I don’t know how you got here but welcome! Although, if you have heard them, you might be in the same league as I was about a year ago. You have a vague idea as to what the three concepts are and where they overlap. But, you don’t quite know where one begins and the other ends. Most importantly, you’re very interested in acquiring knowledge about the AI or DS or ML domain but you are extremely confused about what to look for so you can start.

This article will not only attempt to outline and clarify the differences, but it will also outline the general steps in an AI (or ML or DS) project. The goal of this article is to give you a general overview on the jargon and the process as well as give you insight on where you can start exploring on your Artificial Intelligence journey.

AI? ML? DS?

GIF of Mila Stauffer “What?”, from giphy.com

Often when one is starting out the three terms are thrown around interchangeably. But there is a difference between Artificial Intelligence, Machine Learning and Data Science. Below are the definitions and differences between the three.

Artificial Intelligence (AI):

Artificial Intelligence is maybe the most popular of the three buzzwords, but let’s delve into what it means. Intelligence is generally regarded as a human trait. Attempting to make anything do what a human does without needing a human being is artificially intelligent. So it could be as simple as a self-adjusting thermostat to as complex as trying to win a game of Go. AI refers to trying to replicate an aspect of human intelligence, without really needing the human being.

Machine Learning (ML):

Machine Learning is a little more misleading in its definition than AI. It is not as simple as the thermostat. ML is a subset of Artificial Intelligence that revolves around certain types of algorithms. The algorithms are a mixture of statistics, data structures and other computer science theory. ML algorithms are used to perform “intelligent” tasks such as tell you if your email is spam or not all the way to allowing your car to drive itself. Machine Learning algorithms usually focus on always improving using the available data autonomously. The algorithms tend to be configurable enough so the developer can observe differences within the same algorithm and the effects it has on output.

Data Science (DS):

Data Science is truly an umbrella term that encompasses many aspects of the data realm. Data Science is the most statistics forward out of the three terms. It revolves mostly around the data itself rather than creating an autonomous algorithm to make predictions. It is more about finding patterns and making conclusions rather than making predictions. I don’t wish to mislead you though, there is an overlap between DS and AI. Oftentimes Data Scientists will need to use AI techniques to find patterns and sometimes making predictions is in fact a part of the job, but Data Science is much broader that trying to replicate human intelligence. It is a step outside of AI where it is not just the math and computer science of it all. A Data Scientist is often more heavily involved with the experts in the field she is working in.

From Data to Predictions (The Steps)

Now that we’ve established the similarities and differences between the main three buzzwords in the space, there is a similarity in how to start with creating an AI project that can help a Data Scientist or be the goal of a Machine Learning expert.

Identifying the Goal

Before any project begins, whether it includes AI or not, you need to identify the goals of your project. You need to have a preliminary idea on what you want to accomplish using your artificially intelligent creation. Some goals for AI projects are:

  1. Predict tomorrow’s weather
  2. Filter out my spam email
  3. Identify if I received an image of a cat or a muffin
  4. Generate a realistic musical piece
  5. Create a self driving car
  6. Predicting the probability of an individual developing a disease
  7. Identifying whether a tumor is benign or malignant

The importance of identifying the goal doesn’t only help in knowing what data to pick, but it can be an indicator as to whom you may need to contact. In cases of goals 7 and 8 it would be hard to know how to start without the expertise of people who have been working within the biological fields for example.

Data Collection

Once you’ve identified what it is you want to do, it becomes much easier to know what kind of data you need to have. If you’re lucky, there might already be an open source data set that you can use to eventually create a model.

But, that is often not the case. Sometimes you have to collect the data yourself. You may even be faced with changing or dropping your goal entirely, because the data you need is impossible or too difficult to find. That is why this is generally one of the hardest steps in the entire project creation process. Although it becomes easier if you know both your input and your output. What that means is you identify what your algorithm will learn from and what it needs to tell you.

The way to go about the collection can range from scraping websites to collecting data from sensors over a period of time. All in all it is only an easy process if you’ve chosen your goal based on an already existing data set.

Data Cleaning

What an amazing feeling to be here! You have a dataset! In fact your dataset has thousands… no! Millions of rows! What more can you need!

I hope my sarcasm seeps through your screen because real life is almost never this wonderful, but sometimes it is … I guess we can keep it positive and say that sometimes we get perfect data. Data cleaning is actually one of the most important steps in the process.

Sometimes your dataset is a compilation of rows from a survey. It may also be data collected by an organization’s staff. Your data could have arrived from a plethora of sensors that you, yourself, set up. In all of these cases, human error is unavoidable.

If the survey was not well constructed you may find yourself with rows where values are supposed to be standardized (for example countries, cities… ) but they aren’t because they were created as free text. The organization may not have rules on how numbers or names are inserted. You may even have a case where a value and the units of measurement that value is represented in are part of the same input. Also, you, yourself, may have had a faulty sensor, or the sensor became faulty over time.

No matter what the case was, when your exploration is complete and you’ve identified all the flaws your data may have. You’ve identified all the flaws that can mess up your predictions or cause your models to fail, you need to come up with a plan to deal with them.

Two of the most prominent problems in data are empty fields and duplicated rows. Depending on your data you may choose to remove any rows with empty fields. But sometimes, that may be too much data to lose, so your other option would be to interpolate. Interpolation is when you use already existing data to populate empty fields. You may even choose to populate the empty fields with a default value such as the average. As for duplicated rows, removal is also an option. But of course there’s a catch, sometimes the repetition represents something that you need, in that case you can do what is called feature engineering.

Feature Engineering

I can’t explain to you why but this is probably my favourite step. It requires that you study your data enough to know what parts of the raw data can be used as is and what parts need to be a different way. Going back to the duplication example, instead of using the two rows, you can combine them into one, with an extra column that indicates the number of repetitions.

Let’s also assume you have a column identifying people’s ages but to you it is insignificant what the exact age is. You can create a new column that identifies whether a person is an adult or not by setting the flag true if someone is over 18. You may also be faced with what is called “categorical data”. Categorical data refers to columns that identify whether the entry belongs to a certain category (ex. color, country… ). There are multiple ways you can deal with such occurrences but that depends on the machine learning algorithm and the type of data you have. You can check out this article to learn more about encoding categorical data.

There are so many ways and reasons to engineer your data and this can be an article, or series of articles, all on its own. But to keep it brief, engineering your data is when the raw form of the data is insufficient and you need to create another column (a.k.a. feature) that is better representative of the way your data needs to be at the end.

Exploratory Data Analysis

Once the data has been sufficiently cleaned and engineered you can move on to exploring your data. In this step most of the data visualisation happens. In less technical terms, you make graphs out of your data. And you thought you weren’t going to use that bar chart you learned about in middle school math!

This step helps you find trends in your data. It can help you identify if your data is well spread or if certain values are rare. If for example you wanted to try predicting if a student would pass a particular course and only 1 out of a 100 students fail the course you may be in trouble. Because if you have too few data points, a model that guesses “passing” every time will be correct 99% of the time.

Exploring your data is also very important in shedding light on biases. Because as human beings we are imperfect we cannot expect our data to be. That’s why it is important to ensure that none of our human biases seeped into it. It can prove misleading or even dangerous if gender or race bias seeped into the data that our algorithm will learn from.

You can find out about such anomalies if you graphed out how much of a particular value exists in each column, you can also just look at percentages of each type of value. You can explore by putting your values in bins to find out in which ranges do most of the values fall.

Exploration and data visualisation helps you identify if your data is skewed in any way that you have to take certain precautions to ensure that the model you choose functions correctly.

Model Creation

Finally, the moment has arrived. Creating a machine learning model! What is a machine learning model and what does it do?!

There are many types of ML models, each one good at something in particular. Essentially there are supervised, unsupervised and reinforcement learning algorithms. Supervised means the algorithm knows the output when it’s in the training (“learning”) phase. Unsupervised learning means the algorithm will, for example, come up with its own groupings of the data. Reinforcement learning on the other hand relies on the concepts of reward and punishment to teach the algorithm to perform a task better. You can read more about the different types of machine learning here.

Oftentimes, in the case of supervised learning, a machine learning model is regarded as a blackbox that takes features as input and produces output in the form of a label. A label means a prediction that the model is making.

Once your features are ready you give them to the model and with the magic of math and statistics it produces an output. The output may vary depending on your goal. It may be as simple as a flag that identifies if a row is or is not the label you wanted, and that would be called a classification. It could also be a numerical prediction like the price of a stock for example, which is called a regression.

Many models exist and the variations are immense, but for the majority you can use them by following the steps below.

Prepping your data for the model

When you are first building the model you need to have a system of knowing whether your model is on the right track or not. The best way to do that is to split your data into training and testing sets.

Let’s assume you have a set of 100 rows. What you can do is divide this into two sets one with 80 rows to use as training for the model. The other 20 rows will be used to test if your model trained well enough on the data.

Taking an 80–20 or a 90–10 split for training and testing is not an exact science. Over the years practicing data scientists have used this as a rule of thumb. Because, whenever you’re dealing with large data sets it’s often better to have a larger percentage of training data.

Train

Who said math can’t be beautiful?

When you first start learning multiplication, you use the multiplication table. In it you have all the numbers and what the multiplication between them means. You have both the inputs (the multiplicand and the multiplier) as well as the output (the product). You observe these numbers and how they interact with each other as well as what is the result.

Similarly, when a model is training it has all the features and the related label. From that it starts learning rules, patterns, or variables to be able to make predictions in case it ever gets inputs that it hasn’t seen before.

Test

Going back to the multiplication analogy. Assume that you learned that 3 x 2 = 6. You’ve memorized this as a rule and know that it works. Your teacher wants to know if you understand so on the test she given you 3 x 2 = ?. You can confidently answer this because you’ve seen this before. But assume that you get 2 x 3 = ?. Well, it looks familiar and based on the rules you learned you can answer 6, but not with as much certainty as you can the previous one.

This is what happens in testing. Remember, those 20 rows you isolated? What you do is that you remove the labels and give them to your model. Your model will then make its predictions based on the patterns it learned from the data. Some models will even give you the confidence it has in its answer. Once you feed those rows to your model, you will have the list of predictions that the model made.

Evaluate

At this point you have two sets of labels. One actual set that you extracted from the original set and one that was the model’s output. Just like a teacher will assess how well you’re learning multiplication by comparing your test answers to the actual answers, we evaluate machine learning models. Of course there are plenty of very refined techniques of finding how well the machine learning model is performing, but the overall idea is similar.

Now that the model is trained you want to use metrics to identify whether or not the model is making correct predictions. Accuracy is the easiest way to evaluate a model, but depending on the problem it isn’t the best way. You can use precision, recall or many of the available techniques.

Real World Applications

Now that the model is created, tested and evaluated we can safely launch it into the world. Machine learning and artificial intelligence are present in our everyday life. From recommendation systems (ex. Netflix, Youtube) to weather predictions to flight prices predictions, all these models started out as data and progressed in the AI tunnel until they were models deemed good enough to service people day to day.

Although this is not the end for our lovely models. Just as we never stop learning they can’t either. Machine learning models often have mechanisms to continue learning from the data that is around in the world. But this is also an entire concept of its own with plenty of studies and research conducted on it.

How to do Artificial Intelligence with Beirut AI

Beirut AI’s goal is to make AI accessible to everyone. We aim to create an AI community of enthusiasts and practitioners. That’s why we provide hands-on workshops that cover every step of the AI process starting with the Data Cleaning and all the way the real world applications.

Interested in joining one of our workshops? Make sure you sign up for our newsletter here.

We’re also launching a new University Education program where we’ll help you design an AI curriculum of workshops, talks, datathons, and more to enrich students with the skills for the AI jobs of the future! If you’d like to see Beirut AI at your university, send us a message at universities@beirutai.org.

Disclaimer

Each one of the above mentioned steps warrants an article on its own. Each explanation is by no means exhaustive. Researches and studies on each facet of creating an AI project is its own black hole of topics. This article aims to provide the big picture of how people go about creating machine learning projects.

Special Thanks

Special thanks to Reem Mahmoud for reviewing and helping improve this article.

Thank you for reading! Till next time!

Like what we’re doing? Give us a clap!

Any topics you would like us to cover?

Drop a comment or send us an email at beirut@city.ai

You can even join us on slack

For more information visit city.ai/cities/beirut.

--

--

Alaa El Hariri
Beirut AI

Human. Developer @Hedgeguard . Co-organiser @GDGLebanon . Blog Writer @BeirutAI. Imposter Introvert. I’ll forever be a work in progress 🦋. Opinions are my own.