Stories by Andrew D on Medium

Fairness in AI

Andrew D — Thu, 13 Apr 2023 17:45:34 GMT

Q) Algorithms are increasingly being used for decision-making in various areas of everyday life, including job applications, admission to universities, criminal justice, finance and healthcare amongst others. Research has however shown that algorithmic decisions can perpetuate existing biases and discriminate across particular groups of people. To address the problem researchers have adopted several fairness criteria and evaluation metrics. Nevertheless, when algorithmic decisions are considered to be objectively fair according to certain metrics, they are perceived to be unfair by the people who are affected by them. Discuss this problem using an example. In your discussion reference at least two research papers.

Recently, Artificial Intelligence (AI) solutions are being implemented in many impactful sectors. Such as, in agriculture, where AI has been employed to optimise crop yields and improve overall efficiency, and in Human Resources, where AI has been applied in hiring recommendation systems mostly by giving job candidates scores. In the first scenario the social consequences of a poorly designed AI solution are minimal, however in the second scenario this is definitely not the case. In these types of social applications biassed algorithms might discriminate against individuals. This could be detrimental to their wellbeing. This essay will discuss how to mitigate bias and discrimination through algorithmic fairness.

It is evident that allowing bias in AI systems leads to dangerous outcomes, but how do algorithms become biassed? In reality the algorithms themselves are not biassed, they simply do what they are taught. Hence, the bias that exists in many everyday scenarios is being taught. Therefore, during the design and development stage it is crucial to identify any potential biases and deal with them appropriately. Bias and discrimination can creep in AI systems in a number of ways. This can be the result of unrepresentative datasets, for instance, if the data is predominantly male, the algorithm will generalise better for men than for women. If the AI system is being used to make decisions that affect people’s lives, it is important that these decisions are fair. Hence, prior to introducing these systems into society, bias needs to be identified and addressed.

There are a number of ways to minimise bias, mostly through evaluating the performance using Fairness metrics. To further understand the issue at hand, consider AI loan systems at a bank. Eletter et. al. [1] managed to successfully create a model that evaluates credit applications to support loan decisions in commercial banks. Clearly, this AI application has significant social implications, being not approved for a loan has a sizable effect on the livelihood of the individual. They opted for an Artificial Neural Network (ANN) solution, that is a supervised Machine Learning (ML) technique. The ANN was trained on a dataset of 140 personal loan applications. Each applicant was represented by 11 influential variables that were deemed relevant to the loan decision. The variables are ‘Income’, ‘Job experience’, ‘Loan Size’ and eight others. The researchers deemed the results to be “very good” and stated that the system was able to correctly classify 95% of the cases in the testing set. However, they did not take into account fairness and ethics. They used accuracy to measure the performance of the ML model. However this metric alone should not be used to analyse overall performance since it ignores an important aspect, algorithmic fairness. For instance, if a programmer optimises an algorithm to increase accuracy, the algorithm will be reducing overall aggregated prediction errors, which often favour majority groups over minorities.

Here the ethical concern lies on what patterns the ML model learns from the dataset. In the UK, under the Equality Act [2], there are certain protected characteristics such as race, sex and religion that are prohibited to be a factor in a decision process. Although these features are not part of the dataset, this is not enough, since other features might be correlated. The researchers do not explain how the data is split between sex and race classes. If the dataset is unbalanced this usually leads to the algorithm favours trends in the majority population, since they have a bigger effect on the Accuracy. In this case, after the model has been developed, what needs to be done is examining the ML model fairness scores.

Algorithmic fairness [3] is defined as how fairly the algorithm treats people, on an individual and group level. Fairness metrics are used to determine whether a model is producing results that are fair to different groups of people. Common fairness metrics [4], that could have been used, are Equality of opportunity, Statistical parity and Predictive parity. These metrics will reveal whether the model is biassed towards any certain group. Inherently there is a trade-off between fairness and standard performance, a higher degree of fairness will in return reduce the standard performance of ML models [5]. Therefore, the 95% accuracy would likely go down if the model would have to be more fair. This is not necessarily a bad outcome, as fair and just loan decisions should be a high priority of any decent bank.

In conclusion, this essay aimed to explain the ethical concerns of algorithms, how to manage them and how to quantify them, mainly through algorithmic fairness. All in all, when developing AI models that affect people’s lives, it is crucial that these systems are designed and used in a way that is fair and unbiased towards different groups of people. Doing so will lead to more fair and ethical AI solutions that will benefit everyone.

(809 words)

References

[1] Shorouq Fathi Eletter, “Neuro-Based Artificial Intelligence Model for Loan Decisions”, American Journal of Economics and Business Administration, vol. 2, no. 1, pp. 27–34, 2010.

[2] “Equality act 2010,” Legislation.gov.uk. Available: https://www.legislation.gov.uk/ukpga/2010/15/contents. [Accessed: 26-Dec-2022].

[3] Pessach, Dana, and Erez Shmueli. “Algorithmic fairness.” arXiv preprint arXiv:2001.09784 (2020).

[4] Garg, Pratyush, John Villasenor, and Virginia Foggo. “Fairness metrics: A comparative analysis.” 2020 IEEE International Conference on Big Data (Big Data). IEEE, 2020.

[5] Y. Wang, X. Wang, A. Beutel, F. Prost, J. Chen, and E. H. Chi, “Understanding and improving fairness-accuracy trade-offs in multi-task learning,” arXiv.org, 04-Jun-2021. Available: https://arxiv.org/abs/2106.02705. [Accessed: 11-Nov-2022].

Gender Detection Using Machine Learning Based on Voice Analysis

Andrew D — Thu, 13 Apr 2023 17:43:21 GMT

Introduction

Machine learning (ML) is a branch of Artificial intelligence (AI) in which algorithms improve themselves over a large amount of data. The aim of this project is to experiment and evaluate different ML methods. Here ML is used to predict the sex (Male or Female) of a speaker from their voice. To accomplish this the following dataset was used https://www.kaggle.com/datasets/primaryobjects/voicegender.

This report will discuss the process of applying ML to solve a real world problem. This will include all the design decisions made, the experiment procedure and all the important results and conclusions.

Preprocessing

The dataset ‘voice.csv’ contains 3,168 recorded voice samples and was loaded on Python using pandas dataframe. In the dataset the voice samples are given using 20 features, representing the acoustic properties. Every voice sample is labelled according to speaker sex, Male or Female. The dataset is split evenly between male and female voices. Hence the dataset is balanced. The next step is to check for empty values. None of the features contain null values. The final step of preprocessing would usually be to deal with outliers. However due to the variance and diversity of speech properties this step was omitted. So it will be assumed that the provided data is accurate.

Visualisation

Prior to developing any ML method on the dataset, it is necessary to understand the data. This can be done by visualising the data. For classification useful features are features which help identify differences between the classes. Box plot and bell curve distribution are utilised to showcase this. For every feature, with respect to label, the box plot and the distribution were plotted.

Features which have a noticeable distinct interquartile range or peaks will be considered as useful. These are sd, Q25, IQR, sp.ent and meanfun. For now this will be the hypothesis for useful features.

Independent and dependent variables

Another suitable check for feature reduction is checking the correlation between variables. This will indicate which features are likely to be redundant. Below the correlation between features of the training set are displayed using a heatmap.

If the value is close to 1, (above 0.9), that means those two features are highly correlated. In the dataset the following features are highly correlated:

Meanfreq and Centroid (1)
Maxdom and Dfrange (1)
Meanfreq and median (0.93)
Meanfreq and Q25 (0.91)
Median and Centroid (0.93)
Q25 and Centroid (0.91)
Skew and Kurt (0.97)

Features with 1 correlation can be removed:

Meanfreq or Centroid.
Maxdom or Dfrange.

Therefore, Centroid and Dfrange were removed from the dataset.

From the above, one can suppose that the useless features might be:

Median
Q25
Skew or Kurt.

Splitting the Dataset

To build reliable machine learning models, the dataset was split into a train set(80%), a test set(10%) and a validation (10%) set. The train set is used to fit the model for prediction. The test set is used to tune the hyperparameters, and the validation set is only used to assess the performance of a final model**. This way we can ensure that the ML model built has a high accuracy and generalises well.

Machine Learning Models

In this section the following five ML models will be developed and discussed:

Decision Tree
Logistic Regression
Support Vector Machine
Artificial Neural Network (ANN)
k-means clustering

Decision Tree

A decision tree is a supervised classifier that arranges instances into a tree structure where leaf nodes are the final classification categories, and inner nodes and edges test an attribute. It is generally acknowledged that decision trees have a tendency to overfit on the training instances. That is, having a high accuracy on the training set but a relatively low accuracy on the test set. To mitigate this defect, it is important to tweak the hyperparameter max depth. During the training stage the main focus would be to find a model that generalises well. Another observation during training would be to note which features are deemed important by the Decision Tree. In Decision Trees important features are those which are used by the root and subsequent layers, the lower the layer the lower the importance.

Developing the Decision Tree

To develop this ML model in python scikit-learn was used. More specifically Scikit-learn’s DecisionTreeClassifier, this is a class capable of performing Decision Tree classification on a dataset. A decision tree does not require normalisation of data. The first test performed would be to find a suitable max-depth parameter. This parameter is in charge of setting the depth of the tree. A deeper tree is more likely to fit the noise/abnormality of the training data. A shallow tree is more likely to generalise better. During training, all the features were selected and a decision tree with depths ranging from 1 to 18 were analysed. Below is a plot of the accuracy scores on the training and test sets.

From this plot it is evident that the right tree depth is about 5. After this the test accuracy plummets significantly while the training accuracy reaches 100% meaning it starts to overfit. Hence the correct max depth choice is 5. This model results in a 98.5% accuracy on the training set and a 98.4% accuracy on the test set.

Feature selection

Dt1: The first test is to remove the features which had high correlation: that is removing Median, Q25 and Kurt.

This resulted in an accuracy of 98.5% on the training set and 97.7% on the test set. Which is slightly lower than the previous accuracies, this tells us that indeed these features carry low information.

DT2: The second test is to only use features which have been deemed useful. Displaying the depth 5 decision tree gives insights on which features were deemed important. The features which are at the first few layers have high information gain. Here the node was meanfun, at depth 1: IQR, at depth 2: Q75 and minfun and at depth 3: mode, Q25, sp.ent, sfm and maxdom.Therefore, the features used in this test were the following:From the box plots: sd, Q25, IQR, sp.ent and meanfun. From the depth 5 decision tree: Meanfun, IQR, Q75, Q25, sp.ent, sfm and maxdom. The final set of features are: ‘sd’, ‘IQR’, ‘skew’, ‘sp.ent’, ‘sfm’, ‘meanfun’, ‘IQR’, ‘maxdom’. This model resulted in an accuracy of 98.4% on the training set and 97.8% on the test set.

Evaluation

The above 3 Variations were tested on the evaluation set. Below are the results.

From these results it is evident that the best performing decision tree was constructed when using all the features but limiting the max height to 5.

The confusion matrix of the best performing decision tree (Dt).

4.2) Logistic Regression

Logistic regression is a supervised learning technique which is usually used for binary classification. Similar to linear regression it first fits the prediction to the best fitted line however then it uses a sigmoid function to convert this prediction into a number between 0 and 1, that is, the probability of an item being part of a class. Finally the probability is converted into 0 or 1 depending on a threshold, 0.5 for binary classification.

Developing the Logistic Regression model

Before fitting the data, logistic regression requires that the data is normalised. Here min-max normalisation is applied so that the data is in the range (0, 1).

The parameters to tune here are; Solver (the algorithm used to optimise), penalty (regularisation used) and C (Regularisation strength). The right choice of parameters should regulate training and reduce overfitting.

The following configurations were tested. These cover a sufficient amount of parameter configurations.

Interestingly the best training and test accuracy of 97.4% was achieved with every solver with no penalty. This tells us that for this dataset every solver algorithm converges to the same accuracy and that regulation is not needed.

Feature Selection

Again to try and improve the accuracy the features which had high correlation (Median, Q25 and Kurt) were removed.

The accuracy on the Test set was unchanged. Confirming that these three features are not useful.

Evaluation

Finally the two configurations were tested on the evaluation set.

Again the solution which uses all 18 features performed slightly better.

The confusion matrix of the best performing Logistic Regression model.

4.3) Support Vector Machine (SVM)

SVM is a supervised learning model which separates the training set by finding a hyperplane that distinctly classifies the data points. These hyperplanes are established with the help of support vectors. Support vectors are data points that are closest to the hyperplane. SVM tries to find the hyperplane that maximises the margin.

Image of Hyperplane and support vectors. (from https://scikit-learn.org/stable/modules/svm.html)

Developing the SVM

Using Scikit-learn we can develop an efficient SVM model for the voice gender classification task. The data passed to the SVM function should be normalised. The important parameters to take note here are the following:

C — Regularization parameter. Must be positive, the larger the value is the smaller the strength of the regularisation.
Kernel — will define which method is used to separate the data points.The options are ‘linear’, ‘poly’, ‘rbf’ (radial basis function), ‘sigmoid’ and ‘precomputed’. However, precomputing is invalid as the training data is not a square matrix.

The hypothesis here would be that the linear SVM will not perform as well as rbf and poly since the speech data looks complex and not linearly separable.

Every type of Kernel was tested and below are the best 2 SVM configurations.

Kernel = ‘rbf’ and C = 100, best test accuracy (98.7%) and 99.3% on the training set.
Kernel = ‘poly’ and C = 100, best train accuracy (99.4%) and 97.7% on the test set.
Feature selection

We would like to investigate how the accuracies achieved would change if we remove certain features. Again the features which had high correlation (Median, Q25 and Kurt) were removed.

This is how the accuracy changes when removing these 3 features.

PolySVM Accuracy on the train set: 99.3% (-0.1%)
PolySVM Accuracy on the test set: 98.7% (+1%)
RbfSVM Accuracy on the train set: 99.2% (-0.1%)
RbfSVM Accuracy on the test set: 99.3% (+0.6%)

Removing these features helped improve the accuracy on the test set. This suggests that Median, Q25 and Kurt add noise to the dataset and therefore should be removed.

Evaluation

The final 4 configurations were tested on the evaluation set.

Again, no useful information was lost when removing Median, Q25 and Kurt. Also Poly and RBF methods performed equally well on this dataset.

The confusion matrix of the best performing SVM model

4.4) Artificial Neural Network (ANN)

Artificial Neural Network is inspired from the biology of the human brain. An ANN is a collection of perceptrons and activation functions that map input to output. By making use of an input layer, hidden layer/s and the output layer an accurate representation of a function is captured.

Developing an ANN

Using PyTorch we can develop an ANN for the voice gender classification task.

Similar to SVM the data passed to the ANN should be normalised. When building an ANN the important design decisions are the following:

The Number of Hidden layers. The architecture opted for was simple, consisting of an input layer, a hidden layer and an output layer. All three layers are fully connected linear layers. This should be sufficient to capture the right decision surface.
The Hidden layer size. Usually the more hidden units the complex the representation is. It is impossible to know the right amount of units prior to testing. The hidden layer size in this case was evaluated at 10, 25, 50, and 100.
The activation function, here leaky ReLu was chosen. An improved version of the commonly used ReLU activation function. This type of activation function contains a parameter, Relu factor, that controls the angle of the negative slope. The applied Relu factors were 0.1, 0.01 and 0.001.
The optimization function, here Adam was used since it is slightly better than Stochastic Gradient Descent. This is because it uses gradient descent with momentum hence can escape local minimums. The learning rate parameter controls the step size for the weight updates at each iteration. The tried learning rates were 0.01, 0.001 and 0.0001.
Number of epochs. This number should be sufficient enough for the model to learn the dataset, however too many epochs may lead to overfitting. The tried number of epochs were 50, 100, 250, 500 and 1,000.
Loss function, here, since the output (Sex) is a binary variable, the loss function will be binary cross entropy. This gives us the error between the predicted class and the actual class.
Dropout rate, a dropout layer helps reduce overfitting by relying less on certain neurons. This is done by randomly selected neurons to be ignored during training. Dropout rate sets the ratio of how many neurons are to be ignored. The tried amounts were 0.3, 0.5 and 0.7.

To develop a sufficient ANN a random search through the above options was constructed. This is done through the following method:

Get a random combination of parameters.
If the combination is already tested, go back to step 1.
Develop and train the ANN with these specific parameter choices.
Predict the Test set.
If the accuracy on the Test set is the best seen so far, save the model’s parameters and update best Test accuracy.
Repeat for a number of times.

Evaluation

Using this method, below are some of the parameters of models that achieved 98%+ accuracy on the Test set during the random search. (98% was chosen because it is the best accuracy other standard ML model managed)

The best Model achieved an accuracy of 0.993 on the test set and 0.99 on the validation set. The ANN hyperparameters using in this model are:

Hidden layer of 100
Relu factor 0.1
Dropout rate 0.3
Learning rate 0.01
Epoch 500

Given that this performance is very high and the stochastic nature of training no feature selection will be done. Hence no further testing is required.

Confusion matrix of the Best performing ANN

4.5) k-means clustering

k-means clustering is an unsupervised learning method. k represents the number of clusters the algorithm should identify. Here k is set to 2, since we want to identify the sex(M/F) of the voice. The clustering is done by assigning every data point to the nearest centroid. To measure ‘closeness’ a distance function is used.

Developing k-means clustering

To develop this model in python, scikit-learn will be used.

The data needs to be normalised prior to applying the k-means clustering function. Also, since the algorithm is unsupervised and there are no helpful parameters to tweak, the testing set is ignored, so that the algorithm has a larger training set to fit.

In scikit-learn we can set which algorithm k-means clustering uses, the options are Lloyd or Elkan. The algorithm used did not make any effect on the accuracy.

k-means algorithm had an accuracy of 66.5% on the training set and 64.3% on the validation set. This is relatively poor accuracy, Feature selection should help improve this.

Feature selection

Test 1

Again, we would like to investigate how the accuracy would change if certain features were removed. Again the features which had high correlation (Median, Q25 and Kurt) were removed. This improved the accuracy slightly:

67.4% on the training set
65.2% on the evaluation set.

Test 2

Due to the fact that the k-means algorithm is sensitive to outliers, sometimes having less data is better. The second test is to only use features which have been deemed useful by the box plots.

These were: sd, Q25, IQR, sp.ent and meanfun.

Using just these 5 features the accuracy improved significantly:

88% on the training set
87% on the evaluation set.

Test 3

During this test we will be finding out which two features can separate the dataset best. This is done by investigating which two features obtain the best accuracy.

Out of all the combinations of the 18 features, the best 2 features to use in k-means are meanfun and minfun.

Achieved accuracy of: 94.4%.

Below is a visualisation of how these two features classify the data points.

Red = Female, Blue = Male

Since the accuracy when using these 2 features was significantly higher than the previously best accuracy, a k-means clustering model using only meanfun and minfun was developed.

Evaluation

From testing the best model was achieved when using only meanfun and minfin.

This model achieved the following accuracies:

94.4% on the training set
93.6% on the evaluation set.

Confusion matrix for this model.

5) Results and Observations

Now that we have established how the data was preprocessed and how every ML method was designed, this section will compare and contrast the ML models.

Performance of ML models

Below is a table showcasing the best accuracy obtained by every ML method on the evaluation set.

From these results it is evident that ANN is the best performing ML model and K-means clustering is the least. Also, Logistic Regression and SVM obtained the same accuracy and Decision tree managed a slightly better performance.

Why was ANN the best model?

One might suppose that ANN obtained the best accuracy compared to the other standard ML models because of ANN’s ability to learn complex data. ANN are highly flexible, given that they make use of multiple neural units. This gives them the ability to learn a variety of non-linear decision surfaces making them a suitable option on nearly every dataset. Also, when trained correctly ANN can be made to avoid overfitting. As described in Section 4.4 there are many parameters that affect the performance of the ANN, during training a random search through some of the parameters was opted for, this method helped in finding a model that generalises well. All in all, one might presume that the voice data consists of complex and irregular data which ANN managed to fit better than the other models. However, it is important to remember that the performance gap was minimal. This implies that standard ML models, such as a well designed Decision Tree still managed to fit the voice data with satisfactory accuracy.

Which model is most likely to succeed on other real world datasets?

As this report demonstrated the first step before fitting the data to any ML model would be to understand the given features and application. This is crucial since different ML models are more suitable for different tasks. For example, if the data is mostly linearly separable one might opt for Logistic Regression. However most real world datasets consist of complex and noisy data hence we would require a model that can handle this type of data. Therefore, for most real world applications many individuals opt for ANN. This is mostly due to the fact that ANN can learn non-linear decision surfaces that generalise well when trained correctly, as discussed already. On the other hand, the disadvantages of using ANN is that they are hard to interpret, especially when opting for a deep learning solution with millions of neural units. If the decision making of the model might be questioned, one might opt for designing a Decision Tree. Compared to an ANN, the way data flows through a Decision Tree is far more legible by humans. Moreover, it is hard to notice defects and biases in ANN compared to Decision Tree where one can see the deciding factors at every node. There are ways to mitigate these problems, such as making sure the dataset is balanced and creating explainable AI. Also another important factor is training time. Given the large number of parameters, training an ANN is time consuming and requires certain hardware characteristics. On the other hand SVM and K-means clustering have a faster training time, especially for simple datasets.

All in all, ANN is a great option for many real-world applications with some exceptions.

Which features are not useful?

The experimentation performed aimed to answer this question. Given that I am not a speech expert, determining if a feature is useful or not just by looking at the 20 features of the given speech dataset was impractical. However in some situations this could be easily done. For example to determine how wealthy an individual is the features eye colour and hair colour are irrelevant and we would suppose that age would be an important feature. Therefore, here a more machine learning approach was opted for. We know that useful features for classification are features which help identify differences between the classes. Hence, for every feature, with respect to sex, the box plot and the distribution were plotted. Our hypothesis would be that good features are ones which have noticeable distinct interquartile range and peaks. Below is an example of a good feature and a bad feature.

Q75 seems to be a bad feature since most of the data have similar values for both sexes, making it hard to distinguish data points.

IQR seems to be a good feature since there is a noticeable difference between the male and female values.

Another check was made by making use of a correlation matrix. Highly dependent variables are redundant and may be adding another dimension for little gain. Also variables which had a correlation of 1, that is they are the same variable were removed.

These checks are useful for us to understand the data, but sometimes it is better to let the model decide if a feature is useful or not.

Decision Tree

A Decision Tree uses important features at the top of the tree (especially root node), hence useful features will always have a greater effect. In fact the most accurate Decision Tree was obtained when using all the features but limiting the max tree depth to 5. By doing so a more generalised model was constructed. This outperformed models with feature selection.

Logistic Regression

Without feature selection Logistic Regression managed an accuracy of 98.4% on the validation set. This accuracy went slightly down to 98.1% when the features Median, Q25 and Kurt were removed. Given that this resulted in a change of just 0.3% it is evident that these features do not hold useful information.

SVM

When developing the SVM it was noted that the accuracy with and without feature selection was the same. This furthermore suggests that Median, Q25 and Kurt are useless features. Since they added 3 dimensions for no accuracy gain.

ANN

The best ANN model managed an accuracy of 99%. However no feature selection was performed. This is mostly due to the fact that retraining a neural network is stochastic. Meaning that the performance might improve or decrease for a number of reasons either then the features given. Hence due to this ambiguous feature selection was omitted for ANN.

K-means clustering

With all the features k-means clustering had an accuracy of 64.3%. This is to be expected because the data points for male and female are not easily distinguishable when using all the features. When Median, Q25 and Kurt were removed this improved to 65.2%. It seems that K-means clustering improves with less dimensions, hence only features deemed useful by the boxplots were used, that is, sd, Q25, IQR, sp.ent and meanfun. Using these features the accuracy improved significantly to 87%. This solidifies that these features are useful for classification since there is a noticeable difference between male and female values. The last test was to find the best two features that separate the male and female points. The best two features for this were meanfun and minfun, this resulted in an accuracy of 94.3%. This tells us that meanfun and minfun together are great features.

Below is a table of how the model performed when feature selection was applied.

This tells us that some ML models need feature selection more than others. Also all in all, the results conclude that the feature selection from distribution plot and correlation matrix were suitable approaches.

6) Conclusion

All assignment objectives were achieved. This project showcases the ML progress from start to finish. Including the challenges of exploring a complex dataset, preparing the data for multiple ML models, tweaking models parameters and obtaining useful results. The methodology and experimental procedure used can be applied to any real world application. Here this setup resulted in successful ML models with overall high accuracies. With the best model being ANN achieving an accuracy of 99%. Furthermore, this project showcases the importance of feature selection and model selection. As discussed, certain models are better at certain applications.

Solving Open AI Control Problems using Reinforcement Learning

Andrew D — Thu, 13 Apr 2023 17:26:25 GMT

What is Reinforcement Learning?

In Reinforcement Learning an agent learns what to do by interacting with the environment. The fundamental framework of RL is the Interaction Loop. The agent receives a state from the environment and based on this state the agent takes an action that it deemed to be beneficial. Certain actions lead to positive or negative rewards. The ultimate goal is to maximise the sum of all rewards into the future. There are a number of contrasting approaches to achieve this, such as Value Based methods, Policy Based methods and Actor-Critic Based methods. Moreover, RL has a number of real world applications especially in control problems given that RL agents can be taught to handle large and complex environments efficiently. In this report concepts of deep RL will be discussed.

Brief description of Value based, Policy based and Actor-Critic models.

There exist different ways of teaching the RL agent. The main three methodologies are Value Based, Policy Based and Actor-Critic methods. In value based methods the algorithm keeps track of “how good” each state (or state-action pair) is and in every iteration these values are updated. However, in some cases there are a lot of possible states, therefore this function may be an approximation, such as in DQN. Contrasting with that approach, policy based models focus on directly inferring a parameterized policy, independent of the value function, that maximises the reward on an
environment. This is done by gradient updates. Policy based methods can learn stochastic policies that are important in adversarial scenarios. Moreover, since the method is improving the policy directly, it requires less memory than value based methods. However policy learning leads to high variance estimates of the gradient updates that might destabilise the learning process. Actor-Critic models include a policy gradient and a value function estimator. This is done by having two main components, the actor and the critic. In simple terms, the actor takes an action and the critic provides feedback. Therefore, the model learns two functions, a policy that controls how the agent acts and a value function to assist the policy update by measuring how good the action taken is. All in all, the aim of Actor-Critic models is to benefit from both value based and policy based approaches while minimising some of their shortcomings.

Value Based models

Value based methods use a value function to determine “how good” each state or state-action pair is. At every iteration this value function is updated using the Bellman Optimality Equation. Initially this method has no idea what the value of the states are. Their value is learnt through experience and numerous episodes. For this method to converge, every state must be visited multiple times and different actions must be picked from each state. For this to happen usually an epsilon-greedy policy is implemented. This is a simple way to allow the model to explore different states and actions. Here, with 𝜖 probability pick a random action, and with 1 − 𝜖 probability pick the best action by consulting the value function. Where 𝜖 is a value between 0 and 1. For value based methods to be successful the policy should start by exploring different actions in various states and by time reduce exploration and start exploiting the known state values. One common way to achieve this exploration-exploitation balance is by introducing epsilon decay in the epsilon-greedy policy. Such as 𝜖 = 1/𝑘 or 𝜖 = 𝑒 −𝑘/1000. Therefore, as the number of episodes (k) increases 𝜖 will decrease resulting in less exploring. An example of a value based model is Q-learning where the optimal state-action policy is found using a Q function. The Q function estimates the value/quality of taking an action at a given state and again is updated iteratively using Bellman Optimality Equation.

Deep Q-Networks (DQN)

DQN are an extension of the model-free and value-based model, Q learning. The fundamental difference is that DQN make use of Artificial Neural Networks (ANN) as the Q function. The main advantage over Q-learning is that DQN can handle environments that consist of continuous action and states, making them applicable to many real world scenarios.
This is possible since the Q function is no longer implemented as a table. An ANN is trained to approximate the Q-table. To better understand how DQN utilised consider the research done by Volodymyr Mnih Et al [5]. They developed a RL agent to learn how to play classic Atari games using DQN. The input of the network is the state, here it is images displaying the game state. The neural network is made up of a couple of convolution layers followed by fully connected layers. The output is the Q value of all the possible actions. In this research DQN models managed to obtain above
human level control in many Atari games. To develop a DQN the following components are required:

Experience Replay, in DQN experience, defined as the following tuple: , can be stored and reused. Experience replay memory is made up of a number of experience tuples. Therefore, the first step is to populate the Experience Replay by generating experiences by following a policy. The ANN will be trained by sampling a mini batch from the replay memory. Hence the training examples are independent and identically distributed making them suitable.
Target Network, in Q-learning only one state-action value is updated in every timestep, however in DQN many Q values are updated at every timestep. This causes a problem, an update can affect the value of the next state. Therefore, a target network is used to stabilise learning. A target network is a copy of the DQN that is updated every n steps, where n is a hyper-parameter.
Error Clipping, The Error function of the DQN is defined as the difference between the predicted Q value and the target network’s Q value. To further improve the stability of the algorithm the error is clipped in the range -1 and 1.

The described algorithm is the fundamental DQN algorithm, however there exist many improved versions of this algorithm. Such as, Double DQN that addresses maximisation bias, Dueling DQN that uses state value and advantage value to estimate the Q-value of an action and many more. It is interesting to note that in Matteo Hessel Et al. research [3] they noted that the performance of a DQN improves significantly if several independent improvements are combined together. To conclude, DQN is a good and flexible RL model that enables the agent to learn large environments efficiency.

Policy Based Methods

Policy based models learn a parameterized policy directly, without the need of a value function. The aim is still to maximise the cumulative future rewards. In policy gradient methods the optimal parameterized policy is obtained by gradient descent. Usually an ANN is used as the policy function. With the state being the input and usually having a soft-max output unit. Hence the output is the probability of taking each action, that is, 𝜋𝜃 (𝑎 | 𝑠) outputs a vector of probabilities. The optimal action should have the highest probability, therefore its value should be close to 1. To achieve this, at every iteration, gradient ascent on 𝜋𝜃 (𝑎∗ | 𝑠) must be performed. In policy gradient, actions values are changed proportional to the estimated value. The typical update rule used in policy gradient algorithms follows the following policy gradient equation:

Where: 𝜏 is the trajectory, 𝑠𝑡 is the current state an agent is in at timestep t, 𝑎𝑡 is the action an agent takes at timestep t, log 𝜋𝜃 (𝑎𝑡 | 𝑠𝑡 ) is the log of the probability of the current policy to choose 𝑎𝑡 in 𝑠𝑡 , 𝛾 is the discount factor and R(𝑠𝑡 ′ , 𝑎𝑡 ′ ) is the reward received from 𝑠𝑡 using 𝑎𝑡 . With this equation the update rule can be defined as follows: 𝜃 ← 𝜃 + 𝛼∇𝜃 E𝜋𝜃 𝑅(𝜏) Where: 𝜃 is the neural network weights, 𝛼 is the learning rate and ∇𝜃 E𝜋𝜃 𝑅(𝜏) is defined already. The above methodology can be implemented following the REINFORCE algorithm, a Monte Carlo variant of policy gradients. Pseudo-code for REINFORCE.

(1) Initialise 𝜃 arbitrarily.
(2) Generate an episode following current policy.
(3) For each time step in episode: Use Update Rule to update 𝜃 .
(4) Repeat

The advantage of policy based methods over value based methods are that they require less memory since only a policy function is required. Moreover they can learn stochastic policies and can handle continuous actions efficiently. On the other hand the main disadvantage is their high variance estimates of the gradient updates. There are variants of REINFORCE that mitigate variance such as including baselines in the policy gradient.

Actor-Critic Methods

Actor-Critic can be thought of as a temporal difference (TD) version of Policy gradient. As the name suggests there are two main components, the Actor and the Critic. The Actor picks the actions and the Critic provides feedback. In this way, the agent is learning from feedback. Both components are updated at every iteration. Therefore, the Actor becomes slightly better at choosing actions and the Critic will give better feedback in the future. The main idea of this method is that by having both a policy function, 𝜋𝜃 (𝑠, 𝑎), and a value/advantage function, ˆ𝑞𝑤 (𝑠, 𝑎), it combines the best properties of policy and value methods. The update happens after every action. At each timestep, the state is passed as input through the Actor and the Critic. The process is the following:

(1) The policy takes the state and outputs an action 𝐴𝑡 .
(2) The critic takes 𝐴𝑡 as input and, using 𝑆𝑡 and 𝐴𝑡 , computes the value of taking that action at that state (computed the Q value).
(3) 𝐴𝑡 is performed in the environment and outputs a new state 𝑆𝑡 +1 and a reward 𝑅𝑡 +1.
(4) The Actor updates its policy parameters using the following update rule: Δ𝜃 = 𝛼∇𝜃 (log 𝜋𝜃 (𝑠, 𝑎)) ˆ𝑞𝑤 (𝑠, 𝑎) Where: 𝛼 is the learning rate, log 𝜋𝜃 (𝑠, 𝑎)) is is the logarithmic of the policy output when given s and a, and ˆ𝑞𝑤 (𝑠, 𝑎) is the value of taking a in s.
(5) The Actor produces the next action to take, 𝐴𝑡 +1, given the new state, 𝑆𝑡 +1.The Critic then updates its value parameters using the following equation: Δ𝑤 = 𝛽 ( ̄𝑅(𝑠, 𝑎) + 𝛾 ˆ𝑞𝑤 (𝑠𝑡 +1, 𝑎𝑡 +1) − ˆ𝑞𝑤 (𝑠𝑡 , 𝑎𝑡 )) ∇𝑤 ˆ𝑞𝑤 (𝑠𝑡 , 𝑎𝑡 ) Where: 𝛽 is the learning rate, 𝑅(𝑠, 𝑎) is the reward, 𝛾 is the discount rate, ˆ𝑞𝑤 (𝑠𝑡 +1, 𝑎𝑡 +1) is the Q value of 𝑎𝑡 +1 in 𝑠𝑡 +1 and ˆ𝑞𝑤 (𝑠𝑡 , 𝑎𝑡 ) is the Q value of 𝑎𝑡 in 𝑠𝑡.

The above methodology is the basic Actor-Critic method, there exists a couple of enhancements, such as, replacing the Q function with an Advantage function and using Gaussian distribution to output probability distribution hence can handle continuous action spaces.

Two Actor-Critic Methods: Deep Deterministic Policy Gradient (DDPG) and Soft Actor Critic (SAC)

DDPG. [ 4] is an actor-critic, model-free, deterministic, off-policy method. It is an extension of DQN that can handle continuous action spaces. This is made possible by having as the actor a policy network that takes the state
as input and outputs the exact action. Additionally, the critic is a Q-value network that takes in state and action as input and outputs the corresponding Q-value. Similar to DQN it trains on samples from a replay buffer and uses target networks for both the actor and the critic. This feature greatly improves the stability of learning. The Q function is updated by first computing the target Q values

hen these values are using to update the Q-function by gradient descent using the following equation:

SAC. [2] is an actor-critic, model-free, scholastic, off-policy method. The object of this method is to obtain an optimal policy that maximizes both the long-term expected reward and the entropy of the policy. In return this would provide a suitable balance between exploitation and exploration. In SAC, entropy is defined as uncertainty of the policy at that state, that is, how difficult it is to predict which action the policy is going to take from a state. The Critic is made up of two Q function that are updated using the Bellman equation. This process is composed of two main parts: the
first part being computing the target Q values using the following equation

and then using gradient descent by using the following equation:

The min is taken from these two Q functions this is done to mitigate positive bias in the policy improvement step. The policy is updated using gradient ascent on the following equation:

Here ̃𝑎𝜃 (𝑠) is a sample from the policy. All in all, SAC allows for sample efficient and stable learning. This makes it a suitable algorithm for control problems.

METHODOLOGY

Open AI Gym Lunar Lander (Discrete Actions)

Problem Description. Lunar Lander is a toy game based around physics control, where the objective is to land the rocket smoothly and as close to the landing pad as possible. The environment is defined as follows: Actions: In the discrete version there are 4 actions available: 0. Do nothing. 1. Fire left orientation engine. 2. Fire down engine. 3. Fire right orientation engine. Observation space: The state is a vector of 8 items, being: the coordinates of the lander in x & y, its linear velocities in x & y, its angle, its angular velocity, and two Boolean variables that represent whether
each leg is in contact with the ground or not. Starting State: The lander starts at the top centre of the viewport with a random initial force applied to its centre of mass. Reward system: Reward for moving from the top of the screen to the landing pad and coming to rest is about 100–140 points. If the lander moves away from the landing pad, it loses reward. If the lander crashes, it receives an additional -100 points. If it comes to rest, it receives an additional +100 points. Each leg with ground contact is +10 points. Firing the main engine is -0.3 points each frame. Firing the side engine is -0.03 points each frame. Solved is 200 points. Episode Terminates: The episode ends if a following condition is met: 1) If the lander crashes (the lander body gets in contact with the moon). 2) If the lander gets outside of the viewport (x coordinate is greater than 1) 3) If the lander is stationary (landed). During experimenting, the problem is considered to be solved/converged when the average score over the last 50 episodes is equal or above to 195. Then the algorithm is made to play on the test set. The test set is made up of 50 new episodes. The libraries used were the ones used in class, namely, gym, torch and other standard python libraries.

Solving Using DQN. Standard DQN, as explained before, was implemented to solve this environment in Python. The only difference is that the implemented DQN made use of a priority experience replay. The idea behind this improvement is that some samples are more important than others, hence they should be weighted. The bigger the error, the higher the priority. This is done to sample data more efficiently. The Network used was a simple feed forward ANN consisting of the following architecture: Input layer -> ReLU -> Output Layer. The loss function employed was
Smooth L1 Loss and the optimiser used was Adam. An epsilon greedy policy was used during training, to balance explorative and exploitative steps. Since the algorithm uses a priority queue, the update algorithm is changed slightly to adjust to the biased sampling used. Expected Q value computation code: expected_qvalues = b_rewards + GAMMA * (1 — b_dones.type(torch.int64)) * max_target_qvalues. The expected Q value are computed by using the Target network to obtain the best action to take in the next state. Calculating Lose code: loss = b_weights * loss_fn(qvalues,
expected_qvalues). In DQN the loss is the difference between the Q values and the expected Q values, b_weights is the importance-sampling weights and is introduced to remove bias from sampling. The error is Clipped using the following code: param.grad.data.clamp_(-1,1). The tested parameters were batch size, hidden layer size, learning rate and epsilon decay. The others were kept constant as their effects are minimal and well known. Three variants were designed

V1) The first DQN configuration tested used a hidden layer size of 124, a batch size of 75, a learning rate of 0.001 and an epsilon decay of 0.996. This configuration converged in 683 episodes and achieved an average score of 211 on the test set.
V2) The second DQN used a hidden layer size of 75, a batch size of 100, a learning rate of 0.001 and an epsilon decay of 0.994. This configuration converges in 719 episodes and achieved an accuracy of 165 on the test set.
V3) The last configuration consisted of a batch size of 50, hidden layer size of 100, a learning rate of 0.001 and epsilon decay of 0.997. This converged in 738 episodes and managed a score of 202 on the test set

Solving Using a Policy Based Method, REINFORCE. REINFORCE, explained in Section 2.3. The Policy was implemented as a deep feed forward ANN. Consisted of the following architecture: Input Layer->ReLU->Hidden Layer->ReLU->Output Layer ->Softmax. The last layer is a softmax because the output of the network is the probability distribution of the possible actions. Code for picking an action: sampler = Categorical(probs); action = sampler.sample(). Therefore every action has a chance of getting picked by the sampler, making it a stochastic policy. During training, first an episode is played out using the policy till completion then the policy is updated. The update is made in the following way, first the discounted reward at every timestep during the episode is computed. Code for calculated cumulative Reward: for i in reversed(range(0,len(rewards))): cumulative_rewards = cumulative_rewards * gamma + rewards[i]; discounted_rewards[i] = cumulative_rewards;
The weights of the policy network are updated by applying gradient descent on this loss equation. The algorithms parameters are gamma, ANN hidden size. and learning rate. Gamma was set constant at 0.99.

V1) The first configuration tested used a hidden size of 124 and a learning rate of 0.001. This configuration converged in 5,753 episodes and achieved a score of 193 on the test set.
V2) The next test was to half the size of the hidden layer, 64. This setup converged 7,624 episodes and managed a score of 182 on the test set.
V3) The last tested setup was a neural network size of 256. However, this set up failed to converge.

From these results one can conclude that for discrete action space DQN with prioritised experience replay performed better than REINFORCE. The learning rates of REINFORCE showcases the drawbacks of on-policy methods. Since it learns a policy function directly this leads to a lack of exploration that can result in sub-optimal solutions. Moreover, version V3 of REINFORCE did not converge to the required score (within 10,000 episodes). On the other hand, DQN proved to be a suitable algorithm for discrete action spaces. The main advantages of DQN are that it samples experience efficiently and uses another policy (epsilon-greedy policy) to allow for exploration. All in all, DQN converged in a timely manner and performed well on the Test set.

To conclude, this project seeks to explain and showcase different deep RL methods and paradigms. Moreover, four RL algorithms were implemented to solve a control problem. The main difference between Standard RL and Deep RL is that introducing Neural Networks means that convergence is no longer guaranteed. However, such approximates are essential in order to learn and generalize on large state spaces. Hence, Deep RL is crucial in real world applications, such as in control problems, that are complex with large state and action spaces.

Computer Vision: Viola-Jones Object Detection

Andrew D — Tue, 06 Dec 2022 22:49:44 GMT

For humans, distinguishing objects in an image is a straightforward task. However, for a computer program to detect specific objects from an image it can result in a difficult task. A computer needs to make use of some image features to gather intel on the objects in the image. Then train the program by some learning mechanism on these features. Finally when given a never seen before image some sort of matching/classifying needs to take place. Therefore the need for a framework is evident. One popular solution is Viola-Jones Object Detection. It is a framework developed by Paul Viola and Michael Jones. Although the framework is relatively slow to train, it can quickly detect objects accurately. Moreover, it works really well to detect human faces and can do so in real time. This makes the Viola-Jones a reliable algorithm to detect faces from live video recordings (for example webcams and security cameras).

Viola-Jones Object detection is made up of 4 stages, which are:

Selecting Haar-like features
Creating an integral image
Running AdaBoost training
Creating classifier cascades

To understand how the procedures work one must know what every step is achieving.

Step 1, Selecting Haar features. Haar features are digital image features. More specifically, they calculate image intensities, that is, light vs dark pixels. This is done by making use of a detection window, which is a specific area (usually rectangular) in an image, as can be seen in the diagram below. The sum of these windows, which is worked out by subtracting the sum of pixels under the white rectangle from sum of pixels under the black rectangle, is calculated. Finally the sections are categorized using the difference between the sums of the windows.

From these sections a computer can tell specific properties. For example in a face detectioning algorithm, all faces share similar attributes. Mainly, a nose, two eyes, lips, etc. Now looking at the intensities of an image of a face one can note certain characteristics, for example the nose bridge region is brighter than the eyes. A program using haar-like features can use these similar properties to classify an image efficiently. More on this later.

Step 2, Creating an integral image. An integral image gives a fast and simple way to calculate the value of any haar-like feature. This is done by making use of the Summed Area Table. The value for location (x, y) on the integral image is the sum of the pixels above and to the left of the (x, y) on the original image plus itself. For example, from below to calculate the (2,2) = 16: 5+2+3+6 = 16.

After constructing the integral image, it is used to efficiently calculate the sum of dark/light rectangular regions in Viola-Jones Algorithm. This is done by summing up (see diagram below) the corner pixels of the wanted window.

To get the sum of the blue rectangle, just add the green values and subtract the red values in the integral image. 1+21–11–3 = 8.

Step 3, Running AdaBoost training. The AdaBoost, that is Adaptive Boosting algorithm, is a learn and predict ML model that identifies the best features. It does this through information on the training dataset. The algorithm sets a minimum threshold for determining a useful feature. The output is a classifier. This ‘strong’ classifier is made up of many ‘weak’ classifiers. To find these weak classifiers the algorithm runs for N iterations. When choosing N, one must be careful to not overtrain the classifier. This is done when the model predicts training examples well but cannot predict never seen before data accurately. To avoid overtraining N should not be a high number.

Step 4, creating classifier Cascades. Above it was noted that a strong classifier is made up from many weak classifiers. Here, this strong classifier is turned into a cascade. Having many stages, each stage consists of a weak classifier. By having this chain of stages, it can effectively test our image/subimage. This is since the task of every stage is to determine if in the given window there is not a face or maybe there is a face. The output will always be no or maybe. If ‘no’, the window is discarded and it does not need to go through further levels. If ‘maybe’, the next classifier is called and the process is repeated until the last stage is finished. This is a much more cost effective and a fast procedure since many windows will be quickly discarded.

To conclude, as seen from the above image using Viola-Jones object detection is highly accurate to detect faces. The above 4 steps contribute in making the process of object detection cost efficient and accurate. Hence Viola-Jones is used for detecting faces in live footage. In the following sections, results from using a Viola-Jones framework will be discussed.

Amazon’s AI recruiting tool that showed bias against women

Andrew D — Tue, 06 Dec 2022 22:37:14 GMT

Firstly, one must understand what bias is and how it affects AI systems. Cambridge English Dictionary defines bias as “the action of supporting or opposing a particular person or thing in an unfair way, because of allowing personal opinions to influence your judgement”. From this definition the danger of allowing bias in AI systems can be seen. It is the programmer’s duty to eliminate or minimise bias in algorithms. This is because programs are not inherently biassed, however the people designing them are. By removing personal judgement and prejudice from AI systems, these systems would generalise better and produce more accurate results. However, omitting bias when creating AI techniques can be a difficult procedure. This is because, as will be shown in the next section, spotting bias can be a problematic task.

To understand better how bias can creep into AI systems, take a look at one of the world’s biggest companies, Amazon, and how their hiring tool had a fundamental problem. It was favouring male applicants. According to a Reuters article, “Amazon scraps secret AI recruiting tool that showed bias against women”[1]. Amazon developed a hiring tool that used AI to give job candidates scores. However it was not doing this in a gender-neutral way. The score given to an applicant had a lot to do with gender rather than qualifications. This bias to promote male candidates occurred because of how the algorithm was trained. “Amazon’s computer models were trained to vet applicants by observing patterns in resumes submitted to the company over a 10-year period.” The problem with this approach was that most of these resumes came from men. The system taught itself to penalise resumes that included the word “women’s” for example, “women’s football team”. Also it downgraded graduates of two all-women’s colleges. Moreover, it was seen that the system would promote higher masculine language. Given the rise of popularity of hiring tools in big companies, it is crucial that the underlying algorithms do not discriminate. As seen above, even though gender and names were not included, the AI system still managed to contain bias in favour of male candidates.

From the Reuters article, an understanding of the problem is achieved. In this section, a deeper look into how Amazon’s hiring tool was biassed will be discussed. To understand bias in AI, one must have knowledge about the design of the system[2]. This AI tool was designed to give a score according to the individual’s resume. In this case it was clear that the tool will result in many false negatives, that is, wrongly disqualifying a potentially strong candidate. Since having a different background or gender to others doesn’t make the candidate a weak option. However the system was trained to do exactly this.

How was it trained to assume this? Machines rely heavily on information to come up with predictions. It is crucial that the data presented to an algorithm in the training phase is relevant. It must be able to make intelligent predictions from it. In this case Amazon used the resumes submitted by past candidates. For the algorithm, good resumes were those which received an offer by a hiring manager which were mostly male candidates, resulting in a class imbalance. With just this dataset the algorithm failed to accurately generalise because the data is not an accurate representation of the population of software engineering resumes. Other than that, Amazon used a ‘bag of words’/keywords to score the resumes. Clearly this NLP component didn’t work as intended, since it was noted that the algorithm promotes masculine words while penalising feminine ones.

It is shocking that a company as influential as Amazon in the automated world failed to notice all these issues while creating the hiring tool. How can one prevent this from happening in the future? First of all, when creating algorithms one must be aware of bias and influence being taught to the AI. In this specific case, it is well researched that diversity is crucial in the workplace and improves overall team performance. Therefore, when building a hiring tool these considerations must be taken into account. A solution might be to promote diversity instead of penalising it. For legal or ethical purposes a company would need to have certain minorities hence the algorithm can be made to promote words that are associated to these minorities when necessary. This way the target score aligns closely to the real company goal.

In recent years there has been a rapid rise of Natural Language Processing applications. Many companies introduced automation to handle parts of their businesses. As seen above, even Amazon failed to realise bias in their AI. These NLP systems have an effect on society as a whole. Especially when used to hire people. It is crucial that there isn’t any discrimination. As seen, bias can take many forms. Hence it is up to the programmers and designers to be knowledgeable about it and minimise the effect. Only then can these AI systems be truly beneficial to society.

REFERENCES

[1]https://www.reuters.com/article/us-amazon-com-jobs-automation-insight-idUSKCN1MK08G

[2]https://becominghuman.ai/amazons-sexist-ai-recruiting-tool-how-did-it-go-so-wrong-e3d14816d98e

Johnson’s algorithm To find simple cycles in a directed graph.

Andrew D — Tue, 06 Dec 2022 22:02:09 GMT

Discussion about how Johnson’s algorithm works and how it is used for finding all simple (elementary) cycles in a directed graph.

In graph theory, a cycle is a path in the graph such that the first and last vertex is the same. For example, A->B->C->B->A where A,B and C are vertices. Furthermore, a simple (elementary) cycle is a cycle where no vertex is repeated more than once except the first and last vertex. For example, A->B->C->A. Given a graph and its strongly connected components Johnson’s algorithm will find all simple cycles within the graph.

Here is an overview on how it works:

The first step is to index the nodes in the graph from 0 to n, where n is the size of the graph-1. Then divide the graph into subgraphs. Each subgraph corresponds to a strongly connected component (above is already explained how to find these SCC within a graph). This is because a cycle will always be in a SCC. Since by definition of SCC once the path leaves the SCC there is no path back in that SCC. Hence a cycle can never leave a SCC since the first and last vertex are the same. Johnson’s algorithm is called on every subgraph. This algorithm similarly to Tarjan’s starts a dfs but this time it keeps track of three lists, being; stack, blockedSet, blockedMap and total (total will hold all the simple cycles). The purpose of stack is to keep track of the vertices visited during a search, while blockedset is used to keep track of the already visited vertices. As for blockedMap, it is used to keep track of vertices which are in the blockedset because they lead to a deadend. A deadend occurs when a vertex leads only to a vertex which is in the blockedSet hence the search has nowhere to go. However they are unblocked (hence removed from the blockedSet) when the vertex they lead to is no longer in the blockedSet. The elements inside the blockedMap are of form (1,2), which would translate to if 1 is no longer in the blockedSet then 2 is removed from the blockedSet.

Everytime a DFS is started, keep track of the starting vertex. Also while traversing through the graph (by exploring the children of vertices in a dfs manner (explained already above)), everytime a new vertex is discovered add it to the stack and to the blockedSet. Always expand vertices which are not in the blockedSet, if vertex is the starting node then that is a simple cycle add it to the total by getting the vertices in the stack (hence path) and backtrack. If a vertex leads only to vertices found in the blockedSet, then the algorithm is in a deadend, use the stack to backtrack but also update the blockedMap. Vertices can also be unblocked, hence removed from the blockedSet, this is possible while backtracking. If while backtracking, a vertex which leads to the starting vertex is popped from the stack then since it might lead to the starting vertex it is unblocked. Everytime, a vertex is unblocked, check the blockedSet since it might lead to more vertices being unblocked.

After the DFS ends and hence all the simple cycles from the starting vertex are found, the starting vertex is removed from the graph and another search from a different starting vertex is started. This is repeated till no more vertices are left. At this point, it is guaranteed that all simple cycles within the graph are found.

With a time complexity of O((c + 1)(n + e)) Johnson’s algorithm is a fast and easy to understand solution to find all simple cycles within a graph.

AI search techniques for Sliding 8-Tile puzzle

Andrew D — Thu, 09 Jun 2022 20:05:52 GMT

AI search techniques for Sliding 8-Tile puzzle

Introduction

The program takes as input the initial state of an 8-tile puzzle, then one of 4 possible search algorithms is chosen. If the algorithm chosen uses a heuristic function, this function can either be Manhattan distance or number of misplaced tiles. After the search is complete, the program outputs:

The Plan (list of actions in the right sequence).
The length of the plan.
The number of unique states generated.
The time it took to find the solution.
The board from initial state to goal state, after each action.
If the plan is a Valid or non-valid plan.

In the following section it will be shown how all of the above was managed to be achieved.

Implementation

Firstly, the empty tile is represented by a ‘0’. A board/state is represented by an array made up of 9 integers. The goal state is constant and is stored in a variable goalState = [1,2,3,4,5,6,7,8,0]. The initial state is obtained from the user by function getInput(). getInput() asks the user to input a string to represent the initial board face. Then the input is put into an array. Example “8 6 7 2 5 4 3 0 1” would be stored as [8,6,7,2,5,4,3,0,1].

From every state there are certain actions which are applicable. The 4 possible actions are: Left, Right, Up, Down. It would be helpful if there is a function which checks which actions are available from a given state. availableActions(puzzle) does exactly this. It checks where the empty tile is and the preconditions of every action. The function returns a 4 boolean array. It is important to note that [0] ->Left, [1] -> Right, [2] -> Up and [3] → Down. The function swap(actual, action) applies the action by swapping the empty tile with an adjacent one.

In automated planning a search algorithm starts from an initial node (root) and finds its way to a goal node by expanding branches from nodes. Hence the need to implement a “tree” type object is evident. In class StateTree() a node is defined as an object having:

a state (the 9 integer array)
A list of children
A list of actions. These actions, if applied from the initial state, will lead to this state.
A variable which keeps track of how many children the state has.

The node class has a function called addChild(self, state, action). This function appends an object of type node in the current node’s children list. However in every search function addNodes(node) is used to generate possible children from the current node. This function will get the available actions and apply them one by one to generate new states and add them to the children list of current node.

How the different search strategies were implemented.

Now that there is an understanding of how the nodes are represented, let’s look at how the search algorithms were built.

The 4 search algorithms implemented are:

Breadth First Search
Greedy Best First Search
A* Search
Enforced Hill Climbing

1. BFS is a simple blind search algorithm which expands every node layer by layer (breadth first) until the goal node is reached. Initially, the initial state is placed in the toVisit queue/list. To avoid ending up in loops or cycles UniqueStates will hold a list of unique visited states. A while loop is created, inside the loop the following procedure is repeated:

The first value inside the toVisit list is popped and this is the current node to expand.
Check if the node is a goal node. If is it end else:
Expand node by adding children using addNodes(node)
For every child node check if it leads to an undiscovered state if yes add to toVisit list and UniqueStates list.

This way all unique nodes are expanded until goal state is reached. The plan returned from this search will be optimal since it returns the shallowest goal node (hence shortest).

Before describing the other 3 algorithms, first the heuristic needs to be understood. The heuristic can be either:

Number of Misplaced Tiles
or, Manhattan Distance

misplacedTiles(puzzle) given a state function returns the number of tiles in the incorrect position. It does this by checking if the elements in the array are in the goalstate indexes.

Manhatten_Distance(puzzle) gives the heuristic value by working the mhd of all tiles to their correct positions and returning the sum. mhd() works out the individual values by getting the (x,y) coordinates of the tile and the tile’s correct location then works out the difference.

2. Greedy Best First Search starts from the initial node and selects the node which has the best heuristic value, this node is expanded and the process is repeated. The nodes are stored in a list called ‘find’ which is made up of dictionaries of type {‘node’: X, ‘priority’: h}. This way it is easy to compare different nodes heuristic value. A while loop is created, inside the loop the following procedure is repeated:

Find the index of the lowest (best) heuristic valued node in find. This will be the node to expand. Also remove it from the list.
Check if the node is a goal node. If is it end else:
Expand node by adding children using addNodes(node)
For every child node check if it leads to an undiscovered state if yes: work out the heuristic and add to find list and UniqueStates list.

3. A* starts from the initial node and similarly to GBFS, selects the node which has the best heuristic function but in A* this is made up from the sum of the heuristic and the cost. That is f(n) = h+c. c is the cost to be at the current node. Which is the steps it takes to reach the current node from root. This can be easily found by finding the length of it’s plan. The rest of the function is the same as before.

4. Enforced Hill Climbing starts from the initial node and expends nodes in a BFS style however when a node of better heuristic is found, the OpenList is discarded and only this node is added. In the implementation three different lists are managed: best, states, visited. best keeps track of the nodes to visit, states keep track of all unique nodes already seen and visited keep track of all the nodes already expanded. Variable H keeps track of the best heuristic seen so far. When expanding a node if none of the child nodes have a better heuristic then all of them are added to the best list. However if one of them has a better heuristic, then the best list is thrown away, H is updated and the child node is added to it.

Results and evaluation

Results for the following two initial states.

Time taken by Search

From these graphs one can note that the most efficient (in terms of speed) is Greedy Best First Search and the worst is Breadth First Search. This is what was expected since BFS expands all the possible unique combinations from the initial node irrespective of their heuristic (blind/uninformed search) until the goal state is reached. In contrast, Greedy Best First Search starts from the initial node and always expands the node having the best heuristic value till the goal node is reached. This way a solution is usually found quickly. It’s important to note also that A* took longer than GBFS and Enforced Hill Climb, this was also expected since A* will always converge to the optimal solution, which usually is “harder” to find than any solution.

Unique States explored by Search

Understanding states generated

Since the algorithms never get stuck in loops or cycles, there is a strong positive correlation between time taken to execute and states generated. Hence there is no need to go into detail about these results. (Same reasoning as when looking at the time taken graphs which was already discussed above)

Length of Plans

From the above graph, an important observation is that no other search technique has a shorter (better) plan than A*. So will BFS, these two search algorithms will find the optimal solution. (N.B: BFS is not in the graph since it took more than 15 minutes to run). Enforced Hill Climbing search on average returns the longest plan. This is not surprising due to the fact that it will try to expand the best heuristic valued child of a node (hence adding a step (an action to the step)).

Conclusion from the graphs

Important takeaways about the performance of each of the search strategies. The optimal solutions, that is, the algorithm which always returns the shortest path are BFS and A*. From these two strategies, it is clear that A* is the better option in terms of efficiency. Since, as seen from the graphs above, A* manages to find the plan in much less time and by expanding/exploring significantly less states. This is mainly due to the fact that A* is an informed search algorithm while BFS does not use a heuristic/evaluation function (hence a blind). However it can be the case that the problem is satisfiable given any solvable plan. Here Greedy Best First Search and Enforced Hill Climb Search have their advantages over BFS and A*. These two search strategies (GBFS and EHCS) are significantly faster to compute and will take less storage to work out since visiting less states. In situations where time is crucial these strategies may be opt for.

Another key part that makes a fast and efficient search algorithm is the heuristic used. From the results above, there is a clear distinction between using Manhattan distance as a heuristic and using Misplaced tiles as a heuristic. To solve the same problem, despite the search algorithm chosen, using Mhd as a heuristic was proven to be more efficient in terms of both time and states explored. Hence it can be concluded that Manhattan distance is a better representation to how close/(far away) the current state is to the goal state in a 8-tile puzzle. Another reason why Mhd is a better heuristic to misplaced tiles could be that since search algorithms try to find the best h valued node, the range of values offered by mhd is greater than that of misplaced tiles(0–8). Therefore, when using misplaced tiles as heuristic the algorithm faces many more occurrences where it has to choose between the same h values nodes.

Reinforcement Learning model for BlackJack

Andrew D — Thu, 09 Jun 2022 19:58:04 GMT

The implementation designs of every algorithm that was constructed will be discussed. However, before discussing the RL methods, firstly an understanding of the simulated BlackJack environment must be achieved.

Simulating a game of Blackjack

In this simplified version of Blackjack the game mechanics are the following:

At the beginning of the round, the deck is restored and shuffled.
The player gets two cards and the dealer gets dealt a card.
The player has to decide to HIT or STAND. (Here is where the model decides)
If the players card’s sum is more than 21, the player has lost and the game ends
Else, ask the player to HIT or STAND again.
When the player chooses to STAND, the dealer policy is executed.

Dealer policy:

1. HIT — if the sum of dealer cards is less than 17.
2. STAND otherwise (the sum is greater or equal to 17).
After the dealer plays, the game winner is announced. (Reward is given)

A more RL oriented look at the above game mechanics:

The following is valid for all the RL methods.

The state will be s = (PlayerSum, DealerCard, hasAce?). Therefore the State-Action pairs will be Q = (s, HIT/STAND). The player sum is between 12 and 21 while the dealer’s card is between 2 and 11 (11 being an Ace), and hasAce? is a boolean variable determining if the player has an Ace which can turn into a 1.

For exploration reasons it is important to know the number of times an action 𝑎 ∈ 𝒜 was selected in a state 𝑠 ∈ S. Hence, the counts of every state action pair were stored also.

The Reward given is the following:

If the outcome of the round is a win, the reward is +1,
if the outcome is a loss, the reward is -1,
otherwise in case of a draw the reward is 0.

Therefore the reward is given only after the game winner is identified. Hence after the final state is reached.

How are the State-Action Values and Counts stored?

The State-Action values and their respective counts are stored in a dictionary named “StateActionValues”. The keys of this dictionary are tuples in the form of: (s, a), where s is the state (for example, (12,3,True) ) and a is the action (HIT/STAND). This dictionary will return a list containing two elements. The 1st being the actual Q value expected by taking action a in state s and the 2nd element being the number of times action a was chosen in state s. Example: StateActionValues[((12,3,True),STAND)] = [-0.3853, 218]. -0.3853 being the expected return if Stand is picked and 218 being the number of times the model chose Stand from that state. By making use of a dictionary, the look up speed for every state action pair is very fast and hence the algorithms can adjust these values efficiently. It is important to note that this dictionary is initialised before every model is called by calling InitialiseQvalues().

InitialiseQvalues() sets every StateActionValues (that will be needed) Qvalue and counts to 0.

RL methods

In the next part of this section, how every model was created will be discussed.

The three RL methods that were constructed (each having 4 variations) are Monte Carlo, SARSA and Q-Learning (SARSAMAX). The first two are On-policy while the last is Off-policy. All of the models fall under Model Free Learning, hence they are suitable for learning to play BlackJack. This is because it is impossible to know the underlying Markov Decision Process (MDP) for a game of Blackjack because of the “randomness” of the deck (therefore transitions). Model Free learning models can learn without any knowledge about the underlying MDP.

Monte Carlo

MC methods learn from experience by looking at complete episodes. Therefore the state-action values are updated after the episode is finished (final state is reached). More specifically, it uses a 𝝐-Greedy policy to improve. After every episode the Q values (V(S)) and Ncounts (N(S)) used need to be updates following the following equations:

The 𝜖-Greedy Policy is the following

With probability 𝜖 choose a random action.
With probability 1 − 𝜖, choose the best action

In MC there is the need to keep track of the current episode, currentEpi will do just this. Hence every time an action is chosen from a state it is added to currentEpi by currentEPi.append((state,a)). Every time an action needs to be taken, MC(state,currentEpi,Emode) is called. This function will make a choice depending on the 𝝐-Greedy policy (stochastic).

Before the function returns the reward, the Q-values and counts are adjusted by calling updateQvalues(currentEpi,result).

Two versions of MC were built: playBlackjackMC() and playBlackjackMCExplore(). The only difference being that in playBlackjackMCExplore() the action taken from the first state in an episode is chosen randomly. This is done by a = random.randint(STAND, HIT) for only the first state-action in the episode. The rest of the episode uses 𝝐-Greedy policy with 𝝐 = 1/k.

SARSA (State–action–reward–state–action)

The sarsa model is similar to the MC one with the differences being the following.

SARSA is a temporal difference method hence it learns from every time step.
Perform the Q-values update in every timestep and the formula for the update is the following:

where alpha is the step size

For Sarsa function playBlackjackSarsa(episode,Emode) is called. Again Emode parameter will handle the configuration of the 𝝐-Greedy policy. The difference to MC being that it updates the values at every timestep and uses sarsaUpdateValues() to do so. Hence it only keeps track of the previous Q and the current Q. It is expected that SARSA learns faster than MC.

Q-learning (SARSAMAX)

It is an Off policy method. Hence it uses another policy to select the next action from a given state. The reference policy being a configuration of SARSA 𝝐-Greedy policy. The main difference here is that in Q-learning the best/greedy action is chosen. Q-Learning is biassed towards the highest values and hence exploration is limited. In the implementation, Q-learning mode first learns Sarsa’s policy then switches to Q-learning (calling function qLearning()) where it uses the Sarsa policy but always chooses the best possible action.

After implementing these models, proper testing (mostly input-output testing) was performed to ensure that they work as intended. Finally, various plots and tables were extracted from the results of the models. These will be used to analyse the RL models and compare them to each other. Moreover, the trade off between exploration and exploitation will be highlighted.

Results and observations

The table below displays the mean number of wins, draws and losses over the last 100,000 episodes for each policy obtained using each algorithm configuration.

Dealer Advantage plot

Discussion — The differences between each algorithm and algorithm configuration, and how exploration and exploitation had an effect on the performance of each algorithm.

This section will be dedicated to discussing the difference between each algorithm, the configurations of the algorithms (different epsilon values) as well as the performance contrast between the multiple set ups.

Algorithms

Starting off with Monte Carlo, this algorithm is the most basic one as it lacks the step size that the other algorithms include. Looking at the results of this concept, it is clear to see that the performance is the worst when it comes to winning when compared to the other algorithms. This low performance could be attributed to the fact that Monte Carlo does not update Q values until the end of an episode unlike the other two algorithms that update Q values after each state change. Hence learning is slower than TD learning models. The advantages of MC methods are that the estimates are unbiased due to actually finding the reward before updating the Q values. A particular MC variety which performed poorly compared to the rest is MC exploring starts. This makes sense since in this environment the only actions are HIT and STAND. Hence when choosing a random action at the start of the episode, there is a 50/50 odds that the method will choose either one, which can easily be a very bad option. Consider the case where the sum is 20 (cards being 10 and 10) and the method chooses HIT or the case where the sum is 13 and the method chooses STAND. The dealer has the biggest advantage when playing with this algorithm.

The Sarsa algorithm can be seen as an improvement on the previous model as not only does it include the step size feature, but also updates the state action values after each time step. In fact a slight enchantment in the winning percentages is observed. When compared to the other methods, Sarsa seemed to be quite an exploitative algorithm. This can be seen in its graphs in which the top state action pairs have a much larger frequency when compared to the other algorithms. One can speculate that this is due to the fact that Sarsa’s Q values are inflated due to not getting the real reward every time an update is made. Sarsa also had overall a good performance level. In fact, Sarsa with 𝜖 = 𝑒−𝑘/10000 managed to minimise the dealer advantage the most. (Dealer advantage being 0.13855). This is not a surprise, since this configuration manages to explore more state action pairs at the start and later on, as k grows, it exploits the policy. Resulting in a decent trade-off between exploration and exploitation.

The Q Learning models are off-policy models. Hence a Sarsa policy with 𝜖 greedy was used as a reference. However when Q learning is active it will always choose the best action according to the policy. Hence Q learning tends to exploit much more than explore. With this being said, it did not have the highest win percentages on average with the four configurations. However this might be due to the fact that the sample size that the algorithm had to go off was not huge, meaning that there could have been mistakes that were not ironed out with the low number of repetitions. It could be the case that it did not converge to the optimal policy. The performance was slightly behind that of Sarsa in this case.

Configurations

Beginning this discussion with the Monte Carlo simulation, it is quite clear to see that the exploring starts configuration had an extremely high amount of exploration with the randomised first choice, however this did not produce good results.

With regards to the four possible different configurations of the other two algorithms, there is a pattern that can be noticed by looking at the graphs and the numbers. Since they did not have a configuration of exploring starts it is easier to compare between them. In general, the two configurations that had the highest win percentage were the 𝜖 = 𝑒−𝑘/10000 and the 𝜖 = 0.1. This could be attributed to some factors. Regarding the 𝜖 = 0.1 it could be that the model has a high exploration and keeps on exploring at a consistent rate, no matter the episode number. Thus, it has a higher chance of finding new states and moves that might be better than the current ideal move. For instance, if there is a state that has a high percentage of winning if it is hit on, but in this small chance that it fails, the algorithm is much less likely to go down that path again and it would start to favour a state that might not be the correct one. The optimal policy is reached by exploring many state action pairs and converging to the correct expected Q values. The 𝜖 = 𝑒−𝑘/10000 configuration might have a good performance level as the more episodes go on, the less likely it is to explore and more likely it is to choose the ideal option and exploit. At the same time it takes more episodes exploring compared to 𝜖 = 𝑒−𝑘/1000 configuration which starts to exploit the policy quicker. 𝑒−𝑘/10000 might be the best explore to exploit trade off configuration.

From the graphs of each algorithm, it is easy to observe that the 𝜖 = 1/k configurations are much more likely to exploit as opposed to explore. This can be seen with the frequency counts of the most likely states, as they have a much higher count than their counterparts.

Looking at the graphs that show the state action pairs that were not explored could not really tell you much about the algorithms or configurations. The values are so low and there is little correlation to them. The only piece of information that might be correct is that when the first configuration was run (the configuration with the most exploring incentive) there were no empty states in each of the methods. However, if all the algorithms were run again, there could have been completely different results in this specific piece of data. Once again, this can be attributed to pure chance. However given a configuration that keeps exploring (like 𝜖 = 0.1) and many episodes to run, the chances are that all states will be entered and explored.

Finally, looking at the BlackJack Strategy sheets, which displays what one should play, according to the model, given the current state. It is quite evident that the models converged to a good enough policy after 500,000 episodes. This can be concluded by analysing how the model chooses to Hit when the sum is a small number and chooses to Stand when the sum is a larger number. However a few exceptions can be noted, more specifically the choice to stand with a sum of 12 when having an ace. This for a reasonable player seems to be the bad option and would rather Hit, however some of the policies converged to Stand. One reason for this might be due to the rareness of the state (for the sum to be 12 with an ace, the hand must be double aces) hence the policy hasn’t had enough experience with that state to be certain and converge to the ideal solution. This can be said for many other states. With that said the policies seem to Hit more when the user still has a playable ace, this is expected and correct (since the chance of losing is lower).

All in all, it is important to keep in mind that these results are not an exhaustive list or an end all be all. In fact almost every time the algorithms are run a slight variation of the graphs is observed. This is due to the very large sample state of a game of Blackjack and also the order of the episodes have an impact on the resulting policies (the wins and loses at the start of the training loop have a greater final effect). Moreover five hundred thousand episodes is not a lot in the grand scheme of things and all methods would benefit from an increased number of runs. Some of them might even converge to the optimal policy if allowed to run for a longer time.

-Learn Reinforcement learning through the same book I use. Found: https://amzn.to/3WqY1W5

-The country am from does not allow Medium Partner Program hence I do not make any money from Medium. Appreciate any tip amount if you found the report interesting :) -

AI Drones In Search And Rescue

Andrew D — Mon, 16 May 2022 08:13:13 GMT

Search and Rescue with AI Drones

Search and Rescue is often a very serious and dangerous line of work. When a natural disaster such as a tsunami, earthquake occurs, sending in first responders has its risks since more often than not they have to deal with the aftermaths of the disaster and could even be at risk of being put in danger themselves. Even in human caused disasters, such as bombings or shootings, there are always risks at hand, such as dealing with radiation of a nuclear explosion and so forth. The use of autonomous drones is helpful in this regard.[2] Piloted drones are in fact used nowadays, an example would be the deadly Alabama tornadoes where drones are being sent in to look for anyone who might be in danger, however these are piloted, not autonomous. Autonomous drones have the capability of covering more ground in faster time and it would be much more efficient if they could help the people on their own as this would allow the rescuers to handle the situation much more easily. This isn’t an easy task since Search and Rescue is often never the same and as such the drones would need to be programmed in a way that they can adapt to different environments and elements that the aforementioned disasters might introduce. Therefore this is why we should begin developing and thinking of ideas to better develop these drones in order to make the problems that are faced much more manageable.

How is the data needed for the solution to be gathered?

In order to obtain the data which we will use to program the autonomous drones the use of piloted drones will be used. Piloted drones are nowadays being used in the search and rescue field and as such they propose information regarding the different scenarios the drones might face. The different situations which were recorded by the piloted drones all had different solutions which the pilot performed with the drone. Therefore the mentioned solutions may be implemented into the autonomous drones’ software. As such when a drone enters a chaotic scene, for example after an earthquake occurs, they would be able to compare the current situation with previous recorded earthquakes in order to know exactly how to locate people who are injured and then notify the human supervisors in order to rescue them. Various companies and universities have research teams working on autonomous drones. For example, in order to solve looking for something in a chaotic scene a research team has created an artificial neural network system that can run on a computer on board the drone. [2] This neural network allows the drone to emulate some of the ways that human vision works. Therefore if these different research teams share their work it would be easier to make autonomous drones a reality. Another way that data can be gathered to solve this solution is through trial and error. In the beginning, these autonomous drones will not operate 100% efficiently. As such, pilots would be able to take control if the drones are experiencing any errors during an operation. These errors can then be worked after the operation, in order for them not to repeat. Companies such as Google and Amazon have created autonomous drones that are capable of delivering packages on their own and in the near future will most probably be used around the entirety of the US. These drones can be used as a backbone for Search and Rescue as they already have navigation implemented and as such the same software can be used for these drones to navigate to and from danger zones.

When trying to automate the search and rescue process there are a lot of challenges and boundaries that will cause inconveniences, and setbacks to this implementation. A search and rescue environment is a dynamic and challenging scenario that even poses risk to the highly trained individuals that are specialized for such missions. Hence introducing autonomous drones to aid in a specific aspect of these missions would not only be of aid to the first responders but also to the people at risk.

Our implementation is targeted at locating the people in need. The main challenges faced by our implementation are: the highly volatile, high risk, and dynamic environments that our drones will be operating in. In an effort to adapt and deal with these issues, our drones will be equipped with a multitude of gadgets that will allow our intelligent system to gather data about the environment and conform and operate in those situations. The main equipment on our drones is split into two categories, navigation, and survivor spotting. For navigation, LiDar sensors and satellite navigation will be implemented, along with the standard equipment already on the drones. For survivor spotting, IR-RGB cameras, thermal cameras, ultrasonic sensors, high powered LED spotlights, and an air contamination sensor. Along with these information gathering tools, a loudspeaker and microphone will also be equipped for communication.

Challenges such as fog, heavy downpour, low light conditions, smoke, and many others pose risk to the traditional methods of search and rescue, namely, helicopter spotters or inflatable boats. The equipment included on our drones was purposely chosen to be able to adapt to these situations and aid the rescuers in spotting and location the people in need faster and more efficiently.

The equipment on our drones was selected specifically to deal with the variability of the environments. The IR-RGB and thermal cameras will provide a view of the environment from multiple different spectrums allowing the A.I. system to analyze the situation it is in in depth. The ultrasonic sensor offers another layer of information when dealing in close range (approx. 5–10 meters). IR-RGD cameras have a photographic sensor that captures light between the 400–70,000nm range, outside the human visibility spectrum, which allows the A.I. to effectively see through smoke, rain, and other semi opaque mediums. For even worse visibility conditions thermal cameras can offer another spectrum, locating heat signatures and their intensity. Ultrasonic sensors will be mainly used for guidance and maneuvering obstacles. The loudspeaker and microphone would be used in situations of contact with survivors, the drone would ping the location of the survivors back to the rescuers and play a pre programmed message to reassure the people in need. In the more volatile cases, the loudspeaker and microphone could be accessed manually to communicate directly with the survivors.

While using drones might solve or optimize some issues like, the need for large amounts of food and water for a large crew of rescuers and spotters, or fatigue and tiredness coming for long periods looking for survivors. Drones introduce a new layer of challenges, while drones don’t need food, water, and rest, they do need to charge on a more frequent basis than rescuers need to. On location generators are already common in search and rescue situations, so charging is less of a problem and more of a minor inconvenience. Rescue vehicles with on board generators could travel as a charge hub alongside the drones to offer a quick charge portable station and also be a first responder in case survivors are found. Another limitation of drones for search and rescue is their range, small drones have a range maxing out at around 10km so exploring a larger area of land would require the controller station to also move as well. Workarounds for this limitation can be setting up the control center on a movable vehicle such as a customized truck, or including a signal relay card within each drone, allowing drones to exceed the range by ‘hopping’ their transmission off of one another. This approach will slow down transfer speeds significantly but would overall exponentially increase the range of the drones, as the number of drones used is increased. Since all the drones would need to transfer back to the control station is a location ping of where the survivors were found, having a weak connection line wouldn’t be that big of an issue when only transferring kilobytes or bytes of data.

Overall, the implementation of drones for search and rescue would be very beneficial for all the parties involved. Human overseers would still need to be implemented however, resources needed, fatigue, stress, and loss of life due to time restraints would all be decreased by the ease of deployment of drones.

A.I. techniques

For our implementation to function, different A.I. techniques need to be applied and work seamlessly together. The 2 main aspects where A.I. would need to be implemented is in the path planning, and in object detection. Our drones would need to self pilot themselves and be able to spot people, while also taking into consideration their shortcomings and the environmental challenges. The implementation would consist of 2 A.I. systems, one taking care of path plotting and piloting the drone, while the other gathers information from the onboard equipment and analyzes them in real time. If a match for a survivor is found, the A.I. in charge of spotting survivors would communicate that it has found a message to the pilot A.I., which would then ping the given longitude and latitude coordinates back to the control station.

The path plotting and piloting intelligent system would be managed using a Multi-objective path planning algorithm (MoPP). An MoPP algorithm is a specialized autonomous path planning intelligent system that as input takes the area of land that needs to be traversed. The system takes into account the communication range of the drones, the charge capacity, and the amount of drones being used. Using satellite navigation software like google maps, it takes predetermined routes (roads, footpaths, etc.) and alters them for the most efficient traversal of the entire area given the number of drones available, the communication ranges and the charge stations available. This algorithm was most highlighted in Autonomous robots issue 44 in June of 2020. Deep learning neural networks trained on human drone flight data would be used to supplement the MoPP algorithm for flight and maneuverability.

The survivor spotting portion would be handled using deep learning object detection and classification algorithms. Using a combination of publicly available, and custom sourced videos and imagery, our neural networks would be trained at detecting human heat signatures and forms . In addition using the RGB photos and video feed confirm their presence and differentiate them from nonhuman entities.

Actuation:

The search and rescue drones must be capable of performing a set of actions fully autonomous in any given environment. The drone must adapt to the ever changing environment it is sent into. An AI drone should handle the tasks of:

Optimal route planning.

Navigation through the environment.

Search thoroughly places of interest.

Identify living human beings.

Transmit the location of survivors and sensory data to the rescue team.

Firstly, the drone must be able to understand and differentiate between the search and rescue scenarios. Mostly being, outdoor vs indoor and large scale vs small scale operations. Also, the drone should be able to produce a level of urgency to the given situation. For all situations, the AI needs to handle partial observability and incorrect knowledge. A way for the drone to adapt to the different scenarios is by changing the resources used. One way of doing this is, if it is sent to an outdoor operation the drone makes use of gps and maps. While if it is sent into an indoor area the drone should rely more on sensors to determine the position in space it is in and where it can fly to. In both situations the drone makes use of complex search algorithms for path planning as mentioned in the Ai techniques section. Combined with gathering of big data using sensors and cameras and deep learning models will improve decision making and efficiency.

How will the drone interact with the environment? The AI drone does not need to change the environment to satisfy its goals. It is there to navigate and gather information about the environment. The AI drone makes use of computer vision and sensors which enables them to collect sensory data from the given environment. [1]The drones are able to detect objects and record important values such as oxygen level, radiation, visibility, temperature and much more. These values are reported to the rescue team in real time. More importantly, the drone searches the area to locate survivors. Images and locations are communicated back to the rescue team.

User interaction:

The end goal of the rescue drone is to locate survivors. However, there are still some minor goals that must be satisfied when reaching this end goal. These being, communicate an alert to the survivors and assess the conditions. Firstly, upon reaching the people, a distress message should be played. This message would be a simple audio file that is played through the drone’s speakers. Something in the nature of, “Stay calm and remain in this location. Help is on its way!”. This distress message is there to give hope to the survivors and try to convince them to stay in their exact positions. More ideal would be to equip the drone with basic human resources, such as, water and food. However, this is not done for our drone as speed and simplicity are prioritized in the drone’s build.

Secondly, the drone must be able to quickly and effectively assess the condition of the survivor. For this to be done with little to no mistakes, the Ai must contain a software that has been trained to make these assessments. This software should be trained using computer vision combined with reinforcement learning. Prior to the actual missions, the Ai must learn how to solve computer vision tasks to learn more about the conditions.[10] Using CV, the Ai can identify injuries, measure body temperature, oxygen level, radiation level and any sensory information that the Ai can use to determine living conditions (this would be the environment’s states). The Ai is given a large amount of sensory data and shouldn’t rely heavily on only one or two, as in some situations these might lead to inaccuracy. For example, if the camera malfunctions the Ai should not use image processing techniques to determine conditions. The action taken by the Ai is setting a priority level based on the environment states. Let’s say it can rank the priority from 1 to 5, 1 being little to none urgency and 5 being a life and death situation. Reinforcement learning is used to improve the decision making part of the Ai. This can be optimized by simulating a large number of scenarios and examining the results of the Ai. In a real life operation, all of this must be done quickly by the drone and communicated back to the rescue team. The rescue team acts in accordance to the priority level. Getting first to the ones of higher priority will lead to more lives saved.

Ethical Issues:

Ethics can be defined as choosing what’s morally right and wrong. Furthermore, ethics are the foundations of culture and human life. Hence, AI machines should follow ethical principles since they have an effect on society. When dealing with Search and Rescue autonomous drones it is critical that the AI performs ethically. There are a number of issues one must consider before implementing the AI system.

A common concern and maybe the most important issue is accountability. Given the infallible nature of search and rescue there is no space for the AI drone to fail when given a task. If it does fail, this will result in loss of precious time in life and death situations and may even result in loss of human life. This error might be caused by lack of experience and prior knowledge the drone had or due to the environment, for example a bird flies into the drone and causes damage. In fact, a system failure may be caused by a number of entities. The question in hand is who should be held accountable for the failure.

Andee

To deal with this issue one must focus on the more important element. Conserving human life. What can one do to limit system failure? Given the stochastic environment the AI drone will navigate in, it is almost impossible to simulate all the possible scenarios the drone will be facing. However, care must be taken to ensure that the drones will never stop reacting to the environment. This must be done by implementing an intelligent system which is capable of solving in real time issues. What if the system still fails to make the right decision? A response to this query in search and rescue is to have more than one drone search the same area. The more drones available, the less chance that a mistake is made. Needless to say, this approach comes with drawbacks, some being; cost will be higher since there is a need for multiple drones and efficiency would be reduced since the drones need to cover the same area.

[11]Other ethical issues revolve around use, privacy concerns, bias and practical concerns. These issues must be considered throughout the development of drones. Politics and human bias should have a limited effect in the development and implementation phases of the AI. All in all, these drones should improve society. This must be the goal when considering AI solutions for real life problems.

Conclusion:

To conclude, in this report we provided a short overview of an Ai solution to a real-world problem. That is the development and implementation of search and rescue AI drones. The challenge was to create a drone capable of navigating through the environment fully autonomous and locate and identify survivors with a high degree of precision. We briefly discussed the various Ai techniques that help solve these challenges. Many solutions for an Ai drone to help in search and rescue are already being developed in the real world. Ai technology will help facilitate and improve many tasks in society. Although all this, it is important to implement ethical machines, this was discussed in the ethics part of the report. AI drones would be beneficial to mankind since they would help reduce risks that we may face, since they would replace us in the most dangerous situations, and help us rescue people stuck in extremely risky environments with less time taken, and therefore the people that are rescued could be given the care that they require in shorter time frames. This will be achieved through the methods that were discussed in this document, and with various other solutions that may be obtained through the work of other universities, and other research teams that work for various different companies. As such this is an AI solution that cannot be solved alone, but through the help of a lot of different people.

Citations and references:

[1] Samira Hayat, “Multi-objective drone path planning for search and rescue with quality-of-service requirements”, Springer Link. Available:https://link.springer.com/article/10.1007/s10514-020-09926-9

[2] Vijayan Asari, “Autonomous drones can help search and rescue after disasters”, The conversation.

Available:https://theconversation.com/autonomous-drones-can-help-search-and-rescue-

after-disasters-109760

[3] Alex Brokaw, “Autonomous search-and-rescue drones outperform humans at navigating forest trails ”, The Verge.

Available:https://www.theverge.com/2016/2/11/10965414/autonomous-drones-deep-learning-navigation-mapping

[4] Oisin McGrath, “Are UAVs the future of search and rescue?”, Airmed&rescue.

Available:https://www.airmedandrescue.com/latest/long-read/are-uavs-future-search-and-rescue

[5] Fintan Corrigan, “How Do Drones Work And What Is Drone Technology”, Drone Zone. Available:https://www.dronezon.com/learn-about-drones-quadcopters/what-is-drone-technology-or-how-does-drone-technology-work/

[6] Jonathan Feist, “Autonomous drone vs self-flying drones, what’s the difference?”, Dronerush. Available: https://dronerush.com/autonomous-drone-vs-self-flying-drones-10653/

[7] National Oceanic and Atmospheric Administration, “What is lidar?”, Ocean Service.

Available: https://oceanservice.noaa.gov/facts/lidar.html#:~:text=Lidar%2C%20which%20stands%20for%20Light,variable%20distances)%20to%20the%20Earth.

[8] Bob Gross, “Ultrasonic Sensors to Detect Human Presence”, Max Botix. Available: https://www.maxbotix.com/ultrasonic-sensors-detecting-people-156.htm#:~:text=Human%20Presence%20Detection%20with%20Ultrasonic,excellent%20reading%20to%20reading%20stability.

[9] Tristan Greene, “A beginner’s guide to AI: Computer vision and image recognition”, TNW. Available: https://thenextweb.com/neural/2018/07/17/a-beginners-guide-to-ai-computer-vision-and-image-recognition/

[10] Alexander Bernstein and Evgeny Burnaev, “Reinforcement Learning in Computer Vision”, ResearchGate. Available: https://www.researchgate.net/publication/324558061_Reinforcement_learning_in_computer_vision

[11]Keng Siau, “Ethical and Moral issues with AI”, ResearchGate Available: https://www.researchgate.net/profile/Keng_Siau/publication/325934375_Ethical_and_Moral_Issues_with_AI/links/5b97316d92851c78c418f7e4/Ethical-and-Moral-Issues-with-AI.pdf

Face Detection Algorithm

Andrew D — Thu, 12 May 2022 08:38:23 GMT

How can computers recognize faces?

Understanding Face Detection with the Viola-Jones Object Detection Cascade

Viola-Jones Object detection is made up of 4 stages, which are:

Selecting Haar-like features
Creating an integral image
Running AdaBoost training
Creating classifier cascades

To understand how the procedures work one must know what every step is achieving.

To get the sum of the blue rectangle, just add the green values and subtract the red values in the integral image. 1+21–11–3 = 8.