Feature selection using Relief algorithms with python example.

In depth analysis of how Relief algorithms work, where they are used and python example of how to use them.

Yash Dagli
6 min readJul 29, 2019

Table of Contents:

  1. Importance of Feature Selection
  2. Overview on Relief family of algorithms.
  3. Basic Relief Algorithm
  4. ReliefF- Extension Algorithm.
  5. Python Example.
  6. References.

1. Importance of Feature Selection:

In many machine learning problems there are hundreds or thousands of potential features describing each input object. Majority of learning methods do not behave well in these circumstances because of various reasons, some of them being Curse of dimensionality (overfitting), noisy data, long training time, etc. Thus it is important that before a model is trained, feature selection techniques are applied.

Moreover, using feature selection techniques has many advantages:

  1. Reduced training time
  2. Less complex, thus easier to interpret.
  3. Improved accuracy if right subset is chosen.
  4. Reduces overfitting.

Relief family of algorithms are primarily used in preprocessing step before a model is learned as a Feature Subset Selection method.

2. Overview on Relief family of algorithms:

There are three Algorithms in the Relief Family:

  1. Basic Relief algorithm: It is limited to classification problems with two classes.
  2. ReliefF : Extension of Relief . Which can deal with multiclass problems.
  3. RReliefF : Then ReliefF was adapted for continuous class (regression)
    problems resulting in RReliefF algorithm.

We will discuss Relief and ReliefF in this post. Although basic idea in all the three algorithms remains the same.

The core idea behind Relief algorithms is to estimate the quality of attributes on the basis of how well the attribute can distinguish between instances that are near to each other.

Advantages of Relief Algorithms:

While many heuristic measures assume conditional (upon the target variable) independence of the attributes for estimating the quality of the attributes, Relief algorithms do not make this assumption. They are efficient, aware of the contextual information, and can correctly estimate the quality of attributes in problems with strong dependencies between attributes.

Relief algorithms have been mostly viewed as a feature subset selection method that are applied in a prepossessing step before the model is learned and are one of the most successful preprocessing algorithms. Actually they are general feature estimators and have been used successfully in a variety of settings.

3. Basic Relief Algorithm

Algorithm :

Input : for each training instance a vector of attribute values and the class
value

Output: the vector W of estimations of the qualities of attributes

Pseudo code:

1.set all weights W[A] := 0.0;
2. for i := 1 to m do begin
3. randomly select an instance R
;
4. find nearest hit H and nearest miss M;
5. for A := 1 to a do
6. W[A] := W[A]-diff(A,R
,H)/m + diff(A,R,M)/m;
7. end;

Suppose that I ₁,I ₂,….,I ₙ are examples in an instance space where each example is a vector of attributes A ᵢ, i = 1,….,a, where a is the number of attributes, and each example has a target value t ⱼ. Therefore the examples are points in an a dimensional space.

Working:

All the attributes weight are initially set to zero (line 1 of pseudo code)

Then we select a random instance Rᵢ (line 3)and find it’s two nearest neighbors: one of the same class that Rᵢ belongs to known as nearest hit H and other of the different class known as nearest miss M (line 4). Now all the attributes weights are updated depending on the value of Rᵢ, M and H (lines 5 and 6). This whole process is repeated m number of times, m is a user user defined parameter.

Diff function:

for nominal attributes.
for numerical attributes.

The weight updation of attributes works on a simple idea (line 6). That if instance Rᵢ and H have different value (i.e the diff value is large), that means that attribute separates two instance with the same class which is not desirable, thus we reduce the attributes weight. On the other hand, if the instance Rᵢ and M have different value, that means the attribute separates the two instance with different class, which is desirable.

Example: (Manual)

dataset for example

Here, row 1,2,..,5 are the instances. D is the target class (having two class 0 or 1). A,B,C are the attributes. We will find the weights of the attributes and then select 2 best attributes, i.e. attributes having the highest weights.

Let m = 2 (i.e we will perform 2 iterations).

Step 1 : Let all attributes weight be 0 , A=B=C=0,

First Iteration:

Step 2 : Suppose we select row 5 as our randomly selected instance. (i.e 6,0,0)

Step 3 : Find the nearest miss and hit. Using Manhattan distance, we find that row 4 is the nearest hit (after trying all the combinations of same class) : |6–8| + |0–3| + |0–1| = 6 and row 2 is the nearest miss (after trying all the combinations of different class): |6–5| + |0–1| + |0–0| = 2.

Step 4 : Update all the weights of attributes.

A,B,C : current weight = 0

A = 0-((|6–8|/(9–5))/2) + ((|6–5|/(9–5))/2) = 0-(0.5/2)+(0.25/2) = -0.1875

B = 0-((|0–3|/(3–0))/2)+((|0–1|/(3–0))/2) = 0-(1/2)+(1/6) = -0.33

C = 0-((|0–1|/(2–0))/2)+((|0–0|/(2–0))/2) = 0-(1/4) + 0 = -0.25

Second Iteration:

Step 2 : Now again suppose we select row 4 as our randomly selected instance. (i.e 8,3,1)

Step 3 : Find the nearest miss and hit. Using Manhattan distance, we find that row 3 is the nearest hit : |8–9| + |3–3| + |1–2| = 2 and row 1 is the nearest miss : |8–9| + |3–2| + |1–2| = 3.

Step 4 : Update all the weights of attributes.

A,B,C : current weight = -0.1875,-0.33,-0.25.

A = -0.1875-((|8–9|/(9–5))/2) + ((|8–9|/(9–5))/2) = -0.1875

B = -0.33-((|3–3|/(3–0))/2)+((|3–2|/(3–0))/2)=-0.33 -0+0.166=-0.167

C = -0.25-((|1–2|/(2–0))/2)+((|1–2|/(2–0))/2) = -0.25

Thus after 2 iterations we have A = -0.1875, B = -0.167, C = -0.25

Therefore we can select A and B as our 2 best features as they have the lowest weights.

The original Relief can deal with nominal and numerical attributes. However, it cannot deal with incomplete data and is limited to two-class problems.
Its extension, which solves these and other problems, is called ReliefF.

4. ReliefF-Extension:

Algorithm ReliefF:

Input: for each training instance a vector of attribute values and the class value

Output: the vector W of estimations of the qualities of attributes.

pseudo code of ReliefF

k is user defined value. For most of the cases it can safely be set to 10.

Working:

Similar to Relief,ReliefF randomly selects an instance Rᵢ (line 3), but then searches for K nearest neighbors (line 4) of the same class known as Hⱼ and then for k nearest misses Mⱼ(C) for each of the other class (lines 5 and 6). Then like Relief, ReliefF updates the weight of each attributes but we average the contribution of all hits and all misses(line 7,8,9).

The contribution for each class of the misses is weighted with the prior probability of that class P(C). As the class of hits is missing in the sum we have to divide each probability weight with factor 1-P(class(Rᵢ)).

Selection of k hits and misses is the basic difference to Relief and ensures greater robustness of the algorithm concerning noise.

To deal with incomplete data we change the diff function. Missing values
of attributes are treated probabilistically.

Manual Example of this algorithm will be tedious, but it can be understood from the example above.

5. Python Example

We can see that in first example: 2 class problem, the same tuples are selected by python (i.e A and B).

6. References :

Robnik-Šikonja, M. & Kononenko, I. Machine Learning (2003) 53: 23. https://doi.org/10.1023/A:1025667309714

--

--