Stories by Pranav Joshi on Medium

Usual Graph Algorithms

Pranav Joshi — Sat, 18 Nov 2023 11:17:11 GMT

This article is going to follow the book “Algorithms” by Jeff. I’m assuming you already know the basics, and so I’ll skip straight to the theorems and algorithms.

Depth First Search

This is the most trivial way of traversing a graph. The algorithm goes like this :

1) Iterate over the children of current node and check if one of them is unvisited (not unprocessed, but unvisited). If it is, put it on stack (a sort of to do list with “Last In First Out” principle)

2) Check if the last node is unvisited (it should be if it was added to stack in the last step).

3a) If yes, make that the new current node and then “visit” it (maintain a “visited” array at all times and do visited[current_node] = 1).

3b) If no, remove it from the stack. (Optional : do visited[current_node]=2 to show that this node is “dead”)

4) Go back to step 1.

A simple example :

A represents “active” i.e. visited and currently on stack and D represents “done” i.e. visited and removed from stack

Now, that’s good and all, but this looks pretty useless. Why are we traversing the graph in the first place ? Usually, we do this because we want to process all the nodes. Maybe you are building a 2*2*2 cube solver and want to find a sequence of steps that takes you from an unsolved state (node) to a solved state and want to verify (process) whether the current state is solved or not.

To do this, the sensible thing to do would be to check if the state is solved before putting more nodes on the stack and processing them (before step 1 that is). This is called “pre-processing”. Similarly, processing a node after its children are processed (just before step 3b) is called “post-processing” , and the order in which we pre-process or post-process nodes is called pre-order and post-order.

Dependency Graphs and Topological Sort

In problems like finding the first n Fibonacci numbers, it makes sense to calculate in a “ground up” approach. This is because finding the n-th number is only possible if you know the (n-1)-th and (n-2)-th number. If we think of these numbers as being attributes of nodes in a graph of n nodes, then we can say that the node ’n’ depends on node ‘n-1’ and node ‘n-2’ . To represent this, we draw an edge from node ’n’ to ‘n-1’ and ‘n-2’ . Such a graph is called a dependency graph. (Take a breather and try some examples. I know you need it.) . If we start at the n-th node and do a DFS, the post-order we would get is the order in which we want to compute the values for the nodes.

This animation explains it better :

The post-order is in this case, the reverse topological order (ik, ik, … heavy words) . First let’s define what a topological order is :

An ordering of nodes such that for any node X that precedes node Y , we can go from X to Y , but not the opposite.

It’s clear that any graph that has a cycle can’t have such an ordering and neither can an un-directed graph. Thus, we have a topological order only for a Directed A-cyclic Graph (DAG) .

In fact, for any DAG , the post-order is the reverse of the topological order. If you don’t believe me, let me prove it to you :

Define v.post as the index of node v in the post order for any v. If v.post > u.post , then it’s clear by the recursive nature of DFS. that u is reachable from v. But this means v is not reachable by u because we are talking about a DAG. Thus v comes before u in the topological order.

Minimum Spanning Trees

This is a tree spanning the whole graph (if it’s connected) such that the total weight of the tree, that is the sum of the weights of the edges that the tree contains is minimum. The most popular algorithms to compute MSTs do some flavour of this property which holds when all the edge weights are distinct :

Suppose that the weights of the edges in the graph are all distinct. Now suppose you are somehow given a sub-graph S of the main graph G that you know is also a sub-graph of the minimum spanning tree T. Then for all connected components C of S , the edge (if it exists) of minimum weight that is adjacent from node u in C to node v in G-C is called a safe edge. It can be proven that T contains all safe edges for any S. Thus, if you want to find T, just keep adding these safe edges to S until S becomes T and there are no more safe edges. The order in which you do this is where the algorithms diverge.

One popular algorithm is Jarnik’s algorithm where S has a single connected component at all stages. Naively this algoritms runs in O(VE) since we compute a safe edge in O(E) and do so V times. If we maintain a priority queue that contains all the edges in G-S , maybe as a min-binary-heap , then we can do improve our time to O(V logE). But creating a priority queue by itself requires O(E logE) = O(E logV), giving us the complexity O(E log V). Actually, the best data structure for such a priority queue is a Fibonacci heap which allows us to run the algorithm in O(E+VlogV) time because we keep vertices in the queue in that case, and not edges.

Another simple, yet effective algorithm is Kruskal’s algorithm, where we just sort the edges first, then iterate over them and add to S if the edge is safe. This runs in O(E logV) = O(E logV) as well.

Breadth First Search

This algorithm uses a queue , which is once again, a To-Do-list of nodes, but follows the First In First Out principle (as the name suggests) . We start off with one node in queue, which is marked ‘visited’ immediately. Then we repeat these steps until the Que is empty :

Process the current node (right at the start of the queue), if you want to.
For all children of the current node, check if they are un-visited one by one. If they are, put them (all the un-visited children) at the end of queue and process them (if you want to).
Remove the current node from the queue, thus making the next node the new “current node”

The last step isn’t done by literally popping the element from the array and renaming all the indices, but by maintaining a variable that keeps track of the index of the first element of queue in the actual array used to store the queue elements (or in case of C/C++ , simply incrementing the base address of the array)

Shortest Path Trees

The concept is simple ; It’s a tree made of (not all) edges of the original Graph such that all the paths from the “source node”, which is the root of the tree to any particular node are of minimum length. Note that here, the original graph can be cyclic, un-directed or weighted. The only constraint is that the weights should not be negative. Before we go in, some notation :

‘s’ refers to the source node
v.dist is the length of the current best known path from s to v
w(u,v) is the weight of edge u->v
v.prev is the the node that comes before v in the shortest known (computed) path from s to v.

Alright, now let’s start with an algorithm called Dijkstra’s algorithm :

Set v.dist to infinity for all v
Set s.dist = 0 (obviously)
Put ALL the elements in a queue.
Find the node u in queue with minimum u.dist. This value can be proved to be correct for non-negative weights.
For ever child ‘v’ of current node ‘u’ check if v.dist > u.dist + w(u,v)
If yes, assign u.dist + w(u,v) as the new value of v.dist and assign u as v.prev.
Remove u from queue.
If Queue is empty, we are done, else jump to step 4 .

Now, why does this work ? We said that every time we find ‘u’ , its u.dist is the correct value. To prove this, consider the graph made out of all the nodes that you have removed from the queue. Call that S. Thus, the queue contains all and only the nodes from G-S. Make an induction hypothesis that for any node w in S , w.dist is the correct value. Now, for every such w , we have also calculated the tentative u.dist values for every child u of w that is in G-S , that is to say that we have found the tentative u.dist for every u in G-S connected to S directly. Now, suppose we have found a u that has minimum u.dist, call it u_0 , then u.dist is correct because there cannot exist any other shorter path to u_0. Because such a path would have to pass from some u_1 (different from u_0) for which we have found the tentative u_1.dist, given by (u_1.prev).dist + w(u_1.prev,u_1) for u_1.prev in S , since if it doesn’t then u.dist should have been reached by that path by now. Now, we know that such a path has this structure : s -> u_1.prev -> u_1 -> u_0 , which clearly has a bigger or equal length than the path s -> u_1.prev -> u_1 which has the length u_1.dist , which is already bigger u_0.dist , meaning that such a path isn’t possible.

As you can guess, this algorithm doesn’t have a complexity of O(V+E) , but we can actually find the Shortest Path Tree in O(V+E) in special circumstances, namely if all the weights are the same or if the graph is a DAG. In these cases, the algorithm to use is just simple BFS ! Basically, in these cases, you go “level” by level. First you compute v.dist for s , then for all children of s (call that layer 1) , then for all the children of the nodes in layer 1 (call that layer 2) , and so on. We are guaranteed to never have to compute v.dist for v in a layer that we have already explored. It’s pretty clear why this is the case in the case of DAGs , but what about the first case ? The thing is, in the first case, v.dist is just the level that v is in (immediately and forever).

Ok, but like, what IS the complexity ? Well it’s O(V²) . Every iteration we remove one node from the queue and every iteration we have to do at most V checks to find the node u with minimum u.dist. Thus, the full complexity is O(V²) . Now this can be improved to O(V + E logV) using a priority queue. Basically you maintain the queue to be in increasing order at all times using a heap data structure. So, every time you update v.dist for any node v that is in the queue, you have to inject the node, which needs a time of logV (using the usual “push-down” algorithm for heaps used in heap-sort). This happens for every edge once at worst (remember, you are updating all u.dist and THEN finding the minimum before adding that minimum node to S). Plus, we have to process V nodes in any case.

Another algorithm is the Bellerman-Ford algoritms which goes as follows :

Set v.dist = inf for all v
Set s.dist = 0
set i = 0
For all edges (u,v) , if u.dist + w(u,v) < v.dist , set v.dist = u.dist + w(u,v)
i++
If i = V-1, stop
Jump to step 4

Every time we run step 4 and 5, we are guaranteed that v.dist ≤ dist(v,i) , which is the length of “the” (not current) shortest path from s to v, consisting of less than or equal to ‘i’ edges. Since, when i=V-1, dist(v,i) is the length of the actual shortest path (assuming no negative cycles), so after V-1 iterations, v.dist will also be this value, that is to say the tree can’t be improved anymore. So, if after V-1 iterations, the tree CAN somehow be improved, we can conclude that the graph has negative cycles. The book implements the algorithm like this :

This algorithm has the complexity O(VE). The same algorithm (with some modifications) can be implemented using a queue, giving us what’s called Moore’s improvement. The idea is that we just run BFS many-many times and push the children v of current node u for which we have corrected edges (u,v) in the queue again. When the queue doesn’t contain any element, or if we have gone through V “phases”, we stop the algorithm. The book explains the concept of phases nicely :

What the book means by RELAX(u->v) is “update v.dist to u.dist + w(u,v)” and by “u->v” is tense, it means that this correction is possible.

In every phase, this new algorithm avoids checking the edges that are trivially not possible to be updated in that phase.

All pair shortest paths

In the last problem, we used v.dist to mean the minimum distance from s to v. Here, we use dist(u,v) to mean the minimum distance from any node u to any node v. And we use dist(u,v,l) to mean the minimum distance of path from from u to v with at most ‘l’ edges. The solution can be written with dynamic programming as :

We can remove the last dimension of ‘l’ by updating dist(u,v) (which represents the current best known value) like this :

and initially, setting dist(u,v) to infinity when u != v and 0 when u==v .

Strong components

A strong component of a graph G is a sub-graph S such that for any two nodes u,v in S, there exists a path from u to v and from v to u in S.

A strong component graph scg(G) of G is the graph obtained by shrinking every strong component S of G into a single node s with edge (s_1,s_2) representing any edge from u to v with u in S_1 and v in S_2. For example, in this picture, the graph in right is the strong component graph of the graph in left.

Computing scg(G) can be done in O(V+E) using this property :

For any depth first tree T (a tree formed on doing depth first traversal) of G, each strong component S of G contains exactly one node u that does not have a parent in S for the graph T.

This is because if we enter a strong component S via ‘u’ , we explore all of S during DFS(u). That is to say, that if reach(u) denotes all the nodes that are explored during DFS(u), then reach(u) contains all the nodes in S. Now, there are two ways we could enter S via a different node ‘v’ after we have entered S through ‘u’. Either we re’enter S during DFS(u) or we enter after DFS(u) has been executed. The first case is not possible since then ‘v’ would also be in S since it’s reachable by S and S is reachable from v . And the second case if impossible because after DFS(u) has been executed, every node in S has been visited.

There’s another property that we need, for which I’ll have to define some more terms:

The low-link value of a node u, written as u.low represents the minimum value of v.pre for all v that we try to visit (check if it’s visited) during DFS(u) . It’s easy to see that u.low ≤ u.pre . (Note : if you’ve forgotten what u.pre was, it’s the ‘starting’ time for u, that it, the number of nodes visited before u is pre-processed). Computing u.low is straightforward using the fact that u.low is the minimum of all v.low for all children v of u in the DFS tree T and v.pre for all children v of u in the original graph G and finally, u.pre. So during the post-processing of u during DFS, we can compute u.low. It’s also easy to see that all the nodes in any strong component have the same low-link value.

A sink component is a strongly connected component S such that no other strong component is reachable from S. Similarly, a source component is a strong component, which isn’t reachable by any other strong component.

Now, onto the property :

A node u is the root of a sink component if and only if u.low = u.pre and for no node v in reach(u), you have v.low = v.pre .

The algorithm that we’ll use is called Tarjan’s algorithm and it goes as follows :

During DFS of the graph, update u.low everytime you return to u.
If u.low = u.pre , mark reach(u) as a strong component and “remove” reach(u) from G for the remaining DFS. (This ensures that at any point we run into u.low = u.pre, it’s impossible that v.low = v.pre from some v in reach(u) for the current graph G, since we would have removed reach(v) if it happened, and thus u is the root of a sink component in G at that stage.)

When we say “remove” reach(u) from G, we actually do this implicitly by marking every node in reach(u) as ‘non-existent’. For non-existent nodes, we don’t update low-link value upon visiting their parents. Now, here’s a big issue : how do you compute reach(u) after you have finished DFS(u) (since that is when we check for u.low = u.pre) ? You can’t just make nodes in reach(u) non-existent on-the-fly when you are doing DFS. What you need to do is to maintain a list of the nodes in reach(u) for every u that you haven’t made non-existent. This can be done by maintaining a 2nd stack (the first stack is what we are using for the actual DFS) where you don’t remove nodes from stack when they are explored, but rather when you encounter u.low = u.pre . You do that by marking the last node as non-existent and then popping it off.

End

I wouldn’t be surprised if you end up being more confused after reading this than before. If you are a visual learner (which helps immensely in a subject so “graph”ical like this one), then take a look at this playlist :

https://medium.com/media/ed9aa866ca4c56dbf5f4fb40d4372769/href

It contains more than enough information to understand the algorithms.

Rubik’s cube and Group Theory

Pranav Joshi — Sat, 11 Nov 2023 15:52:31 GMT

The standard Rubik’s cube is a very popular puzzle that is solved when all faces of the cube are of the same color. There is a big community of cubers, mathematicians and computer scientists fascinated by this puzzle and its several variants.

Disclaimer

As the title suggests, this article is directed towards people who know Group theory (mainly commutators and conjugates) good enough and wish to apply it to the standard cube. If you are a cuber or a programmer who is reading this, you might have a little difficulty, but fear not ; go through my full group theory notes (which is also the source of this article) and you’ll understand more than enough :

Group Theory

Introduction to the puzzle

Since you are reading this article, I assume you at least know how to solve the standard cube. I case you don’t have a cube at hand, this simulation can come in handy for the rest of this article :

alg.cubing.net

Throughout this article, I’ll be using “r” and “R^(-1)” for what they call “ R’ ” .

Representing a cube on paper

To represent a certain state of the cube, keeping the faces fixed, you need to describe :

The position of the corner cubies , given by the permutations ‘v’ of numbers from 1 to 8 (indices really).
The position of the edge cubies, given by the permutations ‘w’ of numbers from 1 to 12 .
The orientation of the permuted corner cubies, given by the array ‘r’ of size 8 made of (not necessarily distinct) numbers from 0 to 2 (both inclusive)
The orientation of the permuted edge cubies given by the the array ‘s’ of size 12 made of (not necessarily distinct) numbers which are either 0 or 1.

The way that the orientations of corners are determined is that you label the faces of each corner cubie in the solved state as 0,1,2 in such a way that when you curl you right hand fingers in the direction “0 to 1 to 2” , your thumb points away from the cube and the face labeled 0 must be facing Front or Back. Then, once the corners are permuted, you note down the number on the face that is facing the Front or Back direction. Then, due to the chirality (handedness) of the corner cubies, given only this number we can figure out the full orientation of that cubie (given its position, of course) ,using the right hand rule.

Similarly, for the edges, you label the faces of edge cubies in the solved state as 0 and 1 , with 0 facing Front, Back, Top, or Bottom in that particular priority order. Then again, you note down the number of the face of the permuted cubie, that is facing Front, Back, Top, or Bottom.

Thus, the 4-tuple (v,r,w,s) gives a full description of the cube’s state. So, you might be tempted to say that the Rubik’s cube can be modelled by the group S_8 *(Z_3)⁸ * S_12 * (Z_2)¹² . But in fact, most of the tuples that are in this group are just not possible, because they are “illegal” , that is, unattainable by the 12 basic operations you can do on the cube. To get the actual structure of the group modelling the cube, we must find the constraints that any legal 4-tuple must satisfy. But before that, let’s play around with the cube a bit.

Firstly, notice that every state of the Rubik’s cube can also be represented by the sequence of steps that lead to this state. So, there is a notion of “combining” two states given by x_1 =(v_1,r_1,w_1,s_1) and x_2 =(v_2,r_2,w_2,s_2) , by considering the state that we would get by doing the sequences that lead to x_1 and x_2 one after another. In keeping with the way we deal with permutations and like, I will write this new state, or sequence, however you see fit, as x_3 = x_2 x_1 where we do the sequence for x_1 first, and then the sequence for x_2 . For example, for x_1 = F and x_2 = U , the sequence (or state) x_3 = x_2 x_1 is UF , which usually is written as FU in every tutorial on solving the cube. But we’ll use UF and read right to left. For any arbitrary x_1,x_2, we can do this combination as :

Here P(v_1) and Q(w_2) are matrices that are applied on vectors r_2 and s_2 , and the operation ‘+’ is done modulo 3 for corners and modulo 2 for edges.

Since it doesn’t look easy to deal with orientations, we should probably deal with permutations first.

Permuatations of corners

To keep things tidy, whenever we’ll use one of the 12 basic moves in the few following passages, we’ll be referring to only the ‘v’ part of the move that is in fact fully expressed by the 4-tuple (v,r,w,s) .

Say you had a sequence of moves ‘x’ (that is now a permutation, and not a 4-tuple) that changes only one corner cubie $a$ on the top layer, then the commutator of ‘x’ and ‘U’ is a 3-cycle since supp(x) and supp(U) have only that cubie in common. But first, we must find such an x . On good stratergy to find it is to consider a move that does not interfere with the top layer at all, namely D, and conjugate it with a move like ‘r’ that moves only one cubie from the support of D to the top layer, that too particularly where ‘a’ is sitting. This conjugate will basically involve ‘a’ in the permutation, while not involving any other cubie from the top layer. Thus x = RDr moves only one cubie from the top layer. So, the commutator

is a three cycle. Now, once you have a three cycle, you can get more three cycles by looking a the conjugates of this. For example, conjugating with L gives us the 3-cycle

that permutes 3 corner cubies in the top layer among themselves. Try it out on the 2nd simulator that I linked ! Just remember, that we are reading from right to left and the whole world reads from left to right. So what you would have to input is :

LR'DRUR'D'RU'L'

Notice that you can move one of these corner cubies that is adjacent to a corner that we don’t touch, in any (corner) position on the cube without moving the other 2 cubies that are moved by our 3-cycle. What this fact means is that ANY 3-cycle of corner cubies is legally possible.

Moreover, since all even permutations (of all 8 corners) can be written as 3-cycles, thus, we can in fact invert all such states of the cube which are formed by even permutations, since the inverse, being an even permutation too can be constructed by three cycles.

What about the odd permutations ? We can just apply one of the 12 basic moves first on the given state, which will make the new state an even permutation, since any basic move is a 4 cycle, which is odd. But this is ignoring the effects of the basic move on the edges, which will become important later.

Permutations of edges

Now, just like we did for the corners, for edges, we’ll only consider the ‘w’ part of any sequence. Once again, doing the thing with conjugates, and commutators, we can prove that every 3-cycle of edges is possible legally. To do that you first need one such cycle. Since F and R have only one edge peice common in their supports, their commutator might work, but then you would be messing around with the corner peices too. Once again, to simplify this, you can take a conjugate of D that has exactly one edge cubie from the top layer in the support, and no corner cubies from the top layer, if you view it as a permutation of corners. Then take a commutator of this conjugate with U . Particularly, we are talking about (U . f_M(D)) where M is the move that rotates the middle slice by 90 degree. (Yes, yes, we could represent this as LR and then, a change of orientation of the cube, but we’ll use M, since it makes thinking easier. Although M is actually not a basic move, but we’ll allow this since we know what M means. In essence, by f_M(D), we mean lrFRL )

Then, you can just take its conjugates.

And thus, again, every even permutation of edges is invertable. Moreover, since every basic move is a 4 cycle in edges too, for the odd permutations, we can just do one basic move, and then invert the resulting state using 3-cycles. Once again, we are ignoring the effects of this basic move on the corners.

Characteristics of a legal permutation

Since every legal state is derived from the solved state by a sequence of basic moves, we should look at what is invariant under these basic moves. We already know that a basic move is a 4-cycle in both the edges and the corners. Initially, the parity of both v and w is even, since they are the identity permutations. Now, on every move, the parity of v and w , both change simultaneously. What this means is that either both v and w are even, or both are odd. Basically, they have the same parity.

Thus, states with different parities of v and w are illegal.

When is a permutation legally invertable

So now that we know that v and w will have the same parity, if both v and w are even, we can invert them by 3-cycles, and if both are odd, we can do a basic move, making them both even, and then again, invert them using 3-cycles.

This means, that a permutation of edges and cubes (not considering their orientations), described by (v,w) is legal if and only if v and w have the same parity.

Orientation of edges and corners

A sequence that is the identity permutation might not be the “blank”, or “do nothing” sequence, because it might be changing the orientations of the cubies. Thus, after some deep thinking, one can come up with these useful sequences :

The first sequence is RF’R’FRF’R’F (read left to right).

Now, by the symmetry of the cube, you can change the orientations of any 2 adjacent corner and adjacent edges. Moreover, by composing 2 such moves, you can in fact switch the orientation of any 2 corners (adjacent or not) and any 2 edges.

Notice that there what these two moves are doing algebraically is add 1 and subtract 1 (modulo 3 and 2 for corners and edges) from coordinates of ‘r’ and ‘s’ that correspond to the altered cubies. Say for example, you have r = (1,0,1,1,1,1,2,2) for a state of the cube in which you have the correct permutations of cubies, and just unsolved orientations. What you can do is switch the corners that correspond to first 2 coordinates in r , using the move that we have figured out, in order to transform r into (0,1,1,1,1,1,2,2) . And then do the same thing for the 2nd and 3rd coordinate to get (0,0,2,2,0,1,2,2), and then for 3rd and 4th to get (0,0,0,1,0,1,2,2), and so on. You can do a similar thing for s too, just modulo 2 instead of 3.

Notice that under these moves, the sums of coordinates of r and s (modulo 3 and 2 respectively), namely

are invariant, since we add 1 and subtract 1 from the sums in every move. So, what this means, is that once we make the 2nd last coordinate in r to be 0, the last element will be R . That is, r can be transformed to (0,0,0,0,0,0,0,R). Similarly, s can be transformed to (0,0,…,0,S) .

Now, for any legal state of the cube, R=0 and S = 0 because initially, in the solved state, this is true, and the sums are in fact invariant under each of the 12 basic moves (Don’t take my word for it, test it out please) .

So, what this means that, for any legal state where the cubies are permuted correctly, we are able to transform r to (0,0,0,0,0,0,0,0) and s to (0,0,0,0,0,0,0,0,0,0,0,0) .

And in fact, since ANY sequence doesn’t change the sums, for any legal state, even if the the cubies aren’t permuted correctly, if we can permute them correctly first, then we would only have to correct the orientations of the cubies with the same sums as the original configuration. The condition to be able to permute the cubies correctly is that v,w must have the same parity. So, if this condition is satisfied, and, if R = 0 and S = 0 in the original unsolved state, then since this is also the case in the new unsolved state with correct permutation of the cubies, we can correct their orientation.

When is the cube solvable

In the last para of the last section, we showed that IF the cube is in a state given by the 4-tuple (v,r,w,s) such that :

then the cube can be solved by first correcting the permutation of the cubies, and then correcting the resulting orientations of the cubies.

We have also shown along the way that each of these 3 conditions is followed by any legal state of the cube. Thus, the cube is solvable ONLY IF each of these 3 hold.

Thus, these three are the necessary and sufficient conditions for a state of the cube to be solvable.

This is also referred to as the “Fundamental theorem of Cube theory”

A question

If you have ever come across a certain tutorial on how to solve the cube, you probably know that there are two main sequences of moves used to solve the 2nd layer of the cube. These sequences have the effect of moving the edge cubie on the intersection of top and front face to the position of the edge cubie at the intersection of the front and left face, or the front and right face respectively. For cubers who have cubed for so long that they have forgotten the notation and instead have the moves in their muscle memory, and for the people who have never seen the tutorial, I want you to try and recreate this sequence on your own.

Hint

It has to do with conjugation and 3-cycles. You already know about the three cycle that cycles 3 edge cubies on the top face. Make use of that.

Answer

I encourage not reading this and trying on your own.

It’s these sequences :

which when written in the way that the world writes are :

R'U'L'RF'R'LUL'RFR'LR
LURL'FLR'U'RL'F'LR'L'

Of course, there can be multiple answers, since there are multiple sequences. This exercise is just to show that it’s very possible to create your own algorithms to do clever things with the cube.

Homework (for programmers)

Now that you have some idea on the kind of legality conditions of cubes, try making a cube solver for the 2*2*2 cube and thus verify the legality condition for corners in 3*3*3 case (think about how these are related) . A good place to start is BFS . If you don’t know how to start despite reading the whole article, have a look at this picture :

This is an example of encoding that you could use.

Thanks

Most of the Group theory that I’ve read till now has been from the Abstract Algebra playlist on YouTube by Michael Penn.

Abstract Algebra | Preliminaries

The reason I started reading Group Theory was because it was part of the Discrete Mathematics (ES214) course at IITGN, which at the time I am writing this is still going on.

Neeldhara — ES 214 | Aug-Nov 2023

The 2*2*2 cube solver problem was given to me and my team as an assignment in the Data Structures and Algorithms course being held at IITGN this semester, taught by Prof. Balgopal Komarath.

Visualisations for Graph theory and Gradient Descent

Pranav Joshi — Mon, 06 Nov 2023 19:01:14 GMT

There are plenty of libraries and tools which can help you visualise a graph. They take in the abstract definition of a graph through an adjacency matrix, or an adjacency list, or something similar. These tools create a graph embedding in a 2D plane, like the ones you’ll see in a standard graph theory course, and they do it automatically! Positioning, shapes of edges, everything is done by the computer; no manual input required. But how do this happen ?

What is a 2D embedding ?

It’s basically the graph drawn out, with nodes and edges plotted on the Cartesian plane. To create a simple, but ugly looking embedding, you can use any plotting library (I am using matplotlib) and just scatter plot the nodes with random positions. Then connect the nodes by lines or curves, which will serve as edges.

So what’s the issue ?

The thing is; humans don’t like ugly things, and the embedding you created would certainly look ugly, since after all, it’s randomly generated. So how do we draw a good looking graph ?

What are the characteristics of a good looking graph ?

It’s hard to give a definite list, since beauty lies in the eyes of the beholder, but usually, the accepted criterion is that the nodes should be placed at an ideal distance (say lo) with adjacent nodes given higher priority. That is, for any error in the distance of two nodes, penalise the graph by a higher amount if the nodes were adjacent.

The problem statement

Given an adjacency matrix A , figure out the best possible positions (x_i,y_i) for all nodes ‘i’ in order to minimise the squared error given by :

Here x,y are vectors made of x_i,y_i values. Notice that I have added an extra factor of ‘alpha’ . This is the default priority given to the distance between a pair of nodes. The thing that we are squaring is the error in the distance between two nodes. And yes, I’m over-counting by a factor of 2, but that doesn’t matter, since it’s a constant factor.

The solution

Simple ; just gradient descent! This is a popular technique to solve all sorts of continuous optimisation problems. First, let’s find the partial derivative of our loss function L with respect to a single coordinate, say x_i.

Now, we calculate the gradient of L with respect to the whole of vector x by combining these derivatives.

Finally, we update the vector x iteratively by addind to it the negative of this gradient times the step size. We do the same for y as well.

Ok, now let’s implement this in python.

from numpy import *
from matplotlib.pyplot import *
def Solve(A,lo=0.3,dt=0.01,iterations=10000):
    cW = 1
    cX = 100
    xlim(0,cW*cX)
    ylim(0,cW*cX)
    Ao = A
    A = A+A.T
    N = A.shape[0]
    A = A + sum(A)/N**2
    x = cW*random.random(N)
    y = cW*random.random(N)
    for i in range(N):
        for j in range(N):
            if(Ao[i][j]):
                plot([cX*x[i],cX*x[j]],[cX*y[i],cX*y[j]],'--g')  
    scatter(cX*x,cX*y,50,'red')
    title("before")
    Xd = (x[newaxis,:]-x[:,newaxis])
    Yd = (y[newaxis,:]-y[:,newaxis])
    I = identity(N)
    print(sum(((Xd**2+Yd**2)**0.5-lo)**2))
    for i in range(iterations):
        Xd = (x[newaxis,:]-x[:,newaxis])
        Yd = (y[newaxis,:]-y[:,newaxis])
        distace = (Xd**2+Yd**2)**0.5+I
        repulse = (distace**-1 - I)*lo
        Fx = -2*A*Xd*(1-repulse)
        Fy = -2*A*Yd*(1-repulse)
        fx = sum(Fx,0)
        fy = sum(Fy,0)
        x += fx*dt
        y += fy*dt
    figure()
    xlim(0,cW*cX)
    ylim(0,cW*cX)
    for i in range(N):
        for j in range(N):
            if(Ao[i][j]):
                plot([cX*x[i],cX*x[j]],[cX*y[i],cX*y[j]],'--g')  
    scatter(cX*x,cX*y,50,'red',alpha=1)
    title("after")
    print(sum(((Xd**2+Yd**2)**0.5-lo)**2))
# A test case
#N=7
#A = zeros((7,7))
#A[0][1]=1
#A[0][2]=1
#A[0][3]=1
#A[1][4]=1
#A[1][5]=1
#A[2][6]=1

# Another test case
N=10
A = random.randint(-N/3+1,2,(N,N))>0

Solve(A,0.3,0.0001*N)

The typical output is something like this :

Loss function value : 8.14

Loss function value : 1.9

As you can see, after gradient descent, the graph isn’t an eyesore anymore, like it was before.

This method is called force directed graph drawing. Essentially, we can think of each edge acting as a spring, applying force on the nodes proportional to the deformation in the spring.

Neural Network from scratch

Pranav Joshi — Wed, 01 Nov 2023 20:33:04 GMT

This isn’t one of those articles where you’ll be given an overview of the structure of a neural net. Neither is this a “for dummies” article where you’ll be taught how to use sklearn and scipy. Instead, this is more of a walk-through, where I provide all the necessary code to build a neural net from scratch, using no libraries whatsoever (well, except numpy and some visualisation related libraries, and sklearn.datasets to get a dataset), as well as information for everything to make sense. Firstly, it’s important for you to read one of those “overview” articles before reading this, because this isn’t a topic I can cover in a post of rational length. Anyway, let’s start.

Gradient Descent

This topic is something that you’ll find LOTS of resources on. For example, 3 Blue 1 Brown has a really nice video on it. So, I’ll just be skipping this. In the end, all you have to know is that for a function f(x), you update x by (-f’(x)*dt) where f’(x) is the gradient for ‘ f ’ wrt ‘ x ’ (x may be a vector) . dt is called the step size. It also doesn’t hurt to learn about Jacobians, inner dot product and tensors.

Automatic differentiation

The reason that gradient descent works so smoothly for incredibly complicated functions, with even more complicated gradients (as functions) is because, just like most problems, we can break big problems into smaller ones. This is done by the automatic differentiation, which is basically a clever way of using the multi-variable chain rule, which is given by :

Multivariable Chain Rule

Here, ‘t’ is a variable (possibly a vector or a matrix) that all ‘x_i’ are dependent on (among other dependencies), and ‘f’ is dependent on ‘x_i’ (and thus, ultimately on t) . Now, all of this is good, but how does this help us ?

Functions as Graphs

You must already be thinking of functions as “things” that take in stuff and spit out stuff. Rather than just “things”, let’s use a word that’s a bit more formal. How about “nodes” ?

Actually, each node will be slightly more than just the function. It’ll store all (or sufficiently many) of these things : the input values (t) , the output value (x), the gradient of the final function (f ) with respect to the output of the node and the gradient of the final function (f) with respect to the input values. Usually, at any instant we need to apply the chain rule, we’ll already know the gradient of final function (f) wrt output (x) and we can compute the gradient of output (x) wrt any of the inputs (t). Now, here’s where I make an important distinctions : the input to a node (seen as a variable) and the input node (also a variable) that the input is identical to, are actually separate entities. When I say “gradient of f wrt input t” , I mean just that, and NOT “gradient of f wrt input node t” . This is because the input node could also be the input node of many other nodes, all of which will change on changing the value of the input node, but changing the value of the input doesn’t change the value of other nodes. Okay, enough details.

Simple, but fundamental functions, like addition, multiplication, etc. are what we can easily compute gradients for. So, once we implement these as nodes, we can implement complicated functions, like polynomials, sigmoid, etc. as graphs made of these nodes.

The function sin(0.01 x²) implemented as a graph

For example, in the above graph, “sin” is the final node and thus the output function. Obviously, the gradient of output function (f = “sin”) wrt the output of the node “sin” is 1 . So far, I’ve talked about the gradient of final function wrt output in a verbose manner. lets name it something. How about a property of the node objects, say node.grad. Similarly, let’s name the gradient of output wrt input as node.dxdt . So, the gradient of ‘f’ wrt input ‘t’ is given by… node.grad * node.dxdt . Here, the * operation represents either matrix multiplication, or inner product, depending on the context.

But this isn’t the end. For any input node, t , the value t.grad is caluculated by summing all the values of node.grad * node.dxdt for all the ouptut nodes of t .

Okay, now equipped with this knowledge, let’s implement this in python :

from matplotlib.pyplot import *
from numpy import *
from numpy.linalg import eig
from graphviz import Digraph
from IPython.display import Image
from sklearn.datasets import load_breast_cancer

g = Digraph("Network")
NodeCollection = []
Variables = []
counter = 0
dt = 0.01
def Reset():
    global g,NodeCollection,Variables,counter,dt
    g = Digraph("Network")
    NodeCollection = []
    Variables = []
    counter = 0
    dt = 0.01
class Node:
    name = None
    COLOR = "black"
    def __init__(self,name=None,draw=True):
        global counter,NodeCollection
        self.value = None
        self.outputNodes = []
        self.inputNodes = []
        self.id = counter
        self.draw = draw
        if self.name==None and name == None:
            self.name = "node "+str(self.id)
        elif name!=None:
            self.name = name
        if draw:
            g.node(str(self.id),self.name,color=self.COLOR)
        NodeCollection.append(self)
        counter += 1
    def __repr__(self):
        return f"name : {self.name}\n value : \n{self.value}\n grad : \n{self.grad}\n"
    def __add__(self,other):
        return Add([self,toNode(other)],"+")
    def __mul__(self,other):
        return Mul([self,toNode(other)],"*")
    def __pow__(self,other):
        return Pow([self,toNode(other)],"**")
    def __div__(self,other):
        return self*(other**(-1))
    def __neg__(self):
        return Neg([self],"-")
    def __sub__(self,other):
        return self + (-toNode(other))
    def recieve(self):
        self.grad = 0
        for n in self.outputNodes:
            DFDX = n.dfdx_value[self.id]
            GRAD = n.grad
            #print("recievong from",n.id,"aka",n.name)
            #print("DFDX",DFDX.shape)
            #print("GRAD",GRAD.shape)
            if len(DFDX.shape)==1 and GRAD.shape==(1,DFDX.shape[0]) and len(self.value.shape)==2:
                self.grad += GRAD.T @ DFDX[newaxis,:]
            elif DFDX.shape==(1,) or GRAD.shape ==(1,):
                self.grad += GRAD*DFDX
            else:
                #self.grad += dot(GRAD,DFDX)
                self.grad += GRAD @ DFDX
class Function(Node):
    COLOR = "green"
    f = None
    dfdx = None
    dfdx_value = None
    def __init__(self,inputNodes,name=None,draw=True):
        super().__init__(name,draw)
        for n in inputNodes:
            if self.draw and n.draw:
                g.edge(str(n.id),str(self.id))
            n.outputNodes.append(self)
        self.inputNodes = inputNodes
        self.forward()
    def __repr__(self):
        return f"name : {self.name}\n value : \n{self.value}\n grad : \n{self.grad}\n dfdx : \n{self.dfdx_value}\n"
    def forward(self):
        self.inputs = dict([(node.id,node.value) for node in self.inputNodes])
        self.value = self.f(self.inputs)
        self.dfdx_value = self.dfdx(self.inputs)
        n = int(prod(self.value.shape))
        #if n > 1:
        #    self.grad = identity(n)    
        #else :
        #    self.grad = array(1)
        self.grad = identity(n)
    def backward(self):
        for n in self.inputNodes:
            n.recieve()
            n.backward()
class Variable(Node):
    COLOR = "red"
    def __init__(self,value,name=None,draw=True):
        super().__init__(name,draw)
        self.value = value
        n = prod(self.value.shape)
        self.grad = identity(n)
        Variables.append(self)
    def backward(self):
        pass
    def forward(self):
        global dt
        self.grad.resize(self.value.shape)
        self.value = self.value - self.grad*dt
class Constant(Variable):
    COLOR="black"
    def recieve(self):
        pass
    def forward(self):
        pass
def toNode(other,draw=True):
    name = None
    if isinstance(other,Node):
        return other
    if type(other) != ndarray:
        if type(other) != iterable:
            name = str(other)
            other = array([other])
        else:
            other = array(other)
    return Constant(other,name,draw)

Now, let’s add some fundamental functions.

class Add(Function):
    name = "+"
    def f(self,inputs):
        S = 0
        for id in inputs:
            S = S + inputs[id]
        return S
    def dfdx(self,inputs):
        G = dict()
        for id in inputs:
            n = prod(inputs[id].shape)
            if n>1:
                G[id] = identity(n)
            else:
                G[id] = ones(prod(self.value.shape))[:,newaxis]
        return G
class Mul(Function):
    name = "*"
    def f(self,inputs):
        S = 1
        for id in inputs:
            S = S*inputs[id]
        return S
    def dfdx(self,inputs):
        G = dict()
        for id in inputs:
            S = 1
            for Id in inputs:
                if Id == id:
                    continue
                S = S*inputs[Id]
            S = S.flatten()
            n = prod(inputs[id].shape)
            if n > 1:
                m, = S.shape
                if m > 1:
                    S = diag(S)
                else :
                    S = S * identity(n)
            else:
                S = S[:,newaxis]
            G[id] = S
        return G
class Exp(Function):
    name = "exp"
    def f(self,inputs):
        return exp(next(iter(inputs.values())))
    def dfdx(self,inputs):
        id = next(iter(inputs.keys()))
        x = next(iter(inputs.values()))
        n = prod(x.shape)
        x = exp(x)
        x = x.flatten()
        if n>1:
            return {id:diagflat(x)}
        return {id:x[:,newaxis]}
class Pow(Function):
    name = "**"
    def f(self,inputs):
        x,n = inputs.values()
        return x**n
    def dfdx(self,inputs):
        ids = list(inputs.keys())
        x,n = inputs.values()
        m = prod(x.shape)
        if m > 1:
            return {ids[0]:diagflat(n*x**(n-1))}#,ids[1]:(log(x)*x**n).flatten()[:,newaxis]}
        return {ids[0]:(n*x**(n-1)).flatten()[:,newaxis]}#,ids[1]:(log(x)*x**n).flatten()[:,newaxis]}
class Neg(Function):
    name = "-"
    def f(self,inputs):
        return -next(iter(inputs.values()))
    def dfdx(self,inputs):
        id = next(iter(inputs.keys()))
        x = next(iter(inputs.values()))
        n = int(prod(x.shape))
        #if n>1:
        #    return {id:-identity(n)}
        #return {id:-array([1])}
        return {id:-identity(n)}
class Dot(Function):
    name = "."
    def f(self,inputs):
        x,y = inputs.values()
        d = dot(x,y)
        if len(d.shape)>0:
            return d
        return array([d])
    def dfdx(self,inputs):
        id1,id2 = inputs.keys()
        x,y = inputs.values()
        if len(x.shape) == 1:
            x = x[newaxis,:]
            y = y[newaxis,:]
        return {id1:y,id2:x}
class Sum(Function):
    name = "sum"
    def f(self,inputs):
        x, = inputs.values()
        return array([sum(x)])
    def dfdx(self,inputs):
        x, = inputs.values()
        id, = inputs.keys()
        n = prod(x.shape)
        return {id:ones((1,n))}
class Sin(Function):
    name = "sin"
    def f(self,inputs):
        x, = inputs.values()
        return sin(x)
    def dfdx(self,inputs):
        x, = inputs.values()
        id, = inputs.keys()
        n = prod(x.shape)
        return {id:diagflat(cos(x))}
class Cos(Function):
    name = "cos"
    def f(self,inputs):
        x, = inputs.values()
        return cos(x)
    def dfdx(self,inputs):
        x, = inputs.values()
        id, = inputs.keys()
        n = prod(x.shape)
        return {id:diagflat(sin(x))}
class MatFunc(Function):
    def forward(self):
        self.inputs = dict([(node.id,node.value) for node in self.inputNodes])
        self.value = self.f(self.inputs)
        n = int(prod(self.value.shape))
        self.grad = identity(n)
    def __init__(self, inputNodes, name=None, draw=True):
        super().__init__(inputNodes, name, draw)
        for n in self.inputNodes:
            if isinstance(n,MatFunc):
                n.recieve = n.send
            else:
                n.recieve = lambda:None
    def recieve(self):
        super().recieve()
        self.send()
class MatMul(MatFunc):
    name = "@"
    def f(self,inputs):
        W,X = inputs.values()
        return W @ X
    def send(self):
        w,x = self.inputNodes
        W = w.value
        X = x.value
        G = self.grad
        w.grad = G @ X.T
        x.grad = W.T @ G
class SigmM(MatFunc):
    name = "Sigma"
    def f(self,inputs):
        X, = inputs.values()
        return sigmoid(X)
    def send(self):
        x, = self.inputNodes
        X = x.value
        G = self.grad
        sig = sigmoid(X)
        x.grad = G*(sig**2/exp(X))
class SqNorM(MatFunc):
    name = "Norm"
    def f(self,inputs):
        X, = inputs.values()
        return array([linalg.norm(X,'fro')])**2
    def send(self):
        x = self.inputNodes[0]
        x.grad = 2*x.value*self.grad
class DotM(MatFunc):
    name = "."
    def f(self,inputs):
        X,Y = inputs.values()
        return sum(X*Y)
    def send(self):
        x,y = self.inputNodes
        X,Y = x.value,y.value
        G = self.grad # assumed to be a scalar
        x.grad = Y*G
        y.grad = X*G
class AddM(MatFunc):
    name = "+"
    def f(self,inputs):
        S = 0
        for X in inputs.values():
            S = S + X
        return S
    def send(self):
        for n in self.inputNodes:
            if len(n.value.shape) == 1 or n.value.shape[0] == 1:
                n.grad = sum(self.grad,axis=0)
            else :
                n.grad = self.grad
class NegM(MatFunc):
    name = "-"
    def f(self,inputs):
        X, = inputs.values()
        return -X
    def send(self):
        x = self.inputNodes[0]
        x.grad = -self.grad

Back Propogation

Remember what we are doing all of this for. We want to do gradient descent on the nodes that are not dependent on any other nodes. We know we can calculate the gradient of f wrt the value of a node, given all the output nodes have the latest value of the gradient wrt their value. This means, we must compute these values in the reverse topological order (order of creation, in this case) . And, of course, for computing the outputs of the nodes, we need to move in forward order. The first process is called “back propogation” . Similarly, I name the 2nd process as “forward propogation” .

def BacProp(show=False):
    global NodeCollection
    for i in range(len(NodeCollection)-2,-1,-1):
        n=NodeCollection[i]
        n.recieve()
        if show:
            print(n)
def forProp(show=False):
    global NodeCollection
    for i in range(len(NodeCollection)):
        n=NodeCollection[i]
        n.forward()
        if show:
            print(n)
def Descend(iterations=100):
    global dt,NodeCollection,Variables
    L = NodeCollection[-1]
    print("before",L.value)
    for i in range(iterations):
        BacProp()
        forProp()
    print("after",L.value)

In fact, with all of this, we are in a position to build a small neural network right away. First, let’s make a perceptron.

Perceptron

It’s a function that outputs the sigmoid of a linear combination of different features (nodes in preceding layer) . The weights are variables (hyperparameters really). There’s a lot of information on this topic as well, so I’ll just shut up and show you the code.

def Sigm(inputNode,name="S",draw=True):
    out = Neg([inputNode],None,False)
    out = Exp([out],None,False)
    out = Add([out,Constant(array([1]),None,False)],None,False)
    out = Pow([out,Constant(array([-1]),None,False)],name,True)
    if draw and inputNode.draw:
        g.edge(str(inputNode.id),str(out.id))
    return out
def perceptron(layer,draw=True):
    nl = []
    for n in layer :
        nl.append(Mul([n,Variable(random.random(1),None,False)],None,False))
    S = Add(nl,"+",False)
    S = Sigm(S,"P",draw)
    for n in layer:
        if draw and n.draw:
            g.edge(str(n.id),str(S.id))
    S.nl = nl
    return S

Loss function

Let’s also make a function that gives the Euclidian norm between two vectors. This is what’s called a loss function (there are many varieties) which tells us how bad was our prediction.

def SqEr(inputNodes,name="L",draw=True):
    x,y = inputNodes
    out = Neg([y],None,False)
    out = Add([x,out],None,False)
    two = Constant(array([2]),None,False)
    out = Pow([out,two],None,False)
    out = Sum([out],name,draw)
    for n in inputNodes:
        if draw and n.draw:
            g.edge(str(n.id),str(out.id))
    return out

Neural Net

And finally, let’s make the whole neural net.

class PerceptronNet:
    def __init__(self,LS=[3,3,3]):
        Reset()
        n = LS.pop(0)
        layer0 = [Constant(ones(1)) for _ in range(n)]
        layers = [layer0]
        for n in LS:
            last_layer = layers[-1]
            layers.append([perceptron(last_layer) for _ in range(n)])
        P = perceptron(layers[-1])
        y = Constant(ones(1),"y")
        L = SqEr([P,y])
        self.y = y
        self.layer0 = layer0
        self.P = P
        self.L = L
        self.layers = layers
    def assign(self,X,y):
        if X.shape[1]!= len(self.layer0):
            print("Can't deal with this many variables.")
            return
        self.y.value = y
        for c in range(X.shape[1]):
            self.layer0[c].value = X[:,c]
        forProp()
    def predict(self):
        forProp()
        loss = self.L.value
        y_pred = self.P.value
        return y_pred,loss

Usage

We can use this on the breast cancer dataset like this :

X,y = load_breast_cancer(return_X_y=True)

# PCA
mu = mean(X,0)
sd = var(X,0)**0.5
X = (X-mu)/sd
## Assume X = Y @ R.T and Y.T @ Y = D
C = X.T @ X
D,RT = eig(C)
idx = argsort(D)[-7:]
D = D[idx]
RT = RT[:,idx]
X = X @ RT
X = real(X)

#Neural Net
net = PerceptronNet([7,5])
net.assign(X,y)
dt = 0.001
print(mean(around(net.P.value)==net.y.value)*100)
Descend(100)
print(mean(around(net.P.value)==net.y.value)*100)
Show(g,200)

This gives this graph :

And an accuracy of 90-95% on the training data itself. We can increase this drastically by descending for longer. But to do that, we need a more efficient model. This can be done by using matrices for layers, rather than individual perceptrons.

Layers as Matrices

def Layer(X,pin,pout,name=None,bias=False):
    W = random.random((pin,pout))
    W = Variable(W,"W")
    Y = MatMul([X,W])
    if bias:
        b = random.random(pout)
        b = Variable(b,"b")
        Y = AddM([Y,b]) 
    Y = SigmM([Y],name)
    return Y
def SqErM(inputNodes):
    yp,y = inputNodes
    myp = NegM([yp])
    er = AddM([myp,y])
    s = SqNorM([er],"L")+0
    return s
class LayerNet:
    def __init__(self,X,y,middle=[3,3,3],bias=False,normalise=False):
        if isinstance(bias,bool):
            bias = [bias for _ in range(len(middle))]
        Reset()
        m = X.shape[1]
        n = y.shape[1]
        mu = mean(X,axis=0)
        sdi = diag(var(X,axis=0)**(-0.5))
        X = Constant(X,"X")
        y = Constant(y,"y")
        self.y = y
        self.X = X
        if normalise:
            nmu = Constant(-mu,'b')
            sdi = Constant(sdi,"W")
            X = AddM([X,nmu],"shifted X")
            X = MatMul([X,sdi],"normalised X")
        out = Layer(X,m,middle[0],"layer1",True)
        for i in range(len(middle)-1):
            out = Layer(out,middle[i],middle[i+1],f"layer {i+1}",bias[i])
        y_pred = Layer(out,middle[-1],n,'y_pred',bias[-1])
        L = SqErM([y_pred,y])
        self.y_pred = y_pred
        self.L = L
    def predict(self,X,y,testName = ""):
        self.X.value = X
        self.y.value = y
        forProp()
        y_pred = self.y_pred.value
        print("accuracy on test ",testName," : ",round(mean(around(y_pred)==y)*100,2),"%")
        return y_pred
    def train(self,iterations=100,dtvalue=0.01):
        global dt
        dtold = dt
        dt = dtvalue
        y_pred = self.y_pred.value
        y = self.y.value
        print("accuracy before training : ",round(mean(around(y_pred)==y)*100,2),"%")
        Descend(iterations)
        dt = dtold
        y_pred = self.y_pred.value
        y = self.y.value
        print("accuracy on training : ",round(mean(around(y_pred)==y)*100,2),"%")

Using this for the breast cancer dataset, without even doing PCA, we get 96–98% accuracy on a testing data

X,y = load_breast_cancer(return_X_y=True)
m = X.shape[1]
X = X[:]
y = y[:,newaxis]
X,Xtest = X[:300,:],X[300:,:]
y,ytest = y[:300,:],y[300:,:]

net = LayerNet(X,y,[5],normalise=True)
net.train(100,0.01)
net.predict(Xtest,ytest)
Show(g)

This looks like this :

A neural net with middle layer of 5 perceptrons

Now, you might think that neural networks aren’t so good after all, since we got only 96–98% accuracy. Well, the thing is, I chose the wrong activation function and loss function. We usually use the sigmoid and squared error for regression tasks. In this case, we want to perform a classification task, for which we should use the soft-max activation function and cross entropy loss function. Using that and 2 hidden layers with 10 and 3 perceptrons, we get this output :

accuracy before training :  48.67 %
before [2.77841405]
after [0.00597004]
accuracy on training :  100.0 %
accuracy on test    :  96.28 %

The model was trained on 300 samples and tested on 300 samples from the breast cancer dataset. To view the final code click on the link :

ML_Journey/AutoDif.ipynb at main · pranav-joshi-iitgn/ML_Journey

A more complicated example would be a hand written digit classifier.

ML_Journey/Digits.ipynb at main · pranav-joshi-iitgn/ML_Journey

End

Finally, I’ll like to admit that this was my first time building a neural network as well. I’m NOT an expert in this field, so take everything (especially notation) with a grain (brick) of salt. Hopefully, this helps some lost soul, just as an older Medium post on the same topic helped me.

Contact

Linked In : https://www.linkedin.com/in/pranav-joshi-77487b232/

YouTube : https://youtube.com/@iknowsomestuff7131?feature=shared