Emulating Gene Expression in Prokaryotes using a JavaScript Neural Network
This is my first article and I’m really happy about it :) so please, if you have any constructive suggestions/comments you’re very welcome to leave them.
Let’s start!
For those who don’t have a biology background, don’t panic! I’ll try to guide you through some key concepts that will make easier for this tutorial to be followed (and maybe start loving biology as well).
First, what are we trying to achieve here?
Gene expression refers to the process that allows, in this case, to a bacteria (prokaryote) synthesis something that needs to be used in a greater purpose, mostly proteins, by using a particular gene on their DNA.
In our example, we are going to emulate how a bacteria behaves in terms of gene expression, in situations when there is plenty of glucose (their favorite form of sugar!) and, on the contrary, when there is an insufficient amount of it and uses lactose instead (another form of sugar, a less effective one, energetically speaking).
To do so, prokaryotes use what is called lac operon: a group of genes that contains a promoter (can be more than one for others genes, but for this in particular, uses only one). A promoter is a “special place” in any operon, that controls gene expression. So for this case, genes in this particular operon encode proteins that allows a bacteria to use lactose as an energy source.
In order to be as efficient as possible, a bacteria should express the lac operon only when two conditions are met:
- Lactose is available, and
- Glucose is not available
You might wonder, how a bacteria is able to do this? With the use of two regulatory proteins:
- The lac repressor, which “senses” lactose levels
2. And the catabolite activator protein (CAP), acts as a glucose sensor
In other words, this two binds to the DNA of the lac operon and regulates its transcription (if you are remembering the promoter, you’re in the right path :) because is has to do a lot with this).
I don’t want to get into details here as they are not relevant to what we are trying to build on this tutorial, but if you’re interested in learning more about it you can go to the Khan Academy section.
Continuing with the regulatory part, lac operon contains a number of DNA regions to which particular regulatory proteins can bind and control its transcription:
- The promoter (we already mentioned it) is where a particular protein in charge of transcription (RNA polymerase) binds. That is, if this protein binds here, the bacteria would be able to eat lactose because eventually would produce “forks” to do so. Simple as that.
- The operator is also a place where a particular protein binds, in this case the lac repressor. Imagine the latter as a gigantic rock occupying part of an available parking lot; in this case the car acts as the protein in charge of transcription (RNA polymerase). What happen to the car in this situation? It can’t park. So as the RNA polymerase, it can’t “park” on the promoter because lac repressor (gigantic rock) gets in its way.
3. The CAP binding site is a positive regulatory site that is bound by the protein we already mentioned, called CAP. Conversely to what happen with the lac repressor protein, this enables RNA polymerase to actively binds to the promoter.
After this introduction we can summarize our goal here as an attempt to emulate bacteria behavior depending on how much lactose and glucose is available in their environment.
What’s next?
Let’s dig a little bit more on what happens on every situation with every element involved on this process in order to start thinking in a “coding” way by defining some binary rules :)
First element, the lac repressor
- 1st rule: When lactose is not available, lac repressor binds to the operator impeding RNA polymerase to transcript the gene. If you think about it, it makes totally sense. For what other reason you’ll need “special forks” used to eat “special meal” if there is none?
- 2nd rule: On the other hand, I’m sure you all figured it out already :) if there’s is lactose, we are going to need “special forks”, and how are these made? By expressing the gene, you got it! … You may ask, how is this possible? Well, basically, when there is lactose around there is also allolactose, which is a rearranged form of lactose. This element binds to the repressor and makes it “let go” of the binding site that’s blocking transcription.
Second element, Catabolite activator protein (CAP)
lac repressor by itself is not enough to make a bacteria express the gene as efficient as it should. By working only with it, makes a few transcriptions only unless it gets extra help from CAP. But remember, CAP is attached to glucose availability, NOT LACTOSE. That lead us to…
- 1st rule: When glucose level is low, cAMP is produced. Think of it as a messenger. So this guy appears to let us know that we’re out of yummy sugar (glucose). cAMP attaches to the CAP allowing it to bind to the DNA portion that activates transcription more intensively. As we are running out of glucose, we need to really take advantage of lactose.
- 2nd rule: As you for sure imagine, when glucose is high, there is no presence of cAMP. Consequently, CAP can’t bind to the DNA portion that allows a high rate transcription, resulting on transcription occurring at a low level.
Finally, when does the lac operon really TURNS ON (at a high levels)?
This are the most important premises we are going to define as would lead to our final conclusion allowing us to finally code something (YAY!).
- Glucose must be unavailable
- Lactose must be available
Everything in between may lead to transcription but in a lower rate, also something we can emulate by training our neural network with values that represents those situations as well. So, we have:
- glucose present, lactose absent: no transcription
- glucose present, lactose present: low-level transcription
- glucose absent, lactose absent: no transcription
- glucose absent, lactose present: strong transcription
Let me code, pleeeeaaaaaaaase…
I used brain.js library to work with this project, but there are a lot of options (and you can write your own neural network logic as well, it’s an excellent practice :)). If you’ll like to understand deeper what involves on this particular neural network and it’s learning process, you can check this site.
Let’s go through each piece of code, shall we?
For the most basic type of neural network you’ll fin something like this: a certain amount of input neurons, the “hidden” layer (where training takes place) and the output neurons, intended to show the final result.
… and in order to model the bacteria behavior we are going to need 2 inputs (2 neurons) taking care of glucose and lactose levels. And 3 outputs (3 neurons) corresponding to the no transcription, low transcription and strong transcription outcomes. At this point, there is no connection between them, but our library brain.js will be on charge of that, we don’t have to do anything. In addition, it’s useful for you to know it sets default values for some aspects of the network that can be changed if needed (such as number of iterations to train the network, acceptable error percentage from training data, etc).
Nevertheless, the most important aspect here would be how well we train our neural network. This has to be achieved by providing several combinations of inputs and outputs, which eventually lead to the correct final outcome that we are expecting. Take a look to the data I used (you can write your owns as well, in fact I encourage you to do it):
As you may noticed, it contains ranges of values for the inputs and outputs between 0 and 1. In other words, for a given input we are expecting a defined output (or in some cases, a probability of occurrence for each of them). For example: we have { lactose: 0.3, glucose: 0.6 }, by taking in account all we learned about each of them and how they affect bacteria’s behavior we can picture a situation in where the output may have a probability to transcript in a low-rate manner of 30% (0.3) and a strong one of 70% (0.7).
Note: please notice that the outputs of this particular modeling needs to sum up 100% to make sense.
And that’ll be all! You run the code and after the training process, you’ll be able to use your trained neural network. Pretty cool, right? You just have to use some inputs to see how accurate the predictions are.
If you want to your network to become more precised, you just need to add more training data and play a little bit with the default values associated to its training (like iterations).
Thank you so much for taking the time to read the article. If you liked it, feel free to 👏👏👏 a few times, so other people can enjoy it as well :)