Project Geneva: State Censorship beat by Genetic Algorithms
One of the most impactful usages of Artificial Intelligence in today’s society
The internet is a great tool. It’s not surprising that I am a huge fan of all the potential it provides (I have relied on it heavily for my journey). The internet is an integral part of humanity’s progress and will continue to be a large part of the advancements of the future.
Unfortunately, as technology grows, so does the potential for misuse. Machine Learning can be used to automate many complex tasks in ways that would’ve been impossible just 15 years ago. Unfortunately, this opens up censorship and mass surveillance at a scale previously unimaginable. Ironically enough, AI and Machine Learning become one of the best tools for dealing with such problems as well.
I came across one such project in the video: AI against Censorship: Genetic Algorithms, The Geneva Project, ML in Security, and more! This is a fantastic interview with a leader of the project. The interview was particularly interesting because the author talks about how Genetic Algorithms are well suited for this task. As someone who has talked about Evolutionary Algorithms being underutilized for a while, this had me interested. In this article, I will share some interesting (Machine Learning Takeaways) along with some ideas I had looking into this idea. This article is meant to introduce you to this crucial project and bring it to your attention. Make sure you look into them as they are instrumental for the future.
Looking at the website, “Geneva is a novel experimental genetic algorithm that evades censorship by manipulating the packet stream on one end of the connection to confuse the censor.” It has 2 components-
- Strategy Engine: The strategy engine is responsible for running a given censorship evasion strategy over active network traffic.
- Genetic Algorithm: The genetic algorithm is the learning component that evolves new strategies (using the engine) against a given censor.
The tool is open source, and their Github can be found here. Anyone can set up and run Geneva from their machines. Running it will start the algorithm that will try to test different strategies. It tried more and more refined strategies. If breaks through encryption, the team studies the results in order to extract information about the underlying censorship system.
Given the nature of the project, there are a lot of networking-related terminologies in their documentation. I know nothing about the field, so I won’t pretend to be able to break those aspects down. If anybody more experienced in these areas wants to get on and talk about those factors, I’d be more than happy to open up my platform to you. But I can explain the interesting AI aspects.
In the video I shared about Differential Evolution, I talked about how Evolution Based Algorithms have a larger possible search space than traditional gradient-based methods. This advantage really shows itself in problems like this. An evolutionary algorithm (like the Genetic Algorithms mentioned here) can traverse through a much more varied search space.
Take the search space for example. Gradient-based methods need a smooth, continuous search space. Genetic Algorithms can operate in cases when this isn’t the case. This is why they can tweak solutions and chain components of their search space to create new candidate solutions.
The leader actually touched upon this during the interview linked. He mentioned how there is no gradient for ML methods to compute over. This is true for both the search space (we build strategies by chaining commands) and the output (passing through the censorship). In fact, he even mentioned that since Genetic Algorithms can test everything, they actually had to constraint the algorithm to some basic commands.
Search Space Commands
Evolution Based methods always have a valid basic set of commands that they can test out. This is what they use to create candidate solutions, and in the tweaking existing solutions during the recombination/mutation phases. These are domain-specific. For this project, we have the following building blocks.
It’s important to understand how they came up with these as the basis. The team used a straightforward definition of censoring, Censorship is merely the modification of network traffic. Thus, the strategy is simply a “description of how network traffic should be modified”. From that lens, it’s clear that the block should consist of the possible ways that network packets might be modified.
The goal of a censorship evasion strategy is to modify the network traffic in a such a way that the censor is unable to censor it, but the client/server communication is unimpacted. — From the authors
The Github readme provides a pretty concise description of each of the building blocks.
duplicate: takes one packet and returns two copies of the packet
drop: takes one packet and returns no packets (drops the packet)
tamper: takes one packet and returns the modified packet
fragment: takes one packet and returns two fragments or two segments
It is important to note that duplication and fragmenting “introduce branching, these actions are composed into a binary-tree structure called an action tree. The trigger describes which packets the tree should run on, and the tree describes what should happen to each of those packets when the trigger fires.” An example tree can be here
The action trees are combined together to create full-fledged evasion strategies.
Fitness functions are crucial to all Genetic Algorithms. They determine what the algorithm will consider good and bad, and will ultimately direct what kinds of strategies are involved. Just like the building blocks, designing this is not trivial. The authors of this protocol had a relatively simple function (which is great because it allows for many enhancements and modifications). In the documentation, they share the following priorities
As mentioned earlier, the possible search space for solutions is infinite. This means that cost-effective performance is crucial. This is especially crucial because the system is running from local machines (not large servers like most ML these days), and is against the computing power of a state. This is why the authors put such a premium on tripping potential deadweight solutions early
This hierarchy accomplishes a significant search space reduction. Instead of Geneva fuzzing the entire space of possible strategies (for which there are many!), it instead quickly eliminates strategies that break the underlying connection and encourages the genetic algorithm to concentrate effort on only those strategies that keep the underlying connection alive.
The fitness function creates learners with an emphasis on minimization of regret (regret here being the useless solutions). This works out for them because the search space is naturally large so we can hope that we are hitting the best viable performance. I would be interested in trying out runs with a “reservation” for a few bad solutions. Sometimes they can introduce strong learners when mixed with strong candidates after many generations. Since the building blocks are simple, it might not be super effective in this case, but I would love to see it.
This is an interesting example of adversarial design, where the learner is learning by trying to break the censors. During the interview, Kevin talked about how they were trying to design a few censors themselves. It would be interesting to explore an automated censor that evolves with this evader. Not only is this likely to happen in the future, but such a solution would accelerate the process. This kind of approach has seen great success in most ML
Combining this and a setup where we can link the results of all the runs back to the main system (like we do for self-driving cars) would be a great way to supercharge the evasion agent. It would also leverage scale very well. Make sure you look into the project and share what your thoughts are.
If you liked this article, check out my other content. I post regularly on Medium, YouTube, Twitter, and Substack (all linked below). I focus on Artificial Intelligence, Machine Learning, Technology, and Software Development. If you’re preparing for coding interviews check out: Coding Interviews Made Simple, my weekly newsletter. You can get the premium version for less than 0.5 USD dollars/day. The premium version will unlock high-quality solutions to weekly coding problems, special discussion posts, and a great community. It has helped a ton of people in their preparations.
To help me write better articles and understand you fill out this survey (anonymous). It will take 3 minutes at most and allow me to improve the quality of my work.
Feel free to reach out if you have any interesting jobs/projects/ideas for me as well. Always happy to hear you out.
For monetary support of my work following are my Venmo and Paypal. Any amount is appreciated and helps a lot. Donations unlock exclusive content such as paper analysis, special code, consultations, and specific coaching:
Reach out to me
If you’d like to discuss tutoring, text me on LinkedIn, IG, or Twitter. Check out the free Robinhood referral link. We both get a free stock (you don’t have to put any money), and there is no risk to you. Not using it is just losing free money.
Check out my other articles on Medium. : https://rb.gy/zn1aiu
My YouTube: https://rb.gy/88iwdd
Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y
My Instagram: https://rb.gy/gmvuy9
My Twitter: https://twitter.com/Machine01776819
If you’re preparing for coding/technical interviews: https://codinginterviewsmadesimple.substack.com/
Get a free stock on Robinhood: https://join.robinhood.com/fnud75