A bug of six months, finally solved!
Yes, the one single constant thing that is a nightmare to every engineer or developer out there. The one thing that is capable of making you completely mad and in worst cases even might make you question your intellect.
I am not a bright guy. I write bad codes. Some of the codes I wrote six months back truly embarrass me now. But one thing I always try to do is not to make the same mistake over and over again. And as you keep writing more and more codes, your codes get better automatically. But what’s the context of it? Here’s the story!
After I joined the Engineering team from the Analytics team here at Intelligent Machines, I started going through the codes far more than ever before. Most of our codes were deployed in the cloud in different manners under different strategies. My foremost concern at that time was to standardize it, making all of them automated so that our team can focus more on writing code rather than deploying it. As the team was already familiar with containerization the next steps towards building a service-oriented architecture were easy.
We started using Istio as a service mesh a while back in our organization. The learning curve was steep and was not pleasant at all. But the after effect was rewarding. It opened up a bunch of options for us, enabled us to do many things infrastructure wise. Some days later, we started using Terraform, an infrastructure automation tool. We published three packages to the Terraform registry which we personally used for different infrastructure automation. We were using GitHub Actions for our CI/CD pipeline. Everything worked perfectly, we were very happy.
Looks like a win so far, right? Yes, certainly was. I felt never so happy when I wrapped all those things in a single string.
But for the last 6 months or more, I could not solve one problem which was eating me at my core. Did I found a work-around of that bug right away at that time?
Yes! I needed to keep our application running. But it broke our automation in some way which was my first priority. The statement of the bug was simple.
“We can not run DB migration in our CI using Kubernetes Job/Deployments”
Well, this is not partially true, because we could run the migration to our Cloudsql databases. The problems started when we migrated our workloads to Azure. The migration process was peaceful but the bug was there. So my initial workaround was simple.
- Just log in to the container after deployment and then run the migration manually from one of the pods. As the backend team was small and DB changes were not that frequent, so we lived through the time.
Our application deployment strategy was simple. Most of our applications contain some deployments to run API, some secrets for config, some services for exposing. We use Istio Gateway and Virtualservices for our routing.
#1 Kubernetes Job way
As DB migration was always a one-time thing and we did not want to run a pod constantly for that, we tried to run the migration in a batch job so that the pod might terminate when the job is done. But, we moved away from that because the helm deployment can not upgrade the job definitions on the fly and it’s not very wise to delete all jobs in a namespace or hard-cord multiple job names to delete.
#2 Kubernetes Deployment way
So the next step was to convert the jobs into deployments and tweak the command options of a container definition such that it can run the migration first and boot an API server like the other pods for that service which will increase the API pod count to just one. The reason was if the pod is running then run it in a meaningful way and serve a request.
For GCP, the DB migration was running perfectly as most of our GCP Cloudsql DB were to be connected via proxy. For Azure, our database setup was simple, our DB can only be accessed by the services that were deployed on Azure.
At this stage, I was blinded by Istio. I could not see the obvious thing that other engineers would see. So I tried running migrations too many times in a namespace where istio-injection was enabled. Every pod was booted with a sidecar proxy named
istio-proxy in that namespace and migration was not running. I stumbled upon the Service Entry documentation on Istio and wrote my own External Mesh Service Entry so that my connection to DB that is hosted outside my cluster can be connected via a TCP connection. It failed!!
I contacted Azure Help support regarding this. I had a fantastic experience with them earlier. As usual, they helped me identify the connection from my pods to DB was okay. Then we exchanged mail sometimes but my problem still exists.
This time, I had enough. I didn’t want this thing to prolong my suffering further. We tried changing the
meshConfig to allow any connections outgoing from our service mesh but it was for nothing. Then I dug down this issue from the Istio official repository named Istio sidecar not allowing connection to the database. No luck yet! After some time I found this piece of gem which eventually help me to look otherwise — Connection to external service with ALLOW_ANY Still drop the connection.
This actually states that the booting of Envoy upon which the istio-proxy container is based has a random booting uptime and our container might request for the migration before the istio-proxy is ready to route the request to an external party. The author of the issues suggested running such deployments to a namespace where the istio-sidecar-injection is disabled. So I changed the namespace of the deployment to another namespace and ran. It got failed! I found my secrets were in the other namespace which the deployment could not access. Classic rookie mistake! I added another secret to the namespace as well and it ran successfully finally!
I know, the solution at the end seems very simple. Just changing the namespace config to another one without any side-car injection enabled. I am so happy to get to the bottom of this that it compels me to write this article at 2 AM in the morning. Now my services can be truly automated again and I can sleep in peace knowing it’s resolved.
Looks like, the Istio team still not working on this. It is marked automatically-closed by the istio-bot on their GitHub repository. I am currently running on istio-1.7 and the proxy bug still persists. The only way to run this is to run migrations without a side-car.
Hope this will help to remind each of us to see things more simply and save you from a lot of head-scratching moments.
Stay bug-free and virus-free at the covid time. Happy coding!!!