At Zoosk, the backend contains internal Java microservices that handle up to 400,000 requests per minute. New features and debugging issues in production became troublesome because the codebase was polluted with anti-patterns. It did not help that the creators of the services left the company without having much documentation on why certain code behaved the way it did. The services were abstract to handle every possible scenario, even ones that would never happen in production. “Simplicity over complexity, complexity over complicatedness” otherwise one ends up like this:
I never said “WTF” more in my life then when I first looked at the Java code. As part of migrating our Java microservices at Zoosk to Amazon Web Services we wanted to improve the quality of the code and reduce the number of “WTF”s developers would have to face. It got to the point where people had a sour taste for the Java services. It is a challenge for companies to update their old technology. The whole process is a pain and a huge investment, especially if the application is a giant monolith. Microservices following the Single Responsibility Principle (SRP) provides a simpler refactor because they are not complexly intertwined with other code paths of your application. They are responsible for one thing and testing is simple. The refactor started with investigating frameworks and best practices other companies were adopting. The research helped craft an informed decision on what aspects of the services needed change. The issues solved included updating the old stack, easing/automating the development process, improving API documentation, improving monitoring and alerting, simplifying, and evangelizing the new processes.
Updating the Old
Java 6 to Java 8
How: Java is backwards compatible the upgrade and updating dependencies was easy. Make sure unit tests are in place for regression testing of the upgrade.
Why: Falling behind on updates lead to missing out on features and bug fixes of libraries that have moved to Java 8. By updating the Java version, we removed the annoying UnsupportedClassVersionError caused by JVM and the compiled artifact having different Java versions.
Converted custom messaging format to RESTful
Why: Following no standards in the code base led to a lot of confusion in understanding how the service worked. Services contained one POST endpoint where the actual endpoints that needed to be called were found inside the body of the request. This is great for a batch call endpoint, too bad none of the services used batch calling. The endpoint just became a dumping ground for all calls and debugging what was called when an issue occurred was a nightmare. Instead of following HTTP standards of returning a 401 if a user is unauthorized , all services returned HTTP 200 even if there were errors in processing a request. There was a lot of home brewed code to handle request processing which could have been replaced by Spring with a couple of annotations.
How: Created a Swagger Spec of each service. Ran Swagger CodeGen against the Swagger Spec to create boilerplate code for the Spring Boot App with Swagger UI annotations.
Why: Contract changes for service endpoints involved updating FogBugz pages containing the API documentation that clients used. Developers were not keeping the docs updated whenever a change to the contract of the endpoint occurred. Clients interacting or adopting the endpoint would use stale documentation leading to confusion and developer time spent debugging. With Swagger there is one source of truth for the API documentation. The docs are generated from the annotations in the codebase. Every developer gets code reviewed therefore developers who modify the contract of the service without updating the annotations to reflect the new contract would be found.
How: Java VisualVM to profile the Spring boot app and run load tests against the service. We discovered from the results that services had way more heap allocated than it needed.
Why: Charges are made by how much is used when in the cloud. If services are underutilize, money goes flying into the trash. By optimizing the JVM to be more performant and use less resources one can maximize the dollars spent. We wanted to instill the practice of not arbitrarily setting JVM values. Instead we wanted to load test with predicted traffic and analyze how much CPU and memory the service actually needs.
Before our Auth Service used an average size 989 MB of memory
After JVM changes the service reduce the average amount of heap memory used to 85.9 MB with no noticable performance impact.
Monitoring and Alerting
How: Converted from Java 6 JUL to SL4J, removed log guards in the code base, created a standard logback.xml file for all services to consume, and sent logs to ElasticSearch.
Why: All logs used to be dumped into catalina.out. The worse part was it was in a format that made it hard to do root cause analysis. Unlike other tiers at Zoosk that had their logs standardized and shipped to Splunk for querying and alerting. The changes to ship logs to Elasticsearchusing Fluentd in a standardized format allowed us to query the application logs in Kibana. SL4J parameterized logs allowed us to remove the log guards from our code.
//Try debugging the issue of the service not adding a notification with these log statementsJun 07, 2017 5:40:56 PM com.zoosk.service.feed.notification.cql.AddOperation processRequestWARNING: nullJun 07, 2017 5:40:57 PM com.zoosk.service.feed.notification.cql.AddOperation processRequestWARNING: nullJun 07, 2017 5:41:37 PM com.zoosk.service.feed.notification.cql.GetOperation handleV2WARNING: null//New format standardized to give time, location, log level, class, line number, and the entire message.2017-10-10 11:10:38.416 [http-nio-12311-exec-3] ERROR i.s.api.NotificationsApiController (95) - All host(s) tried for query failed (tried: localhost/0:0:0:0:0:0:0:1:9042 (com.datastax.driver.core.exceptions.TransportException: [localhost/0:0:0:0:0:0:0:1:9042] Cannot connect), localhost/127.0.0.1:9042 (com.datastax.driver.core.exceptions.TransportException: [localhost/127.0.0.1:9042] Cannot connect))
How: Replace the custom JMX metric logging framework with Spring Boot Actuator Metrics framework. We used Telegraf to funnel the data into InfluxDB to be visualized in Grafana. New Relic APM instrumentation was added to each service.
Why: By removing the custom JMX framework created at Zoosk we were able to reduce the amount of code we needed to support in favor using Spring for managing our application metrics. Visualization load time decreased in Grafana compared to when we used Ganglia. New Relic has provided us application performance monitoring for all transactions that are served. New Relic has made my life a whole lot easier in debugging issues in production with their APM product.
Why: The only time there was an alert raised was when the service was down and a feature on the site stopped working. We have the data now to create rules to recognize these patterns and catch these failures before they happen. For example, if the CPU or memory is over 80% or if we see N number of error logs. Instead of being reactive with our services we became proactive with alerting in place.
Why: The quality of the Java services were sporadic. Services had complex code and formatting errors that a Linter would have prevent. We had unit test suites that only covered 10% of the code base discovered by using a code coverage tool. Services sometimes contained no Java docs, which JAutoDoc can generate for you. Using these tools added standardized code quality across all services with minimal effort.
Why: With code coverage we are able to find branchs of code not tested. One could have hundreds of unit tests, but if the tests only cover 15% of the code base that is not as good as five unit tests that cover 90% of the code. It’s not about the quantity but the quality. In our case it was zero percent code coverage. The existing unit tests were being skipped or broken. No insights could be provided on whether a feature developed broke existing behaviors of a service. In this model, our developer cost for implementing a feature because bugs catchable from a unit test are not caught until they hit production.
How: Created Docker files for each service to integrated with Zoosk Docker framework.
Why: The majority of tiers at Zoosk are in Docker containers and internal tools used to ease the development process involved containerized applications. We decided to follow the standard to allow for easier deployment and testing for Java services. To QA test a feature on Zoosk it requires a QA VM with all containerized apps tagged with the feature name. All the work for setup requires one crane command. Because Java services were not containerized we had to create the artifact, set up the service in the QA VM’s. Now all the QA person has to do is run the same crane command and not do any special setup with the Java services. By moving to dockerized services we were able to leverage the orchestration service Amazon Elastic Container Service and reduce the amount of dev-ops support. For autoscaling it is generally faster to spin up a container than it is to spin up an EC2 instance off an AMI.
How: Converted Java WAR file deployed to Tomcat into a Spring Boot standalone Jar
Why: Testing changes for our Java service required us to SCP the artifact to our development VM, put it in a specific location and name, and restart tomcat in order to get the service in a runnable state to test. Following the steps to run a service was confusing and a hassle, but with Spring Boot I can now run it on my local machine, development VM, or wherever I want all with one command.
How: Adopt open source frameworks such as Spring and Dropwizard to replace custom code.
Why: By shifting to open source one can reduce the amount of code to manage and get new features from these frameworks without having to develop them. Developers around the world use these frameworks. Bringing someone up to speed for developing these services is easier than getting them to learn a custom homebrew framework that is not actively maintained because the guy who made it left.
Created a system from going zero to cloud
How: I took the first stab of the refactoring process and documented the process for one service. Members of the team each took a service, followed the guide, and updated with their experience refactoring for migrating the service to the cloud.
Why: Documented steps of development for developers new and old to getting a standardized service to the cloud. This was team developed document where feedback from everyone who will be working on Java services was integrated. The guide improved the speed and estimation accuracy of developing new services in the cloud.
Making the changes to our Java microservices was a tech debt that needed to be address now instead of later. Otherwise every day put off added more dollars to the cost of tackling the task. We were able to simplify and update the code base. We implemented best practices to improve the entire development process of our services. We now have better visibility into the health of our Java microservices and a standardized quality for each service we ship. Developers can now develop, test, and ship a Java microservice to AWS with ease.