Maintaining a large distributed service-oriented platform is hard. As both the organization and the architecture grows, and more features are added, it becomes increasingly harder to keep a mental picture of all the services and pockets of knowledge that start forming across the organization. Existing, manually drawn architecture diagrams become stale quickly, which makes it difficult for new employees to spin up on all the services. It also makes auditing service endpoints for security purposes impossible since our security engineers are unfamiliar with the codebases. To address these issues, we at Knewton have been building Kizceral, a project that came out of a hack day. Kizceral is a dynamic realtime service dependency visualizer based on Vizceral, leveraging tracing data generated by our in-house distributed tracing library TDist.
How it Works
Kizceral consists of two main parts. The first is a Java backend responsible for gathering tracing data and building a dependency graph. The second part is a React frontend used to visualize the graph.
When a user loads Kizceral, a backend request is made to build the service dependency graph. The graph is built by querying Zipkin for the list of service names and a list of spans for each of those service names. The Kizceral backend then looks at the spans (refer to Dapper data model explained here) and builds links by connecting each target service with its direct dependencies. The graph inference makes a best effort in cases where only partial tracing data is available (i.e., missing annotations, Kafka topics). A graph representation is then sent back to the UI along with all the RPC methods that each service exposes. Since this graph is built in realtime and includes current traffic information, it is different than what Zipkin can generate using the bundled offline aggregate job.
Here is a simplified view of all the dependencies:
The frontend is written in React and combines information from multiple sources in our infrastructure which allows users to view fine grained details about each service as they explore our architecture. For example from a service node in the UI a user is able to view:
- The version and latest deployed code change of that service from version control
- The number of instances and CPU / memory allocated from Marathon
- Service logs on Mesos
- RPC methods the service exposes
Here is an example architecture diagram:
A user can select a service node and get a simplified view with all the upstream and downstream dependencies.
Kizceral is a monitoring app, so it is designed to have as little impact on other running services as possible. Since thousands of traces are generated every second during peak traffic, we added a caching layer to cache the graph, the list of services and its details, and the service RPC methods. This was especially important at Knewton because the database cluster that stores tracing data is multitenant. Building an accurate dependency graph requires large amounts of data to be queried in real time, which can stress the database. For example, any service that has long running requests with each request having several thousand sub-requests could cause severe latencies in the database when fetching its list of traces making caching especially important.
Our solution was to design a cache which dynamically adjusts refresh rates and number of traces when querying service RPC data. Services which publish longer or more complicated spans are polled less often, since querying these services adds more load on our databases.
During development of Kizceral’s UI, we also noticed and fixed a small bug with Vizceral’s rendering and pushed a fix back upstream. Specifically, rendering on hi-dpi displays was blurry. Since Vizceral uses Three.js and WebGL under the hood, we traced this down to the device pixel ratio not being set properly. The end result: crystal clear architecture diagrams!
We would like to see Kizceral as a valuable one stop bird’s-eye view of our entire platform. To achieve that, we would like to some additional features:
- Ability to visualize errors and warnings as they happen in real time. There’s a few ways of doing this but we think that the best approach would be to add error annotations inside of the tracing payload generated by TDist. On error, we could generate the receive or send annotations for either client or server with an additional annotation capturing the error or the warning. Since the additional annotations will be stored by Zipkin, the Kizceral backend will be able to query and forward them up to the UI. The UI could then color nodes red or orange depending on the severity.
- Ability to add cost information, to help non-technical teams to get an accurate picture of the platform cost per student, partner, or path through the platform. Doing that in a tool like Kizceral would also help bridge a gap that exists between non-technical and the technical teams building the services.
- Ability to SSH to a service instance directly from the Kizceral UI with a simple click on a service node.