Have you ever wondered what happens under the hood when we make a simple Google search? or How Google makes sure that we get all those numbers of accurate search results within a snap of a finger? , Ever thought of how do they manage such a large scale distributed systems?
Before every other thing, What the heck is a Distributed Systems?
A distributed system is basically a network of autonomous systems/servers connected using a middleware which can share resources, capabilities, files and so on. The goal is to make the entire network work as a single computer.
Distributed Programming is the art of solving the same problem that you can solve on a single computer using multiple computers — Mikito Takada
Let me explain with an example of a simple google search. Whenever We make a search on Google, a front end service may distribute the query to hundreds of query servers, each searching within its own index. The query may also be sent to a number of other subsystems which process advertisements, checks spelling, look for specialized results like images, videos, etc. Results from all these services are selectively combined in the results page. We call this model a “Universal Search”. In general thousands of machines and many different services might be needed for one universal search.
On a monthly basis, Google handles over a billion search requests. To handle these requests google uses over a million computers.
Need for a Tracing Infrastructure in Distributed Systems
In such complex systems, poor performance in any of the subsystem would affect the entire search cycle. An Engineer looking at the overall latency would know there is a problem but may not be able to guess which service is at fault since new services would be added or removed every day and a small issue in one application might affect the performance of other. Moreover, each service would have been developed by different teams and perhaps in different languages. Taking all these into an account, a system/tool which can help us understand the system behavior and reasoning of latency would be invaluable.
What’s so special about Dapper?
Although there are other tracing systems like Magpie and X-Trace, certain design changes which were made in Dapper has been the reason for its huge implementation success. Any tracing system for distributed system needs to record information about all the work done in a system on behalf of an initiator.
The image above shows a simple distributed system of five nodes and the path taken through the system. RPC stands for Remote Procedure Call. In Distributed systems, RPC is when a program causes a procedure to execute in systems on a shared network. Now for tracing, Dapper doesn't only use Annotation based monitoring schemes which assumes there is no additional information other than the recorded message identifiers and time stamp events during every request and RPC execution. Instead, they use trees, span, and traces as well.
Tree’s nodes are considered as span, The edges indicate the relationship of each span with its parent span. Dapper records a human-readable Span name, Id and even Parent Id. The image above shows the relationship of the five spans showed earlier when recorded by Dapper. A span can also contain information from multiple hosts. The image below shows a detailed view of a single span.
The entire process of Dapper collection and logging contains three different stages. First, Span data is written in local log files. Second, It is then pulled from all production hosts and collection infrastructure and finally written to a cell in Dapper Bigtable(High-performance data storage system built by Google)repositories. A trace is laid out as a single Bigtable row with each column corresponding to a span.
The median latency for trace data collection pipeline is 15 seconds. Dapper also provides an API to the trace data in the repository. Developers in Google are using this API to build general or application specific analysis tool.
Hope you have got an idea of how the Tech giant Google monitors and traces its entire distributed network to understand the latency issues and resolve it just for one sole reason, to improve its User experience. Always have Customers First attitude to build great products. Cheers!.
This article is the summary of the research paper “Dapper, a Large Scale distributed systems Tracing Infrastructure”. You can find the entire research paper over here.