Building the Google Photos Web UI

A peek under the hood

Published in

Google Design

27 min readJul 10, 2018

A few years ago I had the privilege of being an engineer on the Google Photos team and part of the initial launch in 2015. A lot of people contributed to the product — designers, product managers, researchers, and countless engineers (across Android, iOS, Web, and the server) to name just some of the major roles. My responsibility was the web UI, and more specifically the photo grid.

We wanted to try something ambitious and simultaneously support a full-width (justified) layout, preserve the aspect–ratio of each photo, be scrubbable (ie let you jump to any section of your archive), handle hundreds-of-thousands of photos, scroll at 60fps, and load near instantly.

At the time no other photo gallery supported all of this, and to the best of my knowledge they still don’t. While many other galleries now support some of these features they usually square crop each photo to make the layout work.

Here is a technical write up about how we solved those challenges, and a peek under the hood of how the web version of Google Photos works.

Why was this hard?

The two biggest challenges both come down to size.

The first size challenge is that for users with large photo collections (and some users have over a quarter-million photos uploaded) there is simply too much metadata. Sending even minimal information (the photo urls, width, height, and timestamps) is many megabytes of data for a full collection, this would directly run counter to our goal of near-instant loading.

The second size challenge is the photos themselves. With modern HDPI screens, even a small photo thumbnail is often 50KB or more. A thousand thumbnails may be 50 megabytes, and not only is that a lot to download, if you try to place them all in the webpage immediately you can slow down the browser. The old Google+ Photos would become sluggish after scrolling through 1000–2000 photos and Chrome would eventually crash the tab after loading 10000.

Let’s break down the discussion into the various parts (you can search for the bolded title to go straight there).

Scrubbable Photos — the ability to quickly jump to any part of the photo library.
Justified Layout — fill the width of the browser and preserve the aspect-ratio of each photo (no square crops).
60fps Scrolling — ensuring the page remains responsive even when looking at many thousands of photos.
Instantaneous Feel —minimize the time waiting for anything to load.

1. Scrubbable Photos

There are a few approaches to dealing with large collections. The oldest is probably pagination, where you show a fixed number of results and click “next” to see the subsequent batch, ad nauseam. A more popular modern approach is infinite scrolling, so named because while you only load a fixed number of results as you scroll closer to the end you automatically fetch the next batch and insert it in the page, repeatedly; if done well the user can keep scrolling continuously (seemingly infinitely) without ever needing to stop.

Both pagination and infinite scrolling share a similar drawback in that you need to load all the earliest content to get to the end, so finding that photo from years ago can become a chore.

For a normal document the scrollbar functions how you’d expect, and better still you can grab it and quickly skip over entire sections to jump straight to the end, or to any other point. With pagination the scrollbar will hit the bottom of the page (but not the library), and for an infinitely scrolling page the scrollbar is always changing, if you drag to the end you’ll notice it scoot back a little as the page grows longer.

A scrubbable photo grid presents a third option, one where the scrollbar behaves properly.

To support jumping to any section of photos we need to pre-allocate space on the page so that the scrollbar is representative. That would be relatively easy if we had the information for all photos, but as it’s too much to send, we need to do so in a different way.

This is where many of the other scrubbable galleries take a shortcut — they crop every photo into an identical square shape. This way, you only need to know the total number of photos to calculate the entire page layout: for a given square size you can trivially take the width of the viewport and use it to calculate the number of columns and rows:

For only three lines of code the sizing is done, and then to render and position the photos it’s barely a dozen more.

The approach we came up with to reduce the initial size of the metadata was to subdivide a user’s collection of photos into discrete sections, and on initial load send the sections and counts. For example, a simple way to section photos would be by month — you can count these on the server (or pre-compute them) and even for millions of photos spanning decades it is still a trivial amount of data. A very basic representation of the data may look like this:

In extreme cases this could still have problems for users who take a lot of photos in a given month (eg professional photographers) — the goal of sections is to reduce every bucket to a manageable amount of metadata, but for heavy users a month may still contain many thousands of photos (and thus megabytes of data). Thankfully the clever infrastructure team outdid themselves and built a sophisticated solution that takes all sorts of things into account (eg geolocation, proximate timestamps, etc…) creating custom sections for each user.

The photos grid is divided into sections, segments, and tiles,

With that information, the client can estimate how much space each section will need and put a placeholder into the DOM, and then when the user scrolls quickly, retrieve the corresponding photo metadata from the server, compute the full layout, and update the page.

On the client, once we have the metadata for a section, we actually go one step further and segment the photos in each section into the individual days. We discussed dynamic segmentation (eg by location, person, date, etc…) which could still make a great future feature.

Estimating the section sizes turned out to be remarkably easy, you can just take the count of photos for a section and multiply it by a best guess at a normal aspect ratio:

You may be asking yourself how could this possibly be accurate? The truth is it isn’t, not even close.

It’s fortunate that I initially overthought this part of the problem (for reasons I’ll explain in the Layout section) but it turns out that you don’t need the estimate to be very good (and for large numbers of photos it is often wrong by tens of thousands of pixels). The only important thing is that the estimate is vaguely representative so that the scrollbar keeps the appearance of accuracy.

The simple trick is that when you finally load a section you calculate the difference between the estimated height and the actual height. If there is a difference you simply shift all the sections below it by that amount.

If you’re loading sections that are above the scroll point you may also need to update the scroll position. However all of that can be done in the merest fractions of a second, within a single animation frame, so that there is no perceivable difference to the user.

2. Justified Layout

All justified image layouts that I know about use an ingenious, if relatively simple, approach: they accept it is okay to have a grid of varying row heights. All the photos in a single row are scaled to the same height, and every row is the same width, but any two rows may vary in height, and the difference is usually not very noticeable.

By giving up a uniform height you can preserve the aspect ratio of every photo while achieving a fixed width grid with uniform spacing. The algorithm to achieve it isn’t even difficult, you select a maximum row height and then one photo at a time you scale the photo to that height, add its width to a running tally, and every time the width exceeds the viewport you scale down each photo in the row until it fits the width (the height will get shorter).

For example laying out 14 photos:

It’s a pretty naive solution but it works well; Google+ used it, Google Search uses a form of it, and Flickr kindly open sourced their implementation of it in 2016 (theirs is slightly smarter and checks if it’s better scaling up with one fewer photo or scaling down with the extra one). The code can be as simple as:

However, because I was initially (if unnecessarily) concerned about having the estimates neatly match the final layout, I went looking for a more sophisticated solution and in the process ended up with a superior one.

My theory was that once we’d estimated a layout we should be able to fit the photos to that area. This is essentially a line-wrapping problem, in many ways similar to text layout (wrapping words in text to justify a paragraph). The Knuth & Plass line-breaking algorithm is a well documented dynamic programming approach that I felt could be adapted for photo layout.

Instead of making decisions one line at a time, it lays out the entire section as a whole, so that each line may be influenced by the successive ones.

It does this through a combination of boxes, glue, and penalties. Boxes are the indivisible blocks to be positioned (usually the words, but sometimes characters), glue are the blocks that can be stretched or shrunken (usually the whitespace in the line), and penalties can be applied to discourage certain things (often hyphenation or line breaks).

In the following diagram you can see how the glue between the boxes varies in size across the lines.

There are some differences for a photo layout but they actually make it simpler. For text, there are a lot more variations people will accept — you can alter the spacing between words, you can even alter the spacing between letters in a word, and you can hyphenate mid-word. For photos, people find it distracting when the margins between the photos vary in size, and photo-hyphenation doesn’t even make sense.

There are some great write ups on how the text algorithm works, but here is how we adapted it to photos.

Photos became the boxes, we could drop the concept of glue entirely, and penalties could be simplified as well. Although now I say that, perhaps it’s more analogous to say we dropped boxes and that photos were the glue (ie the flexible part of our layout are the photos not the whitespace). Maybe we just had sticky boxes?

Instead of altering the spacing between photos, we’d follow the approach of other justified layouts and adjust the row heights. Most of the time there would be multiple places to wrap a row, wrapping earlier would give taller rows (scaling up to fill the width), wrapping later would give shorter rows (scaling down to fit). By considering all the possible permutations of rows, we could find one that best fit the desired area.

That meant we were left with three primary considerations: ideal row height, maximum shrink factor (how much we could scale a row down from ideal), and maximum stretch factor (how much we could scale up).

The algorithm works by checking every photo one at a time, looking for permissible row breaks — ie a group of photos that when scaled to fit the width will have a height that falls within the accepted range (maxShrink ≤ height ≤ maxStretch). Every time it finds a permissible break it adds it to the list of possibilities, and looks for permissible breaks from there and so on until it has considered every photo and every possible set of rows.

For example, with those 14 photos, an acceptable row may have been after 3 or 4 photos, and if we broke at 3 there would be an acceptable break at 6 or 7, although if we broke at 4 there would be acceptable breaks at 7 or 8. These represent multiple completely different, yet all valid, layouts for the grid.

The final piece is to calculate a badness score for each row. That is, how non-ideal it is. A row that is the target height has badness score of 0, and the more a row has to shrink or stretch then the higher the badness score will be. The final cost of each row (this will make sense in a moment) is calculated using demerits, which are often the cube or square of the badness plus some penalties (eg for line breaks). There are many articles on how best to calculate badness and demerits, in our case we use a power of the ratio of each row against the max stretch/shrink size (the power more heavily penalizes rows that are a long way from ideal).

Having run the algorithm we end up with a graph of nodes, where each node represents a possible photo to break on, and each edge represents a row (there may be multiple edges for any given node, signifying that there may be multiple break points when starting from any photo). For each of these edges we can assign a cost (the demerit value).

As an example, for our 14 photos, a target row height (180px), and a given viewport (1120px), it found 19 possible row arrangements (edges), leading to 12 unique grid permutations (or paths through the graph). Displayed below is each unique row, and the possible rows it can connect to. The blue route is the least bad (dare I say best?) one. If you follow the lines you’ll see that each combination constructs a full grid containing every photo — no two rows are the same, and no two grids the same.

Unique row and grid combinations for 14 photos

Finding the optimum photo grid (ie the one with the lowest combined set of bad rows) is as simple as calculating the shortest path through the graph.

Luckily for us, the graph we produce is what’s known as a Directed Acyclic Graph (DAG), which is one where there are no loops and you can only go one way (ie you can’t repeat nodes/photos). This means that calculating the shortest path can be done in linear time (which is computer speak for quickly). Better still, we can actually calculate the shortest path while we are producing the graph.

To calculate the length of the path we simply add up the cost we assigned to each row, and every time we find a new edge that connects to a node, check if this makes a shorter path for that node back to the start — if so remember it.

Here is an illustration of what the computer “sees” as it looks through those 14 photos — the top line shows what photos it is currently looking at (the starting and ending photo for a row), the graph below shows what break points it has discovered, and which edges connect, and at every point it will highlight in pink the currently shortest path for each node. This is actually just another representation of the picture graph shown above — each of the edges between the boxes corresponds to one of those unique rows.

Starting from the first photo it finds an acceptable break point at index 2, with a cost of 114. It then finds another acceptable break point at index 3 with a much higher cost of 9483. It now needs to check those two new indexes (2 and 3) for where they could break. From 2 it finds 5 and 6, and at this point the shortest path for 6 is back via 2 (114 + 1442 = 1556) so it marks it. When photo 3 finds a path to 6 we check the cost again, but because it was so expensive to get to 3 initially, the total cost (9483 + 1007 = 10490) means that 6 keeps its allegiance to 2. Towards the end of the animation you can see that the first path to 11 was non-ideal and switches when node 8 is considered.

Finding the optimum row combination for 14 photos

We keep doing this through the entire set of photos until we get to the very last photo (index 13). At that point the shortest path (and best layout) can be found by following the shortest route we marked along the way (and colored in blue in the animation).

Here is a comparison of what the naive algorithm produced (on the left), and the one that the line-wrap algorithm achieved (on the right). For both they were given a target height of 180px. You can see two interesting things, one is that the naive layout always goes under, and the other is that the line-wrap one was just as happy to go over – however the line-wrap algorithm produced a grid that was much closer to the target height.

Comparison between layout approaches, given a target height of 180px

We found in our testing that the line-wrap (we named it FlexLayout) algorithm produced both objectively and subjectively more desirable grids. It consistently produced grids of more uniform height (smaller variation between rows), and ones with average row heights much closer to the requested target. And it did appreciably better with panoramas and other edge cases that usually trip up a naive algorithm — this is because in the naive approach the panoramic (ultra-wide) photo will be added to the first row it’s considered, and so often scaled very small because there may be multiple photos on that row already, whereas with the FlexLayout all possible rows are considered, and ones that overly shrink the panorama will have high badness values, encouraging the selection of grids where the panorama is placed by itself or with few others.

That may mean that there are a few rows that will be a little more bad (a few more pixels from target height), to prevent one row being much more bad (extremely short or extremely tall). It minimizes surprises.

There are many factors that affect how many possible layouts there are. More photos is one of the largest factors, but the viewport width can also constrain it, and then the actual parameters for shrink-ability/stretch-ability have a big effect too.

Unique layouts for 25 photos across viewport sizes

You can get a sense of this when looking at the graph for 25 photos across a narrow, medium, and wide viewport. In the narrow window there were only a few breakpoints available but we needed a lot of rows, in the medium window there were more breakpoints, and in the wide window while there were even more breakpoints we did not need as many rows so there were actually fewer total arrangements.

The total number of unique layouts grows exponentially with the number of photos. For a medium width viewport I had the layout calculate the actual unique paths for a set of photos and got:

For 1000 photos there were simply too many for the computer to measure, so it couldn’t actually count the precise number of unique paths (it’s an amusing quirk that the algorithm in this case can know it has found the best path nearly instantly, even if it can’t verify itself in reasonable time).

We can estimate the unique layout permutations by taking the average number of permissible breakpoints per row and raising that to the power of the likely number of rows. Most viewports support 2–3 breakpoints per row, and with most rows about 5 or more photos, you can ballpark the number of layouts with 2.5^(count/5).

For 1000 photos that would be a number with 79 zeros on the end. 1260 photos would have a googol of layouts.

While the naive approach will consider a single layout and pick it every time, the line-wrap algorithm considers millions, billions, trillions, and many more unique layouts and selects the best one.

In case you’re curious, it’s also very quick. The layout for 100 photos takes about 2 thousandths of a second (2ms). 1000 photos takes 10ms, 10000 photos takes 50ms, and 1 million photos only takes 1.5 full seconds (we’ve tested). In comparison the naive algorithm takes about 2ms, 3ms, 30ms, and 400ms for those same numbers — faster, but not to a meaningful degree.

So, while the original intent had been to use the sheer number of possible layouts available to pick the one that best fit the available space (ie make the layout match the estimate), because we found that we can smoothly adjust the gap between estimated and actual size, it allows us to always present users with the best possible grid.

The layout works so well the team has since ported it to Android and iOS, and the three implementations are kept in sync.

The last layout trick we do is run the algorithm twice for each section. The first time we run it to lay out all the photos within segment, the second time we run it to lay out all the segments within the section. The primary reason for this is sometimes there are very short segments that do not fill a row, and the layout algorithm will suggest options to coalesce them — and same as with photos, it will look at all the possible groupings to select the most ideal.

3. 60fps Scrolling

Having scrubbable photos and ideal layouts would not count for much if the browser couldn’t handle it. Which, by itself, it actually can’t — fortunately we can help.

One of the biggest ways that websites can feel slow (other than initial load times) is in how smoothly they respond to user interaction, especially scrolling. Browsers try to redraw the contents of the screen 60 times every second (60fps) and when they’re successful it looks and feels very smooth — when they don’t it can feel janky.

To maintain 60fps each update needs to be rendered in a mere 16ms (1/60) and the browser needs some of that time for itself — it has to marshal the events, parse style information, calculate layouts, convert all the elements into pixels, and finally draw them to the screen — that leaves around 10ms for an app to do its own work.

Within those 10ms, applications need to be both efficient in what they do, as well as careful not to make the browser perform unnecessary work.

Maintaining a constant-size DOM

One of the worst things for page performance is having too many elements. The problem is two fold: it consumes more memory for the browser (eg at 50KB thumbnails 1000 photos is 50megabytes, the 10000 photos that used to crash Chrome was half a gigabyte); additionally, it is more individual pieces the browser needs to compute the styles and positions for, and composite during layout.

While most users will have thousands of photos in their library the screen can usually only fit a few dozen.

So, instead of placing every photo into the page and keeping it there, every time the user scrolls we calculate what photos should be visible and make sure they are in the document.

For any photo that used to be in the document but is no longer visible we pull it back out.

While scrolling the page there are probably never more than 50 photos present, even as you scroll through tens of thousands. This keeps the page snappy at all times and prevents crashing the tab.

And, because we group photos into segments and sections we can often take a shortcut and detach entire groups instead of each individual photo.

Minimizing changes

There are some great articles on the Google Developers site about rendering performance and how to use the powerful analysis tools that are built into Google Chrome — I’ll touch on a few aspects here as it applies to photos, but the other write-ups are well worth a read. The first thing to understand is the page rendering life-cycle:

Every time there is a change to the page (usually triggered by JavaScript, but sometimes CSS styles or animations) then the browser checks what styles apply to the affected elements, recalculates their layouts (the size and positions), and then paints all the elements (ie converts text, images, etc… to pixels). For efficiency the browser usually breaks the page into different sections it calls layers and paints these separately, and so a final step of compositing (arranging) those layers is performed.

Most of the time you never need to think about this, the browser is pretty clever, but if you keep changing the page on it (for example constantly adding or removing photos) then you need to be efficient in how you do that.

Sections, segments, and tiles are positioning absolutely

One way we minimize updates is by positioning everything relative to its parent. Sections are positioned absolutely relative to the grid, segments are positioned absolutely relative to their section, and tiles (the photos) are positioned absolutely relative to the segment.

What this means is that when we need to move a section because the estimated and actual layout heights were different, instead of needing to make hundreds (or thousands) of changes to every photo that was below it, we need only update the top position of the following sections. This structure helps isolate each part of the grid from unnecessary updates.

Modern CSS even provides a way to let the browser know — the contain keyword lets you indicate to what degree an element can be considered independently of the DOM. We annotate the sections and segments accordingly.

There are some easy performance pitfalls as well, for example the scroll event can fire multiple times within a single frame, and the same for resize. There is no need to force the browser to recalculate style and layout for the first events if you will change them a second time anyway.

Fortunately there is a handy way to avoid that. You can ask the browser to execute a specific function before the next repaint by using window.requestAnimationFrame(callback). In the scroll and resize handlers we use this to schedule a single callback instead of immediately updating — for resize we go a step further and delay updating for half a second until the user has settled on the final window size.

The second common pitfall is something known as layout thrashing. Once the browser has calculated layout, it caches it, and so you can happily request the width, height, or position of any element pretty quickly. However if you make any changes to properties that could affect layout (eg width, height, top, left) you immediately invalidate that cache, and if you try to read one of those properties again the browser will be forced to recalculate the layout (perhaps multiple times in the same frame).

Where this can really cause problems is in loops with updates to many elements (eg hundreds of photos), if each loop you read one of the layout properties, then change them (say moving photos or sections to the correct spots), then you are triggering a new layout calculation for every step in the loop.

The simple way to avoid this is to first read all the values you need, and then write all the values (ie batch and separate reads from writes). For our case we avoid ever reading the values, and instead keep track of the size and position that every photo should be in, and absolutely position them all. On scroll or resize we can re-run all our calculations based on the positions we have been tracking, and safely update knowing that we will never thrash. Here is what a typical scroll frame looks like (everything is only called once):

Rendering and Painting event order for a typical scroll update

Avoiding long running code

With the exception of Web Workers, and some of the native async handlers like the Fetch API, everything in a tab essentially runs on the same thread — both rendering and JavaScript. That means any code a developer runs will prevent the page from redrawing until it completes — for example a long-running scroll event handler.

The two most time-consuming things our grid does is layout and element creation. For both we try to limit them to the essential operations.

For example, the layout algorithm takes 10ms for 1000 photos and 50ms for 10000 — this could use up our entire frame allowance. However given we subdivide our grid into sections and segments we usually only need to layout a few hundred photos at any time (which takes 2–3ms).

The most “costly” layout event should be a browser resize, because that would need us to re-calculate the sizes of every section. Instead we fall back to the simple estimate calculation, even for loaded sections, and only perform the full FlexLayout for the presently visible section. We can then defer the complete layout calculation for the other sections until we scroll back to them.

The same happens with element creation—we only create the photo tiles just before we need them.

Result

The end result of all the hard work is a grid that can maintain 60fps the majority of the time, even if it occasionally drops some frames.

These dropped frames usually occur when a major layout event happens (such as inserting a brand new section) or occasionally when the browser performs garbage-collection on very old elements.

4. Instantaneous Feel

I suspect that most front-end engineers would agree that sleight of hand plays a role in many good UIs. The trick is selecting what smoke to use and how to angle the mirrors.

My favorite example of this is a secret that a colleague at YouTube shared with me. When they first implemented the navigation progress bar (the red bar that appears at the very top when you change pages) they had no way of actually measuring the progress, so they just animated it at the speed most pages took, and then it sort of “hangs” towards the end until the page actually does load. I have no idea if the current version is still pretending or if it actually works, but the point is it doesn’t matter.

It wasn’t necessary to be accurate, what mattered was it helped the page feel responsive.

In this section I’ll share a few of the tricks we use to make Google Photos appear a little faster than it really is — mostly how we disguise image load times.

The first, and probably most effective, thing we do is preemptively load content that we think you are about to look at.

After loading any tile that is visible we then attempt to stay a page ahead so the thumbnails have loaded by the time you scroll.

However, especially for HDPI screens (where we need to load larger thumbnails), if you are scrolling quickly then the network connection may not be able to fulfill all those requests in time.

We handle that by loading extremely small placeholders for as many as 4 or 5 full screens in the future, and replacing them once they get closer to the viewport.

This means that if you are scrolling relatively slowly (at a speed suitable for actually looking at all photos) then you should never see any loading, and if you are scrubbing quickly (at a speed suggesting you are searching for a photo) we can give you enough of the visual context to help guide your search.

This is a complex tradeoff between doing unnecessary work over-fetching content, and providing a better experience.

We take a few factors into consideration. The first thing is to observe the scroll direction, and only pre-load content in the direction the user is heading. We also measure the scroll speed and skip loading full-res thumbnails as soon as we think you’re scrubbing, and at an even higher threshold disable low-res preloading if you’re flying through content.

In each case (normal thumbnail and low-res) we’re scaling images. Now that modern screens have such high resolutions the common practice to ensure images look crisp is to load an image that is twice as large as the space you are filling and then shrink it (so there are more actual pixels than the space it takes). For the low-res placeholders we request very small images, and also at lower compression quality (eg 25%) and then scale them up.

Here is an example of a sleepy leopard — the image on the left is used in the grid when the tile is fully loaded (it gets scaled down to half size), the image on the right is the low-res placeholder you will only see if you are scrolling quickly (it gets scaled up).

Also observe the byte sizes. The HDPI thumbnail was 71.2KB (gzipped) while the low-res placeholder was only 889B (gzipped) — the thumbnail was 80x bigger! Put another way, a single tile in the grid is the same as 4 or more pages of low-res placeholders.

For a very small increase in extra network traffic we can give the user a much better experience, a grid that always feels full, and always provides visual context.

The last little touch with the low-res tiles was how we asked the browser to render them. By default when you scale up an image the browser will smooth it a little (the center image below), but this doesn’t look very good. You can apply a blur filter (the right-most image) which makes it look more deliberate, but the downside is that filter is computationally expensive, and if you apply it to hundreds of elements you will negatively affect rendering and scroll performance. So we chose to go in the other direction and lean-in to our low-res look by asking the browser to leave the image pixelated (the left-most image) — to be honest, I’m not sure if this is still in the product today, there have been a few refactorings.

While the hope is the user never sees the low-res images (except during fast scrolling) when replacing them after they are in the viewport we previously used a quick animation to make it look like they loaded (instead of flashing into place). That’s easily achieved by overlaying the two pictures and animating the opacity (from fully transparent to fully opaque) — this cross-fade technique has become very common across the web, for example all the images in this Medium post probably did it. I believe the cross-fade for low/high-res has since been turned off, but it does still occur from the empty (grey) tile to the image.

It makes it look like the image is loading. We did this swiftly (in 100ms) which is just enough time to take the edge off, without feeling indulgent. I’ve slowed the animation below to make it more observable.

We use this technique a second time when transitioning from a thumbnail photo into the full-screen view. When the user clicks a tile we immediately start loading the full-res image and in the meantime scale and animate the thumbnail into place, when the full image has loaded we overlay and do the opacity animation between them. The only difference is this time, because we are only applying it to a single element, we can afford to use the more expensive blur filter (which is handy because the pixelated effect is less charming on large images).

Transition from photo grid to full-screen

At all times, when scrolling through photos, or transitioning to the full-screen view, we are trying to provide a smooth experience to the user that always feels it is responding to their input, even when the content isn’t ready. Contrast this with how it would feel if when you clicked a tile, it either displayed a blank screen or alternatively did nothing until the full photo loaded.

We even apply this concept to the empty sections. If you recall, our scrubbable grid only loads sections when it needs to (although like tiles it attempts to pre-load nearby sections). This means, especially if you grab the scrollbar and race ahead, you can get to sections that have not loaded yet — the grid has pre-allocated space for them, but doesn’t know what photos go there or what the layout is.

To make scrolling feel more natural we put a texture in the unloaded sections that is the same height as the target row size and colored to look like an empty tile. When we first launched, it just looked like rows (the left-most picture), although the team has more recently changed the texture to be rows and columns (the right-most picture) which more closely approximates photos. The middle picture is what it looks like when the section has loaded but the tiles have not.

It’s like animal tracks for photo loading states — next time you’re scrubbing through Google Photos see if you can spot the differences.

Instead of using an image for the texture, it was actually created using CSS. This gives the added bonus that the width and height can be dynamically generated to match the target row height that is used for the grid.

We have a few other tricks but they’re mostly about prioritizing network requests. For example, instead of flooding the network with a request for 100 photo thumbnails, we batch them into 10 or so at a time, so if the user suddenly starts scrolling again we don’t end up with 90 photos we loaded but didn’t use. Similarly, we always prioritize loading the visible thumbnails over the off-screen ones.

We even look to see if we have already loaded a similarly sized thumbnail and can use it instead — this last use-case is primarily after browser resizes, often you will end up with a grid layout that is almost the same but rows just a few pixels different. Instead of having to re-download every photo again we will slightly scale the images we already have (opting for new ones only if the difference is too much).

Conclusion

A tremendous amount of care and attention goes into every detail of the Google Photos experience, and the photo grid is just one part of a much bigger product.

While it may at first appear simple, and even stationary, the grid is nearly always thinking — loading, pre-fetching, animating, creating, removing, and presenting your content the best that it can.

Keeping the grid performing well (and constantly improving) has been an ongoing priority for the team. They have comprehensive monitoring to measure the scrolling frame-rate, the section and image load times, and many other metrics, and continue to improve the performance and experience each year.

Here is a short screen capture of what it looks like to scroll through a gallery. At slow speeds you only see the full-res images, as we speed up you can start to see the pixelated placeholders which resolve as soon as we slow down again, and as we race forwards there is a brief glance at empty grey tiles until the grid catches up.

Scrolling and scrubbing the photo grid

A big thank you to my former manager on Photos Vincent Mo, who in addition to his support, shot all the great photos used throughout this post (and which served as a test set during development). Also to Jeremy Selier, the Photos Web Lead, and his team, who continue to maintain and improve the Photos Web UI today.