Focal Systems
Jul 29 · 19 min read
© Photo by Matt McFarland, CNN

Hailed as the future of retail, Amazon’s cashier-less Go stores have arguably been one of the most ambitious projects launched in the retail landscape in a long time. As an engineer myself, I am stunned by the scale and robustness of the system. Dilip Kumar, Jimmy Pang, and the rest of the Go team achieved what most people thought to be impossible. This robustness however comes at the cost of vast amounts of costly hardware, heavy computational requirements, devops, and numerous other challenges.

All of this bears the question: is Amazon Go actually a feasible solution for a large format grocery store?

We wanted to find out. We approached the question from a technical angle, given our deep learning expertise (for a less technical version also check out Brad Stone’s and Matt Day’s excellent Bloomberg article that reaches eerily similar conclusions).

TL/DR: In our analysis, we find that even though GPU compute is getting cheaper each year, the system will not prove a breakeven in a large format grocery store, compared to the status quo of operating the front end with cashiers, until after 2040.

I. Economic Benefits of the Amazon Go Platform in Large Format Grocery Stores

Direct Labor Savings

The Amazon Go platform eliminates the need for cashiers and out of stock scanners / inventory counters. These benefits imply the following direct labor savings:

  1. Cashier: Today’s cashier makes $12/hour on average nationally. While grocery stores may have 10 lanes, and at peak dates/times may open all 10, this is very rare. On average, we estimate that 5 cashiers are staffed for your average grocery store. If the store is open everyday, 7am-midnight, this implies a $372,300 annual savings per store.
  2. Out of stock scanning / inventory counting: With cameras watching each product, there is no more need to scan outs for 4 hours a day in the morning, run cycle counts to ground for 2 hours, and manually count inventory each quarter (which takes ~100 hours). This provides additional direct labor savings of about $40,000/store/year.

Indirect Labor Savings

The platform also creates enormous amounts of data that indirectly permits additional optimization of the workforce and supply chain. With that data, the system knows how many picks per hour each stocker is completing and exactly when items go out of stock. It also knows exactly when the products go out of stock, so they know how much to allocate on the shelf next planogram cycle to ensure the allocation lasts the whole day. This helps management optimize their planograms, the supply chain, and in-store labor performance.

The indirect labor savings of such a benefit is very hard to estimate however, we estimate this to be $50,000/year/store.

How Much Lift Does “No, Seriously, Just Walk Out” Provide?

A key hypothesis Amazon Go is currently testing is:

If lines are shorter, more people will go there.

It is an interesting sociological experiment to think about what the increase in foot-traffic to one store over another would be as a function of the difference in checkout speeds. One would expect the bigger the delta in checkout speeds, the more foot traffic the faster store would get compared to the slower store. To conduct this experiment, you would need 2 stores with very similar offerings that would attract very similar demographics, but very different average checkout speeds. And you would need to conduct this in a few different geographies.

How is Amazon conducting this experiment?

The Amazon Go Team pivoted in 2015 from the larger 30,000 sqft store format to go after the convenience store format to attract the lunch-time rush. To test the hypothesis described above, they installed 14 small grab-and-go stores almost all in less than 3 blocks of a Chipotle, who currently has the largest market share in this category and happens to have very long lines.

Figure 1: Most Amazon Go Stores are within 3 blocks of a Chipotle

Focal’s “Amazon Go and Chipotle” Data Collection

To understand this customer preference, Focal deployed people counters above the entrance of 3 Amazon Go stores in San Francisco and for each one, we also deployed people counters at the nearest Chipotle to track a baseline. We gathered data for 21 days for 5 hours a day (noon — 5pm). The measured data shows the average number of customers entering an Amazon Go store was ~15% higher (or 11 more customers per hour) than the neighboring Chipotle. The average wait time at Chipotle during the measurement period was 9.01 minutes from entering the line until food was given to the customer vs. Amazon Go which is essentially instantaneous. Here is the # of customers broken down per hour interval averaged over the 21 day period in both stores.

Figure 2: New customer arrivals at Amazon and Chipotle during lunch time.

Thus, one can conclude that every 1 minute faster your checkout is than a neighboring store selling similar things, you will get 1.6% more traffic than them.

As a corollary, assuming these numbers hold for a large format grocery store where the average basket size is $41.30/customer with 140,000 shoppers a year per store, this implies $867,300 in lift.

Assuming a 20% gross margin, we conclude the Amazon Go platform would deliver $173k in Incremental Gross Profit per grocery store per year.

This brings the total economic benefit of the Amazon Go system to the following:

Figure 3: The total annual benefit Amazon Go provides to a large format grocery store comes out to $645,000.

II. Accuracy Requirements and Feasibility of Amazon Go in a Large Format Grocery Store

Analysis on how accurate the system must be

The average basket quantity in the US in 2018 was 8 items per basket. That is 8 opportunities per trip for the system to be unsure, or worse, wrong. When the system is unsure, the system needs to take additional steps which very often come in the form of human intervention which costs the store money and frustrates shoppers.

We call these cases a “Can Not Decide” (CND) event (Dilip in his re:MARS address called these events “low confidence” events). Assuming a specific CND rate per “grab”, the percent of shoppers needed to be audited would be:

Setting basket size to 8 items, here is the % needed to be audited as a function of accuracy of the system (CND%):

Figure 4: As CND events increase, significantly more shoppers need to be audited.

This suggests that even at a 1% CND% the system would require 7.73% of transactions to be audited in store or remotely.

First, getting any AI system to 99% accuracy is extremely difficult. Second, auditing even 8% of shoppers baskets would render this system very frustrating or very expensive to audit that many transactions through remote labeling of video data (which is how Amazon Go performs these audits).

Assuming the Amazon Go Platform has a 5% CND%, this implies that 55,000 shoppers a year would need to be remotely audited per store. The cost of auditing a single 20 minute shopping trip may be ~$2.

This would imply a labeling cost of $110,000 / year / store which is the cost itself of 3–4 full time cashiers.

However, as the system gets more accurate, this amount will go away.

The case for scales in the shelves vs. other ways to solve SKU recognition

Dilip mentioned in his address that they had to build their cameras custom. This is because the specs on their cameras are very unique and very hard. They have to be able to tell the difference between two very similar looking products 20–30 feet away. And they need to collect accurate depth data to help build a 3D scene to help track people from camera to camera, frame to frame. Even with these very special cameras, there are still occlusion events that would prevent the cameras to see which product was grabbed or replaced. Since the bar for accuracy is so high (as explained above) even a 1% error rate is unacceptable, and in some edge cases, the cameras alone would not provide enough signal to disambiguate between two highly possible events.

To solve the shortfalls of an overhead camera-only system, Amazon Go installs scales in the shelves that help the system perform even in the midst of an occlusion event.

Some Amazon Go me-too systems have touted the ability to retrofit stores with just a few cameras installed in the ceiling without any scales (Standard Cognition says they can achieve Amazon Go with just 17 cameras in each aisle and no scales). Focal conducted a study by installing hundreds of cameras in one aisle in nearly every viewpoint possible, and asked customers to try to fool the system, and concluded that it is impossible to achieve >95% accuracy with a camera-only system. For example, when items on the bottom shelf are pushed to the back of a shelf, a ceiling-mounted camera could not possibly have the perspective required to see the item to determine if it is in-stock or out-of-stock, let alone to see if a customer has grabbed that item. This allows a customer to secretly grab the item by cuddling the shelf, occluding the grab event or the product features without providing enough information for the cameras to predict what just happened. Even if you have the right perspective to see the customer grab this product, the customer can grab it quick enough to cause a motion blur that would not allow the recognition to work properly. This motion blur would occur in all cameras (global or rolling shutter) unless they had a very bright flash as the EZ Pass / FastTrack systems have which would not be acceptable in a grocery environment.

Thus, a camera-only system would require a significant amount of “audits” which would be very costly, would deter customers from using the system and defeat the entire purpose of the system, mandating the need for shelf weight sensors.

How to Handle Audits Remotely While Still Maintaining Low Shrink?

In many self-checkout systems in the US (Costco) and in Europe (Tesco), random manual audits are performed to ensure customers are not stealing. This frustrates customers so much that in some cases it questions the viability of the system.

Instead of stopping customers, Amazon Go solves this by deploying a number of Human-In-The-Loop (HITL) labelers that take all the input from the system and try to infer what happened on the shelves in the case of a CND event. This gives the added benefit to produce labeled training data of “hard examples” over time for the system to learn from. Eventually the hope is to automate this as the system learns. We discuss the cost of such an audit system further below.

Also, as we discussed above, the only way this HITL can work is if there is sufficient “signal” captured in the store for the labeler to be able to accurately and quickly disambiguate between two likely events, which is not true with camera-only systems.

III. Compute Requirements: How many GPUs would each store need?

Throughput

The compute power required to run the Amazon Go system is directly proportional to both the amount of data the system produces and the algorithmic complexity of the processing required. The more cameras the more expensive the compute, the more models in the algorithm, the more expensive the compute.

While the Moore’s law for GPUs (we will call this “Jensen’s Law”) is in full effect (increasing Teraflops per dollar by ~50% a year), we estimate the compute costs of the system to be orders of magnitude more expensive than employing cashiers and will remain so until after 2040.

Here is our current estimate of the compute costs of deploying Amazon Go in a large format grocery store:

  • Amount of Data Produced by the System: There are about 300 cameras covering the 1,000 sqft. store, which implies 1 camera per 3 sqft. A large format grocery store is typically 50,000–100,000 sqft. This suggests the system would require 15,000–30,000 cameras. Each camera produces 30 frames per second of RGB-Depth data.
  • Algorithmic Complexity:
Figure 5: A prediction of the Amazon Go Computer Vision pipeline to go from Pixels to tracks and actions to pair with scale data to infer what product was taken by which track.
  • Image Difference: Each frame is compared to the previous frame to determine if they are the same. If nothing has changed, then the system should reuse the results from the previous frame’s predictions. In low traffic times, this will reduce the compute required by 10x. However, in a busy store, this will not help very much at all. This simple function could be done locally on the camera by an Image Signal Processor (ISP) or by an intermediary node that performs this math on the raw pixels or the H.265 encoded camera feed. MPEG naturally does this already to a certain extent.
  • Person Detection: Each sufficiently different frame needs to be forward passed through an initial Deep Learning model to detect people. The state of the art model for this task takes 70 milliseconds per image with a max batch size of 16 (on a T4 GPU). This implies each GPU can process 228 images per second. With 30,000 cameras producing 30 frames a second, this means the system needs 3,947 GPUs just forward passing the person detection models.
  • Caveat #1: The T4 GPU runs at 8 Teraflops while bigger GPUs like the V100 achieves 100 Teraflops but costs 5x more. We do our analysis on T4s rather than the many other options for GPU computing.
  • Caveat #2: Image differencing and random dropping of frames can significantly reduce the amount of data needed to be forward passed. The more aggressive you make the dropping of data on the input, the more likely the system is to make a mistake. There is likely a happy middle for this that saves on compute but still meets the system’s accuracy requirements.
  • Caveat #3: If latency is not a concern, most tech companies (like Facebook) do not use GPUs for inference because of the cost described below. Instead they leverage CPUs which are cheaper per TFLOP, but are way too slow for this task (unless customers would be ok getting their receipt next week!).
  • Caveat #4: The model size and inference times can be reduced with quantization a small amount (10–20% in the best case by our measures). Some will purport that model distillation can reduce the model size and inference times by 100x without any loss in accuracy, but this has been debunked in academia.
  • Person Re-Identification: Each Person Detection event in each image is cropped out and then compared to all the Person Tracks currently live in the system to either match the new detection to one of the existing tracks or to create a new Person Track (if near the entrance). The compute requirements for the state of the art model for this task scales with the number of person detection events and the number of person tracks. We ran our own experiments and with 100 person tracks in the system, each GPU could match 50 person detection events per second (on a T4 GPU). Assuming an average of 100 people in a typical grocery store at any one time and cameras producing 30 frames a second with 4x coverage per person, this means the system needs 240 GPUs just forward passing the Person Re-Identification. Although we are only assuming 100 people in the store at this point, the number could be far greater during a lunch rush or a holiday event in which case the system could need 5–10x that number, requiring 1,000–2,000 GPUs for this task per store.
  • Action Recognition: Each person track is cropped out of the video feeds and is merged an action recognition model to detect a “grab” or “replace” event and match that event with a change in weight event from one of the scales. The state of the art model for this task again scales with the number of person tracks and the length of the shopping trip. We ran our own experiments and with 100 person tracks each with 4x coverage downsampled the 30 fps feed to 2 frames per second, we required another 80 GPUs (on a T4 GPU). Again, we are only assuming 100 people in the store at this point, the number could be far greater during a lunch rush or a holiday event in which case the system could need 5–10x that number, requiring 400–800 more GPUs for this task per store.
  • (Potentially) Product Detection and Product Recognition: We do not include these models in our analysis since scales + action recognition should be able to solve the problem, but the pipelines may benefit from detecting items in peoples hands. These models would be very similar to the Person Detection and Person Re-Identification models discussed earlier. In this case, Product Detection may share the same trained weights as Person Detection if the tasks prove similar enough. However, the Product Recognition would require a very different model than Person Re-Identification since there are 100,000+ different types of products per store and would require 500–1,000 more GPUs per store.

The GPU / camera ratio

As we discussed above, the compute power (number of GPUs) required to run today’s Amazon Go system is directly proportional to (1) the amount of data the system produces (number of cameras and the resolution of those cameras) and (2) the algorithmic complexity of the processing pipeline of that data. Given the algorithmic complexity of the system described above, the number of GPUs required to run the pipeline for a single camera can vary based on the type of algorithms, the type of GPUs, the resolution of the input to the model, the downsampling of the images, the number of frames skipped (since you do not always need to run all models on all frames), the depth resolution of the cameras, the resolution of the cameras, and the fps (frames per second) of the cameras. This is such a complex optimization problem that I am confident there are 50+ Amazon infrastructure/systems/devops engineers working solely on that problem today.

With that said, for the pipeline to be 99+% accurate, a Deep Learning engineer like myself would want the fps to be as high as possible (without dropping any data ideally), the input resolution to be as high as possible to make sure the algorithm can see every feature available, and the model to have as many parameters as possible, using a 1000 layer model vs. a 50 layer model to get that extra 1% in performance.

For this reason, assuming 30,000 cameras per 100,000 sq ft store, we estimate each store would require >3,000 GPUs during normal shopper traffic and >7,000 GPUs during holidays. This implies a Camera to GPU ratio of 10 to 1 for normal shopper traffic and 4 to 1 during holidays.

Latency

The first launch of the Amazon Go Platform made use of local GPUs which delivered very low latency (or quick response time) which allowed the system to update the customer’s app within a few seconds of the customer taking a product off the shelf.

However, Amazon quickly realized that this was an expensive feature that was not scalable and that customers really did not care about. First, sending the customer a receipt 5 to 45 minutes after the trip was over provided a fine user experience. Second, this allows a Human-In-The-Loop buffer for all CND events. Finally, by moving all the compute to the cloud, Amazon can scale much more quickly, skirt the need to attain fire marshal approval in every town to permit them to install a giant datacenter per store, and in the end save a lot of money.

IV. Implied Cost Today

To assess the final cost of the system, the decision must be made to use the cloud or for the retailers to purchase the GPUs and host them. We breakdown the cost of each method below.

Figure 6: Implied costs of owning versus using the cloud to run the system. While owning the hardware comes with a higher initial cost, the ongoing cost is significantly cheaper compared to using the cloud.

All the details of this analysis can also be found here.

V. Implied Cost in the Future

Figure 7: Predicted annual cost of the Amazon Go platform using the cloud solution vs. owned compute solution vs. baseline (using cashiers). The Amazon Go platform does not produce breakeven from baseline until after 2040.

As shown in the figure above, given aggressive assumptions on the reduction in compute, camera and scale hardware costs, we estimate the system will not prove a positive ROI in large format grocery until 2040 compared to the baseline method of operating the front end with cashiers.

Here are the time-based assumptions that went into these predictions:

  1. Jensen’s Law will continue and GPUs will increase in cost effectiveness by 50% a year in teraflops per dollar. However, this would require Taiwan Semiconductor Manufacturing (TSM) to build a completely new fab, starting today, which they have not, since the current fab can not possibly add any more transistors to a 12-inch wafer.
  2. The increase in teraflops per dollar will equal the increase in teraflops per kilowatt. This is likely not true since the additional compute (more transistors) would increase the amount of power drawn by each additional transistor which will result in much larger power draw and electricity cost ongoing. We ignore this for now.
  3. The cost of the cameras will come down by 10% and the scales by 5% each year. For the high resolution market, this assumption is extremely ambitious since it has not been happening historically, but if Amazon were to build millions of these, the cost would come down.
  4. The cashier baseline cost will increase by 1% a year which accounts for increases in salaries and inflation. In some parts of the US, this has been way higher. In NY and NJ for example, the minimum wage is going from ~$8/hour to $15/hour, which is much more than 1%.

VI. Other Topics Worth Mentioning

  1. Weighables: There is currently no solution in Amazon Go for pay by the pound fruits, nuts, coffee, vegetables, etc.
  2. Bandwidth: The bandwidth requirements of this system will dramatically exceed what is currently installed in stores today. Each store would have to undergo a very expensive overhaul with their Internet Service Providers (ISP) to allow for the 30+ Gbps of upload that will be needed for this system. In fact, they would likely have to purchase an additional $5,000 / month fee to enable the upload speeds required to make this system work on the cloud, which would be itself 1 full-time cashier.
  3. The cloud vs. onsite compute stack tradeoff: Since the performance and cost effectiveness of GPUs continues to increase, it makes little sense for retailers to “lock in” a cost of compute by purchasing GPUs. Instead, they would be better served by leveraging the cloud until after 2040 so they get the economies of time each year.
  4. Why use GPUs and not custom ASICs or something more scalable? One major area of research worth discussion is the use of custom ASICs per camera for person detection and person re-identification. We will not go into this in this document, but all attempts so far to build custom ASICs that are big enough to run the large models required by these systems have failed (Movidius, Teradeep, Nervana, Coral, Nano, etc). As evidence, there are no major deployments of any of these on the edge in any DL system to date. After discussing with many experts in Silicon, the custom ASIC route would save on electricity but would cost 5–10x per chip and would not likely be able to match the throughout / latency of Nvidia’s GPUs.

VII. Conclusions, Discussion, and Focal’s Vision

This step function that the Amazon team has achieved belongs in the same class of technical accolades as SpaceX’s water landing, simply breathtaking.

However, similar to space travel, the cost is just not economical yet for the general market.

Further to that point, our 2040 prediction is only true for Amazon, not for other retailers as Amazon has the financial and technical ability to build it themselves and the resources to withstand the losses until the economics make sense. In fact, Amazon investing in the solution before it makes financial sense will pull in the prediction to 2040. If they do not, it may be 2050 or later. With Amazon making $12B a year in Operating Profit, Amazon can serve to lose >$1B a year on this strategy for the next 20 years if that makes the future happen ten years quicker. And after that, they will have the datasets, devops efficiency with AWS, economies of scale, and know-how required to make it cost effective. However, Target or Tesco or any other retailer doing so, would financially cripple the company since it would require them to invest in their own AWS-equivalent or get Microsoft to give them compute at cost which doesn’t make any sense for Azure. The heavy losses from this strategy would surely result in the street firing the CEO in the meantime.

As for startups trying to develop this technology on their own to sell it to retailers and make a profit, they too could not possibly sustain the heavy losses for long enough to make the solution cheap enough. I do not believe that would be possible until 2050–2060 assuming the startup would want to make >50% margins which means they would need to charge $1,288,000/store/year. And for Walmart to roll this out, the startup would need to charge Walmart $15B a year ($1,288,000 * 11,766 stores). That would be the largest software deal of all time! I do not think Walmart would ever pay that amount per year to anyone.

We have given this a considerable amount of thought since we have been asked by every VC and retailer for 4 years to go build this system. So why isn’t Focal building this? Let this document serve as our answer.

We have the ability, but we just do not think there is a market for it yet. 5 startups have formed since we were founded promising to build this technology over the last 3 years, and they have raised 100x what we have raised. But where are they? What stores can we go in and see it working? With what revenue?

We are staying the course. It is our conclusion that the system will not beat baseline economics in large format grocery until after 2040 and these startups are playing to get bought or will eventually have to pivot. They will die if one of those two don’t happen first. There is absolutely no business model there.

This however does not mean there isn’t room for innovation in this space with this technology.

At Focal, we take the same class of technology (deep learning computer vision) and deploy it in very cost effective ways to instantly increase sales, decreases cost, and increase the customer experience in more scalable, low investment ways. We have done this with 10+ major retailers, across 3 continents, with over 100+ stores deployed over the last 3 years. We are rolling this technology out with retailers as we speak! (Press Release on this coming out this month)

We are building the Operating System for Brick and Mortar Retail. We deploy small, inexpensive cameras on shelves that take images once an hour and run product recognition on every SKU in the store. Integrating this data with the Inventory Management System is sufficient to automatically order, predict when to increase / decrease forecasts, optimize per store planograms and labor schedules, and train, direct, and manage labor. And since the cameras are one hundred times less expensive and the compute required is a million of times smaller, our technology proves 5–10x ROI today.

In summary, we believe that retailers should not follow Amazon on this journey. Anyone telling you to go this path has an agenda, or just doesn’t know the information in this document. However, there is another way to leverage this technology to transform your business today.

That is what Focal is all about! More on that in our next blog post… Thank you for reading!

— Francois Chaubard, CEO Focal Systems

Focal Systems

Discussing the future of Brick and Mortar Retail: Supply Chain, Inventory Management, SKU Rationalization, Customer Experience, Performance Management all transformed by AI / Computer Vision.

Focal Systems

Written by

Automating Brick and Mortar Retail. https://focal.systems

Focal Systems

Discussing the future of Brick and Mortar Retail: Supply Chain, Inventory Management, SKU Rationalization, Customer Experience, Performance Management all transformed by AI / Computer Vision.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade