Serverless Doorbell — Ring.com and Azure Functions (Part 3)

Part 3 — Video AI with Logic Apps and Cognitive Services

This is part 3 in a series of how I have been using Azure Serverless to extend and connect the functionality of my ring.com video doorbell. Part 1 is here.

Ever since I got my ring.com video doorbell installed, I’ve been connecting and extending it with serverless components. Previously I hadn’t added much new functionality that I couldn’t already get from the ring app. Today is when that all changed. Below is how I used Azure Logic Apps, Functions, and Cognitive Services to detect the faces of anyone who comes to my doorstep, grab the audio, and create an AI-powered log inside of a Cosmos DB database.

Screenshot from https://www.videoindexer.ai

Orchestrating complexity with serverless

I knew I wanted to have this type of facial recognition the moment I had access to the ring.com API. However it wasn’t as easy as just making a call to “grab video feed” for it to work. Here’s how the reverse-engineered ring.com API works:

  • When an event happens (motion or doorbell press), an event is added to my doorbell’s history — this is what I have pushing to Azure Event Grid in part 1.
  • That event has some info on a SIP session, but I couldn’t ever figure out how to get access to it to pull in the live feed (though their mobile app can leverage this).
  • After motion has stopped, a recording of the event is saved under the ID of the initial event and published to my ring.com account. This could be anywhere from 10 seconds to 10 minutes depending on how long someone was at my doorstep.

So the challenge I was faced with was how can I detect and analyze the video as quickly as possible? If I did something like “when a ring event occurs, fetch the recording” in an Azure Function, I would get a 404, as the recording doesn’t exist yet. At the same time I never know exactly how long it will be until the event “completes” and the recording is published. I need some way to orchestrate and maintain state for the process — which isn’t always easy in the traditionally stateless world of serverless. Enter Azure Logic Apps.

Azure Logic Apps provided the perfect orchestration capabilities I needed to solve this rather simply. Not to mentioned, Logic Apps comes loaded with 200+ connectors to different APIs reducing the amount of code I needed to write. The only code I actually had to write for this entire process was these 23 lines of Azure Functions code to grab the video recording link. #serverlessFTW

Azure Logic App to drive this entire process

Here’s what’s happening:

  • I used the Event Grid trigger to listen to the topic I created in part 1: whenever motion or doorbell event occurs with my doorbell. This is why Event Grid is awesome. I can add or remove these listeners at will, and subscribers get the data notifications they need for the events they care about instantly.
  • I immediately start checking for a recording. Logic Apps provides an “until” loop and will maintain the state for the conditions. If I get a 404 response from the function, I delay 15 seconds and check again.
  • I send the recording URL to the Azure Cognitive Service for Video Indexing. This is an all-in-one service that will take any video and pull out faces, detect sentiment, grab audio transcripts, and save the video after upload. All I needed was a free API key and I was off to the races with this out-of-the-box connector.
  • I have a similar “until” loop next to check the processing status of the video and keep checking until it’s processed completely.
  • Once the video is processed, I grab the video breakdown data, and push it all into a Cosmos DB collection for future analysis and aggregation.

Now I know Logic Apps better than the average developer, but even then this entire logic app took about 45 minutes to build. 23 lines of code, 3 out-of-the-box connectors, and 1 workflow later, I get rich insights into the videos captured by my IoT doorbell. How much do I have to pay for this science fiction fantasy come to life? Assuming I get 3 visitors a day, it’d come to about ~$0.18 a month. All told my entire serverless project to this point is less than $2 a month.

Making my doorbell detection smarter

Now that my videos are automatically being uploaded and analyzed into my Cosmos DB account and Video Indexer profile, what can I do more? Well first off, when the initial videos started coming in, all of the faces were “unknown.” Unfortunately no celebrities have been visiting, so the cognitive service doesn’t recognize any of the faces. No worries though — I can “train” my profile and teach it about the faces it is seeing. After opening a recording in the Video Indexer portal, I can replace the label “Unknown Person” with their name. Next time they show up in a video, Video Indexer recognizes them and will correctly assign a name. This same type of learning can be applied to the audio transcripts as well, so I can start to teach my profile about the words it may expect to hear on my doorstep to get more accurate results. While the results aren’t perfect — sometimes the quality of video or angle isn’t quite good enough to pick up a face — it works often enough that I’m very happy with the results.

You may be asking: “Does anyone really need an in-depth historic analysis of the patterns of faces that appear on their doorstep every month?” Maybe not. But for less than a cup of coffee a month, why not create my own doorstep version of HAL 9000 so my home may one day turn against me?

I’m sorry Jeff… I’m afraid I can’t do that

If interested in trying to build something similar yourself — I’ve got all the code here in my GitHub account. Enjoy! https://github.com/jeffhollan/functions-node-ring-doorbell