FStop.fm — Technical Challenges

FStop.fm is a niche social network I built last year to connect photographers with models. It’s essentially a mashup of Tinder and Instagram with features that cater to visual artists looking to network with one another. I wrote another post about some of the lessons I learned bringing FStop to market; this post will detail some of the technical challenges I faced.


Scaling on Azure

FStop lives in Azure as an App Service. When it received press coverage, the accompanying spike in traffic brought my little web app down. This was because I hadn’t configured any kind of autoscaling on my lower-tier app service instance.

Azure makes it ridiculously easy to scale out your app service. I set up a few rules based on CPU and Memory consumption and we were back online within minutes. The core configuration looks something like this, which was achievable within a few clicks:

Configuring scale out in the Azure portal

Note that this is just the app service. FStop is powered by Azure SQL which scales just as easily within a few clicks in the Azure portal.


Troubleshooting App Service Outages

The FStop web app would occasionally go down for minutes at a time. I initially thought that this was just the nature of the lower App Service tiers, but it turns out the app was actually crashing on a regular basis.

To diagnose this, I used an App Service Extension called “Crash Diagnoser”:

After adding this extension and navigating to it from within the portal, you can configure the extension to capture dumps from the w3wp.exe process — the IIS worker process that keeps your web app responsive.

This revealed that email notifications were causing the crashes. For some reason, the .NET SMTP client was choking on O365 and causing issues with port exhaustion that led to the app crashing. Instead of dumping hours on troubleshooting, I decided to hit two birds and replace the SmtpClient with a SendGrid implementation — something that was on my backlog anyways and addressed performance and scale concerns.


Continuous, Seamless Deployments

FStop’s DevOps look something like this:

  1. I check in code.
  2. TFS Build runs and then unit/integration tests run.
  3. The app is deployed to the staging slot of the App Service which has a connection string pointed at a mirror of the production database.
  4. A full suite of Selenium UI tests run.
  5. Production database is mirrored, staging is promoted to production, and any necessary database migrations are performed.
  6. Selenium smoketests are run against production and roll back the deployment if anything goes awry.

This allows me to push continuously to production as much as I want without any concern for interrupting anyone’s experience. Swapping production and staging is what ensures an effectively seamless deployment and — worst case — I can restore the mirror of production and swap back to staging if something goes wrong.

Some gotchas:

  • You may need to set your machine key in your web.config to support truly seamless swaps when you’re using any kind of SignalR backplane
  • You may need to implement a custom warm-up action in your web.config to ensure a truly seamless swap.

Azure SQL Replication / Disaster Recovery

One afternoon I started receiving emails and complaints on social media that FStop was down on all platforms. My uptime monitoring service started blowing up my inbox as well.

It turns out the Azure datacenter in the region I had selected (US East) was experiencing downtime. It lasted for hours and there was nothing I could do. This was super stressful as it happened right after the launch of our mobile app. The timing couldn’t have been worse.

I hadn’t invested any time into mitigating that risk because I’d never experienced it before. It turns out there are many ways to address the need for high-availability however I just needed a low-cost mitigation plan; I wasn’t operating a high-frequency trading platform or medical telemetry solution here. This was just a social network that would worst-case inconvenience some folks. I decided to spin up a failover database via geo-replication that I could route requests to should this ever happen again.


Scaling App Services with SignalR

FStop’s messaging and notifications depend on SignalR for their real-time functionality.

Once FStop scaled to two App Service instances as a result of scale-out, my real-time messaging tests would fail. What was really interesting is that the failures were caused my almost exactly 50% of sent messages not being received by the target browser instance. I would send ten messages from Browser A to Browser B; Browser B would only receive five or six of them.

This was caused by the fact that if Browser A establishes a connection with Instance 1 of my web app and Browser B establishes a connection with Instance 2 of my web app, SignalR won’t be able to synchronize this messaging without some help. This is where a SignalR backplane comes into play. It sounds simple, but it took a few hours of me pulling my hair out before I recognized the correlation between 50% failure and two instances at scale.

Within a couple lines of code and a couple clicks in the Azure Portal I was able to roll out a SignalR backplane on Redis to solve this problem, get my tests passing, and get my code deploying.


Building Broadcasts

Users of FStop wanted to be able to send “casting calls” to other users. The way this would work is as follows:

  1. John, a photographer, had to staff a shoot that requires two pale blonde models.
  2. John opens the FStop app and configures his search filter accordingly: Caucasian, tall, blonde, within 30 miles of Downtown Los Angeles.
  3. John taps the “Broadcast” icon and composes his message. He selects the number of users to message (capped at 50) and clicks submit. He then receives a confirmation that this broadcast would soon be reviewed and approved by an FStop staff member.
  4. An FStop staff member sees a queue of pending broadcasts and approves/rejects as needed.
  5. Upon approval, the casting call is sent to 50 models that match John’s search criteria. The broadcast is also posted on the FStop Feed.

It’s a simple set of requirements but figuring out how to facilitate it at scale and low cost was interesting. The overall architecture looks like this:

It’s essentially a queuing mechanism that gets processed by an Azure Webjob. I then created a little dashboard in the FStop admin panel that lets us approve and monitor casting calls:

Note that the dollars there reflect the future intent of monetizing based on exposure — the more people you broadcast to, the more you pay. This hasn’t been rolled out.


Cache-busting the SPA

FStop was built as a single-page application using DurandalJS. It’s bundled and minified as a two payloads — the spa bits and the vendor bits:

This is made possible by ASP.NET’s out-of-the-box bundling. But what if John loads up FStop.fm and then I push an update? He’ll have the stale bits unless he refreshes. I solved this by appending a “BuildVersion” header to my responses:

With this information, the client just needs to cache its initial version on load and then inspect the BuildVersion header on subsequent server responses to determine whether or not the user should be prompted to reload the app:

if (!warned){
var newVersionNumber = response.getResponseHeader("BuildVersion");
if (newVersionNumber != cachedVersionNumber){
doSomething();
}
}

Proximity Searching

Given a thousand users with locations stored as latitude/longitude coordinates, how do you go about returning results for someone who wants to see users within a 50 mile radius of their coordinates?

Calculating proximity on the fly is tough to do quickly, especially with tens or hundreds of thousands of users.

Step 1: Store their location as Lat/Long

Your need to get your hands on lat/long coordinates. I do this on the client — the mobile apps derive this from GPS or manually-entered zip codes; web apps derive it through a third-party API call to Google’s Geocoding API.

Step 2: Look at how awesome this is

Searching for coordinates that fall within that red circle in a large dataset is expensive. Searching for that red circle within that blue box is much more affordable. If performance is more important than precision, you could even just search for the blue box. The code I use to achieve this is below:

var dapperQuery = $"SELECT {userColumns} FROM [USER] WHERE {filterCriteria}";
dapperQuery = $"dapperQuery AND " +
"([LocationLong] > @longMin and [LocationLong] < @longMax and [LocationLat] > @latMin and [LocationLat] < @latMax) AND " + // FIRST LINE
"((geography::STPointFromText([LocationPoint], 4326).STDistance(@userPoint)) <= @searchRadius)"; // SECOND LINE

This was one of the issues that killed the user experience once I went from 500 to 5K users the day FStop started received press coverage. Boxing the query like this led to a great performance correction.

Denormalizing Likes

The FStop Feed lets users “like” posts:

But the feed sorts posts with a complicated algorithm that takes into account age, likes over time, comment activity, and the reputation of the user submitting.

Performing a Join betweenPosts and Likes would likely slow down the feed page, especially as thousands of posts and likes are handled. So I decided to denormalize these bits and store Likes right on the Posts table in the form of UserIdsUpvoted. This would tell me how many likes any post has along with the users who liked it. In my opinion, denormalizing your domain model is perfectly fine when it’s done in the name of performance.


Thanks for reading!

Mick