Explain by Example: Designing Production Systems

Michelle Xie
May 26, 2020 · 12 min read

The difference between a prototype and a production system is like comparing a party to a wedding. And that got me thinking about how little I know about weddings. According to brides.com, you have to start planning your wedding 12 months beforehand. They even have a checklist for everything you need to do for each month.

Now, I don’t know about you but I wouldn’t spend 12 months to plan out a prototype I’m going to build. I might spend a couple of hours or a couple of days. For a production system, planning and designing how to build something into production would take much longer similar to a wedding. It might not need 12 months of planning, but it will definitely need months of planning, designing and building.


For starters, if you were to plan a wedding, you have to make sure everything is available.

Things like:

  • Is the venue available for the wedding day?
  • Will the food, drinks, and wedding cake arrive on time for the wedding day?
  • Will my make up artist, hair stylist, photographer, musicians, officiant, fiance show up on the wedding day?

In the production systems world, this means when people show up to the wedding, they expect the wedding to be fully functional. It wouldn’t be a wedding if the guest had to bring their own food and drinks or provide their own entertainment.

For example, what if the catering company you booked cannot cater to the number of guests attending the wedding?

You could have multiple catering companies booked in as back up in case the first catering company fails to meet the demand and load balance the additional catering workload to the other catering companies.

Including things like load balancers into your production system design will ensure services are highly available. If you were designing this wedding or production system in the cloud, there are several load balancing options you can use to ensure your systems and your services are highly available.

Fault Tolerant

What happens in the case that something becomes unavailable? Can the wedding still run without it?

Let’s say on the wedding day, the photographer becomes terribly ill. Will you cancel the entire wedding because you won’t be able to get the perfect photo to post to Instagram or the perfect video to post to YouTube?

Some people might argue that you absolutely cannot run a wedding without a photographer but they might change their minds if you removed all the Instagram servers, YouTube servers and printers from the world.

In other words, you can still run the wedding without a photographer. Some people might film or document it on their smart phones. It would have been much nicer to have the photographer but the entire wedding would not fail because one photographer fails to show up.

When designing production systems, you want to design your services so that it is fault tolerant. If one part of the system or one component of the system fails, it does not bring down your entire system.


Ok, but what if the entire wedding turns into a complete disaster? This is every girl’s worst nightmare. So as you can imagine, when production systems fall over, this is every I.T’s worst nightmare.

How can you recover from such disasters?

Disasters are unpredictable but just because they are unpredictable doesn’t mean we should not plan for them. Disaster Recovery (DR) plans or Business Continuity Plans (BCP) usually refer to the plans that are made in the event that a disaster strikes.

Let’s say the wedding venue you had planned to host the wedding in suddenly burned down. You can’t really have a wedding without a venue so this is a disaster!

But what if you had a back up venue?

You would have to contact your guests and route everyone to the back up venue which isn’t ideal but as far as the guests are concerned, the wedding is still running.


Typically, a party is much smaller than a wedding. It has a more specific function or purpose to it for example, you might be celebrating a birthday, a graduation, or someone moving away. You do not have to invite that aunt you have not seen in 10 years or your cousin who you have not spoken to since your childhood days or a close friend of yours who lives in another city. However, if you do not invite them to your wedding, they might get a little upset.

As you can imagine, the size of a party (the scope of a prototype) vs. the size of a wedding (the scope of a production system) varies significantly.

What if your third cousin, twice removed brings a plus one?

But you did not plan for it. Would they have to stand awkwardly on the side for the entire duration of the wedding because there were not enough seats? If they did, they probably would not think very highly of your wedding which bear in mind you spent 12 months planning for.

So being able to quickly scale to handle both predictable and unpredictable workloads is very important. There are two ways to scale. If you can predict the increased workload e.g. if your cousin had RSVP with a plus one on the response then you know you need to plan for an extra person. This can typically be handled by manual scaling. But what if your cousin did not notify you that they were bringing a plus one? You can still handle situations like this by setting up Autoscaling.

Autoscaling means the scaling is done automatically or dynamically. Let’s say the venue you have picked checks people in as they walk through the entrance (for safety) and keeps a count of the number of people that has entered the venue. As soon as the number of people exceeds the number of seats available, one of the venue helpers gets notified so they can quickly bring out more seats from the storage room before the wedding starts.

Ok, but what about catering?

One extra person means there is an extra mouth to feed. You can vertically scale by telling the cooks to just add more ingredients into the meals they are serving so that it can cater for X+1 people rather than X. Or if they are standalone meals and drinks, you can horizontally scale by asking the cooks to make an extra serving and bring out an extra wine glass and set of cutlery.


Speaking of the venue check-ins, a wedding is not just any party gathering and naturally, the security for a wedding needs to be much more robust than the security for a party.

How do you ensure things like:

  • Only guests invited to the wedding can enter the venue
  • Guest are not bringing in firearm, knives, explosives, or anything dangerous
  • Guest do not steal any valuables from the wedding

In the production systems world, this translates to:

  • Ensuring that the guest that enter the venue can identify themselves and show an invite that is tied to their identity
  • You can trust that the guest that are allowed in will not be malicious, violent, or dangerous to other guests
  • Keep valuables such as wedding gifts locked away so it is not publicly accessible to just anyone in the wedding

Normally, you would protect your wedding environment with a mix of security cameras, security monitors and security guards. In the production systems world, you would also protect your environment by using an active directory, firewall, and key vault.

Data Management

Designing how to access and manage your data is another art in itself. I won’t go too in-depth into this today. You can think of data as the money that will finance the wedding and you’ll realize why data is so important these days. If you do not set a budget to outline how much you can spend and how much you should spend on each component of the wedding, you might just find that you accidentally spent 50% of your wedding budget on floral arrangements because you “liked all the options too much”. This definitely sounds like something I would accidentally do.

When it comes to data, there are trade-offs that you have to make just like choosing what to spend your money on. These trade-offs vary depending on the type of system or application that you are building.

If you are doing a lot of database reads, you might want to consider caching data for faster read access. Caching data allows you to temporarily store data a lot closer to the systems or applications that are accessing it for faster access. It also helps offload the read access workloads from the main database. Think of it like keeping a small notebook filled with the contacts for all the wedding services you will need to contact for example, the bakery’s details, the photographer’s details, the venue details, and the details of the florist your aunt recommended so you don’t have to bother your aunt every time you need to make a call to the florist.

Now, if your system or application does a lot of writes, you probably do not want to use a cache.


Well, let’s consider this scenario. You call up the bakery 3 times to describe your perfect wedding cake.

  1. Call #1: You spoke with Victoria and tell her that you would like almonds added to the cake.
  2. Call #2: You spoke with Brandon and tell him to make the cake nut-free because some of the guest has indicated they have a nut allergy.
  3. Call #3: You spoke with Victoria and asked her for a price quote for the cake.

Now, for the baker, if they only looked at Victoria’s notes, they would add almonds to the wedding cake which means someone with a nut allergy will have a really bad time at the wedding. The price quote for the cake described from Victoria’s notes will also be much higher than Brandon’s notes because there are extra ingredients involved.

So when you are designing a system or application that is write-heavy, you need to ensure you can maintain a single source of truth that you can trust and any copies made are synced with the main source of truth.


How do you know if you are hosting a good wedding whilst it is running?

You could probably gauge from the reactions of the guests. The first 5 minutes of the vow exchanging ceremony will be filled with emotions, tears, and cheers however, if your vow exchanges process took 5 hours long, the audience would become less emotional and less cheerful. Some might become extremely bored, some might fall asleep, and some might even leave.

Even if you have spent 12 months to plan the “perfect” wedding, the perfect wedding will not happen. Something will go wrong which is why monitoring and constantly logging all the events that are taking place is important so that you can immediately remediate any issues that come up as soon as they are discovered.

What happens if the people at the back cannot see or hear the vow exchange? Is it fair that they should get a less enjoyable experience than the people at the front simply because they are seated further back?

Some of the common terms you might have heard of when it comes to discussing performance is bandwidth, latency, and throughput.

What is bandwidth?

The average human speaks at about 60 decibels. There is a limit to how loud we can speak. Some people have a bigger limit than others but everyone has a limit. Think of this limit as the (voice) bandwidth that each person has.

What is latency?

Now, even though someone might be speaking at 60 dB, a person that is 100 metres away will hear it at a lower decibel than a person who is 10 metres away from the speaker. The person that is 100 metres away will also hear the speaker at a later time than the person that is 10 metres away because the sound waves have to travel further. This lag is known (voice) latency. Latency occurs when there is distance involved because it takes longer for something to travel from a source to a destination.

What is throughput?

Even though a typical human has the bandwidth to speak louder than 60 dB, they typically won’t. Imagine two partners yelling their wedding vows at one another just so the guests at the back can hear them. That would be rather ridiculous, wouldn’t it?

The same goes for systems, even though they might have the bandwidth to deliver at the maximum capacity, this typically won’t happen because it’s rather strenuous to operate at maximum capacity constantly.

When planning a wedding or designing a production system, you have to think about performance. Should you add speaker at the back and use a microphone (to reduce latency) so that the guest at the back can hear too? Should you have a live video feed at the front so that all guest can see the vow exchange? Or do you simply speak louder (increase throughput) to minimize cost?


How changeable or adaptable is this wedding?

Let’s say a worldwide health pandemic hits and governments across the world imposes restrictions on the number of people that can be gathered in a one place (for those reading this years from now, this is a true story).

You have 3 options:

  1. Reduce the guest list from 100’s to 10’s or 1's
  2. Cancel the wedding
  3. Continue with the wedding but change the process or the normal way it is usually conducted

Option 3 only applies if you have the right virtual infrastructure set up to allow you to easily adapt the physical in-person wedding to an online virtual wedding.

The same applies for production systems. What happens when there is a change in the market that requires you to add or shut down particular services? How easily can the system adapt to these changes? What about adding extra components or new features to the system? Do you have to make a lot of changes to allow for this to happen or can you just build on top of the existing system? Having an agile system is important because most change is unpredictable and if you do not have the flexibility to adapt to the changes, the entire system could collapse or be made redundant.


No one likes unreliability and that’s a fact. If you do not trust a system to handle your data securely, you would not hand over your data. If you do not trust a system to perform a task well, you would not use the system. The reliability of a system ties very closely to some of the concepts mentioned previously. If you designed and planned your systems right, that system would be reliable because you know that even in the event something goes wrong, there are plans and action steps that can be taken to remediate it.

Just like the wedding date and time, you wouldn’t send out the wedding invites with a date and time without planning ahead of time and getting an indication of when most of the services are available. You might have a timeline in mind but until you have a list of confirmations, you would not have a reliable date to send out for the guest to add to their calendars.

That’s it for now. Just remember, a bad party (prototype) is easily forgotten and forgiven, a bad wedding (production system) will always leave a bad memory.

Author: Michelle Xie

Originally published at https://www.linkedin.com.

The Startup

Get smarter at building your thing. Join The Startup’s +791K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

Michelle Xie

Written by

Hi, my name is Michelle and I like to write (and it turns out, not just code)! I am currently the creator and author of the “Explain by Example” series.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +791K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store