Audio-First: Becoming the Next Player in Social Media

Published in

The Startup

10 min readSep 6, 2020

Last week, Product Tienda welcomed Midas Kwant, founder of Rodeo, an audio-first social app. In this piece, we captured that conversation while zooming out to cover the recent rise of audio.

By Eshita Nandini & Abhi

As we near our seventh month of quarantine, limited physical interaction has started to take a toll on Americans. Early on, we turned to video-conferencing software to keep in touch with friends and family, but Zoom fatigue caught on quick — today, a mere 30-minute call tends to exhaust when months ago we used to enthusiastically attend 2-hour long happy hours.

There has been an uptick and audio platforms as they raced to replace in-person interactions. As new solutions flood the space, it’s worth taking a step back to look at the evolution of social media to understand how we got here and where we’re going. With a still uncertain future, new types of digital interaction aren’t just a novelty, they’re a necessity. At first, video seemed the obvious choice to replace in-person gatherings, but as time goes on, we’ve noticed people creating and using more audio-first apps.

An Overview of Social Media and the Unbundling of Media

The evolution of media, and specifically of social media, is a story of bundling and unbundling. Namely, we’ve cycled through phases of decentralization and centralization of the creation, distribution, consumption, and monetization of information.

The start of social media can be traced back to the age of mass media in the latter half of the 20th century, which was characterized by large media organizations serving as the primary creators of content. In this time, people like Walter Cronkite gained fame by broadcasting content to nearly 30 million Americans every night. This era was followed by new media and then social media starting in the early 2000s. Social media enabled user-generated content (UGC), which created an intimate experience where everyone could be a consumer and a creator.

Facebook and other platforms began unbundling mass media by allowing users to share articles, music, and video directly on their personal profiles. Being able to broadcast content directly to friends and family allowed for a feedback loop to form between creators and consumers and the ability to form community online. So while consumers became less reliant on central entities to provide information, algorithms leveraged these feedback loops to personalize content for both creators and consumers. These algorithms gave every user a unique version of Facebook; so while there was only one version of CBS Evening News in 1962, there were a billion versions of Facebook in 2012.

As social media usage grew, it became possible to both serve a niche interest and build a massive audience. UGC became professionalized and the experience became much less intimate. Soon, people were doing things on social media that couldn’t have been imagined in traditional media. Mass social media became defined by virality and virality became commonplace.

For the first 10–15 years of the millennium, as social media became the gathering place for the world at large, the technology of intimate, synchronous conversations stagnated. From the early days of social media, text-based communication dominated all synchronous and intimate interactions. People were texting each other in 2013 as they were in 2003, just through different platforms. This paved the way for new technologies to emerge and enable intimate, sometimes synchronous conversations, such as Snapchat which launched in 2011.

We expected that phone calls would be made obsolete as there was innovation in all other media except audio, but phone calls increased since the start of the pandemic. According to the New York Times, “Verizon said it was now handling an average of 800 million wireless calls a day during the week, more than double the number made on Mother’s Day, historically one of the busiest call days of the year…In contrast, internet traffic is up around 20 percent to 25 percent from typical daily patterns”.

The rise is stunning given how voice calls have long been on the decline. New needs are emerging in the crisis. “We’ve become a nation that calls like never before,” said Jessica Rosenworcel, a commissioner at the Federal Communications Commission, the agency that oversees phone, television, and internet providers, “We are craving human voice.”

The Nature of Audio and the Airpod Shift

So, why was audio left behind in the greatest and fastest upheaval of media since the printing press?

In our conversation with Midas, we discussed the nature of audio itself and the ways people were predisposed to use it. First, phone calls had deeply embedded the idea that audio’s primary role was synchronous, 1-to-1 communication. Second, users can’t skim audio in the same way that they can skim text or quickly glance at images/videos. This made audio a bad match for mass social media where users became used to scrolling through and briefly looking at a large quantity of content.

For these reasons, most progress in social audio at the start of new media anchored around long-form podcasts. However, the rollout of Airpods in 2016 was viewed as a potential tipping point for social audio. At the time, Josh Constine predicted that since Airpods remove friction and enable “always in” headphones, there will be a rise of snackable audio. In 2019 alone, there were 60 million Airpods sold.

Midas compared the impact of Airpods to what front-facing camera did for selfie culture. The introduction of front-facing cameras made selfies the “lingua franca for millennials.” This new hardware changed our roles as users: “By turning the camera on oneself, the user became the model, the photographer, the art director, the image retoucher, and the publisher of their image.”

Airpods will not only expedite innovation in audio, but they may enable companies to build new use cases and change the way users create, distribute, consume, and monetize audio content.

The Social Audio-First Approach to Replacing In-Person Interactions

Let’s take a brief look at in-person interactions and how these are emulated in audio-first apps.

At a high level, interactions can be divided into 1:1 and group interactions. The dynamics of the conversation differ greatly depending on which type of interaction we have.

In 1:1 conversations, the participants can be close in relationship, mere acquaintances, or strangers. In almost all of these situations, conversations can take place in both asynchronous and synchronous settings. This could mean voice memos or hopping on a phone call to chat. Although 1:1 conversations do require coordinating a time to speak or waiting for a response, there is relatively little complexity since only two schedules need to be synced.

Group interactions become a little more complicated as there are more than two schedules. However, some in-person group interactions don’t rely on coordinating ahead of time and instead rely on members indicating they are free. Imagine, for example, the college experience of sitting in a common area when you have free time and coming across a friend who was doing the same.

Additionally, the size of the group matters and can transform a casual conversation into a panel, where only a handful of people dominate the conversation. Recreating group interactions on a platform requires a bit more thought around appropriate features to serve them.

The trade-off between liquidity and familiarity

Every social media platform struggles with the tension between liquidity and familiarity, regardless of their focus on groups vs. 1:1 or synchronous vs. asynchronous communication. Liquidity refers to the requirement that a user is able to interact with someone else roughly when they are available. If a user cannot consistently interact with someone else when they are available, a platform has low liquidity. The second aspect, familiarity, is the requirement that when a user is available, they either find someone they are already familiar with or someone they share an interest with — this interest can be a topic (baking bread) or in an intention (simply wanting to meet new people).

While both of these characteristics are important, there is an important tradeoff: if a user can consistently find someone else to interact with, but if the interaction is consistently irrelevant, the platform will deliver a poor experience with high liquidity and low familiarity. This tradeoff is especially important for apps that focus exclusively on synchronous interactions and especially difficult for social audio.

A synchronous, social audio platform requires spending time and effort to speak with another person or group, which is more taxing than texting synchronously or replying to a tweet asynchronously.

This tradeoff has a slightly different dynamic in a group setting, depending on the relationships between participants. A group can be characterized as being strong-tied or weak-tied: a weak-tied group includes people who gather and might not have pre-existing ties to each other, whereas a strong-tied group hosts people who have existing, close relationships with each other.

In a strong-tied group, a higher level of familiarity usually allows for smoother, constantly evolving conversations. In a weak-tied group, new perspectives are more likely to be introduced and new relationships are more likely to form. Similarly, an app that focuses on strong-tied groups has an upper limit on liquidity because a person’s network is only so large.

So, how are apps generally addressing the above?

The evolution of social media has created dominant platforms like Facebook and Twitter that allow for conversations with many people at once, but are restricted by the medium of text or specifically by character counts. To fulfill the need for a high-context, low-latency conversation between a small number of people — audio-first applications make sense.

Furthermore, Rodeo is a synchronous audio app, which gives a user the ability to indicate when they are free and people from their pre-existing, strong-tied groups who are available can jump in. Rodeo is reversing the dynamic where every group interaction requires coordinating schedules ahead of time. Midas notes that his goals was to make a mobile experience that captures mutual availability in friends groups. The app bootstraps usage by having people join as groups, completely separated from other groups that are using the app.

This strategy is reminiscent of how Facebook onboarded one university at a time so that students would join with their existing communities and peers. This allowed for a base of liquidity and familiarity, which automatically made its experience more sticky. Later on, as it expanded beyond universities and eventually other countries, Facebook observed that to get users to stick, it just had to get any individual to 7 friends in 10 days. It had to establish initial liquidity and familiarity to create stickiness, after which point it could expand the user experience. In a similar way, as long as Rodeo’s current strategy retains users, it will have the option to open its platform in the future, allowing users to discover other groups, communicate with other users separately, etc.

Let’s also take a look at Clubhouse — an app where the user is able to enter any room and join a conversation. Since it is in private beta and invite-only, the web of connections is still pretty condensed relative to how people are connected in real life. Clubhouse doesn’t worry about high liquidity since it has mainly all weak-tied tech and entertainment folks who are interested in generally similar topics and people and therefore, can strike up conversation and form new connections rather easily.

Many social-audio apps feature communities that are usually formed outside of the app and then shifted on the platform. As innovation in audio accelerates, there are many remaining questions: what is audio-first doing well that other social platforms can not? Looking ahead to the future, what are the additional opportunities for social audio?

What to Look Forward to with Audio-First/Only

The future state of social audio will be defined by significant opportunities around which users are served and how users are served.

In the short-term, we’re seeing audio-first social serving users based on relevance or familiarity: either they (1) serve niche communities with relevant, shared interests whether it be a topical interest or a functional interest (Clubhouse, QuarantineChat), or they (2) serve small groups with people that are familiar with each other (Rodeo, Chalk, Cappuccino).

While some products are mainly serving the early adopter Silicon Valley crowd, others like QuarantineChat have broken through to different demographics by serving the common shared interest of wanting to meet new people during the pandemic. Similarly, in the future, social audio will serve a broad demographic. For example, older populations who find visual interfaces frustrating can be served with audio-only or audio-first interfaces. Less educated or illiterate populations in emerging markets can be economically empowered with audio interfaces, enabling social commerce and connecting people to markets.

There will certainly be enterprise implications of social audio, as well. Consumer products have consistently shaped enterprise applications in the past since: employees have a poor experience at work if they are forced to switch from using intuitive personal applications on their iPhones to using outdated, frustrating enterprise applications on old hardware. For one, as consumers become comfortable with audio in their personal life, voice interactions will supplement and replace some of the interactions that currently happen through email or chat.

Unlocking new technology capabilities and new demographics will enable new ways to serve users. For example, while audio-centric social apps today still require some form of visual interaction, audio-only applications are increasingly possible. What happens when audio is the primary way to navigate an app? Similarly, our always-on Airpods create a platform for new forms of content. In contrast to long-form podcasts, the frictionless experience of Airpods will allow users to seamlessly consume bite-sized audio clips.

Looking back to the evolution of media, it feels like we are in an unbundling phase where different types of interactions not being served on major platforms are beginning to be served by niche applications. And like the evolution of media, it is likely that we’ll see another rebundling soon enough.

For example, the nature of Airpods and the increasing comfort with social audio generally may result in a wave of user-generated audio content on social-audio platforms. Eventually, audio-based UGC will become professionalized as either social-audio platforms grow or as Facebook and Twitter adopt or acquire the same capabilities.

Of course, the future depends on what we do as consumers and builders in the present. Social audio seems poised to transcend the phone call to enable a broader array of people to engage with each other through ways that better represent the range of our in-person interactions and enable new possibilities.

Product Tienda is an intimate community of product-focused folks gathering around conversation about exciting verticals + technologies.

Huge thanks to Nathan Baschez for his invaluable feedback in forming this piece!