Sound in windows. The WASAPI in C++

6 min readJan 20, 2024

Have you ever wondered how to capture audio from your microphone and play it through the speakers yourself? doing it on the web in Javascript is super simple, but what about doing it in native applications? In this blog, we’ll capture audio from the microphone and play it through the speakers in Windows using the Windows Audio Session API (WASAPI)

The Windows audio session API is an interface that Windows provides to let programmers manage audio activities. This is the interface your browser most likely uses to render and capture audio data. Chromium, the open-source project on top of which the Chrome and Edge web browsers are built, uses WASAPI for audio.

I will not explain the use of the COM library here as it’s a whole topic on its own, for that I suggest you read more on the Windows COM API. Also, if you want to learn more in-depth about the topic, I suggest reading the MSDN documentation, it’s pretty amazing.

Today, I’ll give you an introduction to this interface and hopefully, you’ll leave with something valuable.

How does it work?

Well, it all begins with what’s called an enumerator. The enumerator helps you choose the device you want to use to capture or render audio, but before we can do that, we need to do some housekeeping, we need to initialize the COM library. Check out this example:

// An number that respresent the output of an operation in the COM API
// This'll help us identiify and deal with errors.
HRESULT hr;
IMMDeviceEnumerator* enumerator = NULL;

// Initializes the COM library
hr = CoInitializeEx(NULL, COINIT_MULTITHREADED);
assert(SUCCEEDED(hr));



hr = CoCreateInstance(
    CLSID_MMDeviceEnumerator,
    NULL,
    CLSCTX_ALL,
    IID_IMMDeviceEnumerator,
    (void**)&enumerator
);
assert(SUCCEEDED(hr));

You’ll see me using these assertstatements all over the place, this is just so that we catch errors immediately and don’t lose our minds for two hours debugging. The first, third, and fourth parameters in the call tell Windows what interface to initialize and the last one is the target pointer to which the call will write the address of the object we create. In our case, we want to create an enumerator that’ll allow us to select a device to use.

Next up, the enumerator interface has a method that’ll allow us to get the default recording and playback device.

hr = enumerator->GetDefaultAudioEndpoint(eCapture, eConsole, &recorder);
assert(SUCCEEDED(hr));

hr = enumerator->GetDefaultAudioEndpoint(eRender, eConsole, &renderer);
assert(SUCCEEDED(hr));

hr = enumerator->Release();
assert(SUCCEEDED(hr));

The eCapture and eRender enum values mean we need a recording device and a playback device respectively. The eConsole parameter just means default system audio, you can also get the default device for communication if you have that setup. The last parameter is a pointer to the pointer to which the system will write the address of the created object. After that, we release the enumerator as we don’t need it anymore.

Next up, we activate these devices to get an AudioClient interface through which we’ll do the actual recording and playback. There’s one more layer, we’ll get to that soon.

hr = recorder->Activate(IID_IAudioClient, CLSCTX_ALL, NULL, (void**)&recorderClient);
assert(SUCCEEDED(hr));

hr = renderer->Activate(IID_IAudioClient, CLSCTX_ALL, NULL, (void**)&renderClient);
assert(SUCCEEDED(hr));

Then, we get the Mix format. This structure specifies the sample rate, bits per sample, channels, and frame size of the audio we get from the recording device. We need to give this format to the playback interface to tell it the characteristics of the audio data to be played.

hr = recorderClient->GetMixFormat(&format);
assert(SUCCEEDED(hr));

printf("Mix format:\n");
printf("  Frame size     : %d\n", format->nBlockAlign);
printf("  Channels       : %d\n", format->nChannels);
printf("  Bits per second: %d\n", format->wBitsPerSample);
printf("  Sample rate:   : %d\n", format->nSamplesPerSec);

After that, we need to do two more things. First, we need to initialize the IAudioClient interfaces we created and then get the corresponding services for actually recording and playing the audio.

hr = recorderClient->Initialize(AUDCLNT_SHAREMODE_SHARED, 0, 10000000, 0, format, NULL);
assert(SUCCEEDED(hr));

hr = renderClient->Initialize(AUDCLNT_SHAREMODE_SHARED, 0, 10000000, 0, format, NULL);
assert(SUCCEEDED(hr));

hr = renderClient->GetService(IID_IAudioRenderClient, (void**)&renderService);
assert(SUCCEEDED(hr));

hr = recorderClient->GetService(IID_IAudioCaptureClient, (void**)&captureService);
assert(SUCCEEDED(hr));

In the Initialize method, the first parameter specifies how we want to initialize the client. There are two ways. The first is called shared mode, In shared mode the audio passes through the mixing engine to support allowing multiple applications to use the same device. The second one is exclusive mode, you can guess what that would do, you have exclusive access to the device but that also means no other application can use it. It has lower latency and a bunch of other nice things but comes at that cost.

The third parameter is the capture time in 100-nanosecond units, the number I have specified means one second.

The fifth parameter means the format to use. We use the one we got from the system just before.

In the next lines, we get the services for capturing and rendering audio data. We’ll do that next.

UINT32 nFrames;
DWORD flags;
BYTE* captureBuffer;
BYTE* renderBuffer;

hr = recorderClient->Start();
assert(SUCCEEDED(hr));

hr = renderClient->Start();
assert(SUCCEEDED(hr));

while (true) {
    hr = captureService->GetBuffer(&captureBuffer, &nFrames, &flags, NULL, NULL);
    assert(SUCCEEDED(hr));

    hr = captureService->ReleaseBuffer(nFrames);
    assert(SUCCEEDED(hr));

    hr = renderService->GetBuffer(nFrames, &renderBuffer);
    assert(SUCCEEDED(hr));

    memcpy(renderBuffer, captureBuffer, format->nBlockAlign * nFrames);

    hr = renderService->ReleaseBuffer(nFrames, 0);
    assert(SUCCEEDED(hr));
}

The first two variables are output parameters, that is, the capture service will fill those variables to tell us how many frames were captured and the flags on them.

The captureBuffer is a pointer that will store the address of the buffer captured.

The renderBuffer is a pointer that will store the address of the buffer to which we can write our audio data. You can write anything to it, we’ll write what we capture to it.

In the next four lines, we start the clients.

After starting, we begin an infinite loop, you can have your exit condition there.

In the loop, first, we get the buffer, the address of that buffer is written to the captureBuffer variable, the number of frames captured, and the flags are written to the nFrames and flags variables respectively. We release the buffer as we no longer need it.

After that, we get the buffer for the render service. It writes to the renderBuffer pointer, In that pointer is the address of the actual buffer we can write to. We can not release this just yet, after it’s released, the data is sent to the endpoint (device in exclusive mode and mixing engine in shared mode) so we first copy the data we have in the capture buffer using trusty ol’ memcpy and then, release it.

You’ll notice that in memcpy the size parameter is format->nBlockAlign * nFrames that is because the GetBuffer function from the capture service tells us the frames, but each frame is not just one byte, it has a size and that size is called the frame size, you can get it from the format you’re using. it’s the nBlockAlign property.

And with that, our loopback is done. we’re successfully looping the microphone to the speakers. In the next blog, we’ll abstract this to classes to make them easier to work with and develop a UDP server and a Relay to create a Voice communication app.

To compiler this, you need to link the Ole32 library and some include (listed below). You this command to compile using GCC: g++ loop.cpp -lole32 -o loop.exe and then run loop.exe to hear your beautiful voice. peace :)

Following is the whole code for reference:

#include <stdio.h>
#include <Windows.h>
#include <initguid.h>
#include <Audioclient.h>
#include <mmdeviceapi.h>
#include <assert.h>

int main() {
    HRESULT hr;
    IMMDeviceEnumerator* enumerator = NULL;
    IMMDevice* recorder = NULL;
    IMMDevice* renderer = NULL;
    IAudioClient* recorderClient = NULL;
    IAudioClient* renderClient = NULL;
    IAudioRenderClient* renderService = NULL;
    IAudioCaptureClient* captureService = NULL;
    WAVEFORMATEX* format = NULL;

    hr = CoInitializeEx(NULL, COINIT_MULTITHREADED);
    assert(SUCCEEDED(hr));

    hr = CoCreateInstance(
        CLSID_MMDeviceEnumerator,
        NULL,
        CLSCTX_ALL,
        IID_IMMDeviceEnumerator,
        (void**)&enumerator
    );
    assert(SUCCEEDED(hr));

    hr = enumerator->GetDefaultAudioEndpoint(eCapture, eConsole, &recorder);
    assert(SUCCEEDED(hr));

    hr = enumerator->GetDefaultAudioEndpoint(eRender, eConsole, &renderer);
    assert(SUCCEEDED(hr));

    hr = enumerator->Release();
    assert(SUCCEEDED(hr));

    hr = recorder->Activate(IID_IAudioClient, CLSCTX_ALL, NULL, (void**)&recorderClient);
    assert(SUCCEEDED(hr));

    hr = renderer->Activate(IID_IAudioClient, CLSCTX_ALL, NULL, (void**)&renderClient);
    assert(SUCCEEDED(hr));

    hr = recorderClient->GetMixFormat(&format);
    assert(SUCCEEDED(hr));

    printf("Mix format:\n");
    printf("  Frame size     : %d\n", format->nBlockAlign);
    printf("  Channels       : %d\n", format->nChannels);
    printf("  Bits per second: %d\n", format->wBitsPerSample);
    printf("  Sample rate:   : %d\n", format->nSamplesPerSec);


    hr = recorderClient->Initialize(AUDCLNT_SHAREMODE_SHARED, 0, 10000000, 0, format, NULL);
    assert(SUCCEEDED(hr));

    hr = renderClient->Initialize(AUDCLNT_SHAREMODE_SHARED, 0, 10000000, 0, format, NULL);
    assert(SUCCEEDED(hr));

    hr = renderClient->GetService(IID_IAudioRenderClient, (void**)&renderService);
    assert(SUCCEEDED(hr));

    hr = recorderClient->GetService(IID_IAudioCaptureClient, (void**)&captureService);
    assert(SUCCEEDED(hr));

    UINT32 nFrames;
    DWORD flags;
    BYTE* captureBuffer;
    BYTE* renderBuffer;

    hr = recorderClient->Start();
    assert(SUCCEEDED(hr));

    hr = renderClient->Start();
    assert(SUCCEEDED(hr));

    while (true) {
        hr = captureService->GetBuffer(&captureBuffer, &nFrames, &flags, NULL, NULL);
        assert(SUCCEEDED(hr));

        hr = captureService->ReleaseBuffer(nFrames);
        assert(SUCCEEDED(hr));

        hr = renderService->GetBuffer(nFrames, &renderBuffer);
        assert(SUCCEEDED(hr));

        memcpy(renderBuffer, captureBuffer, format->nBlockAlign * nFrames);

        hr = renderService->ReleaseBuffer(nFrames, 0);
        assert(SUCCEEDED(hr));
    }

    // This code won't be reached but if the loop condition changes
    // you should always clear the resources

    recorderClient->Stop();
    renderClient->Stop();

    captureService->Release();
    renderService->Release();

    recorderClient->Release();
    renderClient->Release();
    
    recorder->Release();
    renderer->Release();

    CoUninitialize();
}

Sound in windows. The WASAPI in C++

How does it work?

Written by Shahid Khan