Station 2. Stories Behind Developing One of the Most Sophisticated Yandex Devices

Published in

Yandex

18 min readJun 23, 2022

Yandex Station 2 is our newest device that is becoming the centerpiece of the smart home with Alice, our voice assistant. We have completely overhauled the interior and exterior design, added an LED screen on the top panel, and tried to learn from the previous generation’s experience when it came to many details that may not be immediately obvious.

I want to share the inner workings of this process. Below, you will find several stories about different aspects of hardware development: we will discuss the study of the room’s shape with microphones, the scattering of light in transparent materials, generative animations, and the unexpected benefits of FPGAs.

A Story About Sound

Initially, we planned for Station 2 to be smaller than the first Station. However, our acoustic flagship — Station Max — had already been launched by the time we started engineering Station 2. It was almost impossible for a smaller device to compete with the Max when it came to sound. To counter this, we decided to make the sound different — not directed at a person but evenly distributed throughout the room (the so-called “360-degree sound”). In addition, we sought to strike a balance between all frequencies so that users could play voice compositions or podcasts without the annoyance of overwhelming bass.

The 360-degree mode is already used in our smaller speakers. In Station Mini, the surround feeling is achieved by directing the speaker downwards and dispersing the sound to the sides. Yet Station 2 was supposed to be a more powerful device with much better sound quality than the Mini. Early experiments demonstrated that a downward-facing speaker could not provide the quality we were looking for.

In addition, we wanted to preserve the hallmark of the first Station and Max version — stereo sound capabilities.

As a result of many experiments and testing a variety of sound schemes, we came up with the one shown in the above illustration: two full-range speakers are directed forward and backward. On the right and left sides, they are supported by passive radiators responsible for the low frequencies.

The foremost advantage of this arrangement is that you can rotate the device however you like. The speakers can point forward and backward or left and right — the latter positioning slightly enhances the stereo effect. But herein lies the problem: few people put the speaker in the center of the room. In the real world, one of the speakers — or sometimes both — can end up close to a wall, and the sound deteriorates due to constant reflections. Two years ago, we would have given up and opted for a safer sound scheme. But conveniently, amid designing the acoustics for Station 2, we had some encouraging results while playing around with Room Correction. (I will tell you what it is and how it affects the sound in the following story.) After seeing these results, we decided to take our chances.

We had to do some tinkering with the speakers — they needed to be broadband, with the best possible amplitude-frequency response, but also very compact. In addition, we discovered an interesting aspect: with this arrangement, you need to take the position of the device’s center of gravity into account. The higher the center of gravity, the less stable the device. The speakers use specially shaped magnets, additionally weighted to ensure stability. And yes, our speakers are traditionally custom-made — it is impossible to find such speakers on the market.

An unexpected advantage of this symmetrical layout is that high-volume sound does not give rise to excess vibrations throughout the device’s body. They tend to negatively affect everything, including the quality of voice recognition by Alice, because microphone membranes vibrate along with the device itself.

A Story About Room Correction

In their reviews for the first-gen Station and Station Max, users often pointed out that there is too much bass. Low frequencies dominated, even in instrumental tracks and orchestral music. We found that, at least partly, this perception depended on where exactly the device was placed. A small study showed that many people have the Station on a recessed shelf or niche, and some users even put it on the floor in the very corner of the room. Walls and room partitions in close proximity to the device would create acoustic barriers. The resulting changes in frequency ratio often meant precisely what our users wrote about: the overabundance of bass.

This issue can be solved with an equalizer — we launched our own in December last year. But this function is directed at music lovers more so than the broader userbase. We wanted the Station to sound great out-of-the-box: even if put in a closet.

On the one hand, there are solutions on the market that help you adapt the speaker system to a specific room. On the other hand, this primarily applies to multi-speaker setups. The sound quality is optimized for the listener located at a particular spot, like on a sofa in the center of the room.

To automatically adjust settings depending on the environment, the device needs to be aware of its surroundings. Unlike self-driving cars, a speaker does not have a lidar (laser sensors): it cannot determine the shapes of nearby objects. What it does have are microphones.

Using Station 2’s microphones, we recorded the sounds it produced in an acoustic chamber and used them as reference. When the speaker plays the same sounds in the user’s actual environment, it gauges the subtle differences between what it “hears” and the prerecorded reference. This way, the device can get a rough understanding of how the environment distorts the sounds being played. Suppose that some frequencies are louder than the reference by 2–3 decibels: then the Station dials down the volume at these frequencies.

At first, we tried to regulate the sound at all frequencies. We had spent many weeks on tests, but in the end, we realized that the perception of the “highs” depends on not just the speaker’s physical location but also the listener’s. We have not yet learned to detect where a person is in relation to the Station; besides, there can often be several listeners. Nevertheless, we are pleased with how the bass and part of the midrange can adapt to the surrounding room.

A Story About Design

Coming up with a design for a whole new generation of a device is not an easy undertaking. It is essential that users familiar with previous Yandex Stations can see continuity in the new smart speaker and feel that it belongs to a larger family of products. On top of that, we had to develop something entirely new.

As a start, we decided to find out what exterior features people associate with Yandex Stations the most. These features were:

Simple shapes: simpler than those of, say, the Amazon Echo.
Lights: LEDs around the volume controls on the first Station, a glowing Alice logo on the Station Mini, etc.
Body fabric.

All three features had to stay. Station 2 is the second device since the Yandex Module that uses our new design language, which we plan to keep on using in the future. The core property of this language is its two-dimensionality. The main part is responsible for function (in this case, it is covered up with fabric, and the function is sound), and the top part is responsible for emotion. If you look at Yandex Module or the power adapter for the new Station 2, you will notice this two-dimensional principle in action.

In Station 2, we strived for a more elegant look compared to the first generation. To achieve this elegance, we gave the base a sleeker shape. The smoothness creates a subtle illusion: it almost looks like the Station floats above the table. In one of the prototypes, the device’s sides seamlessly transitioned into the base. This prototype could tip over with a faint push, so we decided to slightly complicate the lower part while preserving the rounded corners. To compare our prototypes by unsteadiness, we placed them on an incline and observed the angles at which the speakers would start falling over. Eventually, we reached an improvement of 4–⁠5 degrees. The final design of the base is quite intricate:

Picking the suitable fabric was a challenge of its own. We wanted to take a step forward and make the station a part of the home interior: it had to look good in most environments. After experimenting with different fabrics, we realized that mélange makes for the best solution. You may be familiar with this type of fabric, as it is often seen in clothing and upholstered furniture:

Mélange implies that the thread is woven from fibers of different colors, and the cloth itself is made from this thread. Having added 30% of gray thread, we got a dark Anthracite color with lighter inclusions:

But when picking among earlier Stations, most customers gravitated toward the light gray option. It would not mesh well with the new speaker’s black top panel. We solved this by coming up with a Sand-colored variation. Blue cobalt, in turn, is practically black, but with a gentle twist — it is the kind of look people often go after.

A Story About the Top Panel

The idea of lights, which I listed among the main features shared by all Yandex Stations, is taken to an extreme with Station 2. We chose LED as the most suitable technology for our purpose: to illuminate the top panel from edge to edge. Many difficulties lay ahead, only a handful of which we had anticipated.

The concept of light scattering can occasionally be found in consumer devices, although not very often. Typically, one or more lenses are placed above the light sources. We ran experiments with different types of LEDs, constantly altering their number and mutual arrangement. It became clear that we would need way more than 50 diodes. With so many diodes, it was impractical to use lenses, as that would make our device too expensive. We had to develop a panel that would diffuse the light all by itself — and look gorgeous while doing it.

In this picture: two materials with identical properties but different thickness

We tried laying out translucent pieces of plastic on top of the LED board. It was important that one diode does not illuminate the entire panel but also does not look too accentuated, like a bright dot. We wanted something in between so that the light flows smoothly between the diodes. Moreover, the picture had to be clearly visible in the sun. Glossy surfaces glare more, and matte surfaces can affect saturation. Even the matting method plays a role.

An additional complication was equipping the panel with a central unit — an “island” with touch buttons and microphones. We considered the option of lighting the entire panel, including through the printed circuit board of the island. It was supposed to be made either from a transparent film or fiberglass so thin that it is translucent. Alas, not everything is meant to be transparent: during trials, the glowing diodes would highlight the tracks, microcircuits, resistors, and capacitors, making them visible. In the end, we decided to allocate some dedicated space for the island. We carefully chose the central unit’s size to avoid users accidentally hitting the microphones when pressing buttons. By the way, you can switch between tracks by swiping across the three buttons.

The presence of this central unit imposed certain limitations, prompting us to add additional LEDs to the Alice logo, the light from which did actually need to stand out prominently. We went so far with this in one of the prototypes that several distinct dots would shine through the logo. In another mockup, the light from the diodes would trickle into the neighboring microphone openings; those cannot be completely isolated.

In the final version, the island’s components are hidden under the opaque central area of the film that covers the entire top panel. The scattering of light inside the plastic (polycarbonate, to be precise) is achieved by adding a special diffusing agent, the composition of which is almost as secret as the Coca-Cola recipe. The edges of the top panel are visible from the sides: the light reaches them, too. Even if the speaker is above eye level, it will still give you a visual hint at what action is being performed. 84 LEDs provide the backlight — this is the number we eventually settled on.

Brainstorming and implementing the glowing top was one of the lengthiest steps in the development of Station 2. Producing such complex “sandwiches” with an IML film of a sufficiently large area is by no way a trivial task. We found only one company in China that could produce panels of sufficient quality.

A Story About Generative Animations

The screen described above cannot display any specific objects or indicators. It would seem that 84 LEDs are enough, but in practice, this roughly corresponds to a resolution of 7x12 pixels minus the island: less than in early color-screen phones. For comparison, the display of the Nokia 3310 had a resolution of 84x48, tallying 4032 pixels.

But how do we show mesmerizing animations while playing music with only 84 lights? The solution came to light as a result of close cooperation between a developer and a tech-savvy designer. We represent the audio track as a raw data sequence, cut off the outlier bursts, and smooth the data out. At each moment, the renderer “sees” a frame with a set of data distributed by frequencies and their total quantity. You may picture it as such: the tops have one volume, the mids have a different one, and the subs — another volume of their own. Next, we distribute data across the LEDs and determine which group of diodes is responsible for which frequencies. If there are lots of subs, the corresponding diodes flash with higher intensity, and so on.

However, in this mode, the animations tended to look a bit too simplistic. To combat this, we came up with a layer of mathematical modification of the data. From each point, we would build the so-called Manhattan distance. This allows us to light up not just one LED but a kind of circle around it, made up of several diodes. As a final touch, we randomized the borders of this circle a bit. This results in splashes of ambiguous shapes that continuously “flow” into each other — the picture changes as the track plays.

Alright, I told you how we generate shapes on the top panel. But how does the Station choose which colors to use? The answer is that our designer drew a large gradient map with a complex, multi-colored, and non-linear gradient. At each moment in time, a set of colors for LEDs is taken from a certain fragment of this map, then from a neighboring one, and so on. This is how images that pop up during the animation transform not only in shape but also in color. Unique combinations are generated each time, even when you replay the same track.

A Story About Chipageddon and FPGAs

During development of Station Mini, our previous device, we firstly experienced Chipageddon, a hardware shortage caused by the pandemic. That experience prepared us to brace for problems with finding the right components for Station 2.

The Chipageddon impacted the availability of three nodes especially hard: the Type-C controller, STM32 (more on that below), and the amplifier. The latter had to be substituted during development; initially, we designed the device with a different amplifier model in mind. It turned out that class D amplifiers with digital input and built-in signal processing are represented by just a few SKUs in the entire market, and their tuning varies significantly from one manufacturer to another. So, we had about two months to bring the sound level on the new amplifier to the specified requirements.

Our previous speakers used a standard STM32 as the LED controller, and the early versions of Station 2 were no exception. But at the beginning of last year, STMicroelectronics, who used to produce these controllers by the millions, had stopped supplying them in that amount. The sharp decrease in production volumes would soon trigger a chain reaction: many companies began panic-buying whatever was left of STM32s. As a result, the chips not only went up in price but disappeared from the market altogether. At that moment, Chinese analogs with the characteristics we needed (memory size and frequency) were available. But there were few of them; they cost about 2.5 times more than the STM chips we were familiar with and lacked in-depth documentation. We also looked at domestic microcontrollers: they turned out to be even more expensive. The possibility of producing hundreds of thousands of chips in the required (short) time frame was also murky.

Then one of our distributors suggested a more affordable FPGA (Field-Programmable Logic Integrated Circuit) as a replacement for the STM32. Powerful circuits of this type are used for complex calculations and fast processing, such as Fourier transform or video transcoding. Simple FPGAs, in turn, are often used to control simple devices. We were won over when the distributor let us know they could quickly prepare an MVP of such a solution. In the meantime, we had an opportunity to gain expertise in an area that was not entirely familiar to us. The thing is, working with FPGAs is not the same as working with STM32. Yes, you write code in the IDE in both cases, but FPGA programming is based on a completely different logic.

We had figured out controlling LEDs previously in Station Max: there were 400 single-color diodes. In Station 2, there are 84 four-color LEDs (RGBW), which makes for 336 channels. Remembering that the FPGA is all about parallel processing, we decided to take advantage of the opportunities presented by this logic. The STM32 controlled four hundred diodes through shift register chips connected in a sequence into one 400-bit register. In comparison, the FPGA controls seven 48-bit chains in parallel — plus one shorter chain for the logo LEDs. This made it possible to increase the frame rate to several hundred per second, making the animation smoother.

But LEDs are easy to figure out, even when involving a frame buffer and 256 brightness levels per channel. It was a lot trickier to implement touch controls.

While the STM32 has a built-in touch interface controller, the FPGA does not. It might seem that there is a large selection of touch microcircuits to choose from, made by many manufacturers. But the Chipageddon took it all away. The necessary chips were practically absent — the last remaining batch of several hundred thousand that we had managed to order had to be reserved for Station Mini.

Therefore, we began building a touch input controller on an FPGA. It is enough to connect the touch zone with two FPGA legs and one capacitor. But due to the design of Station 2, the tracks between the integrated circuit and the touch controls had to be run all over the board, through connectors, and along a flexible cable. There were a lot of interfering LED control signals around. The first firmware with touch functionality worked fairly well…right until we turned on the animation. That was when false positives started occurring: the system would register a key press even when no one touched the button.

We had to collect raw data from sensors, analyze it and build digital filters while also selecting parameters for reducing noise. Then we found that the average signal level from the touch zone changes slowly over time, so the response threshold must be made adaptive. But that was not the end either. It turned out that the quality of the signal depends on the specific microcircuit. About 17 out of 20 test batch speakers worked reliably. With the remaining three, we observed false positives. We had tweaked the filtering parameters, and three “bad” devices stopped misbehaving. But four of the “good” ones broke.

This effect was caused by the fact that every microcircuit has individual delays in each cell: the signal transmission speed slightly differs. We are talking nanoseconds — but even such minuscule differences matter. Finally, we managed to rewrite the measurement algorithm so that the difference between the microcircuits did not affect anything. But by this time, we had already added a special touch signal quality meter to the FPGA firmware and introduced it into the testing procedure during manufacture. Now we make sure that the touch controls work properly on every Station.

A Story About Zigbee

Previously, Alice would manage smart home devices only through the clouds. The “turn on the light bulb” request would be sent to the Yandex smart home cloud, then to the light bulb manufacturer’s cloud. Finally, the manufacturer’s cloud signaled for the bulb to turn on: that is how it worked for most smart devices. This scheme is unstable and fragile, which often leads to delays in the execution of commands.

The modern Zigbee protocol allows you to send commands locally, directly from the speaker to a light bulb. This is possible when there is a Zigbee chip installed on both devices. Such chips can operate on battery power for several years — dozens of times longer than Wi-Fi modules. While Zigbee’s bandwidth is much more modest, it is plenty for a smart home. In addition, a Zigbee network is a mesh network in the sense that devices with constant power (such as sockets) can act as repeaters and automatically expand the coverage. Newly installed devices choose the closest repeater to connect to, and if it fails, they quickly adapt to the changed topology.

It was crucial to find a place for the Zigbee module somewhere within the speaker’s body. As a result, it shares the board with the top panel’s LEDs located right under the touch controls. It is positioned at the maximum possible distance from all the metal components that could negatively impact the module’s antenna.

The smart speaker still relies on the cloud for voice recognition, but for us, the emergence of Zigbee is a step towards a fully local smart home.

Pre-release board with LEDs and Zigbee module

A Story About a Full-Fledged Type-C

What was the difficulty in implementing the Type-C port in first-gen Stations? When users see USB Type-C, they expect that they can power the device with any adapter, such as a phone charger. This is where a task arises: we need to determine whether the connected adapter is suitable. The USB Power Delivery (PD) protocol was developed precisely to tackle this.

Two special chips are needed for the protocol to work correctly: one in the adapter and another in the device. The adapter uses a chip to tell you what voltage and current it can provide. The device responds by communicating its needs for voltage and current, and the connection occurs with the parameters suitable for both parties.

When we were designing Station 2’s power system, we agreed on affordable PD controllers for both the adapter and the device — for the first time ever. The power supply was exclusively designed for this speaker. It outputs the exact parameters needed but is also explicitly adapted for playing music, capable of providing brief on-demand surges of much stronger current. And, of course, you can use it to charge any other devices that support PD.

However, the greatest difficulty was not in developing a suitable adapter but in handling situations where the power supply either does not have a PD chip or has one but cannot provide enough power. We made a volitional decision at this point: we let the Station boot up from unsupported power supplies and quietly alert the user that a more suitable adapter should be used. If the power supply supports the [15V 3A] PD profile, the speaker is fully functional.

Station 2 is one of the most technologically sophisticated Yandex devices to date — we had faced many challenges during its development and hope that we were able to make a smart speaker that users will love.