Musical Instrument Practice Support System based on Crossmodal Correspondences to be Presented at IUI 2023

Shigeo Yoshida
OMRON SINIC X
Published in
4 min readMar 15, 2023

This is Shigeo YOSHIDA, a Senior Researcher at OMRON SINIC X.

We are glad to announce that one of our papers, a musical instrument practice support system based on Crossmodal Correspondences, will be presented at The 28th Annual Conference on Intelligent User Interfaces (IUI 2023). This work is joint research by OMRON SINIC X and The University of Tokyo (Cyber Interface lab. and Saruwatari & Koyama lab.).

Kota Arai, Yutaro Hirao, Takuji Narumi, Tomohiko Nakamura, Shinnosuke Takamichi, and Shigeo Yoshida. 2023. TimToShape: Supporting Practice of Musical Instruments by Visualizing Timbre with 2D Shapes based on Crossmodal Correspondences. In 28th International Conference on Intelligent User Interfaces (IUI ’23), March 27–31, 2023, Sydney, NSW, Australia. ACM, New York, NY, USA, 16 pages. https://doi.org/10.1145/3581641.3584053

In this work, we propose a method to represent timbres in 2D shapes based on the “correspondence between timbres and shapes” that people have in common. Below is a brief introduction of the background, proposal, methodology, and evaluation. For more detailed information, please refer to our paper.

Background and Proposal

Music players can convey information and feelings to their audience by adequately changing the sound timbre. However, because timbre is a high-dimensional and sensuous concept, it is difficult for learners to develop the skills necessary to play the desired timbre.

Then, we proposed a system that supports the acquisition of skills necessary for playing timbres by converting timbres into visually perceived ones based on an intuitive correspondence between timbres and visual shapes.

To propose such a system, we focused on “Crossmodal Correspondences,” which is the association between different modalities.

Crossmodal Correspondences

Humans feel a natural connection between different modalities, such as sight and hearing (e.g., “bright voices” and “dark voices”). This non-arbitrary associative relationship between different modalities is called “Crossmodal Correspondences.”

A well-known example of crossmodal correspondences is the “Bouba-Kiki”. When asked the question, “If you had to name the two shapes below, which would be Bouba and which would be Kiki?” most people will answer that the left one is Kiki and the right one is Bouba.

https://en.wikipedia.org/wiki/Bouba/kiki_effect#/media/File:Booba-Kiki.svg

There are various such correspondences between different modalities, such as shape and sound as in Bouba-Kiki, brightness and sound as mentioned earlier, and shape and taste/smell.

In this study, we focused on the crossmodal correspondences between “timbre and shape” for intuitive visualization of timbre. On the other hand, it is not obvious what kind of shape should be given to an arbitrary timbre produced by playing a musical instrument.

Method for Visualizing Timbres with 2D Shapes based on Crossmodal Correspondences

In this study, we constructed a system that generates and visualizes the 2D shape corresponding to the current timbre in real-time based on the correspondence between some timbres and shapes answered by the musical-instrument learner in advance.

First, timbres are encoded in the latent space by unsupervised learning using a variational autoencoder (VAE) based on previous research. Then, the system estimates the shape for an arbitrary timbre by linear interpolation of the correspondence between some timbres and shapes that the musical-instrument learner has answered in advance. This shape is then decoded and visualized in real-time for the musical-instrument learner.

Below is a scene of the actual use of our proposed system. As you can see, the system is able to present shapes in real-time in response to the sound of the musical-instrument learner.

User Study

Two user studies were conducted to verify the effectiveness of the proposed system.

In the first study, we used crowdsourcing (n=75) to verify whether the proposed system could generate a shape that could be perceived as corresponding to an arbitrary violin timbre. The results showed that the shapes generated by the proposed system were more likely to be perceived as corresponding to the timbre than randomly generated shapes.

In the second study, we investigated the effect of visual feedback of the shapes generated by the proposed system on violin practice with six violin players. The results showed that the proposed system was well received in terms of visual clarity, ease of understanding the relationship between timbre and visual feedback, and ease of recall even after the visual feedback was lost.

Call for Interns

At OMRON SINIC X, we will continue fundamental research on natural language processing, computer vision, machine learning, robotics, and human-computer interaction (HCI). If you are interested in working with us as an intern, please visit our call for internship page.

--

--