Interpreting Auslan with gesture analysis: A sign of tone deaf times

Published in

Trends in Data Science

10 min readOct 26, 2022

The deaf community and Auslan interpreting

1 in every 50 Australians suffer from complete or partial hearing loss (Australian Bureau of Statistics, 2018) yet their needs are often forgotten and in particular — misunderstood — by the hearing population. With the lack of widespread Auslan (Australian Sign Language) participation by the latter, there exists an ever-present distance between the deaf community and accessibility to society. The primary mechanism to bridge this divide has been interpreters, hearing individuals who ‘translate’ to Auslan. Interpreters have been employed to good use in broadcast events, but are not a practical aide for daily life as the process reduces someone’s agency and appointments are not guaranteed.

Perhaps due to the technology industry’s proclivity to look for solutions before problems, or the growing capabilities showcased within data science, many have looked to automation. The most commonly repeated approach is to use data from motion sensors or computer vision for gesture analysis. Unfortunately, research has shown that these innovation attempts tend to pave over complex issues rather than fill gaps in accessibility.

Potential for data-driven gesture analysis

On the surface, this problem scenario does indeed lend itself to data analysis. Machine learning models have been able to classify motion data into Auslan signs using hand-mounted hardware as early as 1995 (Kadous, 1995). Since then, computer vision pipelines have also been shown to reliably identify hand poses (Parton, 2005).

Advances in mainstream cameras also mean detail can be captured in low-light environments thanks to better sensors and processors, which in turn allow for increased shutter speeds and frequency of recorded observations. As machine learning models and the hardware they are run on improve, so too does the latency involved in translating human movement to data, analysing it and generating output.

In order to carefully ‘cook’ raw data, notation systems such as HamNoSys (Hanke, 2004) and Sign Writing (https://signwriting.org) have been developed to ‘phoneticise’ and codify sign language. They aim to address the problem of capturing sign language in recorded or programmatic form without losing or changing meaning. In turn, these systems position gesture data to be used more readily for data science tasks, especially for natural language processing (Parton, 2005).

German Sign Language gesture attributes codified as symbols in the HamNoSys system. — German Sign Language gesture attributes codified as symbols. (Hanke, 2004)

Example of the “Sign Writing” sign language system. A hand gesture and facial expression are represented by a symbol graphic. — “Pretend you can see through the back of the head. You are reading and writing how your face “feels” when you sign” — Sign Writing (https://signwriting.org/archive/docs1/sw0008-About-SignWriting.pdf)

Additionally, the ubiquity of camera devices as well as the growing familiarity with and appetite for AI products mean that mainstream access to computer vision approaches has become increasingly accessible.

If computer driven approaches reliably interpret Auslan, they could shift some onus of accessibility from the deaf community similar to websites and applications having a responsibility to be accessible to screen-reading software. Working implementations which only require the cameras that people already own would be a relatively straightforward upgrade to their sense of agency. Be My Eyes, for example, is an app that pairs blind individuals with sighted users, allowing them to complete otherwise impossible tasks simply by using their smartphone (Be My Eyes, 2022).

Data challenges with existing approaches

1. Auslan is a temporal spatial data task

Using computer vision to classify signing is certainly an explored field. Prior approaches typically employ object detection or pose estimation to identify the hands, then feed that data into a model trained to classify the gesture as a category of sign.

A grid of photos of sign language gestures, representing the letters A, L, P, H, A, B, E and T. — Fingerspelling dataset used for training (Halvardsson et al., 2021)

Video still of a man holding the Auslan sign for the letter “C”. Superimposed on the video is a box surrounding his hand, labelled with “C”. — Tom Quirk from Coviu demonstrating static gesture (finger spelling) detection **(Appendix A)**

What you may notice about the prior work above is that they have all been performed on static hand gestures. They take a video frame and analyse it with no relation to the previous or next movement. Auslan however, is signed using both hands to create gestured motions, not only static poses. One moment in two separate signs can seem identical, therefore any reasonable representation of the language cannot solely rely on them.

As such, the task of classifying signs is a temporal spatial one which brings with it all the complexities of combining time series with coordinate data. A slice of time representing a sign is no longer a single frame of motion or image data, but dozens. Each of these at minimum At 60 records a second, there would be over a hundred observations to analyse for a sign lasting just one and a half seconds.

While this issue frequently goes unacknowledged or is simply relegated to further research in concluding statements, research by Mahmoud Waleed Kadous explored it in depth (Kadous, 1996). Kadous identified that an even more difficult hurdle than analysing a sequence of data points is the demarcation of signs by recognising when one ends and the next begins.

Additionally, some approaches analyse only the pose of hands but not their position in relation to the body. By localising the x, y, z coordinate space to just the wrist to the fingers, a great deal of information about how gestures move across the body is lost. For instance, the sign for “day” and “night” use identical hand poses but move in opposite directions. (Appendix B)

2. Auslan modality is not exclusively manual

Even if hand gestures are reliably captured and analysed in Auslan, they do not complete the message being communicated by someone. Facial expressions, body language and modality of gesture are combined with specific signs to convey meaning. These aspects are known as the paralanguage of signing. In “On the Conventionalization of Mouth Actions in Australian Sign Language” (Schembri, 2016), Adam Schembri observed that 46.7% of dialogue in Auslan involved mouth actions across different text-types.

Addressing the issue is not simple because facial analysis, body pose estimation and gesture recognition are separate tasks, yet need to be processed together to generate accurate output. This is also exacerbated if they are captured with different inputs, such as aligning hardware sensor data with image data.

It is worth mentioning however that the oft-cited “Inference of attitudes from nonverbal communication in two channels” study (Mehrabian et al., 1967) which claimed spoken communication is 55% from the face, 38% in tone and 7% in words, has been largely disputed. Lapakko (1997) identifies issues in the methodology of the study while Amsel (2019) describes the misuse of these particular figures.

Despite the difficulties involved, it nonetheless remains crucial to address as current methods only frame the data problem as capturing hand gestures, not interpreting human intent.

3. Auslan is not English

Just as captions are not an end-all solution for the deaf, neither is translating Auslan directly to English. It is a separate language with its own grammar and even its own northern and southern dialects. If the aforementioned challenges with temporal spatial data and expression were solved, there would still remain the issue of language itself. Existing demonstrations tend to showcase interpretation in isolation, where a single sign is analysed and a single word or letter is generated. To form intelligible English, a linear sequence of such interpretations would not suffice.

Implementation Considerations

We must not consider the implementation of such methods solely as a technical problem or in isolation from business and human implications. As such, it’s worth examining the validity of pursuing them in the first place and what hidden costs a successful implementation may carry.

1. The data science community is not listening

Despite the mentioned data challenges being researched for years, practitioners still continue to adopt the same approaches. Perhaps as a result of this space being ripe for innovation (refer again to “Opportunities and benefits in data-driven gesture analysis” section), there is commonly an eagerness to prove the hypothesis of projects more than the applicability of projects. Innovation and impact in these instances are thus measured from zero instead of prior research or user feedback — the belief that it is innovative because it has not been done before.

Take for example the study, “HELPING SIGNERS”, which has an honourable goal in its title but nevertheless only assesses Auslan within the scope of isolated, static gestures. Indeed, as recently as 2022 another study remained focused on static gestures for its research (Subburaj, 2022). Reading through the paper, there is much reflection on the use of data and analysis techniques but seldom any scrutiny of integration with community. Since 2021, a patent has been pending for a system requiring glove sensors to interpret static finger spelling (Retek et al., 2021). These works indicate that data practitioners are putting forth incomplete approaches with serious intent, while not measuring the agency afforded to deaf individuals as a success metric.

Better collaboration with the deaf community as stakeholders would help steer further research. Whether it’s due to prioritising research over applicability or due to research being founded on assumptions, it does appear these works are hearing less than the deaf community they claim to innovate for.

2. A successful implementation may not be a helpful solution

“Success” here is defined as meeting the technological challenges mentioned previously and being made available for use.

Imagining that gesture analysis interpretation was the norm, there would be a degree of deafness ‘literacy’ lost in the general population. That is, the more that people believe interpreting is a solved problem, the less they are faced with having to understand what deaf individuals experience. It encourages hearing people to look past the nuanced needs of the deaf community and reduces the chances for them to identify with their disability.

If a successful implementation relied on hand mounted hardware approaches, the agency it would afford people would also come at the cost of accessibility. While the ubiquity of camera devices is a lower barrier by comparison, tailored hardware may prevent usability due to cost, availability or compatibility.

Consider too, the feedback loop of technologically successful implementations which are rejected by the deaf community. There is the possibility that poor adoption can discourage further innovation attempts from data scientists, even if it is a result of poor business understanding and user collaboration.

Training performed on data will also codify the language in place. That is, if models are used and relied upon long term, the ability for the language to evolve naturally is stymied. For example, gender biases in sign language have been present for decades (McIntosh, 1990) with children being impressioned with them from a young age. (Appendix D) shows how biases exist already, with male and female words deliberately separated as high and low. Use of models with static weights will reinforce biases that are embedded in the model and therefore the language.

References

Amsel, Tuvya Thomas. (2019). “An Urban Legend Called: “The 7/38/55 Ratio Rule””. DOI:10.2478/ep-2019–0007

Australian Bureau of Statistics. (2018). “Long-term health conditions — Australia”, National Health Survey: First results, Table 3.2. https://www.abs.gov.au/statistics/health/health-conditions-and-risks/national-health-survey-first-results/latest-release

Be My Eyes. (2022). “Our Story”. https://bemyeyes.com/about

Hanke, T. (2004). “HamNoSys — representing sign language data in language resources and language processing contexts.”, Institute of German Sign Language and Communication of the Deaf. https://www.sign-lang.uni-hamburg.de/dgs-korpus/files/inhalt_pdf/HankeLRECSLP2004_05.pdf

Halvardsson, Gustaf., Peterson, Johanna., Soto-Valero, César., Baudry, Benoit. (2021). “Interpretation of Swedish Sign Language Using Convolutional Neural Networks and Transfer Learning“, SN Computer Science (2021) 2:207. https://doi.org/10.1007/s42979-021-00612-w

Kadous, Mohammed Waleed. (1995). “GRASP: Recognition of Australian Sign Language Using Instrumented Gloves”

Kadous, Mohammed Walled. (1996). “Machine Recognition of Auslan Signs Using PowerGloves: Towards Large-Lexicon Recognition of Sign Language”

Lapakko, David. (1997). “Three cheers for language: A closer examination of a widely cited study of nonverbal communication”, Communication Education, 46:1, 63–67, DOI:10.1080/03634529709379073

M. Mohammad, Saif. (2022). “The NRC Valence, Arousal, and Dominance (NRC-VAD) Lexicon”, Saif. https://saifmohammad.com/WebPages/nrc-vad.html

McIntosh, Rebecca Anne. (1990). “Is ASL gender-biased? : a study of whether deaf school children can recognize gender-bias in their language”, Graduate Student Theses, The University of Montana. https://scholarworks.umt.edu/etd/7802

Mehrabian, A., Ferris, S. R. (1967). “Inference of attitudes from nonverbal communication in two channels”, Journal of Consulting Psychology, 31(3), 248–252. https://doi.org/10.1037/h0024648

Parton, Becky Sue. (2005). “Sign Language Recognition and Translation: A Multidisciplined Approach From the Field of Artificial Intelligence”, The Journal of Deaf Studies and Deaf Education, Volume 11, Issue 1, Winter 2006, Pages 94–101. https://doi.org/10.1093/deafed/enj003

Retek, David., Palhazi, David., Kajtar, Marton., Alvarez, Attila., Posci, Peter., Nemeth, Andras., Trosztel, Matyas., Robotka, Zsolt., Rovnyai, Janos. (2017). “Computer vision based sign language interpreter”, United States Patent and Trademark Office. US20210174034A1

Subburaj, S., Murugavalli, S. (2022). “Survey on sign language recognition in context of vision-based and deep learning”, Measurement: Sensors. Volume 23, October 2022. https://doi.org/10.1016/j.measen.2022.100385