Homoscedasticity and Mixed-Effects Models

Mattia Di Gangi
Jan 24 · 7 min read

What is homoscedasticity in linear regression, why heteroscedasticity calls for mixed-effects models and a real example in spoken language translation.

Heteroscedasticity: you cannot perform linear regression on randomly dropped balls.
Heteroscedasticity: you cannot perform linear regression on randomly dropped balls.
Photo by Patrick Fore on Unsplash

Linear regression is a popular statistical model to find a linear function between two samples, which works particularly well when your data is distributed as in Figure 1.

Example of a well-fit linear regression
Example of a well-fit linear regression
Figure 1: Very high correlation between X and Y. Linear regression is likely the best model.

However, similarly to any other statistical methods, linear regression works only under well-defined assumptions. When the assumptions do not hold you need to find a different model to fit your data.

In this post, I try to explain homoscedasticity, the assumption behind linear regression that, when violated, makes it a bad fit for your data. In this case, called heteroscedasticity, the main alternative is to go for linear mixed-effects models. The post is closed with an example taken from a published research paper.

Homoscedasticity

The variances of the dependent variables are constant across the population: they do not depend on the x values.

If the variances are not constant, this property does not hold and we have its opposite: heteroscedasticity.

We have heteroscedasticity when the data are not independent, usually because there has been some sort of grouping in the data collection. For instance, we can have that different portions of the data set have been collected at different times, or there is a hierarchical ordering of the data.

Imagine collecting data about the effects of exercising. For each point, we record hours of physical exercise and a body parameter (let’s say heart rate). We can easily imagine that, while there is a trend that shows the benefit of exercising, its magnitude will be different for people in their 20s or in their 50s. The global linear regression may be difficult to interpret if the age factor is not considered and we can have some surprising results, an effect known as the Simpson paradox.

Linear Mixed-Effect Models

When data groups have different variances (heteroscedasticity), the regression line is affected more by the group with higher variance. In order to have a good model to fit the data, we need to explain away the variance within groups, and the variance across groups.

Mixed-effect models have an additional parameter to explain random effects, that is one or more variables in our data that can explain the grouping. This is different from computing a different linear regression for each group because in this last case there is a global trend in the whole population that is ignored. Also, in a smaller data sample (a single group) the noise is more significant than in the larger samples containing multiple groups.

Then, the fixed effects explain the global trend, while the random effects explain the local trends. The result is a method that takes the best of the two worlds but requires to choose carefully the random effects.

Beware that linear mixed-effects models have a strong assumption shared with linear regression: the residuals are normally distributed. Again, if this assumption does not hold, also a mixed-effects model will be a bad fit for your data.

A Real Analysis

We are in the following scenario: we have a machine translation (MT) system trained to translate well-structured input text. Instead of using it in a test scenario similar to the training set, we want to assess its feasibility for spoken language translation (SLT).

The following example compares a properly structured text with the transcript of a spoken sentence:

On Friday night, I like to have pizza.
on friday night i like ehm to have mmh pizza

In SLT, the most common approach is to first transcribe the input speech using an automatic tool (automatic speech recognition, ASR), then the output is normalized for the MT system’s sake and given as input to it.

were running with our dog and its enjoying

-> we’re running with our dog, and it’s enjoying.

Obviously, the output of ASR contains errors, in particular for words that can be misinterpreted in many ways. Some words may be ignored at all by the system, and other words can be misinterpreted for similarly sounding words
(of <-> off; where<-> were) or different phrases (wearable robot <-> wreck a raw boat, anatomy <-> and that to me).

In our study, we have 8 different ASR systems of varying quality, and 2 MT systems built following different approaches:

The goal is to understand which MT system is more robust to ASR errors and how the ASR quality affects the results.

ASR quality is measured in terms of word error rate (WER) and MT quality is evaluated with translation edit rate (TER). The two similar acronyms suggest correctly a similarity between the two metrics. Indeed, they are computed as the Levenshtein distance between the system’s output and a given reference sentence. As a quick reminder, the Levenshtein distance computes the number of INSertions, DELetions and SUBstitutions of the output with respect to the reference, normalized by the length of the reference in terms of words.
Both metrics are computed sentence by sentence and then averaged on the whole test set. The ASR systems have a varying quality as measured by WER:

WER values for the eight systems
WER values for the eight systems
Figure 2: WER of the ASR systems for the test set in exam.

A first analysis (Figure 3) shows, for each ASR system, the number of times that each of the two MT systems produces the best translation as measured by the TER metric:

Figure 3: Count and percentage of winning outputs of each MT system relative to every ASR system. The winning output is the one with lowest TER.

For each ASR system, NEURAL wins in about half of the cases while MMT wins in about 30% of the cases, the rest being ties. However, NEURAL is a stronger system when translating clean input. Then, NEURAL can “win” also when having a similar or even worse TER degradation than MMT. Now it is relevant for us to understand whether the WER can be used as a predictor for translation quality degradation.

For this goal, we resort to linear mixed-effects models as we know that the data points are grouped at least according to the ASR system. For all the translations available from all the ASR outputs, we plot the WER of each sentence (observation) against the DELTA TER. DELTA TER is computed as the difference between the TER obtained by translating clean input and the TER when translating noisy input.

The random effects are assumed to explain all the possible sources of variance: the intrinsic difficulty of translating a single utterance (UttID), the WER variance within a single ASR system (SysID), and the variance of WER across the systems (WER).

The results are shown in Figure 4. Beta represents the line of the fixed effects, whose parameters are its intercept and its slope (WER). The slope indicates the variation in TER given by a single point of WER. We can observe that the WER produces a similar response in both systems, causing a TER degradation of (0.61 +- 0.02) for NEURAL and (0.56+- 0.02) for MMT. With the explanations provided by our random effects the residuals are about zero, meaning that this linear mixed-effects model is a good fit for the data.

Parameters of the mixed-effects model. With variance explained away we can use WER as a predictor of the degradation.
Parameters of the mixed-effects model. With variance explained away we can use WER as a predictor of the degradation.
Figure 4: parameters of the linear mixed-effects model evaluating DELTA TER against WER.

A further mixed-effects model is applied to the three WER components SUB, DEL and INS to evaluate how they affect the two systems.
The results shown in Figure 5 reveal that NEURAL, with respect to MMT, is more sensitive to substitutions (0.68 +- 0.02) vs (0.54 +- 0.02), but it is much more robust to deletions (0.43 +- 0.02) vs (0.59 +- 0.02) and slightly more robust to insertions (0.56 +- 0.03) vs (0.60 +- 0.03).

Another mixed-effects model that breaks the components of WER. SUB is the most impactful on NEURAL.
Another mixed-effects model that breaks the components of WER. SUB is the most impactful on NEURAL.
Figure 5: mixed-effects model that breaks WER in its three components SUB, DEL, INS.

Our paper also presents examples of how the different types of ASR errors affect the two systems.

Conclusions

After a high-level introduction of the basic ideas, we showed a practical example from the spoken language translation world. Given the possible source of variance in our data (random effects), we were able to find a linear response between the transcript quality of an ASR system as predicted by the WER metric, and the corresponding degradation in translation quality compared to translating the clean, correct transcript.

If you became curious about this model, in the following section I list useful resources that I found online.

Read more

For a more formal introduction read: http://userwww.sfsu.edu/efc/classes/biol710/mixedeffects/Mixed-Effects-Models.pdf

The following is a more thorough explanation with math and examples: https://stats.idre.ucla.edu/other/mult-pkg/introduction-to-linear-mixed-models/

This is a nice introduction in Medium but lacks the visualization of the mixed-effects model

Mixed-effects models with Matlab: https://www.mathworks.com/help/stats/linear-mixed-effects-models.html

Mixed-effects models in R (with a long description, recommended): https://www.r-bloggers.com/getting-started-with-mixed-effect-models-in-r/

Machine Translation @ FBK

Machine Translation Research Unit at Fondazione Bruno Kessler, Trento, Italy. We develop state of the art technology that supports both human translators and multilingual communication applications.

Mattia Di Gangi

Written by

PhD Student in Machine Translation. Blogging about research, technology and life.

Machine Translation @ FBK

Machine Translation Research Unit at Fondazione Bruno Kessler, Trento, Italy. We develop state of the art technology that supports both human translators and multilingual communication applications.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade