Elo as a tool for ranking LLMs

Published in

Thomson Reuters Labs

9 min readJun 27, 2024

In a previous blog post, Vikas and Shreya talk about the importance of evaluating LLMs and how they built an evaluation framework for LLMs. We now build on that work, to see how we can visualise the results that our evaluation framework produces. We also investigate techniques to benchmark LLMs through human evaluation.

LLM Leaderboard

Our aim is to build an end-to-end solution that helps users at Thomson Reuters (TR) help evaluate the performance of LLMs against their gold datasets. To achieve this, we want to support the following workflow:

Users can upload the dataset they want to evaluate.
Users can then setup an evaluation run by selecting an LLM (modifying its default parameters if necessary) and a prompt template.
The results of the evaluation results are stored in a DB, and can be downloaded by the user.
If the same dataset has been evaluated against more than one LLM, the evaluation becomes available in the Elo battle arena and users can rate it.

We aim to support this workflow incrementally. To begin with, we created an evaluation pipeline and evaluated a set of datasets (both internal and external) against this pipeline. The aim was to track the performance of different models in order to benchmark them, and to track their performance over time using a common set of metrics. When a new model became available, it would be evaluated within the same framework. Its results can then be compared to the existing results, and depending on how well it performs a decision can be made whether to consider that model for further use cases.

To present the results of LLM evaluations, we created a leaderboard that displays individual scores for how well an LLM (that TR employees have access to) performs against various task categories.

Note: The evaluation framework is a work in progress. The metrics displayed in the screenshot below (and others in the blog post) are therefore intermediate and should not be taken as authoritative reporting.

The individual score that is assigned to each task category is the mean (converted to a percentage value) of the following individual metrics with each score normalised to have a value between 0(worst) and 1(best).

The task categories “Classification” and “Multiple Choice Question Answer” have the following metrics:

Accuracy
Precision (micro and macro)
Recall (micro and macro)
F1 (micro and macro)

The task categories “Entity Extraction”, “Question Answer” and “Summarization” have the following metrics:

ROUGE
Semantic Similarity
MoverScore
BERTScore
LLM Score: We also asked the LLM to grade its own answers (only for one dataset for now) and normalised that score as well.

It is also possible to click on a task category to see the individual scores and evaluation runs that constitute the overall score.

Leaderboard for the Clasification task category

The individual scores that constitute the average are also accessible.

Individual scores for gpt-3.5-turbo for legal bench (consumer contracts qa) dataset

The above leaderboards present an overview of how well an LLM performs against datasets, both internal (to TR) and public. These leaderboards can be used to narrow down the LLMs that might be best suited for a task category.

While comprehensive (in terms of the metrics we covered), the above approach has an obvious shortcoming. It does not allow benchmarking LLMs through human evaluation of LLM responses.

Elo Leaderboard

Automated evaluation scores like ROUGE can provide an objective overview of how well an LLM performed. However, compared to human evaluations, automated evaluations have the following shortcomings:

Understanding context: Humans are better than LLMs at understanding context, and can make nuanced judgements about the appropriateness or accuracy of an LLM’s output.
Capturing subjectivity: Given that most of the datasets we evaluated are legal datasets, a subject matter expert would be better equipped than an LLM to judge how well the generated response addresses the query.
Detecting errors of omission: Human subject matter experts would be more experienced in assessing whether the generated response captures all the important information.
Generative outputs: Automatic metrics don’t work very well for tasks with open-ended generative outputs, where there are many possible good answers. Metrics like ROUGE, which rely on word overlap as a proxy for a good output, have been shown to correlate poorly with human judgements, and also depend on having a gold answer to compare to. For some tasks we may not have gold data.

For these reasons, we decided to incorporate humans into our LLM evaluation framework.

From our evaluation runs for each dataset-LLM pair, we had access to the response that the LLM generated for each item in our dataset. We wanted humans to evaluate this response. A possible approach could have been to simply ask subject matter experts (SMEs) to rate the generated response and use their rating for each item in the dataset to generate an overall rating for the dataset-LLM pair. However, this leads to rating an LLM in isolation and still does not provide us with an overall comparative score for the LLM. What we wanted was to evaluate how LLMs performed compared to each other.

We searched for what other developers working on the same problem had proposed, and we came across Chatbot Arena. It proposes to use the Elo rating system to rank LLMs. They implement a battle arena, where two models (selected at random and unknown to the user) generate an answer to the same query. The user then records their preference (Model A is better, Model B is better, both are good, or both are bad) and their response is used to update the Elo rating. After the user has submitted their response, the system reveals the models that were used to generate the responses. We decided to implement this solution for evaluations, as it provided us a way to compare LLMs against each other.

Implementation

Unlike Chatbot Arena, which allows users to input their own queries, we wanted users of our tool to evaluate the responses that we had already generated for our datasets. To achieve this, we first cleaned up the data and the recorded responses and created a JSON file for each dataset with the following schema:

[
  {
    "query_id": "Unique identifier for the query",
    "llm_id": "Unique identifier for the LLM",
    "llm_eval_config_id": "Unique identifier for the dataset-model pair evaluation run",
    "eval_dataset_name": "Evaluation dataset name",
    "query": "The query that was sent to the LLM",
    "output": "The response the LLM generated"
  }
  ...
]

We added this data to a database (DynamoDB). We then created an API which, given a dataset, provides us with a random pair of query/output responses from that dataset, with the caveat that both members of the pair have the same query_id.

We then present this query/answer pair to the user and ask them to record their feedback. We present them with the same options as Chatbot Arena — Model A is better, Model B is better, both are good, or both are bad.

Like Chatbot Arena, we start with a rating of 1000 for each LLM. The rating gets updated each time the user records their feedback. After the user has evaluated the models, they are informed of the models that they were evaluating and their updated Elo rating.

The architecture for the above flow is provided below.

For updating the Elo ratings, we base our work on the Bradley-Terry model as described by Chatbot Arena.

def compute_mle_elo(self, ptbl_win, SCALE=400, BASE=10, INIT_RATING=1000, sample_weight=None):
  """Compute Elo ratings using maximum likelihood estimation."""
  models = pd.Series(np.arange(len(ptbl_win.index)), index=ptbl_win.index)
  p = len(models)
  X = np.zeros([p * (p - 1) * 2, p])
  Y = np.zeros(p * (p - 1) * 2)
  cur_row = 0
  sample_weights = []
  for m_a in ptbl_win.index:
      for m_b in ptbl_win.columns:
          if m_a == m_b:
              continue
          # if nan skip
          if math.isnan(ptbl_win.loc[m_a, m_b]) or math.isnan(ptbl_win.loc[m_b, m_a]):
              continue
          X[cur_row, models[m_a]] = +math.log(BASE)
          X[cur_row, models[m_b]] = -math.log(BASE)
          Y[cur_row] = 1.0
          sample_weights.append(ptbl_win.loc[m_a, m_b])
          X[cur_row + 1, models[m_a]] = math.log(BASE)
          X[cur_row + 1, models[m_b]] = -math.log(BASE)
          Y[cur_row + 1] = 0.0
          sample_weights.append(ptbl_win.loc[m_b, m_a])
          cur_row += 2
  X = X[:cur_row]
  Y = Y[:cur_row]
  lr = LogisticRegression(fit_intercept=False, penalty=None, tol=1e-6)
  lr.fit(X, Y, sample_weight=sample_weights)
  elo_scores = SCALE * lr.coef_[0] + INIT_RATING
  return pd.Series(elo_scores, index=models.index).sort_values(ascending=False)

The overall ratings for each LLM are then presented in a separate Elo leaderboard.

Lessons Learnt

The current datasets that we have evaluated belong to the legal domain. We currently have no way to safeguard that only SMEs conversant with the dataset are rating the responses. Also, our approach to fetching a random query/output pair is not extensible, as it assumes that the range for the `query_id` for each dataset is known, the ids are integer and are in incremental order.

In the coming weeks, we aim to implement a role-based solution where only the feedback from allowed users is used to update the Elo ratings. We are also transitioning to DocumentDB as a datastore which will allow us to improve our logic for generating random pairs.

Once we have a role based solution implemented, we can further segregate the Elo ratings by those roles. It would then therefore be possible to view the Elo ratings by SMEs and by non-SMEs separately.

Future

We believe we have laid the groundwork for building an evaluation framework and LLM leaderboard at Thomson Reuters. We have an evaluation pipeline that can be reused when a new LLM becomes available, and we are able to calculate a common set of metrics that give us an idea of how well the LLM performs. We aim to make this accessible to everyone at Thomson Reuters. To that end, we are working closely with other teams at Labs on the following initiatives:

Midas — The Core Capability team at TR Labs is building a gold data management system (called Midas). Our evaluation framework would utilise Midas to store the gold data that our solution evaluates. This would become the way new datasets are added to our evaluation pipeline.
Foundational Research — We are also closely working with the Foundational Research Team to improve our framework and make it statistically robust and extensible.

Additionally, we also aim to provide an API which can be used to add pairwise comparisons between different LLMs to our Evaluation Feedback DB. This would provide a way for teams (at TR) to add comparison results obtained in other annotation tools to our database. The Elo ratings will be recalculated after new data is added.

Our aim remains to build an evaluation framework that is useful to everyone working with LLMs at Thomson Reuters. Our use cases fall into the following two broad categories:

ML practitioners have a ranking list that they can use to decide which LLM to pick for their next task.
Product partners can have a baseline measure of how well untrained LLMs will perform on their tasks.

We will continue to work with these groups as we continue to develop our evaluation framework and explore novel ways to present the findings. Watch this space for further updates!

💬 Until then, let the conversation begin, here, or start a chat on our LinkedIn group!