Deploying LLMs in AWS for job listing summarisation

Tim Elfrink
the-stepstone-group-tech-blog
6 min readAug 12, 2024

It is not only OpenAI GPT models that glitter, and even more so when it comes to summarisation. That is the journey that we embarked on a few months ago when we started to work on the topic of job listing summarisation.

StepStone, one of the world’s leading job platforms, has for a long time built products revolving around data and AI is at the forefront of its strategy. One of our largest sources of data is job listing texts, and there are many reasons for which their summarisation is important for StepStone, namely:

  • Allows users to preview the most important aspects of a job listing, permitting them to sneak peek whether they could be a good fit.
  • Unifies writing style, which along a concise and clear text eases readability for job applicant that want a quick overview on the different existing offers.
  • Reduces computational time for calculating embeddings, useful for other use cases such as recommendation algorithms for finding suitable job offers for our users.
  • Provides a more digestible format for mobile application users.

As such and given the different requirements existing for the different use cases along the organisation LLMs are a great solution. However, from the very beginning we have put our focus on open-source models or, alternatively, available through AWS.

What constitutes a good summary

There are multiple use cases that could use job listing summaries across StepStone, such as chat bots, mobile app, or SEO pages. Thus, each of these uses hold different lengths and requirements for which multiple standards are existing. We can, however, fix some general guidelines to assess the quality of a summary:

  • Whether it gathers information on all relevant job listing aspects, being these at least an introduction to the role, requirements, duties, and compensation.
  • The existence of factual errors or hallucinations in them, as it is critical not to have any of these being shown to end users.
  • Whether they are well-formed and written, containing no unfinished sentences, and maintaining a cohesive style across the whole summary.
  • The most valuable information from the listing is contained in the summary. Furthermore, we have used similarity metrics such as BERTScore and BLEURT to control the extractiveness of our summaries with respect to the original listing texts.

We have counted with the support of StepStone’s Linguistic Services team, that has also provided us detailed feedback on how a good summary should look like.

Thanks to their support, we have defined the above guidelines and identified what is a high-quality output from our model.

Objectives and strategy

Self-hosting models in our own AWS account is a tempting possibility as this would allow to have full control and accountability about them as well as not depending on external sources.

Open-source models are a great option for this, with Falcon 40B being the first experimented approach that yielded decent results. However, hosting such large models also has disadvantages, particularly when it comes to pricing costs. GPU instances are expensive, and their usage should be carefully planned, as we want no surprises in our billing accounts.

Using a g5.12xlarge instance for hosting Falcon, would cost $3.573/hour using a 1 year savings plan. This is not even accounting for possible horizontal autoscaling, which considering the current amount of daily new listings coming to StepStone would represent a considerably higher cost than this. Additionally, setting this up in a development environment would double up these costs, for a total spending in the lines of several dozens of thousands of dollars per year.

In conclusion, we would like to keep costs lower, and by doing this, potentially reduce latency if we jump into a smaller model. We aim to do all of this while preserving or even improving the quality of our summaries.

Benchmarking models for summarisation

As mentioned before, serving such large models like Falcon in our own AWS account also means a large bill.

Strictly looking at performance though, there are a few more things that could be improved from this model’s summaries:

  • A limited context length of only 2048 tokens. Considering that job listings can be long, this forced us to tackle such cases with a map reducing approach thanks to LangChain.
Map-reduce approach to summarise long listing texts in Falcon 40B

This is not ideal as context can get lost and as of such attention could be more sparse than desirable, instead of focusing on the most important parts of the listing.

  • Response time is higher than requirements, taking around 15–30 seconds to generate a summary.

So, before considering other options like model fine-tuning, we would still have to generate high-quality summaries from a model we could distillate knowledge from. This is provoked by the fact that curating manual training samples is an extensive and lengthy process.

Therefore, our mandatory first step is to ensure the generation of good summaries.

Testing summary generation with various models

Bedrock is a new service offered by AWS that grants access to foundation models through a simple API call. It gives access to multiple powerful LLMs such as Titan or models from A21 Labs and Anthropic. From these, we have extensively experimented with model like Claude 2, one of the most capable models of those offered on Bedrock. It allows for a large context length (up to 100,000 tokens), a robust performance and availability, and speedy output generation. Next to closed source models it also allows to use massive open source models like Mixtral 8X7B Instruct which usually is very expensive to host yourself. This advantage by paying per token vs per hour the instance works can be really cost effective.

Bedrock will be packaged as a Docker image available through AWS Lambda. This way, we can access every Bedrock model through a simple Lambda invocation from any region, provided we can make a HTTP request through an Application Load Balancer (ALB) or API Gateway.

For generating summaries on demand, a Docker container is created with the above workflow and, to productionise it, we embed it into the following process:

Architecture for generating summaries in bulk or on-demand

Periodically, we run a job as scheduled by a CloudWatch event which will start running a Lambda that starts a Processing job in SageMaker with the above workflow in it. This process can be also run under demand thanks to our ALB. The output summaries can be ingested by other StepStone services or directly dumped into a S3 bucket.

Operational costs

While models like Claude and Mixtral proved to be exceptionally performing for summary generation, it can be expensive to maintain an infrastructure relying on Bedrock for our solution. That, and the privacy concerns for the data we use, prompted us to steer away from this approach, and try a cheaper but self-hosted solution from an open-source model.

As literature has shown that with just 1,000 samples, we can already safely fine-tune LLMs.

Let’s try to fine-tune LLaMa 2 7b with QLoRA which quantizes a pre-trained model to just 4 bits and reduces memory usage allowing to run the fine-tuning on a single GPU instance.

To generate the input dataset, we set up a process with of creating summaries for a subset of our listings, and then re-evaluating them with the support of our Linguistics team. This step of the process was to ensure that the input for the finetuned model is conforming our internal listing text guidelines and tone.

This task takes just under 2 hours for such small set of samples, which considering the instance pricing it amounts to a total of:

1.95h * 1.813 $/h = $3.53 for fine-tuning the model

So, we can see that it is a cheap task. Furthermore, such model can run on a g5.2xlarge instance, hosted for only $0.8524/hour. This would represent a staggering 76.14% cheaper price than the original Falcon 40B model.

Conclusion and final takes

We have compared the performance from different LLMs for different tasks and presented a proposed architecture for generating summaries in bulk in your own AWS infrastructure, while ensuring efficiency in resources and a consistently high quality on all outputs.

Thank you for reading!

Special thanks to Ricardo Fernandez for writing the original post and integrate the solution!

--

--

Tim Elfrink
the-stepstone-group-tech-blog

Tim Elfrink is a Machine Learning Engineer at The Stepstone Group.