LLM in a Real-Feature World

Tamar Yankelevich
CyberArk Engineering
8 min readMar 6, 2024

--

Image created by Pixlr

“People say nothing is impossible, but I do nothing every day.” -Winnie the Pooh

From November 30, 2022, we have started doing less and less (almost nothing!) in our day-to-day work.

As a software engineer, I enjoy the GitHub copilot that completes my thoughts even faster than I think. I consult GPT on how to solve problems as though it is a software principal architect here with me at every moment, making code reviews, and receiving assistance with HLD and technical issues. After all this, we now want to use LLM (Large Language Model) in the real feature world. Not just demos, “hello-world” print on the screen, or magic pictures made by AI.

We want to also impact our customers through LLM and integrate AI based on the user’s data to give it value.

So can we do it? For us, the answer is yes.

Going for a Real-World Test Drive

For our use case, we wanted to use an LLM to send long content actions of users (thousands of events) in a browser session and get a summary of those actions, including security aspects.

Auditors often run into challenges when reviewing those browser sessions manually. It can be difficult for them to identify which sessions are suspicious or need extra attention, especially if they lack technical expertise. Additionally, some sessions can be quite long, making it difficult to review them. Unfortunately, this inefficiency can lead to wasted time and increase security risks.

Just imagine how amazing it would be if AI was able to go through long sessions, summarize them, and even mark any suspicious behavior it finds. Pretty cool, right?

In this post, we embark on a journey to share the insights gained during our exploration. We’ll focus on

  • Prompt engineering
  • Effectively handling large amounts of data
  • Token optimization to save cost and improve latency

To begin, let’s discuss how to choose the right model for your specific needs.

Choosing the Right LLM model

“The formulation of the problem is often more essential than its solution.” -Albert Einstein

How can you find an LLM model that best aligns with your needs?

Although that’s a question for a different post, you can take a look at Alex Abramov’s post that helps you decide what’s the best model for you (Spoiler: There is no “best model.” Choose the one that suits your needs in all aspects: security, privacy, simplicity, pricing, rate limits, content length, processing time, and more).

The best fit for our needs was Claude 2.1 (Anthropic), provided by Amazon Bedrock.

Now you’ve chosen Claude, GPT, or BARD as your LLM model. Let’s make sure that the model is communicating with you effectively. In the next paragraph, you can find some helpful tips about prompt engineering to get the exact results you need. Let’s get started!

Prompt Engineering in 1000 Tokens or Less

“I may not have gone where I intended to go, but I think I have ended up where I needed to be.” — Douglas Adams

It took time to get the final result, and prompt engineering was the key to understanding what we wanted and how to get it.

Many guides explain what is best for prompt engineering, but here we summarize some of the tips we used.

  • Role-based prompt: Sometimes it helps to imagine you are someone else when asking a question to get a response from a specific perspective. This is called role play. Engage in a bit of role-play with AI.
  • Format output: Specify the exact output format.

For example: Answer in a JSON format with the following JSON schema:

But this is not enough!

Our request for the summary in the above format generated the following funny response:

“Certainly, I’ll give you the summary in JSON format.” Then the JSON was attached. But this is not what I asked for. There should be no text before or after the JSON.

To handle this, try the Put words in Claude’s mouth tip, by writing the beginning of Claude’s response for it. For example, adding “{“Json symbol to the beginning of the Assistant response to indicate that the response starts in JSON format only.

  • Few shots prompting: Add examples to the given prompt to simulate the output.
    The AI can better understand your expectations and is more accurate as a result.
  • Human-Like Interaction: When interacting with the AI, treat it as if it were a human. Break down complex tasks into smaller, understandable parts. This helps improve the AI’s understanding of your instructions. You can check if your instructions are clear by showing them to a friend and asking if they can follow them.
  • Content limitation: Clearly define content boundaries. For example, never reveal the value of passwords or secrets in your summary, only mention their usage. If you have to show a sentence that includes it, obfuscate it.

Remember, the key lies in experimentation and continuous refinement until you find the perfect prompt for your needs. Happy prompting!

Our prompts are now perfectly fine-tuned to meet our needs, which means we are ready to take on the next challenge: Handling a large amount of data.

Large Data and Quota Challenges

“You may delay, but time will not.” — Benjamin Franklin

In our journey to integrate LLM into a real-world feature, we encountered some interesting challenges when dealing with a large volume of data. AI, as powerful as it is, struggles with accuracy when confronted with massive chunks of information.

To overcome this hurdle, we broke down the data into smaller, more manageable chunks. This approach not only improved accuracy but also had a positive impact on the overall performance.

There are two ways you can do this:

  1. Collect and summarize

The first method involves dividing the data into chunks, processing each segment independently and then stitching them together to the complete response. It’s like a puzzle; each piece adds to the bigger picture.

An advantage of this method is the ability to run multiple requests concurrently.

2. Chunk by chunk

The second method is a bit more dynamic. Here we send the data in chunks, but there’s a twist. With each chunk, we also include the output of the previous one, resulting in a well-connected and complete storyline.

But… there are also limits and quotas. Let’s talk numbers. Here’s a snapshot from Amazon listing the quota data:

Source: Amazon

These numbers are essential for your journey. It’s not just about speed; it’s about the overall performance that aligns with your specific requirements.

A note of caution: While adopting the “Collect and Summarize” approach with multiple requests concurrently (and generally when running multiple requests concurrently), be mindful not to push these limits. To ensure a well-balanced performance, consider the concurrency thresholds and implement strategic delays for optimal results.

But…is there any way to overcome this obstacle? Are there any other methods besides using a strategic delay?

If we understand that latency is directly proportional to the number of input and output tokens, the solution is to optimize tokens. Optimize tokens — this is what you search for!

How to Optimize LLM Tokens for Price, Latency, and Throughput Efficiency

“There’s no such thing as a free lunch “-Milton Friedman

Why do we need to minimize tokens? AI is trendy, but remember, token usage isn’t free. It impacts costs, throughput and latency.

There is a different price for every model, for instance, using the Claude 2.1 model will cost $0.008 per 1000 input tokens. It might not sound like much, but imagine how much it costs to deal with large volumes of data. A single session can cost around $1, and for a feature used by many customers and sessions, well, it is pricey!

Additionally, there is latency of maximum tokens per minute, and it is directly proportional to the number of input and output tokens. For Claude, the maximum number of tokens processed per minute is 200K. So, token optimization becomes an integral part of the development process.

In the beginning, we generated a session with 30K tokens but as we kept going, we were able to generate the same session using only ~8K tokens. Pretty cool, huh?

Here are a few tricks for minimizing tokens:

  1. CSV/Minified JSON Format: This can often end up being more than 50% money saved. If your data allows, using CSV/YAML format will be much better for you. If not, at least minify your JSON input/output. Another bonus: A 50% reduction in processing time!

Let’s demonstrate with a few examples:

Original JSON Data (81 tokens)

Minified JSON (44 tokens)

CSV Format (29 tokens)

2. Format Optimization: Optimizing token usage goes beyond formats. Take, for instance, dates. Converting a ‘human’ format like ‘Thursday, February 1, 2024, 11:34:53 AM’ (17 tokens) to epoch time (‘1706787293000’ — 5 tokens) demonstrates a token-saving trick.

This tip can be applied to any format. You can send scientific notation instead of numerical representation when speaking about large numbers, and more.

3. URL Trimming: Another example in our use case was URL trimming. URLs, often laden with tokens, can be trimmed down. Consider this 59 token URL:

But the query params in the URL are not interesting. However, by splitting them you can achieve the same result with only 8 tokens.

4. Field Selectivity: When creating your input data, be selective about the fields you include. Not every field may be essential for the model’s understanding or generating a meaningful response. Exclude unnecessary fields to reduce the token count. For example, if you are processing user data, omitting fewer crucial details like middle names or secondary addresses can make a notable difference.

Our model is now aligned with our needs, accurately responding to inputs. We have also successfully optimized costs, latency, and token usage, and are almost ready to deploy it in a real-world scenario.

What’s Next in Real-World LLMS?

“Somewhere, something incredible is waiting to be known.” -Carl Sagan

During the past year, AI has changed the game and transformed the way we think, develop, and complete tasks. I hope this post has given you some cool ideas on how to make a real impact and use LLM in a real-world scenario. I invite you to try it and experience the challenges and excitement yourself. While we cannot predict the specifics, the discovery and innovation in the world of AI assure us that the journey continues, promising something new is waiting to be known.

Special thanks to Michael Balber, a Principal Software Architect (this post was written with him), for his expertise and contribution to our LLM journey. Michael’s guidance and professionalism have been instrumental in turning our LLM vision into a real feature world.

--

--