GenAI: Lessons working with LLMs

Published in

Ben and Dion

6 min readFeb 15, 2023

Creativity & Constraints, Foundations & Flywheels

The developer community is buzzing around the new world of LLMs. Roadmaps for the year are getting ripped up one month in, and there is a whole lot of tinkering… and I love the smell of tinkering.

At Shopify we shared a new Winter Edition, which packaged up 100+ features for merchants and developers. Some of the launches had a lil Shopify Magic in them, using LLMs to make life better for our users.

I had a lot of fun, shipping something for developers that used LLMs, and I thought I would write about a few things that I learned going through the process of getting to shipping.

NOTE: This was originally posted on my own corner of the Internet.

What did we ship? mock.shop

We want to make it as easy as possible for developers to learn and explore commerce, by playing. We wanted to take as much friction as possible from being able to explore a commerce data model, and build a custom frontend to show off your frontend.

This is where mock.shop comes in, it sits in front of a Shopify store, but doesn’t require you to create one yourself. Just start playing with it and hitting it directly!

One thing we have heard from some developers is that they are new to GraphQL and/or new to the particulars of the commerce domain. We show examples, and the GraphQL and code examples of how to work with it, but could we go even further?

Generate query with AI

What if you could just use your words and ask us to generate the GraphQL for you? That’s exactly what we did. And here’s what we learned…

Foundations & Flywheels

We used OpenAI for this work, and when working with LLMs you are working with a black box. While GPT3 had some knowledge of GraphQL, and Shopify, it’s knowledge was out dated and often wrong. Out of the box you are working with anything that the model has sucked up, and you can’t trust this data at all.

You need to do all you can to feed the black box information so that it can come up with the best results. Given the black box, you will need to experiment and keep poking it to see if you are making it better or worse.

Here are some of the foundational things that we did:

Feed it the best input

Gather all of information that you think will nudge the model in the right direction. In our case we gathered the GraphQL schema (SDL) for the Shopify storefront APIs, and then a bunch of good examples. With these in hand, we would chunk them up and create OpenAI embeddings from them. You end up with a library of these embeddings, which are vectors that represent the chunks of text.

With these embeddings we can take user queries (eg. “Get me 7 of the most recent products”), get an embedding from that query, and then look for similar embeddings from the library that you have created. Those will contain snippets such as the schema for the products GraphQL section, and some of the good examples that work with products. We call this context and you will pass that to the OpenAI completions endpoint as part of a prompt.

Customize the prompt

You will want to play with prompts that result in the right kind of output for your use case. In our case we are looking for the black box to not just start completing with sentences, but rather give back valid GraphQL.

You end up with a prompt such as:Answer the question as truthfully as possible using the provided context, and if don’t have the answer, say “I don’t know”.\nContext:\n${context}\n\nQuestion:\nWhat is a Shopify GraphQL query, formatted with tabs, for: ${query}\n\nAnswer:

You can see how the prompt is:

Politely asking for the answer to be truthful
Nudging for the answer to be tied to the given context (from the embeddings) vs. making it up from full cloth, and saying that it’s ok to say “I don’t know”!
Asking for a formatted GraphQL query

One other way that we try to stop any hallucinating from the model is via setting the temperature to 0 when we make the completion call:What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

It’s quite funny to see how we do everything to try to get the model to speak the truth with this type of use case!

Feedback and Flywheels

Now it’s time for the flywheels to kick in. You want to keep feeding the context with high quality examples, sometimes show what NOT to do, play with different prompts, and start getting feedback.

You will see lots of examples where users are asked for feedback. E.g. in support systems and documentation: did this help? is it accurate? To train the model as best as possible, you can look for ways to get this information from the experts (humans!) and feed it on back, as well as simply tracking what your users are asking for and how well you are acting on those needs!

Creativity & Constraints

We have the foundations in place, and the quality of data will improve through the flywheels. Now it’s time to get more constrained. We are doing all we can to nudge for truth, but you can’t trust these things, so what guardrails should you put in place?

We really want the GraphQL that we show to be valid, so… how about we do some validation?

We take the GraphQL that comes back and we can do a couple things:

We would tweak it, when possible, to place valid IDs and content, for the given dataset that we have in the mock.shop instance.
Validate the GraphQL to make sure the syntax is correct
Run it against the mock.shop, since we have real IDs, and show the results to the user!

You can’t assume anything, so you often will have to have a guard step once you get results.

ChatGPT vs. Stockfish

There was a lot of hubbub when someone pit ChatGPT vs. Stockfish in a game of chess. Many used it as a way to laugh at ChatGPT. This thing is crazy! It did all kinds of invalid moves! No doy! You have to assume that and build systems to tame it… a chess engine wouldn’t allow invalid moves.

Defensive

You have to be incredibly defensive. You are poking a brain with electrodes. It comes out with amazing things, but you can’t trust everything that comes back. Making remote calls to OpenAI itself is flaky, and often goes down.

Now only will you be checking for timeouts and errors in results, but you should consider a feature flag toggle. In the case of mock.shop, the tool is usable without any of the AI features. They are progressive enhancements to the product.

We can add checks to automatically turn it off if something really bad is happening with OpenAI. Marry both:

const openAIStatusRequest = fetch("https://status.openai.com/api/v2/status.json");

and check the results for the type of incident:

openAIStatus.status.indicator === "major"

It’s incredibly fun, getting creative with how you can use the power of LLMs, which are getting better and faster all the time. The black box nature can be frustrating at times, but it’s worth it.

I hope you are having some fun tinkering!

There are so many helpful libraries out there. I have been working with some friends on Polymath to make it simple to import and create the libraries, as well as query it all.