That Time When We Broke the System

Vlad G
11 min readSep 19, 2024

--

A couple of months back, I wrapped up building a product with Generative AI at its core. Now that some time has passed, I can look back and assess what happened, how we did it, and—most importantly—the impact of bringing Generative AI into an established process.

Also, are we doomed? Is AI making humans obsolete? Will machines replace us all? These are also good questions to ask.

Before we go any further — yes, the situation has changed. We saw it coming, though. If you were waiting for SHTF — this is it. But it’s not TEOTWAWKI yet. Here’s why.

Let’s start with the main question: What problem were we solving? What were the initial assumptions, and how did they pan out for the team?

Back in the days when Google Search was still relevant, it was important to appear at the top of the search results page. The difference between being on the first and second pages is, in many cases, millions of dollars. One of the critical parts of getting there is the constant flow of fresh and unique content. Do you want your website to be found? You want to keep banging that content out all the time. You can’t copy-paste. You can’t reuse it. You can’t even quote too much. The content needs to be fresh and unique — every time, all the time.

People, despite being creative in general, generally suck at being creative consistently. If you ever heard about “writer’s block,” you know what I am talking about. Producing fresh content isn’t a big deal when you own one website or blog. If you have more than 100 websites covering the same topic, consistently creating fresh and unique content for each one might be more challenging.

Of course, people outsource content creation all the time. From Mark Twain famously helping Ulysses Grant with writing Grant’s autobiography to content farms of today charging $10 per blog post to $500 and upwards for writing your own life story in a book. One of the dirty secrets of ghostwriting content farms, as they’re the closest to fulfilling that need for content on multiple websites, is that the quality of that writing may vary significantly. I would dare to say that quality is often sacrificed for quantity, that’s why they’re called “farms” and not “writers’ academies.”

If you are operating hundreds of websites, writing a few blog posts and pages for each of them each month becomes a pretty big operation. But your budget usually doesn’t, so you outsource the writing task to a content farm at around $15 apiece — give or take. It’s not much for a large company in the grand scheme of things, but think about it from the other side of the table. A copywriter is being paid the same $15 per copy minus the fees that the farm is withholding. So if they want to get to that sweet $20/hour spot (the highest minimum wage in the US as of the time of writing), you’re on the hook of writing a piece of content every 30 minutes, give or take. How many can you churn out in a day, realistically? Each day a week? Each week for a month? How long can you do it for?

All of this makes the content coming out of content farms pretty low-quality. Additionally, people who research your niche and get to the point when their content gets good tend to burn out. It’s not a consistent situation.

On top of that, each outsourced piece of content is still being reviewed by the editors, which adds time. From the request for a copy being picked up by one of the nameless writers from the content farm, then the writer doing a bit of research, writing a copy that would pass a basic readability check, submitting it, and then to one of the editors to review the submission and make a decision whether the content is good enough to be used. On average, it takes about 2 to 3 days to order a copy and get a usable piece of content.

Problem TL;DR: The company's need for content pushes it to outsource content production. This outsourced content is of low quality and has high latency. These factors contribute to a significant backlog of content to review, rewrite, or replace by in-house editors, resulting in publishing delays.

When you think about the overall content production process, outsourcing is like a black box. You throw the request in, magic happens, and you get content out. You don’t really care what happens inside the black box as long as you provide consistent input and receive a consistent output. That’s where we decided to plug in Generative AI. We were going to substitute (although partially) one mystery box with another mystery box.

This decision, of course, did not come at random, as we did the due diligence and analyzed the processes through Value Stream Mapping workshops that allowed us to identify the best place to plug in Generative AI without disrupting the process. One of our assumptions was that for the famous fast-cheap-good triangle, we would be okay if the “good” (a.k.a. the quality part) didn’t drop too much. We wanted the sliders to move on cheap (this was the “must”) and fast (this was the “should”) for the whole thing to make sense. Everyone was okay with improving two out of three, even if it meant taking a minor hit on the quality side. The content was being reviewed by in-house copywriters and editors anyway. A little hit to quality wouldn’t change the situation much, but getting more of that content cheaper and faster would improve things significantly.

In about four months, we rolled out the first proof of concept of the product. Subjectively, as described by the editors, the generated content was always readable, compared to outsourced content, which wasn’t always there. Of course, there were issues with content produced by Generative AI, but overall, it looked like we were within the target parameters. Once POC validated most, if not all, of our assumptions, we moved on. Our new focus was building a minimally usable product from the POC we had. Additionally, we wanted to collect real-world metrics around the product’s cost, quality, and effectiveness.

The most problematic item on the metrics list was “content quality.” How would we define this “content quality”? We agreed that it was a synthetic and subjective perception of the produced content. How much time does an editor need to spend on “fixing” the GenAI content? How does this time compare to the time they spend fixing content that was outsourced previously? What specifically is being fixed?

To understand this better, we used feedback from the editors and copywriters who generated and edited the content using GenAI. Additionally, we used some metrics that compared the “first draft”—either the copy received from a content farm or produced by GenAI—to the final draft—the copy that got published to one of the production websites. We’ve set up a feedback review process. We read EVERY. SINGLE. FEEDBACK. that users provided — out loud to the whole team. We did it every week.

So, how did we do? What did we learn? Once our proof of concept became the product — what happened?

Once we started running actual content requests through the system, we first realized that the latency of the content — i.e., the time it takes from order to fulfillment- has been reduced drastically. It went from 2 to 3 days down to 2 to 3 minutes! This helped address the growing backlog of content that needed to be reviewed and rewritten. Now, instead of days or even weeks, you can have turnaround times in hours. The content covered by the GenAI use case wasn’t a contributing factor to the growth of the backlog anymore.

Think about it in terms of the overall process.

Imagine ordering four pieces of content from an outsourced service on Monday. You get them two days later; it’s already Wednesday. You need to spend 2 hours reviewing and editing it, you find out you need to return one for a rewrite (and who knows if the writer will actually rewrite it), and one is totally unusable. So, you end up writing it yourself — add writing time to the review time. It’s now Thursday, and you now have three pieces of content. With any luck, you get a rewrite from the content farm on Friday, and by the end of the week, you’re good to go with the four pieces of content you wanted. You publish your content on Friday. Your content lives in your backlog for days.

With the GenAI mystery box we’ve built, you will have your first draft for order number 1 ready by the time you finish placing order number 4. AI rewrites take a minute or two, and if you are not happy with the direction that AI is taking, you can start anew, which takes another couple of minutes. In about 20 minutes of twiddling your thumbs and waiting for GenAI to respond to your requests, you have four first drafts that are good enough to be edited further. So you spend the same 2 hours reviewing and editing, which gives you results by lunchtime. You publish your content, and it’s still Monday. Your “content latency” is just gone.

We’ve got our “fast” side of the triangle. But wait, how expensive is AI in generating content?

I have to caveat this comparison by saying that I can’t make a direct comparison between both content-generating systems, i.e., between the two mystery boxes. Many things are involved in signing a contract with a third-party service provider, like a content farm. There are lawyers with NDAs, procurement with financials, Enterprise Architecture with technology agreements, plus time and effort spent figuring out the niche of writers you want to work with, the cost of paying for unusable content, etc. Similarly, there’s a lot of effort to set up an infrastructure capable of supporting products and leveraging Generative AI. The good news is that it’s there to support more than just a single product or a single use case, so it’s hard to say what it costs for this product alone at a system level. Instead, I am going to look at the cost of each output. Mystery box number one (content farm) produces content at $15 apiece. Mystery box number two (GenAI-based product) produces content of the same size at an average of 15 cents apiece (as of early 2024).

It looks like we got our “cheap” side of the triangle, too. But both “fast” and “cheap” won’t matter much if your “good,” i.e., the quality of the content, isn’t at the acceptable level. From the beginning, we had assumed that the content quality would dip a little. Given that “little” was really little, we were okay with it to get the wins around faster and cheaper. But how much is too much? When does the loss of quality stop justifying the wins?

Since we were operating between the “first draft” and the “final draft,” our real audience was the in-house copywriters and editors who had to take the first draft and make it into the final draft. The quality of content directly influences the amount of work the editor needs to do to bridge that gap. We could look at the time they’ve spent or, again — at the latency of the content. But these times don’t tell you the true story, as these people are working on other content, too. Tying the content latency to its quality isn’t exactly the right move. But what if we ask them directly what they think about the GenAI-produced content? How would they rate it? Can they point out specific issues with it?

We pushed the users to leave as much feedback as possible. We asked them to be as candid as possible. No holding back. Take gloves off.

By reading every single feedback item we have received, we could categorize and prioritize the major content issues. For example, even though we supplied the LLM with correct factual data, it still made factual mistakes. So, one of the categories was factual errors. Another example would be the word count. When ordering a piece of content, editors are pretty specific on the size of that content. Yet, the LLM we used at the time easily missed the mark by more than a hundred words. In total, we have identified around a dozen of feedback categories. Prioritizing them by the number of occurrences — how often content falls into specific categories — we were able to prioritize which issue type to address first. We addressed most of the issues by tweaking the prompts and the data supplied to the LLM. This led to editors needing to make smaller and smaller changes to bring the generated content to production quality. The content latency went down again!

While the content rating scores remained consistent, we saw an increase in numbers in the “positive feedback” category. The editors figured that if they didn’t like the quality of the generated content, they could abandon it (and rate it as “bad”) and start from scratch with better instructions to generate better content. It is, after all, fast and cheap to do so. Once they got what they wanted, they would leave positive feedback (without scoring) and move on. In the terms of Value Stream Mapping, the cost of re-do in case of content farm was tens of dollars and days of time. The cost of re-do for GenAI was tens of cents and minutes of time.

A less subjective metric was the “first to last draft” comparison. It is measured as the difference between the initial draft (generated by GenAI or produced by a content farm) and the final draft after it was revised by one of the in-house editors. The GenAI content consistently outperformed content farm content in best and worst-case scenarios. When content quality was good, the difference in the changes between GenAI and the farm content was around 10%. In worst-case scenarios, the amount of outsourced content edits was around 70%. Worst case scenario for GenAI content? Around 30% of content changed. The “difference of the difference” was roughly 30%.

Wait, what? Is AI-generated content of better quality than content coming from the content farm? We thought it would be slightly worse — it turns out, it’s about 30% better on average.

900 times faster. 100 times cheaper. 30% better. I think we broke the system. We shrunk the triangle on all three sides! Seriously, I think we broke the system!

At this point, I should mention that this was done with a Claude 2 model. By the time of this writing, Claude 3 and Claude 3.5 have been out, significantly outperforming Claude 2 while being cheaper. We used 2-step prompts injected with data, no other methodologies or more sophisticated approaches. We didn’t build a custom model and didn’t fine-tune anything—just prompts and data.

This is why I believe it’s the SHTF situation. An already obsolete model consistently produces better results than humans. It took us less than a year to build the product with the underlying infrastructure, using basic methodology and a pretty mediocre Generative AI model.

Think of what’s coming next.

--

--