I created a complete (audio) book in 10+ languages in a few days using generative AI: Here is what I learned

Filip Sokołowski
19 min readDec 21, 2022
a futuristic artificial intelligence apparatus with white screen spills out many books, by Brian Bolland — ar 3:2 — v 4 — q 2; processed with Photoshop

I created an entire (audio) book in 10+ languages within a few days using generativeAI tools. My e-book is now available in 11 languages (English, German, Spanish, Chinese, French, Japanese, Italian, Portuguese, Dutch, Swedish, Finnish) on Amazon Unlimited and for single purchase at $2.99 (local currency equivalent) as an e-book. In this blog post I would like to tell you why I did it, how I did it and what I learned.

(Audio) Book Samples:

Please find “Chapter 1 — A new friend” samples as e-book pdf and audiobook sample here.

A Small Blue Dot — An Artificial Journey Through the Solar System; translated to 10+ languages
  1. Motivation
  2. My workflow & what I learned along the way
  3. The final product
  4. Conclusion

Motivation

You may wonder what my motivation was to write a whole book only with the help of AI. Well the holiday season was around the corner and don’t they always say “the best gifts are homemade” (I’ll let you decide whether that also applies to A.I. supported creations). That was one reason, I needed some gifts for friends and family. On the other hand, generative AI tools have developed a lot in recent times. OpenAI’s ChatGPT recently made headlines, growing to over 1 million active users in just a few days. This speaks to the general interest in generative AI. But ChatGPT is only the tip of the iceberg and is made specifically for human-like conversations. There are, however, many other fields of application (such as natural-language processing, text-to-speech, text-to-image, etc.) in which AI can be used. I wanted to explore the limits of these tools by writing an entire children's (audio) book (incl. cover & illustrations) using AI. My goal was to find out how far generative AI tools are already able to do this and I wanted to see what I could learn in this small experiment.

1. My workflow & what I learned along the way

It was my goal to use AI as much as possible throughout the process, but in the end I knew that without human intervention my book would be nothing. Generally speaking, regardless of whether AI is used or not, there are four steps: 1. content creation, 2. packaging, 3. localisation and 4. publishing. The third step is probably optional, but since the whole process was fairly simple and quick, I decided to create my book directly in several languages to offer my creation to a wider audience. Of course, step 5, marketing, is still missing in this view, but that is not the focus of this post (after all, you are reading a content marketing piece right now).

My workflow & tools that I used

For creating the manuscript I used a combination of ChatGPT and Jasper.AI for all text-to-text use cases. For the audio book I used Murf.AI to generate the voices of the characters and narrator via text-to-speech. The mixing and adding of ambience, foleys and music/SFX I did with Adobe Audition CC and I used Artlist.io to obtain these assets. The cover and illustrations for the book I created with OpenAI Dall-E 2 and/or MidJourney and post-processed with Adobe Photoshop CC. I merged all chapters and increments into one Word master document before translation. The translation of the final manuscript into 10+ languages was done with DeepL Pro. Finally, I did the layout and formatting with Amazon Kindle Create to create the EPUB/KPF files required for publishing on Amazon Kindle Direct Publishing (KDP). In the last step I had to upload the EPUB/KPF file along with metadata (title, author, keywords, pricing, etc.) to Amazon KDP for publishing. But more about the individual steps and why I chose these tools in detail below.

Stage 1: Content Creation

Of course, every book starts the same way — with the idea. I wanted to use as much AI as possible in my little experiment — but to come up with the actual idea myself — I couldn’t let the AI take that away from me. I came up with the idea for the plot in the form of chapters with a central plot myself. It should only be the task of the AI to convert the plot into a continuous text with dialogues for the characters and a text for the narrator. Of course, I could have typed “write me an outline for a children's book about space exploration” in ChatGPT, but that would have been very arbitrary and not following my own idea. However for ideation in particular, ChatGPT is certainly a good way to get an outline/structure quickly.

Creating the Manuscript

For my plot, I more or less freely followed Aristotle’s classical 3-act plot structure (setup, rising action, resolution) for dramas. In a nutshell, the story goes like this: a little girl named Lina finds a robot named Sam in the basement, who comes to life and a few days later — transformed into a spaceship — he takes her on a journey through the solar system. On the way through the solar system, Sam explains the most important interesting and exciting facts about our planets to her. After a while, Lina realises how much we have already learned about our universe with the help of science, but also that there are still many mysteries that remain unexplained. She also realises that the earth is still the most liveable planet in the solar system. Of course it is important to explore other planets, but they are more or less hostile to life, so it is important for humanity to protect our planet.

So in summary, my plot looks like this:

Table of contents / outline / structure of my book

So, the main plot was already established, but how do I get a continuous text out of it? This would normally have been the author’s job, but I outsourced this step to generative AI tools.

Limitations of ChatGPT and workarounds I used:

After a few attempts with ChatGPT in writing full paragraphs / chapters I had to realise that there are some limitations to write a book:

a) the maximum response length is limited (no whole book / chapter will come back to a single prompt) and

b) the chatbot is designed to be very transactional, so it’s hard to make simple changes in the body text and

c) it takes ChatGPT time to make little edits to existing paragraphs / responses in the thread

With this in mind, I took a slightly different approach and also used another tool called Jasper.AI. This step is purely optional, but after a few minutes with just ChatGPT and my master Word document, I realised that the transactional / conversation-focused way of working with ChatGPT was not conducive to my workflow. I would have to constantly switch between Word and ChatGPT and copy texts back and forth. That’s why I decided to use Jasper.AI as an additional tool, but more about that and why in the following.

Expanding and driving plot through A.I.:

After the rough story, structure & main plot points had been established by my human draft, I could now take on the individual chapters. I proceeded as follows. First, I wrote more plot points as bullet points per chapter and asked ChatGPT to make a continuous text out of them.

Here is my prompt & response for Chapter 1:

ChatGPT request & response

I then copied these paragraphs into Jasper.AI. The main advantage of Jasper.AI is that you have a truly collaborative interface, which means that you can quickly make changes directly in the body text. In addition, you can set a style, tone and keywords for the entire document.

Jasper.AI interface with “style, tone, keyword” parameters and “compose” module

The advantage is that the larger the document, the AI learns its own style and writes new paragraphs in the established style. For example, for children’s books it is important to use the dialogue attribution method. In written texts, dialogue attribution is usually done using tags, such as “said Lina,” or “replied Sam,” to indicate who is speaking. Once I started using dialogue attribution for some time newly created paragraphs via new prompts would follow the same ductus.

Jasper.AI offers many templates for creating different content. I decided to use the “Freeform Document” mode, which is very versatile. On the left side in the navigation menu you can see the mentioned paramaters like “Content description / brief”, “Tone of voice” and “Keywords”. In the Freeform Document mode you also have the option of using the “Composer”, you can enter a prompt directly in the editor, for example: “Sam and Lina now continue their journey towards the planet Mercury” and execute it directly via CMD/CTRL + Enter. In this way I was able to formulate small sections according to my defined plot and insert them flexibly in the right places.

The whole manuscript had over 8000 words in the end, normally this process would have taken weeks, but with the help of ChatGPT and Jasper.AI I was able to create the manuscript in 8 hours spread over several days.

Creating the Audiobook

Now that I have finished the entire manuscript, I wanted to generate an audio book from it via text-to-speech. For this purpose, I looked at numerous text-to-speech tools. Many tools have very different focus areas (e.g.: personal reading, voice cloning, audiobook, e-learning, ads, podcast, blog, article, etc.). For me, the quality of the text-to-speech speech synthesis was the most important criterion when choosing a tool. I also wanted a simple multi-voice editor / studio so that I didn’t have to edit the individual speakers / voices together afterwards in Adobe Audition. In the end, I had Murf.AI, Lovo.AI, Resemble.AI, Coqui.AI and Natural Reader on the shortlist. Lastly, I chose Murf.AI because of a combination of 1. the quality of the speech synthesis and 2. the functionality of multi-voice, timeline editing. In general, I noticed that you can’t generally say that tool a) or b) has the better speech quality, it depends very much on the individual voices, the quality is still very widely spread among all providers and voices. In terms of quality, in my eyes only Lovo.AI and Coqui.AI and Natural Reader would have been better, but unfortunately they had a weak multi-voice/timeline editor, which would have meant more effort in audio editing / post-processing in Adobe Audition. In addition, in e.g.: Murf.AI and many other tools, you can manually select Words-per-Minute (WPM)/Speed, Pitch, Emotion, Emphasis, Pronunciation and Pauses per sentence/section. I used -7% pitch and -5% speed for all speakers & narrator. Also when copy & pasting text Murf.AI lets you to choose whether you want to keep everyhing in one block or split ot by paragraphs / sentences. This makes it way easier to attribute / select speakers per section.

Murf.AI Studio Mode: Multi-Voice Timeline Editor with WPM/Speed, Pitch, Emphasis, Emotion and Pause

In this way I could easily export a single high quality stereo .WAV for each individual chapter in a single audio file. Most tools also support the insertion of background music, ambience, foleys, SFX via connected stock libraries. Unfortunately, the stock asset selection was very poor, so I decided to add music, SFX, foleys etc. to the audio book myself with the help of Artlist.io and Adobe Audition CC.

Mixing Narrator, Characters and Music, SFX, Foleys, Ambience in Adobe Audition CC

Unfortunately, as far as I know, Amazon Kindle Direct Publishing (KDP) does not currently support the upload of self-created audio books. Typically, you have to give the manuscript to an Amazon network partner, who then turns the manuscript into an audio book. If you know another way, please let me know. In the meantime, only the ebooks are available on Amazon. If I find a way to publish my audio book as well, I will do so at a later date.

The whole text-to-speech process also took me probably 8–10 hours to transfer my manuscript chapter by chapter into Murf.AI, selecting speakers per sentence/section, inserting pauses/emotions, etc. Murf.AI Basic costs $19 per month and includes 2 hours of voice generation per user/month and commercial usage rights. This is probably 10 times cheaper than dubbing the audiobook using conventional methods with human narrators/voice actors/sound engineers etc.

Stage 2: Packaging

Now that I have finished my English manuscript and audio book, it was time to bring the children’s book to life. For this, I decided to create one dedicated cover per language and a total of 15 illustrations for the book. There are many different text-to-image tools (e.g., OpenAI Dall-E 2, MidJourney, StableDiffusion, Rosebud.AI, Nightcafe, BRIA or CrAIyon, etc.) but after my first attempts I found OpenAI Dall-E 2 and MidJourney to be the most advanced. However, since OpenAI Dall-E 2 is very expensive at 15$ for 115 tokens / image retrievals, I have been working with MidJourney for the most part. Finally, prompt-crafting is also enormously important for achieving good results. I would like to share my experience with you in the next section.

Recommended MidJourney prompt anatomy:

There is an official documentation by MidJourney which deals with prompt crafting and all available parameters. There are also excellent other blog posts and resources on the subject (9 TRICKS FOR WRITING AI PROMPTS TO CREATE THE BEST MIDJOURNEY PORTRAITS, 56 AWESOME MIDJOURNEY EXAMPLES TO JUMPSTART YOUR AI PORTRAIT GENERATING, MidJourney Reddit, Prompter Guide & Tool). However I have largely followed Joy Olivia Miller’s prompt structure, which I explain briefly below.

For the most predictable text-to-image outputs, your prompt should basically be a formula of X + Y + Z

/imagine X + Y + Z

Prompt anatomy:

  • X (Content)= describe what you want (and its characteristics)
  • Y (Style)= provides style-related preferences
  • Z (Parameters)= covers size and render information through parameters

Example components:

  • X (Content)= a small blue dot, children book cover, little astronaut girl and robot friend
  • Y (Style)= front view, portrait, negative space, symmetry, vector illustration
  • Z (Parameters) = — ar 2:3 — v 4 — q 2

Full prompt:

/imagine a small blue dot, children book cover, little astronaut girl and robot friend :: front view, portrait, negative space, symmetry, vector illustration — ar 2:3 — v 4 — q 2

a small blue dot, children book cover, little astronaut girl and robot friend :: front view, portrait, negative space, symmetry, vector illustration — ar 2:3 — v 4 — q 2

The most important (in my opinion) MidJourney parameters briefly explained:

Aspect ratio:

--aspect, or --ar Generates images with the desired aspect ratio. Try --ar 16:9 for example, to get a 16:9 aspect ratio (~448x256).

Version:

--version <1 or 2> or --v <1 or 2> Uses old algorithms 1 (which was formerly the "vibe" option, sometimes better for macro or textures) or 2, the last improvement. We are at 3 now, which you do not need to specify. So specify --version 2 to use the previous older model, or --version 1 for the one before.

Quality Values:

--quality changes how much time is spent generating your image. The shortcut version is --q. Please only use one of the values below. Any other value will be rounded to a valid value instead.

--quality 0.25 Rough results, 4x faster / cheaper.

--quality 0.5 Less detailed results but 2x faster / cheaper.

--quality 1 The default value, you do not need to specify it.

--quality 2 More detailed results, but 2x slower and 2x the price (2 GPU minutes per /imagine).

--quality 5 Kind of experimental, might be more creative or detailed... also might be worse! (5 GPU minutes per /imagine).

Advanced Text Weights:

You can suffix any part of the prompt with ::0.5 to give that part a weight of 0.5. If the weight is not specified, it defaults to 1. See also Text Prompt Questions.

Some examples:

  • /imagine hot dog::1.5 food::-1 — This sends a text prompt of hot dog with the weight 1.5 and food of weight -1
  • /imagine hot dog::1.25 animal::-0.75 — Sends hot dog of weight 1.25 and animal of negative 0.75
  • /imagine hot dog:: food::-1 animal — Sends hot dog of weight 1, food of weight -1 and animal of weight 1

As mentioned, however, I recommend that you read the full documentation and try out all the parameters for yourself.

Creating consistent look & feel across many different prompts:

With this I have already defined the general structure for my prompts, but I noticed that the style of the outputs was still very different even though I included descriptions such as “vector illustration” within the Y-part (Style). To work around this problem, I further restricted the characteristics to an artist level. In the end, the AI doesn’t really generate new content, it just replicates what it has been trained to do. There are many different blogs and posts where the artists, eras, methods, styles, etc. have been reverse engineered by the community. I can especially recommend the MIDJOURNEY STYLES LIBRARY by ANDREI KOVALEV’S.

I used it to identify some illustrators I particularly liked (e.g.: by Adrian Tomine, by Alex Ross, by Becky Cloonan, by Brian Bolland, by Butcher Billy, by Darwyn Cooke, by Mike Mayhew, by Tang Yau Hoong). For the illustrations in my book I used Tang Yau Hoong’s style as far as possible. In the end, I didn’t have to mention the artist as a contributor / illustrator, but I included all full prompts as descriptions of the images in my book.

GenerativeAI & IP: Should it be possible to monetise generativeAI assets that are based on training data from third party artists?

/edit 12/22/22: Since the paragraph above has caused some queries wrt to using third party artist styles & commercialisation of the same on the Reddit Subreddit /r/GPT3/, I would like to comment on it / set the record straight.

Here is a comment by Impractical_Lychee wrt to the above:

I would like to clarify once again that my aim with this small experiment was precisely to direct the public dialogue towards this circumstance. In the end, I used the exact style of Tang Yau Hoong in my MidJourney prompts and apparently MidJourney v4 was also explicitly transposed to it. However, according to the TOC of MidJourney Inc., as a paid user I am the complete owner of my created assets and can do what I want with them. Before publishing my book, I asked Tang Yau Hoong via e-mail if he was ok with it and explicitly marked all prompts in plain text in the book. If Tang gets back to me, I will update you.

Now back to my article.

Here is an example of the same prompt, but with different style parameters according to different illustrators:

Creating the planet illustrations:

After I wanted to include an illustration of each planet in the main part of my book “Chapter 2 — Our Solar System”, I quickly realised that the AI dreamed too much too fast and created very artistically free images. Since the children’s book also has a certain educational purpose, I wanted to make sure that the pictures looked reasonably scientifically correct. To achieve this I used the possibility of MidJourney to give the AI an image as input. You can simply include the URL of an input image in the prompt.

Input Image:

Source: https://www.nbcnews.com/id/wbna22650678

Prompt & Result:

/imagine https://s.mj.run/pVFpKvavDBQ planet mercury

This way I could make sure that the planets looked halfway like they do in reality.

Creating the book covers:

Book covers are important for readers in different languages and territories because they serve as the first introduction to a book and can influence a reader’s decision to purchase or borrow it. In addition, book covers can differentiate a book from others in the same genre and help readers find a specific book in a crowded market. Overall, book covers play a crucial role in attracting and engaging readers.

For this I actually chose exactly the same approach as for my illustrations. Basically, I only changed the aspect ratio of the output via — ar 2:3 to portrait format.

/imagine a small blue dot, children book cover, fjord, little swedish astronaut girl and robot friend, front view, portrait, negative space, symmetry, robot, by Beatrix Potter

In the end, I only added the title, subtitle and author manually in Photoshop. In some cases I changed the composition of the output graphic in photoshop to improve the layout and create space for the texts. When I created empty image areas, I filled them using “Content Aware Fill” in Photoshop. I also applied colour gradients over the graphics and drop shadows to the text to improve readability. I have decided to create a separate book cover for each language / territory.

Layout of final cover art in Adobe Photoshop CC

Stage 3: Localization

So with that, I created my final manuscript in english, including all the illustrations and book cover. In order to make my book available to as large a target group as possible, I wanted to translate it into the most important markets and languages. For this I used DeepL Pro, which allows the translation of entire documents. DeepL Pro currently supports 27 languages. For the languages I speak, I did a short proof read myself and found no grammatical errors, but some of the translations are very awkward / unfavourable in terms of word choice and idiomatic usage. So now I had my manuscript in 11 languages as well.

Translation of Manuscript via DeepL Pro

Stage 4: Publishing

The last step was to publish my final manuscript, which was now a Word document, on Amazon Kindle Direct Publishing (KDP) as an e-book. For this, it is necessary to convert the Word-based manuscript into an EPUB/KPF file format. To do this, I used Amazon’s own tool called Kindle Create. All you have to do is import the Word and use a wizard to define the chapters and the design/theme. I then exported the e-books as EPUB/KPF.

Publishing final e-books to Amazon KDP

The final product

My e-book is now available in 11 languages (English, German, Spanish, Chinese, French, Japanese, Italian, Portuguese, Dutch, Swedish, Finnish) on Amazon Unlimited and for single purchase at $2.99 (local currency equivalent) as an e-book.

Conclusion

When I look back on my entire experiment, I have noticed three things in particular that I would like to emphasise.

Creative job mobility

Firstly, AI has allowed me to try out a completely different creative field. I have done a lot of photography, video and graphic design in the past, but writing a book or creating an audio book was not on my list. Using generative AI has allowed me personally to work in a different creative field as an author. Of course, I could have acquired the skills to write a book on my own, but the barrier to entry in this field is very high and requires a lot of experience if you want to do it right. I am not saying that my book is particularly good or bad. Certainly there are many general principles for writing (children’s) books and audiobooks that I didn’t follow at all, but it was the availability of AI tools that gave me the idea and made me take the first step. I can certainly say now that this won’t be the last book I create this way. In this sense, the availability of generative AI tools has in a way expanded my mobility in the creative field after I moved into a completely different area. I expect the same for the entire creative industry. Authors will create illustrations themselves and illustrators will make audio books themselves. In a way, one could speak of a consolidation of the professional job profiles in the creative field, if you will.

Impact of AI on the creative process

On the one hand, I have never written a book before, so the whole process was new to me, but I quickly realised that generative AI toolsChatGPT and Jasper.AI in particular allowed me to iterate on increments or get ideas quickly. If you had written a book in the classic manual way, you would probably tend not to discard already written sections/chapters too quickly if they had already been formulated. Because the formulation by the AI is so quick and easy, I was able to concentrate much more on the plot. I also realised the importance of generative AI middleware. Big software companies like NVIDIA, Microsoft, Google, OpenAI, etc. have the resources to train general purpose natural langue models, which is probably not economically viable for smaller companies, but this is where middleware AI comes in, where smaller specialised companies use the big pre-trained NLP models to adapt them for a specific application.

Development & convergence of AI tools

What has impressed me very much is the rapid development of the entire generative AI space and also the quality of the outputs. The trigger to start this little experiment was certainly the release of ChatGPT, but also just the progress in quality from MidJourney v3 to MidJourney v4 is incredible. MidJourney itself released the open beta in July 2022, version 2 in April, version 3 in July and the current version 4 in November 2022. In addition, the field is currently very fragmented in various generative AI tools. I would assume that in the near future we will see a consolidation of the many generative AI applications into complete E2E software solutions for verticals/horizontals or workflows. For example, Microsoft has already integrated Dall-E 2 into their Canva-like web application Microsoft Designer. Another example is veed.io, which already offers numerous genAI use cases in different areas. Finally, there are still different specialisations of genAI according to creative role models, but I alone have already noticed several media breaks in my mini workflow between different tools. I expect that in the future complete creative teams / departments will work together on projects with the help of integrated E2E genAI supported applications.

Bonus (Edit 22.12.22, 01:53AM)

After you made it to the end of my blog post and I received the request several times now I decided to publish all cover art images (and unpublished cover images) + the corresponding prompts here as a PDF as a bonus. You can also find my Photoshop layout file here. I mostly used MidJourney v4 for the creation of the artworks. Occasionally I did use OpenAI Dall-E Outpainting for replacing small portions of MidJourney outputs. I then only added titles & texts in Photoshop and filled missing image parts via “Context Aware Fill”.

In total I sent out over 500 prompts for the book (cover, illustrations) and only a total of 26 graphics or 5,2 percent (11 cover images + 15 illustrations) made it into the final book(s). This alone would have cost me approx. 65USD on Dall-E 2 (15USD/155 tokens) — which is why I mostly used MidJourney. You can find everything that didn’t make the cut on my MidJourney profile.

--

--

Filip Sokołowski

Tech, Media and Telco enthusiast and creator with a passion for the Metaverse, Web3, and AI. Based in Frankfurt, Germany.