Midjourney in Tandem with Dall-E 3 (and Chain of Thought prompting in GPT)

Alex Tully
8 min readOct 23, 2023

--

Now I already wrote an article about Dall-E 3 vs. Midjourney. But what about Dall-E 3 PLUS Midjourney? Would it be possible to use them in conjunction so that the strengths of one would compensate for the weaknesses of the other?

First, let’s consider the relative strengths and weaknesses of each. I’ve whipped up this table to illustrate them:

Table of strengths and weaknesses of Dall-E 3 and Midjourney

Dall-E 3’s biggest strength is that it is more likely to give you every element you want when you prompt it with longer, more complicated pieces of text.

Midjourney does worse at taking in large amounts of text (which is why I recommend prompting it with underscores, see here for a post about it). But you can feed it a huge amount of data by prompting it with an image, something which is currently impossible with Dall-E 3. And Midjourney also lets you adjust your creations to match your creations exactly using inpainting (something which I first explored here, and in many creations since).

So if you’re going to use the two generators in tandem, the logical sequence would be:

  1. Start by feeding a nice big text prompt into Dall-E 3, to get an image with all or most of the elements you want
  2. Use the output from #1 as an image prompt in Midjourney, together with text keywords for those elements
  3. Edit the image from #2 in Midjourney inpainting, perhaps using more images from Dall-E 3 as part of the prompts for inpainting

I decided to use this technique to repeat what I did for Cirque Jungle Tunnels (I wrote about it here, but basically it shows Antarctica after the climate has dramatically warmed, like in J.G. Ballard’s novel The Drowned World). Using just Midjourney I got an image like this:

Title: Cirque Jungle Tunnels (Made in Midjourney) https://www.instagram.com/p/CyNcBqarEzl/

And below is what I got using Dall-E 3 + Midjourney:

Title: Cirque Jungle Tunnels II (Made in Dall-E 3 + Midjourney) https://www.instagram.com/p/CyvQnm7JX6L/

Before I’d been using Dall-E 3 through Bing Image Creator, but OpenAI finally got around to giving me access through my GPT subscription. So I decided to experiment with that. Using GPT, I prompted Dall-E 3 with: Aerial photograph taken with a wide-angle lens, highlighting the details of a massive mountain resembling Federation Peak. Low-angle sunlight sidelights a towering limestone crag with an underground city carved into it, resembling Uplistsikhe or the Anasazi cliff-dwellings. Rice terraces are intricately carved into the steep mountainside. The landscape of sharp ridges is blanketed by dense subtropical jungle of tree ferns and palm trees. Long shadows fall across a dense cloud inversion of thick fog filling U-shaped valleys. The scene is devoid of any snow or ice and has a high horizon with desaturated cold hues.

Now there’s one thing that might catch you out unawares if you use Dall-E 3 through GPT. The bot automatically rewords your prompt, though it’s not obvious at first. For example, the bot gave me back this:

Screenshot of Dall-E 3 output (run in GPT-4)

But when I clicked on the top right one (the one I liked the most), I noticed this on the side:

Screenshot of Dall-E 3 output (run in GPT-4)

It had reworded my prompt to: Wide-angle aerial photograph of a towering limestone mountain, highlighted by low-angle sunlight. The mountain has features similar to Federation Peak and has an underground city carved into it, evoking the essence of the Anasazi cliff-dwellings. Rice terraces beautifully curve around the steep slopes. The surrounding landscape is covered in a lush subtropical jungle with tree ferns and palm trees dominating the view. U-shaped valleys are filled with a thick cloud inversion, creating long dramatic shadows. The entire scene is presented with a high horizon and cold desaturated colors, devoid of snow or ice.

Now this at least had most of the elements I was hoping for, but some of the other prompts (and hence their results) were not so good.

Time to feed it into Midjourney. I ran the prompt: <URL of Above Image>::9 cavates in a crag like Uplistsikhe or the Anasazi cliff_dwellings in the Transantarctic_Mountains with rice_terraces, sidelit by low_angle_sunlight casting long_shadows on a cloud_inversion of dense_fog, covered by subtropical_jungle of tree_ferns and palm_trees, wide_angle aerial_photo by Sony A7R II with high_horizon and desaturated_cold_tones — no snow, ice, icecap, glaciers — ar 16:9 — s 50 — style raw and got:

Midjourney Output

I loved the bottom left picture, with its tunnels going into the vertical sections between the rice terraces.

Image made in Midjourney

All I needed to do was using inpainting to make the terrain look more post-glacial.

I went back to Dall-E 3 get an image to include in the image prompt. It was a bit annoying how GPT was taking liberties to create its own prompts, but maybe I could use that to my advantage. I wanted to try using GPT’s reasoning abilities to figure out what kind of landscape would emerge, and then create prompts to pass to Dall-E 3. I kept GPT in Dall-E 3 mode.

Screenshot from GPT

And prompted:

Assume the role of a geologist, with expertise in the geology of the Trans-Antarctic Mountains:

First describe the bedrock topography of the Queen Maud Mountains (in the Trans-Antarctic Range) in detail.

Second assume that Antarctica undergoes sudden and complete deglaciation due to an abrupt warming in climate. Make an exhaustive list of the post-glacial features that would be observable in the Queen Maud Mountains.

Third, assume that the Queen Maud Mountains have had a subtropical climate (with several metres of annual rainfall) for 2,000 years since their sudden deglaciation. They have been colonised by lithophytic tree ferns and palm trees. What would still be the same? And how would the landscape have changed over this time? Remember that this is only two thousand years after deglaciation, and that karst landscapes take much longer to form.

Fourth, create and run Dall-E 3 prompts to depict images from an upland area of the landscape described in the third step. The images must be aerial photographs taken with a wide-angle lens, highlighting rugged details of the landscape in desaturated cold tones. There must be a high horizon and sidelighting by low angle sunlight casting shadows on a cloud inversion of thick fog. Be sure to include keywords referring to landforms that you predicted to persist in the third step. Also be sure to include keywords reminding Dall-E 3 that the climate is too warm for ice or snow to persist.

This is a prompt engineering technique called Chain of Thought, known as such because LLMs such as GPT tend to do better if you force them to reason step-by-step. Even more importantly, it makes it easy to catch the hallucinations and know where to tweak the prompt. For example, GPT predicted a karst landscape, which wouldn’t have had time to form in 2,000 years. So I had to rework the prompt by including a sentence telling the bot the prompt not to go off track at this point. I know everyone wants to be optimistic about LLMs, but wonderful as they are, the situation with hallucinations is not getting better, so it’s imperative to fact-check every single claim they make.

Anyway, here are screenshots of the bot’s output for each stage of the chain of thought:

Screenshot of Dall-E 3 Output (in GPT)
Screenshot of Dall-E 3 Output (in GPT)
Screenshot of Dall-E 3 Output (in GPT)

Note that I fact-checked every item on the lists in #2 and #3.

Screenshot of Dall-E 3 Output (in GPT)
Screenshot of Dall-E 3 Output (in GPT)

The second image appealed to me:

Image made in Dall-E 3

So I went back to the Midjourney image of rice terraces and tunnels and clicked Vary (Region), selected everything except the underground city beneath the rice terraces. Then I prompted: <URL of Above Image of Post-Glacial Valley> ::9 glacial_valley with hanging_valleys and aretes forming steep_ridges around cirques, sidelit by low_angle_sunlight casting long_shadows on a cloud_inversion of dense_fog, covered by lithophytic tree_ferns and palm_trees, wide_angle aerial_photo by Sony A7R II with high_horizon and desaturated_cold_tones — no snow, ice, icecap, glaciers — s 50 — style raw — ar 16:9

  • Note that I needed to put a space before the double colon, because Midjourney became buggy and has started screwing up if you put a double colon directly after an image prompt. Without the space, it interprets the double colon and the number as part of the URL.

Anyway, I got these 4 images:

Midjourney Output

And I loved the top left one. Selecting that and doing a bit of very basic inpainting, I got my final product:

Title: Cirque Jungle Tunnels II (Made in Dall-E 3 -> Midjourney) https://www.instagram.com/p/CyvQnm7JX6L/

What I learned from this project:

  • Dall-E 3 works differently in GPT to Bing Image Creator, because GPT doesn’t let you prompt the bot directly, but rather inserts itself as an intermediary to “refine” your prompt.
  • It’s possible to make the most of the above by directing GPT to use its reasoning abilities and expert knowledge when creating the prompts, (of course being sure to verify every stage of its argument in case of hallucinations).
  • Dall-E 3’s images work great as image prompts for Midjourney.

You can try the above with pretty much any genre of images. Good luck with it!

--

--

Alex Tully

Into Generative AI, but 100% Human-Written Blog (every word)・Bachelor’s in Maths・Master’s in Linguistics (@ANU 🇦🇺 )・Taught myself 🇯🇵 and 🇹🇭・Digital Nomad