OpenAI launches third version of DALL-E

Disclaimer: Any opinions expressed below belong solely to the author.

OpenAI has just announced the third version of its AI-powered image generator, DALL-E, which has seen less publicity and traction over the past year, having been surpassed by solutions such as Midjourney or open-source Stable Diffusion.

The updated image generator is coming to GPT Plus subscribers in early October, with a host of new features.

dall-e 2 vs dall-e 3 — Differences in output quality between DALL-E 2 (on the left) and DALL-E 3 (on the right) / Image Credit: OpenAI

Obviously, the company behind ChatGPT had to offer something that its competitors can’t, and that is exactly what it is promising with this update.

Most importantly, unlike others, DALL-E is finally integrated with ChatGPT. This is a big deal, as good understanding of input text is necessary for achieving a desired visual output.

And since OpenAI is currently the leader among developers of large language models, it is only natural that it would (and should) use its prowess to make image generation better in terms of output quality and ease of use, so that users don’t have to try to learn how to speak to the model before it can produce what they actually want.

This is known as “prompt engineering” — a practice of combining certain words and key phrases to nudge the AI in the desired direction, rather than simply describing what you want in a natural manner.

Well, according to OpenAI, this won’t be necessary with DALL-E 3, as it is expected to be a lot better at understanding the sense of written language, as presented in this promotional video:

also, the video we made for dalle 3 is SO CUTE: pic.twitter.com/k1FOFTOsU5
— Sam Altman (@sama) September 20, 2023

Promptly fired?

Prompt engineering has become one of AI buzz phrases of the past year, leading to creation of new jobs in the field, but it looks like progress is so quick that newly-employed prompt engineers might already be forced to look for something else to do, given how human-like interactions with AI become.

On the other hand, this might also be a major junction, which is going to see different solutions follow their own paths, which may still require considerable human expertise.

Image Credit: SIphotography / depositphotos

This is because DALL-E 3 is coming with a handful of limitations. It allows artists to opt out from having the model trained on their data. It also prevents users from asking the model to produce artwork in a style of any living artist — likely a hedge against possible lawsuits, which are already being filed against OpenAI and other AI companies.

Direct, mass-market competitors, like Midjourney, may be forced to follow a similar path but free, open-source solutions like Stable Diffusion won’t need to, since anybody can run a local instance of the service and train it on any images they like.

Doing so will, of course, require some knowledge, especially as it’s unlikely that SD will be able to match ChatGPT’s proficiency in handling conversational language.

In other words, it might be the beginning of market fragmentation, that we see in other IT services and products, depending on the degree of flexibility one needs.

A rookie user who would like to just get some pretty pictures will be more than happy with something that works well enough out of the box, while designers (in a multitude of fields) may require the ability to fine-tune their models — including by having minute control over their prompts — at the expense of ease of use.

In fact, this divergence may only grow with time as AI is, ultimately, just a tool that’s supposed to serve our needs. For all the generative AI art flooding the internet over the past 12 to 18 months, the real value is in output that has practical use. This means that in most professional cases, it has to be perfect or near-perfect.

Think of designing assets for 3D projects, video games, or movies. Or product and packaging design. Or highly accurate, choreographed photography scenes, including realistic people, real-life locations, angles, focal lengths of lenses used, times of day, seasons of the year, and so on.

To convey a desired message for professional use, creators already have to have a lot of control and intelligent approximation coming from a large, closed-source, pre-trained model, is unlikely to be good enough.

It is also where those new jobs, that AI advocates are promising, might be created. Lighting engineers, colour graders, photo retouchers might soon be replaced (unless they upgrade their own skills) by someone who can speak in a language that smart software requires to produce what is required of it.

Plain, natural language may never be accurate enough — just think of all the times you told someone something and they didn’t quite understand you. Well, the same is bound to keep happening with computers. No matter how intelligent they become, they can’t read your mind and there’s only so much information a conversation can contain.

Which is why instructing them in more direct, specific, technical terms may actually be a better way to achieve what you want, when precision is key.

After all, the most we can achieve with conversational AI is parity with humans, not clairvoyance (at least not until we begin implanting chips into our brains). And as long as that is the case, there’s bound to be plenty of work left for us, even if it requires constant adaptation to changing conditions.

Featured Image Credit: Dall-E 3 / OpenAI

Also Read

Pilots be warned: AI-powered humanoid robot can fly a plane just by reading its manual