OpenAI’s DALL.E New Machine Learning Model creates images from text

The name of this machine learning model comes from a combination of artist Salvador Dalí and Pixar’s beloved robot WALL·E. A truly creative name, if we dare say. DALL.E is a 12-billion parameter version of OpenAI’s GPT-3 specialising in image generation from text.

DALL.E uses a dataset of text-image pairs to create the images, which can range from anthropomorphised animals and objects. It can combine unrelated concepts to form plausible pictures, render text, and transform existing images.

GPT-3 is proof that language can allow large neural networks to perform text generation tasks, and Image GPT has generated high-fidelity images. OpenAI aims to show that using language to manipulate visual concepts is now closer than ever.

DALL.E is a transformer language model. After receiving text and image data up to 1280 tokens in a single data-stream, it will use maximum likelihood to generate all the tokens one by one. DALL.E can create images from nothing and even regenerate regions in an image within the text prompt’s requirements.

This can have significant impacts on society, and OpenAI will continue investigating how DALL.E and future models will affect work processes and professions, how much bias is in the output, and any ethical concerns the technology has.

DALL.E can create images for many different sentence types, allowing for exploration of language’s compositional structure. It can modify the attributes of objects as well. The model can even draw multiple objects and combine multiple attributes. The process involves relative positioning and stacking objects as well. There is a level of controllability over the positioning of some objects and the attributes, but the way captions are phrase will directly affect the outcome. The more objects and attributes are introduced, the more likely DALL.E will become confused. DALL.E also does not understand semantics and rephrasing, which means it does not always know the correct interpretations.

Images from the prompt 'an armchair in the shape of an avocado. An armchair imitating an avocado'. Image from source below.
Images from the prompt ‘an armchair in the shape of an avocado. An armchair imitating an avocado’. Image from source below.

For different viewpoints, DALL.E can apply distortions to scenes, which also meant that the scientists could tinker with how it generates reflections. Even macro photos and other ways to visualise objects are possible.

DALL.E can understand some implied context, such as sunlight necessitating shadows even when not explicitly mentioned. This it can function almost like a 3D rendering engine such as Unity to create the perfect lighting conditions. It can do this without the need to state every single detail in the incoming captions.

The model can even generate designs for fashion and furniture.

As language composition involves both the real and the imaginary, DALL.E can create objects that do not exist in real life. Even unrelated concepts will not hamper the generation of interesting images. This includes anthropomorphised animals and objects, hybrid animals, and even emojis.

GPT-3 can perform tasks even if it is only armed with a description and a cue to answer in a specific way. If given a phrase such as ‘here is the sentence ‘a person walking his dog in the park’ translated into French:’ GPT-3 will answer in French, ‘un homme qui promène son chien dans le parc’. This is called zero-shot reasoning. DALL.E can also do this with the images it generates should it be instructed in the proper way.

OpenAI did not anticipate many of these capabilities, and it tested DALL.E’s expertise with analogical reasoning problems, particularly on Raven’s progressive matrices. This is a visual IQ test that was used widely in the 20th century.

DALL.E also has knowledge about geographical facts and objects from different periods of time.

Read more from Open.AI -> Article