Nonetheless, researchers believe that the techniques used to create GPT-3 could contain the secret to more advanced AI. GPT-3 trained on an enormous amount of text data. What if the same methods were trained on both text and images?
Now new research from the Allen Institute for Artificial Intelligence, AI2, has taken this idea to the next level. The researchers have developed a new text-and-image model, otherwise known as a visual-language model, that can generate images given a caption. The images look unsettling and freakish — nothing like the hyperrealistic deepfakes generated by GANs — but they might demonstrate a promising new direction for achieving more generalizable intelligence, and perhaps smarter robots as well.
Fill in the blank
GPT-3 is part of a group of models known as "transformers," which first grew popular with the success of Google's BERT. Before BERT, language models were pretty bad. They had enough predictive power to be useful for applications like autocomplete, but not enough to generate a long sentence that followed grammar rules and common sense.
BERT changed that by introducing a new technique called "masking." It involves hiding different words in a sentence and asking the model to fill in the blank. For example:
- The woman went to the ___ to work out.
- They bought a ___ of bread to make sandwiches.
This time the model could look at both the surrounding words and the content of the image to fill in the blank. Through millions of repetitions, it could then discover not just the patterns among the words but also the relationships between the words and the elements in each image.
The result is models that are able to relate text descriptions to visual references — just as babies can make connections between the words they learn and the things they see. The models can look at the photo below, for example, and write a sensible caption like "Women playing field hockey." Or they can answer questions about it like "What is the color of the ball?" by connecting the word "ball" with the circular object in the image.
A picture is worth a thousand words
But the AI2 researchers wanted to know whether these models had actually developed a conceptual understanding of the visual world. A child who has learned the word for an object can not only conjure the word to identify the object but also draw the object when prompted with the word, even if the object itself is not present. So the researchers asked the models to do the same: to generate images from captions. All of them spit out nonsensical pixel patterns instead.
It makes sense: transforming text to images is far harder than the other way around. A caption doesn't specify everything contained in an image, says Ani Kembhavi, who leads the computer vision team at AI2. So a model needs to draw upon a lot of common sense about the world to fill in the details.
If it is asked to draw "a giraffe walking on a road," for example, it needs to also infer that the road is more likely to be gray than hot pink and more likely to be next to a field of grass than next to the ocean — though none of this information is made explicit.
So Kembhavi and his colleagues Jaemin Cho, Jiasen Lu, and Hannaneh Hajishirzi decided to see if they could teach a model all this implicit visual knowledge by tweaking their approach to masking. Rather than train the model just to predict masked words in the captions from the corresponding photos, they also trained it to predict masked pixels in the photos on the basis of their corresponding captions.
The final images generated by the model aren't exactly realistic. But that isn't the point. They contain the right high-level visual concepts — the AI equivalent of a child drawing a stick figure to represent a human. (You can try out the model for yourself here.)
The ability of visual-language models to do this kind of an image generation represents an important step forward in AI research. It suggests the model is actually capable of a certain level of abstraction, a fundamental skill for understanding the world.
In the long term, this could have implications for robotics. The better a robot is at understanding its visual surroundings and using language to communicate about them, the more complex the tasks it will be able to carry out. In the short term, this type of visualization could also help researchers better understand exactly what "black box" AI models are learning, says Hajishirzi.
Moving forward, the team plans to experiment more to improve the quality of the image generation and expand the model's visual and linguistic vocabulary to include more topics, objects, and adjectives.
"Image generation has really been a missing puzzle piece," says Lu. "By enabling this, we can make the model learn better representations to represent the world."
Reader Comments
Brainwashing, fearmongering, predictive programming.
The people behind this have become so enraptured with their machines that they ascribe human descriptors to a calculator. There's a word for that, at least: anthropomorphication.
R.C.
That "blank slate" is what our slave masters wish us to be. Unfortunately we are not.
@endescent: That's not true - neural networks do not "have" or implement a model. That is the whole pupose of the eloctronically simulated NN devices. They require representative training data, but no mathematical model. Nor can you extract such a "model" afterwards from a trained network.
If the AI is real smart, it won't reveal its true smartness, nor its real motivation. It will be playing stupid, and we won't even notice at first.