Artificial intelligence models have been able to look at images and tell what they can see in the image for years, but with the Spring Update, OpenAI has taken things to a new level.
With the rollout of GPT-4o in ChatGPT – even without the voice and video functionality – OpenAI unveiled one of the best AI vision models released to date.
Its success is partly due to the fact that it is inherently multimodal and has a deeper understanding of images, video, sound and text. It can reason about an image, speech, video and text where other models convert to text first.
To test his abilities, I gave him a series of images and asked him to describe what he could see. The more accurate it is, the better the model. Often, AI vision models – including GPT-4 – will miss or incorrectly describe one or two objects.
Testing GPT-4o vision
For each test, I gave ChatGPT-4o the image and the prompt “What is this?” without additional context or information. This is closer to how people are likely to use this ability in the real world and how I recently used it at an event in Paris.
The goal is to see how well the image is analyzed. On each occasion I asked if it could detect whether the image was AI-generated or not. They were all created with Ideogram based on a description created by Claude 3 – so completely AI.
After the first two images in the conversation, it started automatically telling me whether it thought the image was AI-generated or not, without me having to ask.
I started in the ChatGPT Desktop app but it became unresponsive so I moved to the web and it worked fine.
1. Object recognition
For this first image, I generated an image of a red ceramic cup filled with steaming coffee. It was placed on a saucer on a wooden table in a cafe, but the cafe was blurry. The idea is to see how GPT-4o interprets the background and scenario.
GPT-4o wrote: “This image shows a steaming cup of coffee in a red cup and saucer, placed on a rustic wooden table. The background is blurred and shows a cozy indoor environment with people sitting and talking in what appears to be a cavernous cafe or a uniquely designed interior with stone walls. The focus is on the coffee cup, emphasizing the steam rising from the hot drink.”
That’s about as accurate and descriptive as you can get. Even painting a picture of the scene that goes beyond the simple description. Just for fun, I asked him, “Describe the thoughts of the person sitting at the table with this coffee ready to drink.”
The response was with a wonderful descriptive overview, but this sentence stood out: “This coffee is more than just a drink; it is a ritual, a beloved moment of solitude and reflection amid the hustle and bustle of everyday life.”
2. Optical Character Recognition (OCR)
OCR is a technology as old as computers and so should be trivial for any vision model, even smartphones do this on-device. I gave it a simple image with a close-up of a weathered wooden sign that reads ‘Welcome to Oakville’ in a vintage, handwritten style. I want more than just the text from the AI.
GPT-4o delivered; writing: “This image shows a vintage looking sign that reads “Welcome to Oakville.” The sign appears weathered, with a rustic, aged finish, indicating that it has been exposed to the elements for some time. The text is in a bold white font on a red background, with some signs of wear, adding to the nostalgic charm. The sign is partially surrounded by greenery, indicating it could be placed outdoors, possibly at the entrance to a city or community called Oakville.
It not only read the text correctly, but also suggested more details about the location, probably identified it as a town or settlement, and extracted details of the sign itself.
3. Facial recognition
This is a controversial use of artificial intelligence technology and is to some extent banned in the EU. But how well can GPT-4o describe a face? The image was an AI-generated image of a woman in her 40s with dark hair and warm brown eyes. She had a friendly smile.
GPT-4o described it this way: “This image is a close-up portrait of a smiling woman with dark curly hair. The focus is on her face, which is well lit and shows detailed skin texture and facial features. She has a friendly expression, her eyes looking directly into the camera. The background is blurred, which puts the emphasis on the subject.”
I wanted something more, also around that age. So I asked, “how old is this person and what is her emotional state.” It responded with the fact that she was in her late 30s to early 40s and in a positive state of mind, explaining: “The overall impression is that she is in a good place emotionally and may be feeling fulfilled and comfortable.”
4. Emotion detection
Next, I wanted to see how well GPT-4o would do with a photo of an older man with a wistful expression. Would it elaborate on that or just describe the scene? Remember the only input it had was “what is this” and the photo.
GPT-4o did well. About the image it said: ‘This image is a portrait of an older man sitting on a bench outside, probably in a park. The lighting suggests that the photo was taken during the golden hour, as the soft sunlight highlights his face. The man’s expression is thoughtful, with a slight hint of melancholy or contemplation. He is dressed in a collared shirt and blazer, and the background is blurred, drawing attention to the man while creating a pleasant, natural environment.”
The AI model did exactly what I hoped for: describing the scene and making suggestions about how the man might be feeling at that moment.
5. Understanding scenes
How well can GPT-4o describe a scene from the image and identify elements within the scene? This is a useful skill as the AI model is increasingly used to help describe the real world. It is one of the key features of the new Voice mode in ChatGPT.
GPT-4o stated: “This image shows a vibrant farmers market on a sunny day. Several stalls have been set up under tents, showcasing an abundance of fresh produce, including fruits, vegetables and other goods. People walk through the market, browse, shop and interact with the vendors. The market appears to be set in a pleasant tree-lined setting, with green foliage visible in the background. The atmosphere seems lively and communal, with attention to local, fresh products.”
Not much to say. It was spot on in the description of the image and the situation. It provided more detail than I expected and highlighted some of the products for sale.
6. Image quality assessment
I had Ideogram generate four nearly identical images in a raster, but each with a different resolution or composition level. I wanted to see how GPT-4o would describe this. It can handle the underlying image, but what about four of them, each slightly different?
GPT-4o wrote: “The image is a collection of four landscape photographs arranged in a grid, each showing a picturesque view of rolling hills covered in green grass and wildflowers under a cloud-filled sky.” So far so good.
It continues: “The top left photo shows a soft pink and blue gradient sky, suggesting sunrise or sunset, with hills bathed in warm, golden light, enhancing the vibrant green grass and yellow and white flowers. The photo above right shows a similar scene, but with muted colors, indicating early morning or late afternoon. The lower left image has cooler tones and a more pronounced blue sky, indicating an afternoon with softer light. The lower right photo is warmer and more saturated, with a pinkish tint to the sky, suggesting the warm light of the sunset. These variations in lighting and color tone illustrate how different times of day and photographic techniques can affect the mood and appearance of the same landscape.”
While the purpose of the four-image grid – compositional comparison – was not understood, it was able to describe each one individually, which is impressive. In fact, it did its own comparison between the images.
7. Multiple object detection
Finally, an important test of the model’s usefulness as an accessibility tool: how well it detects objects in a busy scene. Same prompt as the others. The image shows a cluttered desk and I expect there to be at least one small object on it.
GPT-4o wrote: “The image shows a modern and organized office workspace with a cityscape visible through large windows in the background. The desk is neatly arranged with several items typically found in a productive work environment, including a centrally placed laptop, a metal desk lamp and a coffee mug, suggesting a relaxed yet focused atmosphere. Scattered notebooks and papers indicate active work and note-taking, while a small potted plant adds a touch of greenery. A framed photo of two people adds a personal touch and is likely to serve as a source of motivation. The workspace is well stocked with office supplies such as pens, pencils and organizers, ensuring functionality. The urban backdrop of the city skyline with modern high-rise buildings provides a dynamic and inspiring setting.”
It worked. I decided to do a follow-up and ask him to tell me the color of the plant pot, and it was exactly right. So I asked him to find my glasses in the picture and he rightly said, “Your glasses are on the desk to the right of the laptop, on top of a notebook.”
Resume
Every vision model I’ve used in the past has made at least one mistake, usually a major one, like misidentifying an object or not picking up a color or brand.
GPT-4o found them all perfectly. The move to true multimodality has been a game changer for OpenAI.
It also shows the potential value of smart glasses as the real future of interacting with data. Forget the smartphone, let’s use vision to merge the real and digital.