Aditya Miskin

15th April, 2024

OpenAI Omnimodal GPT-4o

My thoughts on the latest release of OpenAI, GPT-4o.

OpenAI just released their latest model, GPT-4o and the 'o' stands for Omnimodal. The word "Omni" means "all" or "everywhere" and "modal" means "a particular mode in which something exists or is experienced or expressed". So, Omnimodal means "all modes". This model is trained on a wide variety of data sources including text, images, videos, and audio. This is a huge leap from their previous models which were only trained on text data.

Prologue

Now it was very much expected that OpenAI was going to release a update of some kind and there were various speculations on what it could be. Just some weeks back, an unknown model was released in the name of "im-a-good-gpt2-chatbot" on LYSYS Chat Arena where all the models are pitted against each other in a head-to-head competition. There were numerous reports of this model being extremely intelligent and was able to beat all of the models in the competition, even GPT-4-turbo which is the king of the leaderboard. A lot of people were speculating that this could be the next model from OpenAI.

Most of the people were expecting GPT-5 or atleast GPT-4.5 release but just a day before the release, Sam Altman, CEO of OpenAI tweeted that all the speculations were wrong and that they were going to release something that no one had ever seen before and he quoted, "feels like magic to me".

Release Day

On the release day, Mira Murati comes on stage and announces that they are releasing GPT-4o. Now the banger here is they announce that its free for everyone to use in ChatGPT. This was a huge surprise to everyone as it allows GPT-4 level intelligence accessible to everyone. This is a huge leap in democratizing AI.

All this was amazing but they were'nt done yet. They announced that they upgraded the "Voice Mode" feature in GPT-4o. For those who don't know, prior to GPT-4o, you could use Voice Mode to talk to ChatGPT with latencies of 2.8 seconds (GPT-3.5) and 5.4 seconds (GPT-4) on average. To achieve this, Voice Mode is a pipeline of three separate models: one simple model transcribes audio to text, GPT-3.5 or GPT-4 takes in text and outputs text, and a third simple model converts that text back to audio. This was'nt particularly efficient as in it could not handle real-time conversations.

With GPT-4o, they have trained a single new model end-to-end across text, vision and audio meaning all inputs and outputs are processed by the same neural network. This gives the model the ability to behave more like a human than ever before. The latency is almost negligible and you can dial in the expressiveness of the voice to your liking. This truly felt like a giant leap forward in STT(Speech-to-Text) and TTS(Text-to-Speech) technology, essentially solving both.

The "Voice" Reaction ✨

What caught people off-guard was the way the model spoke. It was soooo human-like that it was hard to distinguish between a human and the model. The model had no issues in understanding the context of the situation and was able to handle pauses like a champ. It also had a very uncanny resememblance to the voice of "Samantha" (voiced by Scarlett Johansson) from the movie "Her". For those who don't know, Samantha is an AI assistant in the movie "Her" who is capable of learning and evolving. The movie is set in the near future where the protagonist falls in love with Samantha and the movie explores the relationship between the two. Btw, if you haven't watched the movie, I highly recommend it. It's a beautiful movie. One thing that I found weird was that the model was soo flirtatious. It was like talking to a human but with a lot of what i would call "sass". I don't know if it was intentional or not but it was a bit weird.

Other Features

Apart from the voice mode, the model crushes various benchmarks and its other competitors in the market. The only model I see coming close to GPT-4o is the upcoming model from Meta which is "Llama 400B" which is still in training. Very eager to see how that model performs compared to GPT-4o. OpenAI blog post on GPT-4o also mentions a slew of new capabilities that is possible with GPT-4o. Some of them include: Enhanced Text-to-Image, 3D object synthesis, Text-to-Font, Speaker Diarization, Iterative editing in images, Sound generations, Better Multilingual tokenization, new ChatGPT Mac App and many more.

Its insane to see how much OpenAI is ahead of the curve and is pushing the boundaries of AI. I'm a very pro-Open Source person and just when the open source community was starting to catch up with GPT-4, OpenAI slips way ahead. Hopefully, open source soons catches up so that we can see more amazing things in the future.

Conclusion

Me and my friends had a discussion if AI was good for the masses or not. Their thinking of "AI dumbs down the masses" is somewhat correct. It's very important to know that AI is a tool and not a replacement for humans (yet). Its not a "one size fits all" solution. It's very important to know the limitations of AI and use it accordingly.

What I feel is AI brings the worlds knowledge to your fingertips. It's like having a super intelligent friend who knows everything. It's upto you to use it wisely. I'm very excited to see what the next few years hold for AI and how it will shape our future for the better. Here is a tweet from Andrej Karpathy about the state of AI. LOL.