3 min read

How GPT-4 Will Change Human-AI Interaction

What comes after ChatGPT? is the question I posed in Everydays 171, and the consensus is MLLMs – or Multimodal Large Language Models. Although text-based interfaces for interacting with language models (such as ChatGPT) helped familiarize millions of people with AI, they’re just a step on the way to a much more natural human-AI interface. MLLMs will make interacting with language models and AI as simple as having a natural conversation.

And we may be just a week away from OpenAI releasing an MLLM.

WTF? GPT-4 Release Next Week

OpenAI will release GPT-4 as early as next week. The update to their language model is rumored to be trained on 100 trillion parameters, which is 500x the size of GPT-3. However, other sources say that GPT-4 will have a similar model size but much better training. Overall, most expect that GPT-4 will be multimodal, making it the first MLLM for public use.

Jim Fan outlines what changes if GPT-4 is, in fact, multimodal in this tweet. Here’s what he predicts:

If GPT-4 is multimodal, we can predict with reasonable confidence what GPT-4 might be capable of, given Microsoft’s prior work Kosmos-1:

1. Visual IQ test: yes, the ones that humans take!

2. OCR-free reading comprehension: input a screenshot, scanned document, street sign, or any pixels that contain text. Reason about the contents directly without explicit OCR. This is extremely useful to unlock AI-powered apps on multimedia web pages, or “text in the wild” from real world cams.

3. Multimodal chat: have a conversation about a picture. You can even provide “follow-up” images in the middle.

4. Broad visual understanding abilities, like captioning, visual question answering, object detection, scene layout, common sense reasoning, etc.

5. Audio & speech recognition (??): wasn’t mentioned in Kosmos-1 paper, but Whisper is already an OpenAI API and should be fairly easy to integrate.

Jim Fan’s predictions are based on rumors of what Andreas Braun said, (Microsoft Germany CTO). They are largely based on the research paper for Kosmos-1, which shows a lot of very real progress being made on MLLMs. An image from that paper shares what MLLMs will be capable of:

Source: Arxiv

One, if GPT-4 is released next week, this will be pretty demoralizing for everyone that has started building with the ChatGPT API. It’s likely that GPT_4 will effectively make anything made with older versions obsolete. That’s beside the point, though.

What’s more important is that an MLLM from OpenAI would change human-AI interaction completely. We’d be getting much closed to a world shown in most sci-fi films where we just talk to computers and things get done. Without any effort on our part, we guid the computer to do our work for us. Just like Tony Stark/Iron Man interacting with Jarvis or Friday.

Connecting task-specific language models through APIs is what will really make the magic happen. Yesterday’s note on creating Pokémon cards with AI, the human has to take output back and forth from between Midjourney and then back to ChatGPT because these AIs do different things. APIs connect the web today, but eventually, LLMs will replace APIs and just know how to interact with other models and come back with finished products in multimedia formats.

We’re already seeing builders connect language models, such as Prismer (Github code here):

Overall, the stuff from Kosmos-1, Primer, and ensuing MLLMs is the best thing I’ve seen that could convince me we’re getting closer to AGI – Artificial General Intelligence.