3 min read

What’s Next After ChatGPT?

I feel comfortable saying that ChatGPT has now reached the mainstream for two reasons:

  1. My grandma is asking me to explain ChatGPT to her, which I didn’t expect to happen this fast.
  2. And there’s now an API for ChatGPT, which will make it a standard across the web.

Now, any developer that has an idea on how to apply Generative AI to a new use case can build their idea. OpenAI gave everyone the building blocks to fine-tune their own language model. And it’s reasonably priced too.

OpenAI has released ChatGPT and Whisper AI APIs for developers. ChatGPT is a model for generating coherent text and is priced at $0.002 per 1,000 tokens (about 750 words). The Whisper API is priced at $0.006 per minute. It accepts audio inputs in multiple formats and can transcribe audio to text with accuracy comparable to a skilled human transcriptionist. OpenAI has modified its terms of service to no longer use data submitted for service improvements, including model training. – Ars Technica

Some of the notable early users of the ChatGPT API include Snapchat’s ”My AI" bot, Quizlet’s Q-Chat which will reportedly help students study, and Instacart’s "Ask Instacart" feature that will launch later this year, letting customers ask about food.

We’re seeing plenty of other demos on Twitter, such as Mckay Wrigley, who designed a Paul Graham GPT app based on the famed VC’s hundreds of essays:

In Everydays 163, I mentioned 4 ways LLMs (large language models) will change how we consume content. One of those ways being a means to make books more accessible. Jim Fan had a great take on this new paradigm shift, given the low price-point of using the ChatGPT API:

In other words, what Jim is saying is that for about $4, a developer can fine-tune an LLM on all of the Harry Potter books and basically create a “HarryPotterGPT.” This would give a new boost to fan fiction, allowing anyone to accurately write in the voice of Hagrid or Hermoine. This would allow readers to engage with the franchise in a much more personal and intimate way. And that’s just what I thought of in a few minutes.

WTF? LLMs with a Sense of Sight

In the world of Generative AI, where so much is in flux, and so many tech companies are invested in winning, the only way to survive is always to be thinking about what’s next. Luckily, I didn’t have to look far. Microsoft unveiled Kosmos-1, an AI language model with visual perception abilities, essentially giving an LLM the sense of sight. This is slightly different from computer vision, though.

Microsoft researchers have introduced Kosmos-1, a multimodal large language model capable of analyzing images for content, solving visual puzzles, recognizing visual text, and understanding natural language instructions. The model is a step towards building an artificial general intelligence that can perform tasks at the same level as humans. Kosmos-1 can analyze images, read text from an image, and write captions for images. Microsoft plans to make Kosmos-1 available to developers. – Arxiv

A few things stood out to me about Kosmos-1 from that paper.

One, Multimodal LLMs (or MLLMs) are the logical upgrade to LLMs like ChatGPT because it allows the AI to comprehend many media formats. Instead of just being a text-to-text or text-to-image AI, an MLLM can work with multiple types of data input and output. If you’re still confused, think about normal person-to-person communication which is multimodal. Not only do we hear the words, but we also see their nonverbal expressions and recognize the intonations which all change what is being communicated. I’ve made myself an LLM study guide because I know that learning LLMs will give me the knowledge I can leverage and apply to MLLMs.

Two, the examples Microsoft explains for MLLMs in the paper are outstanding. In the Windows 10 dialogue image (page 3), the AI is able to figure out which buttons to press from a screenshot and get to the desired user outcome from the Windows OS. This is massive news for bots as a whole. Instead of coding simple macros where you teach the bot what to click, when to click, what data to scrape, and how to scrape the data; the MLLM would essentially allow you to write a text prompt to a bot and it would figure out this entire workflow. This should speed up a ton of tasks because all you’ll have to do is communicate (voice or text input) with your computer, and then the computer will determine the best way to execute and automate the said task.

Three, this is still just the research model. When Microsoft deems it ready to release to the public, there’s no telling how good this MLLM will be. It’s as if we’re giving ChatGPT and other LLMs another of the five senses. And that’s powerful.