Future Directions for Large Language Models

Large language models (LLMs) have taken the world by storm in the last year. It’s not even been one year since ChatGPT was released, and we have seen countless applications in business, education and entertainment.

In this post I’ll discuss 8 exciting developments in the field of LLMs that I think will be important in the next 1 to 3 years.

Prediction is very difficult, especially about the future. - Niels Bohr

Calling APIs

By calling APIs, LLMs can become actors in the real world.

Some examples of what can be done via API calls:

Provision a server
Send an email
Post a tweet
Buy a product and have it shipped
Operate a smart home device (lights, thermostat, lock, etc.)
Control a robot (vacuum, drone, etc.)
Send a task to a human worker via a crowdsourcing platform

As capabilities expand, the need for policy and regulation on this topic rises.

Better assistants

Siri feels rather underpowered compared to ChatGPT Plus. I expect that to change in the next few years so that phone voice assistants will be able to reliably do more than just set a timer or call a contact.

What sets Siri, Alexa and Google Assistant apart from ChatGPT is that they can control the phone. They can open apps, make calls, and send messages and are deeply integrated into the phone’s operating system. While ChatGPT, especially ChatGPT Plus is much smarter, it’s trapped in an app.

A phone assistant with ChatGPT’s smarts, integration with the phone’s operating system and the ability to call functions would be a game changer.

In addition to assistants, I expect to see LLMs become a standard part of many apps, as Microsoft 365, Notion, Photoshop and others have done.

LLM Agents

Currently common uses of LLMs primarily treat the model as a source of information and copywriter.

A more powerful approach is to treat the model as an agent with a task. AutoGPT and BabyAGI are frameworks for this.

In this approach, the LLM is part of a larger AI system:

A human provides a directive
The directive is commited to memory, such as a text file or database
The LLM is called with the directive as input, along with the current state of the system and available choices
The LLM can call copies of itself recursively to work on subtasks (e.g. “look up a term on Wikipedia”, “find a photo on Unsplash”)
This continues until the task is achieved and the LLM returns a result

The combination of LLM reasoning, recursive calls, memory and the ability to call APIs makes this approach very powerful.

However, real results have fizzled for these reasons:

Never ending loops
Needing too much babysitting to be useful, basically doing the easy part of any task and leaving the hard part to humans
Producing generic, lame results
Trouble with parsing information on the web

The potential is incredible, but there’s still a lot of work to be done.

A ceiling on the “bigger is better” trend

GPT-4, the current most capable LLM all around is rumored to have 1.7 trillion parameters. Will the bigger = better and more data = better trends continue? In text, the answer is probably no. GPT-4 was trained on almost all human text available on the internet. In terms of volume, there’s not much more text to train on.

An alternative to getting even more text is to improve the quality of the text used for training. Common crawl, a major component of GPT-4’s training data, is full of spam and low quality content. With less noise, models may also need fewer parameters to achieve the same performance.

Multimodal models

While model’s are hitting the limit on text, there’s still a massive amount of images, video and audio available on the internet waiting to be used for training. Multimodal models, meaning models that can process multiple types of data, are already here. The addition of image recognition to ChatGPT has unlocked a new level of capabilities, such as interpreting diagrams, assisting blind people or diagnosing repair issues.

Multilingual or non-English LLMs

Current LLMs work best on English text. While other languages work decently with OpenAI’s GPT models, performance in open source models like Llama 2 is lacking.

The economic incentive to train LLMs on non-English text is hugel As an example, I’m excited about the recent publication of LeoLM, a German LLM and the ongoing AYA project by Cohere.

Besides the models themselves, tokenization could benefit from a multilingual approach. As the majority of training data is in English and other languages that use the English alphabet, tokenization is optimized for those languages. This leads to a situation where Chinese, Arabic and other languages that use different alphabets are tokenized less efficiently and at higher cost.

Edge computing and efficiency

The deployment of LLMs is currently held back by their compute demands. Running models like Llama 2 7B requires a top of the line GPU and larger models like Llama 2 70B require a GPU cluster. So typically LLMs are deployed on cloud servers rather than on edge devices.

Developers and researchers are working on reducing the compute demands of LLMs through techniques such as quantization, sparse matrices, pruning, and distillation. The MIT HAN lab in particular is taking a lead on this.

I expect these techniques to become more widespread and more effective in the next few years, making it possible to deploy LLMs on edge devices like smartphones and laptops, at lower cost and without the privacy concerns of the cloud. Apple’s recent announcement of better text prediction in iOS 17 by using a transformer model on device is an example of this trend, though the model isn’t large enough to be considered an LLM.

Efficient training of specialized models

In Against LLM maximalism, spaCy creator Matthew Honnibal argues that LLMs are not the best choice for all NLP tasks, citing speed, cost, observeability, lack of modularity and measurement difficulties as reasons. He argues that smaller models trained on specialized data are often a better choice.

In economic terms, running a 1.7T parameter model on a GPU cluster when a 10M parameter model on a CPU would do the job is wasteful.

But it’s not an either or situation: LLMs can be used to accelerate the training of specialized models. I’m excited about Explosion AI’s development on integrating LLM produced labels into labeling with Prodigy and expect to see similar developments in other labeling tools.

Rather than LLMs replacing specialized models, I expect to see them used to accelerate the training of specialized model and an overall increase in the number of models in production.

Conclusion: Hype to quiet productivity

AI is whatever hasn’t been done yet. - Larry Tesler

In the long run, I expect that LLMs will follow the AI effect similar to features like spell checking and translation, which initially stood out as novel AI features but are now seen as standard features of software, quietly delivering value to users.