With so many large language models available today, it can be difficult to choose one that’s right for you. What aspects or features should you even consider?
This guide will shed some light on the topic.
We selected the best large language models in 2023 based on a number of factors, from model performance to open-source access. Let’s quickly review our criteria and jump right to the selection.
Criteria
From the beginning, we knew we wanted to consider several key criteria when evaluating different large language models, namely:
- the number of parameters,
- the size and quality of training data, and
- the overall model performance – model accuracy, toxicity risks, etc.
However, in order to make the list beneficial to organizations and individuals with various different needs, we’ve also taken note of special model-specific features.
For example, we’ve considered questions like:
- Can the large language model process only text data or other formats of data, as well?
- Does the model support multiple languages?
- Is the model open-source?
Other factors could have been considered as well, such as model interpretability or generalization abilities. However, information about these factors is usually not publicly or readily available, so we haven’t considered them as our key criteria.
Additionally, we also wanted to include smaller, lightweight models. Although their performance may be significantly lower than that of large-scale models like GPT-4, they typically require fewer resources and may be a better fit for cost-conscious users. Including them was our attempt to cater to a wider range of user profiles.
13 Best Large Language Models In 2023
Without further ado, here’s our selection of the thirteen best large language models in 2023.
1. OpenAI: GPT-4
- Released in: March 2023
- Number of parameters: N/A (could be more than 1 trillion, according to some sources)
- Size of training data: N/A (could be around 45GB, according to some sources)
- Special features: multimodal, multilingual
GPT-4 is a transformer-based language model trained to predict the next token in a text. It is also a multimodal model, which means it can process different forms of data. More specifically, it can process text and image inputs.
When it comes to its performance, GPT-4 surpasses previous models in its ability to understand and act on user intent. For example, evaluators preferred its responses over those generated by GPT-3.5 on 70.2% of prompts.
GPT-4 has also shown human-like performance on multiple academic and professional exams, like the GRE and SAT, and even passed the Uniform Bar Examination with a score in the top 10% of test takers.
Besides showing exceptional performance on diverse English-centric tasks, GPT-4 has also demonstrated strong multilingual capabilities. It has surpassed other SOTA LLMs, like Chinchilla and PaLM, in 24 of 26 languages considered, including low-resource languages such as Latvian, Welsh, and Swahili.
Overall, GPT-4 is an impressive and one of the most versatile LLMs on the market.
2. NVIDIA x Microsoft: MT-NLG
- Released in: October 2021
- Number of parameters: 530 billion
- Size of training data: 338.6 billion tokens
- Special features: SOTA training techniques (like multi-GPU model parallelism)
At the time of its release, the Megatron-Turing Natural Language Generation model (MT-NLG) was “the largest and the most powerful” monolithic LLM ever developed.
Training such a large model was challenging. The parameters couldn’t fit in the memory of even the largest GPU, and the training would take unrealistically long with “traditional” training methods.
NVIDIA and Microsoft thus pioneered breakthrough training techniques, like intra-layer model parallelism. These techniques enabled them to accelerate training and alleviate the memory pressures without rewriting the model – and set a new standard for training transformer-based language models.
MT-NLG also outperformed other large-scale language models at the time on a wide range of NLP tasks. These tasks include:
- completion prediction,
- reading comprehension,
- commonsense reasoning,
- natural language inference, and
- word sense disambiguation.
For example, here’s MT-NLG’s performance on reading comprehension tasks compared to that of GPT-3 and Gopher:
Here’s also its performance on commonsense reasoning assessed on three different benchmarks:
With that in mind, MT-NLG is one of the most accurate and powerful LLMs currently available.
3. Google: PaLM
- Announced in: April 2022
- Number of parameters: 540 billion
- Size of training data: 780 billion tokens
- Special features: SOTA training system (Pathways), multilingual, exceptional generalization abilities
The Pathways Language Model (PaLM) was trained using Google’s proprietary Pathways system, which allows for efficient training across multiple TPU v4 Pods.
The model’s few-shot performance surpasses that of GPT-3, MT-NLG, and Gopher on 28 out of 29 tested English NLP tasks – including natural language inferences, question-answering, and reading comprehension.
PaLM also showed strong performance on non-English natural language tasks, despite only 22% of its corpus being non-English. Additionally, it performed exceptionally well on coding tasks.
According to Google, PaLM’s few-shot performance on coding tasks was on par with Codex – OpenAI’s large language model specifically fine-tuned for coding – despite using 50 times less Python code for training.
This shows that PaLM has incredible generalization abilities, and makes it the perfect example of Google’s main objective for Pathways: to train language models that can perform thousands or millions of different tasks instead of just a few.
4. AI21: Jurassic-1-Jumbo
- Released in: August 2021
- Number of parameters: 178 billion
- Size of training data: 300B tokens
- Special features: the first to use multi-word tokens
Like all AI21’s Jurassic-1 models, Jurassic-1-Jumbo uses multi-word tokens such as expressions, phrases, and named entities, alongside single-word tokens. This enables the models to use fewer tokens to represent a given amount of text – which, in turn, significantly improves their computational efficiency and reduces latency.
Third-party tests show that Jurassic-1-Jumbo can answer Jeopardy-style WCT questions with a 55.4% accuracy rate. This seems impressive when compared to the 52% accuracy of human participants, but falls short of the 73% accuracy rate of GPT-3.
However, Jurassic-1-Jumbo outperforms GPT-3 on several popular benchmarks for evaluating large language models, including StoryCloze, RTE, and RACE-middle. These benchmarks test understanding of causal relationships and context, understanding of logical relationships, and reading comprehension respectively.
5. HuggingFace x BigScience: BLOOM 176B
- Released in: July 2022
- Number of parameters: 176 billion
- Size of training data: 1.6TB of pre-processed text, converted into 350B tokens
- Special features: Open-access, multilingual
The biggest model in the BLOOM series has 176 billion parameters and can produce text in 46 languages and 13 programming languages. Here’s a snapshot of its training dataset:
It is primarily aimed at text generation, information extraction, question answering, and summarization, but can be used for other natural language processing tasks as well. The model performs almost on par with other, perhaps more popular LLMs in terms of fairness, robustness, and even accuracy:
However, the biggest advantage of BLOOM may be that it is open-source and publicly available to all. You can download a pre-trained BLOOM checkpoint here and freely run the model on your computer.
6. BAAI: Wu Dao 2.0
- Released in: June 2021
- Number of parameters: 1.75 trillion
- Size of training data: 4.9TB of text and image data
- Special features: primarily trained in the Chinese language, secondarily trained on English datasets; multimodal
Wu Dao 2.0 is the successor to Wu Dao 1.0, the first Chinese large-scale language model. It is likely the largest language model developed to date, although we don’t know for sure how many parameters GPT-4 has.
Such a large model is expected to show incredible performance on various NLP tasks, and this is certainly the case with Wu Dao 2.0. The model surpasses state-of-the-art (SOTA) levels on 9 benchmarks, namely:
- ImageNet zero-shot – surpassed OpenAI CLIP
- LAMA knowledge detection – surpassed AutoPrompt
- LAMABADA Cloze – surpassed Microsoft Turing-NLG
- SuperGLUE few-shot FewGLUE – SOTA, surpassing GPT-3
- UC Merced Land-Use zero-shot – SOTA, surpassing OpenAI CLIP
- MS COCO text generation diagram – surpassed OpenAI DALL-E
- MS COCO English graphic retrieval – surpassed OpenAI CLIP and Google ALIGN
- MS COCO multilingual graphic retrieval – surpassed UC2, M3P
- Multi 30K multilingual graphic retrieval – surpassed UC2, M3P
What’s more, Wu Dao 2.0 is one of the few truly multimodal language models. Not only can it process inputs in different formats, but it can also generate text and image outputs.
Today, it powers Hua Zhibing, China's first AI university student. Hua can paint, dance, compose music, and much more.
7. DeepMind: Gopher
- Released in: December 2021
- Number of parameters: 280 billion
- Size of training data: 10.5TB corpus
- Special features: external memory database
According to DeepMind’s research, Gopher performed significantly better on a range of NLP tasks from the MMLU benchmark compared to other existing LLMs at the time of its release.
In fact, DeepMind claims that Gopher can beat the performance of neural networks 25x its size thanks to RETRO (Retrieval-Enhanced Transformer), a unique system incorporating an external memory database.
RETRO acts as a cheat sheet for Gopher. The model can use it to retrieve relevant information on the fly instead of storing all of that information in its internal parameters. That way, Gopher can generate more accurate predictions while requiring less training than LLMs that rely solely on their internal knowledge and training.
This makes Gopher a one-of-a-kind LLM, as well as one of the most efficient models available.
8. Meta: LLaMA 65B
- Released in: February 2023
- Number of parameters: 65 billion
- Size of training data: 1.4 trillion tokens
- Special features: open-source
Meta’s LLaMA series features four models ranging from 7 to 65 billion parameters. All LLaMA models are trained on text from 20 languages and using exclusively publicly available data.
LLaMA 65B surpassed PaLM on most code generation tasks and proved to be competitive with GPT-3, Gopher, and Chinchilla on closed-book question-answering tasks – despite being 5 to 10 times smaller.
With that considered, LLaMa models may currently be the best choice for users seeking a highly efficient and resource-conscious model without compromising on performance. In fact, this was precisely Meta’s goal with LLaMa: to push the boundaries of what smaller models can do.
Unfortunately, at the time of writing, Meta claims it only grants access to LLaMa for research purposes, but you can try applying here.
9. LMSYS: Vicuna
- Released in: March 2023
- Number of parameters: 13 billion
- Size of training data: 70K samples of user-shared conversations
- Special features: open-source chatbot
Vicuna is an open-source chatbot trained by fine-tuning LLaMA on 70K user-shared ChatGPT conversations. It performs better than Stanford’s Alpaca on 8 out of 9 tested NLP tasks, including coding, writing, and commonsense reasoning tasks.
It also performs significantly better than both LLaMA 13B and Alpaca on math tasks:
However, LMSYS does note that current research is not definitive and that more needs to be done. You can make your own judgment about Vicuna’s performance by chatting with it here.
10. BAIR: Koala-All
- Released in: April 2023
- Number of parameters: 13 billion
- Size of training data: N/A
- Special features: chatbot
Koala is described as a dialogue model for academic research fine-tuned from Meta’s LLaMa 13B. There are two models available, Koala-Distill and Koala-All.
Koala-All was trained using ChatGPT distillation data and open-source data, like OpenAI WebGPT and Stanford Alpaca dataset.
It was tested on two test sets, Stanford’s Alpaca and BAIR’s own Koala Test Set. When tested on the Alpaca set, Koala-All exhibited comparable performance to Alpaca, while it exceeded it in nearly half the cases when tested on the Koala Test Set.
However, Koala’s biggest strength is that it works comparably well to ChatGPT but is significantly lighter and open-source. It also includes EasyLM, a framework for pre-training, fine-tuning, serving, and evaluating large language models.
You can try Koala yourself here.
11. Databricks: Dolly 2.0
- Released in: April 2023
- Number of parameters: 12 billion
- Size of training data: 15k human-generated prompt/response pairs
- Special features: open-source, instruction-following, open-source dataset
Dolly 2.0 is a large language model fine-tuned for following human-generated instructions, i.e., ChatGPT-like interactions, from EleutherAI pythia model family.
Its performance has been assessed on the usual benchmarks, which include reading comprehension, commonsense reasoning, and similar tasks. While it performed well, its results may not be exactly up to par with many other models reviewed here.
However, Databricks openly states that Dolly 2.0 was never envisioned as a state-of-the-art model in the first place. Instead, it’s primarily created to give smaller teams a way to build LLMs without needing huge budgets or extensive resources – something that can’t be done with models like GPT-4.
The model is revolutionary because it’s completely open-source and available for both research and commercial use. This goes for even the model’s training dataset, called databricks-dolly-15k, which contains 15,000 human-generated prompt/response pairs. Anyone can use, modify, or extend it.
The project will surely disrupt the industry, allowing teams to build LLMs faster and more cheaply than ever. You can find the model weights here.
12. LAION: Open Assistant
- Special features: chat-based, open-source
Open Assistant (OA) is a chat-based, community-driven assistant that aims to offer a viable open-source alternative to ChatGPT. The currently-released models have been fine-tuned from Pythia and LLaMa, and are available in a browser-based chat format here.
Because OA is completely community-driven, we can’t discuss the exact number of its parameters, release dates, or training data. Each individual or organization can fine-tune OA using their own datasets and contribute to its improvement, thus modifying its features.
Users can improve OA by completing tasks such as:
- Classifying assistant replies, which involves labeling the model responses as, for example, low- or high-quality, inappropriate, or spam.
- Replying as an assistant, which involves writing accurate, high-quality replies to given prompts.
- Ranking assistant replies, which involves ranking model responses from best to worst.
With some modifications, OA should also be able to connect with third-party apps and retrieve information from external databases or even the Internet. That way, it would overcome one of the most common LLM downfalls – the lack of specific or recent-enough data.
It certainly sounds like one of the most exciting LLM projects in 2023.
13. Google: Flan-UL2
- Released in: March 2023
- Number of parameters: 20 billion
- Size of training data: N/A
- Special features: open-source
Some consider Flan-UL2 to be the best available open-source LLM in 2023, and for good reason, too. Its performance on the MMLU and BigBench Hard benchmarks surpasses that of, for example, GPT-3.
The model has the same configurations as Google’s previous UL2 20B model, which excels at various NLP tasks, including language understanding, generation, and question-answering.
However, it has been instruction fine-tuned with Flan, a technique that involves training a model on a collection of datasets phrased as instructions. The result? Greater usability – and much better generalization abilities – of Flan-UL2 compared to its base model.
Like all Flan models, Flan-UL2 is free and cost-effective to launch. However, it’s primarily tuned on academic tasks, which may limit its effectiveness on more personal or business use cases.
Quick-Fire Tips: Considerations For Choosing A Model
Choosing the right model requires you to consider your needs and budgetary constraints first. Once you do, you’ll know exactly what to look for.
Here are a few guidelines to help you choose.
🔥 If you prioritize cost-effectiveness…
- Look for open-source models which can be accessed and customized without additional licensing fees.
- Choose models that require fewer computational resources or have lower ongoing operational expenses. Smaller models will usually require fewer computational resources and storage, but always check the model's documentation for more details.
🔥 If you prioritize model performance…
- Choose models with a proven track record of high accuracy, fluency, and contextual understanding in specific tasks relevant to your use case.
- Consider models that allow fine-tuning based on your own data.
🔥 If you need a versatile model…
- Choose a large-scale model rather than a smaller model. Large-scale models usually have a higher capacity to learn and generalize from a wide range of data, making them more suitable for diverse tasks and domains.
- Consider multilingual or multimodal models. Multilingual models support a wide range of languages, while multimodal models can process and/or generate text alongside other data types, such as images, video, or audio.
🔥 If you’re worried about ethical considerations…
- Prioritize models that have undergone bias evaluation and demonstrate efforts to address fairness concerns.
- Check if the model provider has transparent privacy and data usage policies aligned with your ethical standards.
🔥 If you’re seeking long-term scalability and future-proofing…
- Consider models that can handle larger datasets.
- Choose providers that are committed to ongoing research, updates, and advancements in the NLP field.
🔥 If you have domain-specific requirements…
- Look for models that have been fine-tuned or designed specifically for your industry or field.
- Alternatively, work with companies that can customize LLMs for your specific needs – like us.
Customize The Best LLMs For Your Needs
Large language models work best when they’re customized to your specific requirements.
That’s why we take the best LLMs on the market and train them on your data and for your desired tasks, making them 10x more efficient.
Learn how we can help you develop your custom LLM without compromising your budget.