Caktus Builds the World’s Best Academic AI Model With Multimodal
5 billion tokens
Training dataset
2x increased
Maximum sequence length (from 4k to 8k)
89.73
Eguana's average performance across 4 academic benchmarks
“Partnering with Multimodal allowed us to put together the next-generation barrier-breaking system we needed to overcome the obstacles we faced and start growing in a sustainable way.”
Challenge
Using a generic LLM (GPT-3):
- limited Caktus’ scalability;
- hindered its ability to attract investors; and
- gave the company no control over its product’s intellectual property.
It also proved to be too inaccurate for academia.
Solution
- Continual pre-training of Llama 2 (7B) on 5 billion tokens
- Fine-tuning Llama 2 (7B) on academic downstream tasks
- Enabling additional capabilities, such as ingestion of user-uploaded materials
Results
- The world’s best academic AI model — Eguana outperforms all major LLMs on academic benchmarks
- Expanded MSL from 4k to 8k
- 3 specialist models for downstream academic tasks
- Tens of thousands of paid users acquired
- Product offering expanded — Caktus now serves both educators and students and has released Eguana on a standalone platform
Summary
Caktus is an edtech company providing a cutting-edge AI academic tool that helps students study, write papers, and prepare job applications. When it first launched, it was powered by OpenAI’s GPT-3.
Caktus wanted to replace GPT-3 with a proprietary Generative AI model specialized for academia. They partnered with Multimodal, and we developed the initial versions of the model in just 3 months.
Today, Eguana surpasses all major large language models (LLMs) on academic benchmarks — and is currently the world’s best LLM for academia.
Caktus’ founders, Tao Zhang and Harrison Leonard, first launched their GPT-3 powered academic tool in March 2022. Its goal was to help students solve academic tasks more efficiently using AI.
However, the pair soon realized that using an off-the-shelf model was hindering their progress. GPT-3 was the main bottleneck, and building a proprietary Generative AI model became Caktus’ main priority. In December 2022, they partnered with Multimodal to achieve that goal.
Off-The-Shelf Models Are Too Expensive, Volatile, and, Surprisingly, Readily Available
Caktus wanted to overcome the challenges that came with using off-the-shelf models developed and, essentially, owned by other companies. The three main challenges were as follows:
- Limited and overly expensive scalability. OpenAI, like many other LLM providers, uses a token-based pricing model. This meant that the more users used Caktus.ai, the more this costed the company. Caktus was quickly hitting the token limits and was, in effect, forced to slow down in order to cut down costs. From Caktus’ perspective, it was nearly impossible to scale and become profitable with a non-proprietary model.
- Lack of control. “When you’re using someone else’s model, you never know where it’s going,” says Zhang. He mentions feeling blindsided by OpenAI’s release of ChatGPT, which was similar to Caktus’ early-stage product. In a way, Caktus ended up competing with its LLM provider, while also lacking the resources to do so on an equal footing. From a broader perspective, using third-party models strips companies of control of typically major components of their products.
- No differentiation. Finally, off-the-shelf models are available to everyone, which can become a significant obstacle for companies seeking outside investments. In Zhang’s words, “professional investors don't like to participate in companies that lack proprietary technology” since they expect the market to become too competitive soon. Caktus experienced this response from investors first-hand.
Some of Caktus’ features are still similar to ChatGPT. Considering it also used similar technology, Caktus desperately needed a differentiation point to attract investors and customers.
It became clear that Caktus needed a Generative AI model they could evolve themselves and use without significant recurring expenses. This made them turn to Multimodal and start building Eguana — currently the world’s best academic large language model (LLM).
“Number one, we want to build a scalable system. OpenAI is a good starting point, but when you're getting millions of users, that adds up to a pretty big sum very quickly. So, we wanted to really figure out a system where we can optimize usage. And Eguana seemed like the perfect solution.” - Tao Zhang
Turning a Generic Model into Proprietary, Advanced AI
Caktus’ goals were twofold:
- Build a proprietary model specialized for academia.
- Outperform generalist models to deliver added value and attract investors and users.
To preserve the resources, we decided against building a model from scratch and opted for fine-tuning a generic LLM on academic tasks and using academic data. This allowed us to directly target the biggest weakness of generalist models — poor performance on specialist tasks, caused by all-purpose training on unvetted web data.
We aimed to mitigate these issues by improving upon Llama 2 (7B) in partnership with Cerebras and CORE. LLaMa-2 was primarily chosen because of its accuracy and relatively small size; Cerebras acted as our hardware partner, and CORE was our dataset partner.
- Llama-2 (7B) is Meta’s open-source model. It performs on par with GPT-3.5 on various academic benchmarks and achieves a lower violation percentage compared to its competitors, all while being significantly smaller. Its relatively small size is one of its biggest assets, as it requires less computational resources, enables faster inference, and decreases operational costs.
Llama-2 (7B) Architecture
- Cerebras is a training platform for Generative AI models. We used groundbreaking hardware from their Andromeda supercomputer to pre-train Llama 2 (7B).
- CORE is the world’s largest collection of open-access research papers. We used over 100 million academic papers from their repository to extend Llama 2 from 4k Max Sequence Length (MSL) to 8k MSL.
Preparing CORE academic papers for model training required complex data engineering from our end. Here’s exactly what was done to turn it into input data:
Turning CORE research papers into model training data
1. Adjusting the data
2. Extracting relevant data
3. Performing OCR
4. Structuring the data
5. Cleaning the data
6. Normalizing the data
For pre-training purposes, we used Cerebras’ hardware and a total of 5 billion tokens from CORE academic papers. We performed continual pre-training, i.e., pre-trained the model on a continuous stream of data over time, rather than pre-training it once on a static dataset.
This increased the model’s accuracy, enhanced its real-time responsiveness to changing data, and improved its learning abilities.
Further, we prepared an elegant infrastructure of GPUs from various providers, including Google Cloud Platform, CoreWeave, and Modal Labs, for model fine-tuning. We used thousands of well-curated instructions to fine-tune the model on several downstream tasks, namely short-form and long-form essay generation with citations and chat interaction.
Finally, we enhanced the model with additional capabilities that provide a better and more personalized user experience. For example, the chat model can access and leverage chat history and user-uploaded course materials to provide more accurate outputs.
Eguana Outperforms All Major LLMs, and Caktus Is Seeing the Returns
The new base model we built for Caktus, namedEguana, was significantly improved over Llama 2 (7B), with one of its main upgrades being the increased Maximum Sequence Length (MSL).
Compared to Llama 2 (7B), Eguana can handle two times longer sequences, which allows students to ask more detailed questions and receive more in-depth answers through a chat-like interface.
Eguana Architecture
Eguana also showed significantly better performance on the four most important metrics for academia: relevance, clarity, structure, and flow. It surpassed not only Llama 2 (7B), but also all other major LLMs released at the time, including GPT-3.5 Turbo, GPT-4, and Vicuna 1.5 (13B).
Eguana’s average score on all four benchmarks is 89.73, significantly better than all competitor LLMs we tested.
Further fine-tuning resulted in three models, each specializing in a specific downstream task and achieving greater accuracy.
As Zhang says, Eguana helped Caktus.ai become a premium product, providing the “much-needed tech backbone” to what used to be an essentially GPT-3 wrapper app. Caktus.ai was now able to offer users a higher quality experience and better outputs. This, in turn, attracted investors, kickstarted the switch to a paid business model, and increased user retention.
Our joint efforts also allowed Caktus to expand its product offerings.
- Caktus.ai now also offers features for educators, allowing them to create classes, monitor student usage, track students' progress, and upload course materials. This enables the model to answer students’ questions in the educators’ exact teaching styles.
- Caktus now also offers access to Eguana to anyone who wants to build academically correct AI apps without having to train AI models themselves.
According to Zhang, these results are slowly but surely helping Caktus transition to a technology company. Simultaneously, they’re also enabling even non-experts to quickly build reliable AI apps for academia. “That’s easy to do when you have an excellent tech foundation, and we finally really have that,” concludes Zhang.
“Over the period of such a short time, Multimodal delivered something that is traditionally impossible for a software company of any scale. Plus, they have the know-how needed to build AI that outperforms even major industry players. That's why we intend to keep building together.” - Tao Zhang
Schedule a free, 30-minute call
Explore how our AI Agents can help you unlock enterprise-wide automation.
See how AI Agents work in real time
Learn how to apply them to your business
Discuss pricing & project roadmap
Get answers to all your questions