Top 5 open-source LLMs that can be used for question answering over private data

Exploring the Intricacies of the Top 5 LLMs: Uncovering Their Remarkable Contributions to Advancements in AI Technology

Mostafa Ibrahim
Avkalan.ai
8 min readJan 12, 2024

--

In the rapidly evolving landscape of artificial intelligence, Large Language Models (LLMs) have emerged as powerful tools for parsing and generating human-like text. These models, built on vast datasets and intricate algorithms, have revolutionized how we interact with data, especially in the realm of question-answering. This article delves into the top 5 open-source LLMs that are particularly effective for question answering over private data. Geared towards data scientists, engineers, and tech enthusiasts, this listicle provides a concise yet comprehensive overview of each model. By exploring these open-source marvels, we aim to provide a clear and informative guide to harness the power of LLMs for efficient and secure data handling in a variety of applications.

Top 5 LLMs

Here’s a curated list of some of the most groundbreaking LLMs available today, complete with download links and additional resources for the enthusiastic explorer.

  1. Llama2
  2. Falcon
  3. Bloom
  4. MPT
  5. Vicuna

Let’s dive deep into each one of them.

Llama2

source

Llama 2 represents a significant advancement in the field of large language models (LLMs), boasting a robust training on 40% more data than its predecessor, Llama 1, which directly contributes to its enhanced capabilities. With a vast model size of 13 billion parameters, Llama 2’s architecture is designed to handle a wide array of complex tasks with higher efficiency. The pretraining process involved a staggering 2 trillion tokens, ensuring a comprehensive understanding of language patterns and nuances. Notably, Llama 2 supports a context length of 4096 tokens, doubling what was possible with Llama 1, thereby allowing for more intricate and extended dialogues. In addition to its extensive pretraining, Llama 2 has been fine-tuned for chat use cases with over 100,000 instances of supervised fine-tuning and guided by human preferences data surpassing 1 million examples.

Following are some technical details of the model based on the research paper released by Meta.

  • Architecture: They used a standard transformer architecture with specific enhancements like increased context length and grouped-query attention​​.
  • Hyperparameters: The models were trained using the AdamW optimizer, with specific settings for learning rate, weight decay, and gradient clipping​​.
  • Tokenizer: They employed a byte pair encoding (BPE) algorithm and a vocabulary of 32k tokens​​.
  • Hardware Used: The pretraining was done on Meta’s Research Super Cluster and internal production clusters using NVIDIA A100s​​.
  • Carbon Footprint: The carbon emissions from the pretraining were calculated and offset as part of Meta’s sustainability program​​.

For more reading check this research paper released by META titled as Llama 2: Open Foundation and Fine-Tuned Chat Models

Explore and read this tutorial by “datacamp” to see how to fine-tune Llama2.

Download Llama2 here.

Falcon

source

The Falcon 180B is a highly powerful language model with 180 billion parameters, trained on 3.5 trillion tokens. It stands out in the Hugging Face Leaderboard for pre-trained Open Large Language Models. Falcon 180B is part of a suite of models, including Falcon 40B, that prioritize open-source and open-access principles. The decision to make these models openly available aligns with the belief that innovation thrives in an environment that fosters accessibility. Falcon 40B, is provided under the Apache 2.0 software license to both researchers and commercial users, ensuring security, availability, and a robust governance model for open-source software.

The following are certain characteristics of Falcon.

  • Data Preparation: Falcon leverages a high-quality, five trillion token, web-only English pretraining dataset named REFINEDWEB, focusing on web data to enhance model performance.
  • Data Filtering and Deduplication: Implements a MacroData Refinement (MDR) pipeline for filtering and deduplicating web data from CommonCrawl, ensuring high data quality.
  • Document-wise and Line-wise Filtering: Includes sophisticated filtering mechanisms like repetition removal, document-wise filtering for spam and non-natural language, and line-wise corrections for irrelevant content.
  • Deduplication Strategy: Falcon adopts an aggressive deduplication strategy that combines fuzzy document matches and exact sequence removal to improve data quality and model generalization.

Explore and read this tutorial by “lightening.ai” to see how to fine-tune Falcon.

Check this blog on “Hugging Face” to deep-dive.

To explore more deeply, here is the research paper by Falcon LLM team.

Download Falcon here.

Bloom

source

BLOOM stands as a landmark in AI research as the world’s largest open-science, open-access multilingual large language model (LLM), boasting 176 billion parameters. It represents a significant stride in democratizing access to advanced AI, as it is the first LLM of its scale to be trained transparently and collaboratively by a global consortium of over 1000 researchers. BLOOM can generate text in 46 natural languages and 13 programming languages, offering unprecedented capabilities for academia, nonprofits, and small research labs. Its release empowers researchers to delve into the intricacies of LLM operations and further harness its capabilities for applications like text generation, summarization, classification, and language translation. Available within the Hugging Face ecosystem, BLOOM simplifies AI experimentation and innovation, heralding a new era of accessible, community-driven AI development.

The technical characteristics of Bloom are as follows,

  • ALiBi Positional Embeddings: BLOOM employs ALiBi positional embeddings, which directly attenuate attention scores based on the distance between keys and queries, rather than adding positional information to the embedding layer.
  • Embedding Layer Normalization: The model includes an additional layer normalization immediately after the embedding layer, improving training stability.
  • Hardware Used: BLOOM was trained on the Jean Zay supercomputer, which is equipped with 48 nodes, each having 8 NVIDIA A100 80GB GPUs (totaling 384 GPUs).
  • Training Framework: The training utilized Megatron-DeepSpeed, a framework that combines Megatron-LM for Transformer implementation, tensor parallelism, and data loading, with DeepSpeed for the ZeRO optimizer, model pipelining, and general distributed training components. This framework enabled efficient training with 3D parallelism, combining data parallelism, tensor parallelism, and pipeline parallelism​​.

Check this tutorial by “E2E Networks Documentation” to learn about fine-tuning the model.

To explore more deeply, you can download the research paper here.

Download Bloom here

Vicuna

source

This innovative model is turning heads by reaching over 90% of the conversational prowess of renowned systems like OpenAI’s ChatGPT and Google Bard, all at a fraction of the usual cost. For those just stepping into the realm of Large Language Models (LLMs), Vicuna represents an exciting and accessible leap forward. Built by fine-tuning the LLaMA model with a treasure trove of 70,000 user-shared conversations from ShareGPT, Vicuna-13B showcases the power of community-driven development in the AI space. The technical crowd will appreciate its impressive cost-efficiency and the transparency of its training process, not to mention the preliminary evaluations — conducted using GPT-4 as a benchmark — that suggest Vicuna’s detailed and structured responses surpass those of its peers.

The technical characteristics of Vicuna are as follows.

  • Memory Optimizations and Context Length: The training includes memory optimizations like gradient checkpointing and flash attention. Vicuna can handle a max context length of 2048, expanded from 512 in Alpaca.
  • Cost Reduction Strategies: The training used SkyPilot-managed spot instances to reduce costs. The cost of training the 13B model was around $300.
  • Serving Infrastructure: Vicuna-13B is served using a lightweight distributed system that supports GPU workers from both on-premise clusters and the cloud. It employs a fault-tolerant controller and utilizes cheaper spot instances for cost-effective serving.
  • Safety Measures: To mitigate potential toxicity or bias, the OpenAI moderation API is used to filter out inappropriate user inputs in the online demo.
  • Evaluation Framework: Vicuna-13B was evaluated using an automated assessment framework based on GPT-4. This framework involved generating a set of diverse questions and having GPT-4 judge the model outputs.

Check this tutorial by “Ray Docs” to learn about fine-tuning the model.

To explore more deeply, you can read the blog here.

Download Vicuna here

MPT

source

MosaicML is reshaping the landscape of language AI with the introduction of MPT-7B, a game-changing open-source model with commercial usability that challenges the dominance of models like LLaMA-7B. MPT-7B is a transformer marvel, trained from scratch on a colossal dataset of 1 trillion tokens of text and code, boasting fast training and inference, and capable of handling context lengths up to 65k tokens thanks to ALiBi — a significant leap over the usual 2k-4k token limits. This model, a product of a seamless, human-free training process over 9.5 days on 440 GPUs, represents MosaicML’s commitment to democratizing access to powerful AI tools for businesses and developers alike.

Here are some technical characteristics of MPT.

  • Versatile Model Variants: MPT-7B includes specialized models like MPT-7B-StoryWriter-65k+ for long-form content creation, MPT-7B-Instruct for instruction-following tasks, and MPT-7B-Chat for dialogue generation, each addressing specific use cases.
  • Advanced Training Techniques: Utilizing a modified transformer architecture and innovations like FlashAttention and ALiBi, MPT-7B is optimized for efficient training and inference, capable of handling exceptionally long context lengths up to 65k tokens.
  • Extensive Dataset Training: The base model is trained on a diverse 1 trillion token dataset, including both text and code, ensuring robust performance across various domains and applications.
  • Commercial Usability and Licensing: MPT-7B is uniquely positioned for commercial use, released under open-source licenses like Apache 2.0 and CC-By-SA, allowing for wide-ranging applications including commercial deployment.
  • Integration and Deployment Ease: The model is available on Hugging Face, facilitating easy integration into existing projects and applications. MosaicML’s platform also supports custom training and deployment of these models, enhancing their accessibility and practicality.

Check this tutorial by “Paperspace” to learn about fine-tuning the model.

For further exploration check how to fine-tune with Gretel GPT and check how to do it on Amazon Sagemaker.

To explore more deeply, you can read the blog here by MosaicML.

Download MPT 7b here

As we have delved deep into the details of each llm, let’s summarize some of the technical details below.

Conclusion

In conclusion, the development of top Large Language Models like Llama2, Falcon, BLOOM, MPT, and Vicuna marks a significant milestone in the field of artificial intelligence and natural language processing. Each model, with its unique capabilities and innovations, contributes to the evolving landscape of AI technology. Collectively, these models not only demonstrate the rapid advancements in AI but also pave the way for a future where complex language tasks are handled with unprecedented ease and sophistication.

--

--

Mostafa Ibrahim
Avkalan.ai

Software Eng. University College London Computer Science Graduate. Passionate about Machine Learning in Healthcare. Top writer in AI