Open Source Large Language Models (LLMs)

A large language model (LLM) is a type of machine learning model that can perform a variety of natural language processing (NLP) tasks, including generating and classifying text, answering questions in a conversational manner and translating text from one language to another.

The label ā€œlargeā€ refers to the number of values (parameters) the model can change autonomously as it learns. Some of the most successful LLMs have hundreds of billions of parameters.

LLMs are trained with immense amounts of data and use self-supervised learning to predict the next token in a sentence, given the surrounding context. The process is repeated over and over until the model reaches an acceptable level of accuracy

Falcon-40B-Instruct

Falcon-40B-Instruct is a 40B parameters causal decoder-only model built by TII based on Falcon-40B and finetuned on a mixture of Baize.

The Technology Innovation Institute (TII) in Abu Dhabi has announced its open-source large language model (LLM), the Falcon 40B. With 40 billion parameters, Falcon 40B is the UAEā€™s first large-scale AI model, indicating the countryā€™s ambition in the field of AI and its commitment to promote innovation and research.

Unlike most LLMs, which typically only provide non-commercial users access, Falcon 40B is open to both research and commercial usage. The TII has also included the modelā€™s weights in the open-source package, which will enhance the modelā€™s capabilities and allow for more effective fine-tuning.

In addition to the launch of Falcon 40B, the TII has initiated a call for proposals from researchers and visionaries interested in leveraging the model to create innovative use cases or explore further applications. As a reward for exceptional research proposals, selected projects will receive ā€œtraining compute powerā€ as an investment, allowing for more robust data analysis and complex modeling. VentureOne, the commercialization arm of ATRC, will provide computational resources for the most promising projects.

TIIā€™s Falcon 40B has shown impressive performance since its unveiling in March 2023. When benchmarked using Stanford Universityā€™s HELM LLM tool, it used less training compute power compared to other renowned LLMs such as OpenAIā€™s GPT-3, DeepMindā€™s Chinchilla AI, and Googleā€™s PaLM-62B.

Vicuna

Vicuna an open-source chatbot trained by fine-tuning LLaMA on user-shared conversations collected from ShareGPT. Preliminary evaluation using GPT-4 as a judge shows Vicuna-13B achieves more than 90% quality of OpenAI ChatGPT and Google Bard while outperforming other models like LLaMA and Stanford Alpaca in more than 90% of cases. The cost of training Vicuna-13B is around $300. The code and weights, along with an online demo, are publicly available for non-commercial use.

After fine-tuning Vicuna with 70K user-shared ChatGPT conversations, we discover that Vicuna becomes capable of generating more detailed and well-structured answers compared to Alpaca (see examples below), with the quality on par with ChatGPT.

Vicuna is created by fine-tuning a LLaMA base model using approximately 70K user-shared conversations gathered from ShareGPT.com with public APIs. To ensure data quality, we convert the HTML back to markdown and filter out some inappropriate or low-quality samples. Additionally, we divide lengthy conversations into smaller segments that fit the modelā€™s maximum context length.

The training recipe builds on top of Stanfordā€™s alpaca with the following improvements.

  • Memory Optimizations: To enable Vicunaā€™s understanding of long context, we expand the max context length from 512 in alpaca to 2048, which substantially increases GPU memory requirements. We tackle the memory pressure by utilizing gradient checkpointing and flash attention.
  • Multi-round conversations: We adjust the training loss to account for multi-round conversations and compute the fine-tuning loss solely on the chatbotā€™s output.
  • Cost Reduction via Spot Instance: The 40x larger dataset and 4x sequence length for training poses a considerable challenge in training expenses. We employ SkyPilot managed spot to reduce the cost by leveraging the cheaper spot instances with auto-recovery for preemptions and auto zone switch. This solution slashes costs for training the 7B model from $500 to around $140 and the 13B model from around $1K to $300.

Alpaca

Alpaca, which is fine-tuned from Metaā€™s LLaMA 7B model. The Alpaca model on 52K instruction-following demonstrations generated in the style of self-instruct using text-davinci-003. On the self-instruct evaluation set, Alpaca shows many behaviors similar to OpenAIā€™s text-davinci-003, but is also surprisingly small and easy/cheap to reproduce.

There are two important challenges to training a high-quality instruction-following model under an academic budget: a strong pretrained language model and high-quality instruction-following data. The first challenge is addressed with the recent release of Metaā€™s new LLaMA models. For the second challenge, the self-instruct paper suggests using an existing strong language model to automatically generate instruction data. In particular, Alpaca is a language model fine-tuned using supervised learning from a LLaMA 7B model on 52K instruction-following demonstrations generated from OpenAIā€™s text-davinci-003.

The figure below illustrates how Alpaca model was trained. For the data, a generated instruction-following demonstrations was built upon by the self-instruct method. We started with the 175 human-written instruction-output pairs from the self-instruct seed set. We then prompted text-davinci-003 to generate more instructions using the seed set as in-context examples. We improved over the self-instruct method by simplifying the generation pipeline (see details in GitHub) and significantly reduced the cost. Our data generation process results in 52K unique instructions and the corresponding outputs, which costed less than $500 using the OpenAI API.

Equipped with this instruction-following dataset, we then fine-tuned the LLaMA models using Hugging Faceā€™s training framework, taking advantage of techniques like Fully Sharded Data Parallel and mixed precision training. For our initial run, fine-tuning a 7B LLaMA model took 3 hours on 8 80GB A100s, which costs less than $100 on most cloud compute providers. We note that training efficiency can be improved to further reduce the cost.

LLaMA

LLaMA (Large Language Model Meta AI), a state-of-the-art foundational large language model designed to help researchers advance their work in this subfield of AI. Smaller, more performant models such as LLaMA enable others in the research community who donā€™t have access to large amounts of infrastructure to study these models, further democratizing access in this important, fast-changing field.

Training smaller foundation models like LLaMA is desirable in the large language model space because it requires far less computing power and resources to test new approaches, validate othersā€™ work, and explore new use cases. Foundation models train on a large set of unlabeled data, which makes them ideal for fine-tuning for a variety of tasks. We are making LLaMA available at several sizes (7B, 13B, 33B, and 65B parameters) and also sharing a LLaMA model card that details how we built the model in keeping with our approach to Responsible AI practices.

Smaller models trained on more tokens ā€” which are pieces of words ā€” are easier to retrain and fine-tune for specific potential product use cases. We trained LLaMA 65B and LLaMA 33B on 1.4 trillion tokens. Our smallest model, LLaMA 7B, is trained on one trillion tokens.

Like other large language models, LLaMA works by taking a sequence of words as an input and predicts a next word to recursively generate text. To train our model, we chose text from the 20 languages with the most speakers, focusing on those with Latin and Cyrillic alphabets

LLaMA-13B outperforms GPT-3(175B) on most benchmarks, and LLaMA-
65B is competitive with the best models, Chinchilla-70B and PaLM-540B.

GPT J

GPT-J 6B is a transformer model trained using Ben Wangā€™s Mesh Transformer JAX. ā€œGPT-Jā€ refers to the class of model, while ā€œ6Bā€ represents the number of trainable parameters.The model consists of 28 layers with a model dimension of 4096, and a feedforward dimension of 16384. The model dimension is split into 16 heads, each with a dimension of 256. Rotary Position Embedding (RoPE) is applied to 64 dimensions of each head. The model is trained with a tokenization vocabulary of 50257, using the same set of BPEs as GPT-2/GPT-3.The model was released by EleutherAI.The core functionality of GPT-J is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting GPT-J it is important to remember that the statistically most likely next token is often not the token that produces the most ā€œaccurateā€ text. Never depend upon GPT-J to produce factually accurate output.

GPT-J was trained on the Pile, a dataset known to contain profanity, lewd, and otherwise abrasive language. Depending upon use case GPT-J may produce socially unacceptable text.This model was trained for 402 billion tokens over 383,500 steps on TPU v3ā€“256 pod. It was trained as an autoregressive language model, using cross-entropy loss to maximize the likelihood of predicting the next token correctly.

Dolly

Databricksā€™ Dolly-V2ā€“12B, an instruction-following large language model trained on the Databricks machine learning platform that is licensed for commercial use. Based on Pythia-12B, Dolly is trained on ~15k instruction/response fine tuning records databricks-dolly-15k generated by Databricks employees in capability domains from the InstructGPT paper, including brainstorming, classification, closed QA, generation, information extraction, open QA and summarization. Dolly-V2ā€“12B is not a state-of-the-art model, but does exhibit surprisingly high quality instruction following behavior not characteristic of the foundation model on which it is based.

Dolly-V2ā€“12B is a 12 billion parameter causal language model created by Databricks that is derived from EleutherAI's Pythia-12b and fine-tuned on a ~15K record instruction corpus generated by Databricks employees and released under a permissive license (CC-BY-SA)

The comparison of LLMā€™s can be viewed here:

Open LLM Leaderboard

The Open LLM Leaderboard aims to track, rank and evaluate LLMs and chatbots as they are released.

References:

Follow our Social Accounts- Facebook/Instagram/Linkedin/Twitter

Join AImonks Youtube Channel to get interesting videos.

--

--