2.0: The World’s First Open Source LLM by Databricks that allows fine-tuning and ChatGPT level performance.

Felix Reynoso
2 min readJun 22, 2023

Dolly 2.0, the first open-source LLM that is instruction-following and adjusted on a human-generated instruction dataset licensed for commercial use, was unveiled to the world by Databricks CEO and Co-Founder Ali Ghodsi on LinkedIn.

Databricks revealed more about Dolly 2.0 in a blog post. They claim in their post that Dolly 2.0 can follow instructions, allowing businesses to create, own, and modify LLMs to suit their particular requirements. This means that a business doesn’t have to start from scratch if it wants to utilize an LLM for sentiment analysis of customer evaluations. With Dolly, they could begin with an LLM that had already been trained and then refine it using a database of user reviews.

Dolly 2.0 is a 12-billion parameter model that was developed exclusively on the new, superior human-generated instruction-following dataset known as databricks-dolly-15k. It is based on the EleutherAI Pythia model. This is the first open-source instruction dataset created by humans that is intended to enable LLMs to mimic ChatGPT’s human-like interactivity. In accordance with the terms of the Creative Commons Attribution-ShareAlike 3.0 Unported Licence, Databricks made the dataset, the training code, and the model weights available to anybody for commercial use.

After releasing Dolly 1.0, which was trained using a dataset produced by the Stanford Alpaca team with the OpenAI API, Databricks got numerous requests to utilize their LLMs commercially. However, this dataset included output from ChatGPT, and according to its terms of service, no one is allowed to develop a model that is in direct competition with OpenAI. Dolly 1.0 was therefore restricted to non-commercial use. Databricks built its dataset by crowdsourcing it among its staff during March and April 2023 in order to get around this restriction.

In order to get around this restriction, Databricks built its dataset by crowdsourcing it among its staff between the months of March and April 2023.

The databricks-dolly-15k dataset includes 15,000 human-generated prompt/response pairings that were created expressly for instruction-following, ranging from content production and brainstorming to information extraction and summary. Databricks hopes to democratize access to LLMs by releasing Dolly 2.0 open-source. This would allow businesses to create customized models without having to pay for API access or provide their data to outside parties.

--

--

Felix Reynoso
0 Followers

Dominican Software Engineer & Data Scientist Based In Vancouver, I Regularly Work With Startups And Businesses On Full Stack Development, Cloud, Data, And ML.