Exclusive Leak: Internal Google Source Reveals Open Source AI to Challenge Google and OpenAI

Khalid Hossain
6 min readMay 5, 2023

--

Leaked Document Reveals Open-Source AI as Fierce Competitor Against Google in AI Race

Have you heard about open-source AI giving Google a run for its money in the race to be the top AI?

I came across an intriguing piece of news in a Substack Newsletter (Semianalysis). It suggests that open-source AI technology is becoming increasingly popular and could potentially challenge the dominance of Google and OpenAI.

An anonymous insider recently leaked a document on a public Discord server, which disclosed that open-source AI is emerging as a formidable rival to Google. Following an analysis of the document, Semianalysis has reported that the open-source community may surpass both Google and OpenAI in AI.

According to Semianalysis, a leaked document has been confirmed to be genuine. The document states that neither the company nor OpenAI has a competitive advantage in AI. It also highlights the recent progress made by open-source AI projects, which has surpassed that of Google and OpenAI.

The document lists remarkable accomplishments by open-source AI, including running foundation models on a Pixel 6 at five tokens per second, fine-tuning personalized AI on a laptop in an evening, and creating multimodal models in record time.

If you want to learn more about the leak, you can check out the full article by clicking on the link below.

One of the most fascinating elements of this document is the Timeline, which is presented below.

I have included a section of the document, but please note that this is not a copy.

The Timeline

Feb 24, 2023 — LLaMA is Launched

Meta launches LLaMA, open sourcing the code, but not the weights. At this point, LLaMA is not instruction or conversation tuned. Like many current models, it is a relatively small model (available at 7B, 13B, 33B, and 65B parameters) that has been trained for a relatively large amount of time, and is therefore quite capable relative to its size.

March 3, 2023 — The Inevitable Happens

Within a week, LLaMA is leaked to the public. The impact on the community cannot be overstated. Existing licenses prevent it from being used for commercial purposes, but suddenly anyone is able to experiment. From this point forward, innovations come hard and fast.

March 12, 2023 — Language models on a Toaster

A little over a week later, Artem Andreenko gets the model working on a Raspberry Pi. At this point the model runs too slowly to be practical because the weights must be paged in and out of memory. Nonetheless, this sets the stage for an onslaught of minification efforts.

March 13, 2023 — Fine Tuning on a Laptop

The next day, Stanford releases Alpaca, which adds instruction tuning to LLaMA. More important than the actual weights, however, was Eric Wang’s alpaca-lora repo, which used low rank fine-tuning to do this training “within hours on a single RTX 4090”.

Suddenly, anyone could fine-tune the model to do anything, kicking off a race to the bottom on low-budget fine-tuning projects. Papers proudly describe their total spend of a few hundred dollars. What’s more, the low rank updates can be distributed easily and separately from the original weights, making them independent of the original license from Meta. Anyone can share and apply them.

March 18, 2023 — Now It’s Fast

Georgi Gerganov uses 4 bit quantization to run LLaMA on a MacBook CPU. It is the first “no GPU” solution that is fast enough to be practical.

March 19, 2023 — A 13B model achieves “parity” with Bard

The next day, a cross-university collaboration releases Vicuna, and uses GPT-4-powered eval to provide qualitative comparisons of model outputs. While the evaluation method is suspect, the model is materially better than earlier variants. Training Cost: $300.

Notably, they were able to use data from ChatGPT while circumventing restrictions on its API — They simply sampled examples of “impressive” ChatGPT dialogue posted on sites like ShareGPT.

March 25, 2023 — Choose Your Own Model

Nomic creates GPT4All, which is both a model and, more importantly, an ecosystem. For the first time, we see models (including Vicuna) being gathered together in one place. Training Cost: $100.

March 28, 2023 — Open Source GPT-3

Cerebras (not to be confused with our own Cerebra) trains the GPT-3 architecture using the optimal compute schedule implied by Chinchilla, and the optimal scaling implied by μ-parameterization. This outperforms existing GPT-3 clones by a wide margin, and represents the first confirmed use of μ-parameterization “in the wild”. These models are trained from scratch, meaning the community is no longer dependent on LLaMA.

March 28, 2023 — Multimodal Training in One Hour

Using a novel Parameter Efficient Fine Tuning (PEFT) technique, LLaMA-Adapter introduces instruction tuning and multimodality in one hour of training. Impressively, they do so with just 1.2M learnable parameters. The model achieves a new SOTA on multimodal ScienceQA.

April 3, 2023 — Real Humans Can’t Tell the Difference Between a 13B Open Model and ChatGPT

Berkeley launches Koala, a dialogue model trained entirely using freely available data.

They take the crucial step of measuring real human preferences between their model and ChatGPT. While ChatGPT still holds a slight edge, more than 50% of the time users either prefer Koala or have no preference. Training Cost: $100.

April 15, 2023 — Open Source RLHF at ChatGPT Levels

Open Assistant launches a model and, more importantly, a dataset for Alignment via RLHF. Their model is close (48.3% vs. 51.7%) to ChatGPT in terms of human preference. In addition to LLaMA, they show that this dataset can be applied to Pythia-12B, giving people the option to use a fully open stack to run the model. Moreover, because the dataset is publicly available, it takes RLHF from unachievable to cheap and easy for small experimenters.

Here is my perspective on the leaked information.

Based on the leaked information suggesting that a new open-source AI technology is gaining popularity and could challenge the dominance of Google and OpenAI, it’s clear that the landscape of the AI industry is constantly evolving. While Google and OpenAI have long been considered leaders in the field, the emergence of this new technology suggests that other players may be able to disrupt the status quo.

This leak is incredibly intriguing as someone who closely follows developments in AI. The prospect of a new open-source AI technology shaking up the industry is exciting and potentially concerning. On the one hand, it could lead to more incredible innovation and competition, ultimately benefiting consumers and society. On the other hand, it could lead to increased fragmentation and less collaboration, hindering progress in the field.

Here is an important excerpt from the document.

The current renaissance in open source LLMs comes hot on the heels of a renaissance in image generation. The similarities are not lost on the community, with many calling this the “Stable Diffusion moment” for LLMs.

And we must thank the semianalysis team for publishing this document.

Ultimately, only time will tell how this new open-source AI technology will impact the industry. It may fizzle out and fail to gain traction, or it could become the next big thing in AI. Regardless of the outcome, I’ll keep a close eye on this development and continue to learn as much as I can about the evolving landscape of AI.

Important Note: Please be aware that the text above was obtained from a Substack forum and I am unable to verify the accuracy of the information provided. However, I am sharing this document along with the source information for your reference.

--

--

Khalid Hossain

Geeky Project Manager | Programmer | Market Analyst | Entrepreneur