Phi 1.5 and the Shift Towards Smaller Models with Curated Data: A Closer Look

Published in

azhar labs

8 min readSep 22, 2023

Introduction

In the ever-evolving landscape of artificial intelligence and natural language processing, there are few moments that truly stand out as game-changers. The emergence of the Phi 1.5 model from the “Textbooks Are All You Need 2” paper is one such moment.

This model, however, isn’t just another name in the endless list of models; it signifies a significant trend shift. While popular models like GPT-4, PaLM, and even the predecessors like GPT-3 banked on enormous data sets, the spotlight is now on the effectiveness of smaller models paired with thoughtfully curated data.

In this article, we embark on a comprehensive exploration of this groundbreaking model, delving into its intricate details, capabilities, and potential implications for the wider AI community.

The Backdrop

Historically, the mantra seemed to be ‘the bigger, the better.’ We’ve seen models like LLaMA-2 trained on over 2 trillion tokens, with subsequent fine-tuning. While the results from these massive models have been impressive, it raises a pertinent question: Are we redundantly training these models on repetitive data? Are we feeding them lessons that are already learned and just reinforcing the same over and over?

A Different Path with ‘TinyStories’

Earlier in April, Microsoft Research introduced a fascinating paper titled ‘TinyStories.’ This paper’s advent is intriguing, given Microsoft’s collaboration with OpenAI, known for its gargantuan models. In a contrasting approach, Microsoft Research seems to advocate the potential of compact models fine-tuned with high-quality data.

‘TinyStories’ encapsulates this ideology. It commences with petite models, aiming to train them to craft brief narratives reminiscent of children’s tales.

This approach stems from a compelling argument: A child learns to articulate and weave basic stories with an exposure equivalent to merely a few billion tokens. Then, why do machine models demand hundreds of billions, or even trillions, to achieve a similar capability?

This research didn’t just remain theoretical; the researchers created a collection of these ‘tiny stories.’ The results were promising, echoing the sentiment that maybe, size isn’t everything. For enthusiasts, the data set from this research is now available, offering a hands-on experience to explore this novel perspective.

Reflecting on Data Quality over Quantity

The crux of this paradigm shift rests on data curation. Instead of bombarding models with vast amounts of data, the focus is on feeding them rich, well-chosen data sets. It suggests that the right quality and type of data might be more crucial than sheer volume.

Introduction to Orca

Following the insights from ‘TinyStories’, another influential paper that has captured the attention of many is titled ‘Orca’. This research paper extends the narrative of ‘small models with curated data’ but brings its unique spin, particularly emphasizing the curated and synthetic data aspects.

Orca’s Proposition

While ‘TinyStories’ showcased the potential of small models to craft narratives, ‘Orca’ delved deeper into the mechanics of data curation and its role in enhancing the capabilities of compact models. A recurring theme here is the exploration of synthetic data to train smaller models, enabling them to punch above their weight and mirror the performance of giants like GPT-4 or ChatGPT.

What distinguishes ‘Orca’ is its emphasis on data creation. The paper provides a comprehensive account of the process they employed to produce their dataset. This transparent methodology galvanized many in the AI community to create open-source datasets inspired by Orca’s approach.

However, there’s a caveat. As of now, Microsoft hasn’t publicly released the exact dataset used for the Orca model, leaving enthusiasts and researchers to rely on open-sourced versions inspired by the original.

Achievements of Orca

Orca’s prowess isn’t merely theoretical. The paper documents a 13-billion parameter model, which, though compact compared to some, delivers results that put it in the same league as GPT-4 or ChatGPT. In essence, it’s a featherweight boxer with the punch of a heavyweight, metaphorically speaking.

The methodology expounded in the ‘Orca’ paper has laid the groundwork for various ‘open Orca datasets.’ These datasets, following the formula from the paper, have empowered AI enthusiasts and researchers to fine-tune sophisticated LLaMA models. The outcome? Models that can emulate and, in some instances, rival the prowess of their much larger counterparts.

The Way Forward with Data Curation

‘Orca’ solidifies the belief that the future of machine learning might not necessarily be in making bigger models but in refining and curating the data that feeds them. This trend, combined with synthetic data generation, has opened a new avenue of research and application. As evident from ‘Orca’s success, when you couple a well-designed small model with meticulously curated data, the results can be astonishingly effective.

The ‘Textbooks Are All You Need’ Paradigm

In the wake of groundbreaking research from the likes of ‘Orca’ and ‘TinyStories,’ Microsoft’s AI research team introduced yet another compelling piece titled ‘Textbooks Are All You Need.’ Building upon their previous methodologies, this paper pushes the envelope further and introduces the community to the Phi 1 model.

Model Specs and Training

Phi 1, with its 1.3 billion parameters, is not just compact, it’s diminutive compared to AI titans like GPT-2. But the magic isn’t just in its size. The model underwent training for merely four days on 8 A100s, which by today’s standards, isn’t exhaustive.

What’s the secret sauce? Microsoft attributes the model’s prowess to its ‘textbook quality data.’ Instead of training on vast swathes of internet data, Phi 1’s training was grounded in high-quality, textbook-level content.

Benchmarking Success

While Phi 1 might not outrun GPT-4 on all fronts, it does manage to hold its ground impressively when benchmarked against models such as LLaMA-2, Falcon, and even GPT-4. What this paper brilliantly showcases is that with the right data, especially well-curated textbook quality data, and the application of methodologies like curriculum training, even smaller models can achieve commendable results.

My exploration into ‘Falcon 180’ also hinted at this. There’s a growing suspicion in the AI community that leading companies possess some sophisticated training strategies, intricacies in data selection, and novel ways to sequence the training data. ‘Textbooks Are All You Need’ lends credence to this speculation.

Introducing Phi 1.5: The Evolution

Drawing from the success and insights of Phi 1, the research team recently unveiled the technical report for Phi 1.5. Touted as the sequel, or as some are calling it, ‘Textbooks Are All You Need 2,’ this report delves deeper into the mechanics behind Phi 1’s success while introducing its evolved successor.

‘Textbooks Are All You Need II’ offers a fresh perspective on the importance of synthetic data in model training. If we were to classify the sequence of papers, ‘TinyStories’ would be the inauguration, introducing the AI community to a 10 million parameter model adept at coherent English. Phi 1, as previously discussed, excelled in Python coding tasks. Phi 1.5, on the other hand, focuses on natural language reasoning.

Synthetic Data — The New Goldmine?

The continuous emphasis on ‘textbook quality data’ is noteworthy. In essence, Phi 1.5 was bootstrapped using the original 7 billion tokens from the Phi 1 model. This was then augmented with synthetic data, which was intriguingly manufactured using GPT-4. Leveraging the power of a behemoth like GPT-4, researchers synthesized data emulating the style of textbooks, drawing inspiration from 20,000 meticulously selected topics.

In juxtaposition with models like LLaMA, the Phi 1.5 training was significantly leaner, both in terms of time and tokens. However, the brilliance of the paper isn’t in its architecture, which predominantly revolves around standard transformers. The true genius lies in its training data curation.

Benchmarks — How Reliable Are They?

Regardless of the ongoing debate, a pressing question remains: how much credence should we place in benchmarks? As models evolve and synthetic data becomes more prevalent, the authenticity and relevance of benchmarks may dwindle. The real metric of success is not just about surpassing benchmarks but about the model’s applicability and efficacy in real-world scenarios.

The Phi 1.5 model’s prowess in handling tasks like the GSM 8K (Grade School Math) suggests a practical utility that extends beyond mere benchmark tests.

How Reliable is It?

Given that Phi 1.5 is not fine-tuned in any way for specific tasks, it is impressive how it handles a variety of questions. Whether it’s coding, instruction answering, or just a chat, the results are quite remarkable, especially given that it’s just the base model and not specialized like other models we’ve seen in the past.

However, as seen from the examples, there are clear instances where the model’s thinking process seems sound, but the end result is not quite right. This might be due to the fact that it sometimes tends to overgenerate, providing more data than what was specifically asked. This ‘verbosity’ can sometimes lead it astray.

Potential Areas of Improvement

The Phi 1.5, despite its impressive capabilities, has shown some gaps. The occasional arithmetic mistake and the overgeneration can be areas for improvement. Given its structure, one could think of potentially fine-tuning it for specific tasks, or perhaps employing post-processing methods to clean up outputs.

Final Thoughts

The Phi 1.5 model by Textbooks Are All You Need 2 is undoubtedly an impressive feat in the world of NLP. Its synthetic data approach, combined with the immense power of GPT-4, offers a promising step towards making AI even more useful and widespread.

While it has its flaws, as any model does, the opportunities it presents are vast. With more research, user feedback, and iterations, we could be looking at the next big thing in AI.

I urge researchers, developers, and enthusiasts to take the model for a spin. Whether you’re looking at its capabilities in a specific domain or just having fun chatting, there’s a lot to uncover.

Finally, as the AI community continues to push the boundaries, it’s important to remember the role we all play. Let’s continue the dialogue, share findings, and collectively push for advancements that benefit everyone.

Stay connected! Hit that follow button on Medium and connect with me on LinkedIn for a steady dose of data science and deep learning magic.” 🚀📊🤖

Until next time, keep experimenting and stay curious!

Reference:

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent…

arxiv.org

Textbooks Are All You Need

We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1…

arxiv.org

Textbooks Are All You Need II: phi-1.5 technical report

We continue the investigation into the power of smaller Transformer-based language models as initiated by…

arxiv.org

Phi 1.5 and the Shift Towards Smaller Models with Curated Data: A Closer Look

Introduction

The Backdrop

A Different Path with ‘TinyStories’

Reflecting on Data Quality over Quantity

Introduction to Orca

Orca’s Proposition

Achievements of Orca

The Way Forward with Data Curation

The ‘Textbooks Are All You Need’ Paradigm

Model Specs and Training

Benchmarking Success

Introducing Phi 1.5: The Evolution

Synthetic Data — The New Goldmine?

Benchmarks — How Reliable Are They?

How Reliable is It?

Potential Areas of Improvement

Final Thoughts

Reference:

TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent…

Textbooks Are All You Need

We introduce phi-1, a new large language model for code, with significantly smaller size than competing models: phi-1…

Textbooks Are All You Need II: phi-1.5 technical report

We continue the investigation into the power of smaller Transformer-based language models as initiated by…

Written by azhar