Snowflake Arctic Cookbook Series: Arctic’s Approach to Data

On April 24, we released Snowflake Arctic with a key goal in mind — to be truly open. In line with that goal, the Snowflake AI Research team is writing a series of cookbooks to describe how to pretrain, fine-tune, evaluate, and serve large-scale MoEs such as Arctic. We will share our journey of training the Arctic model, our findings related to sourcing and composing pre-training data, designing MoE architectures, co-designing models with training and inference systems in mind, and methods for fine-tuning and evaluating the models.

For the full series, please go to Snowflake Arctic cookbook catalog.

Data is the cornerstone of high-quality Large Language Models (LLMs). It is the fuel that powers the intricate engine of an LLM, enabling it to learn, adapt, and evolve.

Doing a great job with the data recipe for LLMs is full of hard challenges:

  • The Voracious Token Appetite of LLMs: The modern LLM training stack is insatiable, requiring several trillions of tokens. These tokens need to be high-quality, domain-specific, and diverse. To put this in perspective, one trillion tokens is roughly equivalent to the content of 15 million books. Simply repeating tokens to meet this colossal requirement can lead to performance degradation. The pressing question remains: where can we source such an immense volume of data?
  • The Quest for Quality in Enterprise Data: For the enterprise tasks that Arctic zeroes in on, high-quality code and SQL data are paramount. Assembling a corpus of high quality tokens for these data sources is not trivial.
  • Processing Massive Raw Data: Once a research team has a large corpus of tokens at hand, the challenge shifts to processing it effectively and scalably.
  • Deciphering Data Composition and Curriculum: Understanding the makeup of our data and structuring a learning curriculum for LLMs is a complex puzzle to solve.

Arctic is trained on 3.5 trillion tokens sourced from the public domain, encompassing web content, code & SQL, STEM, and more.

In this blog, we go into the origins of our data sources and the methodologies employed to elevate them to the desired quality. We provide an overview of the approaches we’ve taken to tackle the first three challenges head-on. We describe our strategies for 1) assembling vast quantities of web data, 2) gathering high quality enterprise-focused datasets, and 3) data processing techniques and pipeline enhancements to refine data quality. By sharing an insider’s view of the data sources, techniques, and configurations that have proven successful for us, we aim to provide valuable insights to our readers.

At the end, we offer a sneak peek on how we’ll address the 4th challenge about data composition and curriculum, which we will dive into in an upcoming blog on Arctic data.

Assembling High Quality Web Data

Start with high precision web crawl data as base

Web crawls are a great starting place to pre-train an LLM. We started with high quality data sources that have been extensively used in the research literature: C4 (originally described in the T5 paper) & Refined Web (used to train the Falcon LLM). These two data sources are pre-processed from the Common Crawl dataset to improve performance by throwing out low quality text. These datasets already have language filtering, deduplication and quality filtering. In addition, we performed further document quality based filtering based on the KenLM perplexity score (using fast n-gram models trained on Wikipedia, and the one that is not similar to Wikipedia will have a high perplexity score). We filtered any doc with a score greater than 1000. This roughly reduced the number of documents by 20%, but increased our commonsense reasoning in ablation by ~2%. These two data sources combined gave us ~750B tokens.

Where do we get high recall web crawl data?

Merely using high precision web crawl data is not sufficient to provide the volume of tokens we want. To add more, we used an annotated set of 84 different Common Crawl (CC) web dumps from Together.AI. This dataset, while gigantic, is not of the same quality as RefinedWeb and C4 on average. At the same time, it does have annotations including perplexity score and over 40 useful signals such as “head”, “middle”, and computed minhashes. This leads us to our next question.

How do we filter the high recall web data to match the high precision web data in quality?

We ran several dataset ablations to come up with a good filtering criteria over the high recall web crawl dataset. Holding the high precision data as the gold standard, we trained an MoE model on 27B tokens on this data. Let’s call this model high-precision-MoE.

For every choice of filtering criteria, we generated 27B tokens from random documents from the high recall web dataset that met the filtering criteria and trained another MoE model on this data. Call this filter-high-recall-candidate-MoE. We then hill-climbed on the filtering criteria until we got the performance of filter-candidate-MoE to match that of the high-quality-MoE. As our MoE architecture for ablations, we used a nimble 350M x 64 expert model and for comparisons, we used the stable eval harness (average of 9 common sense metrics). Here is our exact filtering config:

See the Together.AI V2 web dataset description for variables definition.

rps_doc_word_count < 50
rps_doc_word_count > 100000
rps_doc_mean_word_length < 3
rps_doc_mean_word_length > 10
rps_doc_symbol_to_word_ratio > 0.1
rps_doc_frac_lines_end_with_ellipsis > 0.3
rps_doc_frac_no_alph_words > 0.2
ccnet_perplexity > 1000000
rps_doc_frac_chars_dupe_10grams > 0.1
rps_doc_frac_chars_dupe_9grams > 0.11
rps_doc_frac_chars_dupe_8grams > 0.12
rps_doc_frac_chars_dupe_7grams > 0.13
rps_doc_frac_chars_dupe_6grams > 0.14
rps_doc_frac_chars_dupe_5grams > 0.15
rps_doc_frac_chars_top_2gram > 0.2
rps_doc_frac_chars_top_3gram > 0.18
rps_doc_frac_chars_top_4gram > 0.16

# The following need to be calculated indirectly from the raw signals.
# If the url is in the ut1 blacklist
ut1_blacklist = true
# If there are no stopwords
num_stopwords = 0
# If the fraction of number of lines that start with a bullet is greater than 0.9
max_percent_lines_start_with_bullet > 0.9
# If the fraction of number of lines that have at most 1 word is greater than 0.05
max_percent_lines_min_num_words > 0.05
# If the fraction of number of lines that are purely numeric are is greater than 0.05
max_percent_lines_purely_numeric > 0.05
# If the fraction of number of lines that contains only upper case characters is greater than 0.05
max_percent_lines_too_uppercase > 0.05

Even with a reasonable filtering config, there was still a puzzling quality gap between the high precision web data and the high recall web data (even after filtering the latter dataset). We found that this gap was because we were always using the most recent Common Crawl when doing ablations. We removed this filter and pulled from CommonCrawl going back 10 years and — voila! — we matched (actually slightly exceeded) the high precision dataset’s performance. Our hypothesis for why this is happening is that Common Crawl is blocked more in recent years and it’s possible that crawl quality is declining over time.

Following this work, we were able to extract around 2.5T tokens from the web to use in pre-training.

Figure 1: Quality improvements measured by CommonSense from different filtering techniques over the high recall web data

Data for Enterprise Tasks

At Snowflake, we see a consistent pattern in AI needs and use cases from our enterprise customers. Enterprises want to use LLMs to build conversational SQL data copilots, code copilots, and RAG chatbots. From a metrics perspective, this translates to LLMs that excel at SQL, code, complex instruction following, and the ability to produce grounded answers.

To start with making our model better at enterprise tasks, we focused on pulling more SQL, programming and math data.

Programming Data

Since we wanted our model to be good at complex programming tasks, it was important to train with a large amount of high quality coding data. We assembled our code dataset from several sources including StarCoder, the public GitHub Dataset, and PyPi. We used the GitHub API to collect a list of all repositories with at least 10 stars to ensure a minimum quality standard.

Add file-level attributes to make data easily filterable

Following StarCoder, we added attributes at file level for the Github data we extracted to make the data filterable for quality. We used the Snowflake cloud data warehouse to process our data with one table for code metadata (filenames, stars, license) and another for the content itself keyed by content hash. We added columns to our metadata table for any attribute we might want to use for filtering or sorting. Once the dataset was collected, we labeled each file’s metadata with the same filterable attributes that StarCoder uses like noisy_html (if there is too much boilerplate in a .html file), percent_alphanum (what percentage of the file is alpha-numeric characters) and pct_html_visible (what percentage of the HTML document would be visible if rendered). These attributes came in handy to create the highest quality data.

Add repo-level attributes to identify duplicate repositories

In addition to file-level attributes, we labeled repositories with whether or not we believed they were duplicates. We used the GitHub fork bit wherever available. Unfortunately, it’s surprisingly common for repositories to have been forked via git clone instead of the fork UI in GitHub, so the GitHub fork signal is not always present in the Github API response. To solve this, we also deduced whether or not two repositories were duplicates (or at least very similar) by their content overlaps and by any shared commit history.

List of attributes collected and used in filtering.

Decontamination, and all that good stuff

We followed the DeepSeek Coder recipe in its method for decontamination. We annotated all content in our code repository with an is_contaminated bit if it has textual overlap with any of our evaluation data. There are a surprisingly large number of MMLU snippets floating around on GitHub! It’s very easy for model trainers to accidentally train their model on the same data they later evaluate on.

Topological sorting

After generating data, we followed the technique pioneered by DeepSeek-Coder, where we topologically sort all the files within a repository in the order in which they are imported. To do this, we first broke repositories up by their constituent programming languages. Then we constructed a dependency graph of which file imports which. We then traversed this graph to produce a single, large document per repository and language where the files included are closer to the top of the file. Whenever there were ties, we ordered the files lexicographically, but taking into account their directory depth. Finally, we had directional indication from experiments that led to us also adding the data in its original, ungrouped form into our data recipe.

Tokenization and ablations

Each of these filter criteria we have mentioned gets added to our code metadata database tables. We then wrote a Snowflake query to select data with various quality and software license filters per programming language.

At this point we can finally tokenize these documents for pre-training. At the end of the process, we have a cleaned, topologically sorted, and tokenized dataset per programming language. This allows us to specify what exact data mixture of tokens we want not just over code, but over each programming language in our dataset.

Coding, SQL, and Math topics from Web Corpus

To improve our model quality in math and coding (Python and SQL in particular), we selected subsets of web crawl data closely relevant to those topics: tutorials, blog posts, documentation and textbooks.

Surfacing high quality coding domains and URLs

Processing terabytes of webdata and doing fuzzy matches for filtering data is a hard problem. Luckily for us, we work at the best enterprise data cloud company in the world. Once terabytes of web data and metadata were ingested into Snowflake, it was easy to surface the relevant web pages using Snowflake SQL. For example, we were able to source coding data through simple queries:

  1. looking at URLs — e.g., does a URL contain “python”?
  2. looking at other signals and metadata — e.g., was it linked with “Python” in the title?

We went one step further and indexed the web-data to allow us to search the web content itself. This meant we could search for web pages containing such “Python programming”.

For each topic at hand, we used an expansion process to generate a long list of URLs relevant to the topic. We started by establishing a base set of high precision URLs containing content relevant to the topic at hand (e.g “math”). Then, we augmented this list by calculating the “lift” (or popularity score) for each domain related to our topic using Snowflake SQL. This was achieved by determining the frequency of each domain per topic. If a domain’s lift surpassed a particular threshold, we included all pages from that domain into our set. Once we had our expanded set of URLs, we employed straightforward quality filters, using SQL again, to eliminate any low-quality web pages.

An interesting sidebar: to get the actual content from a HTML web page, you can use a Python open-source library, for example Beautiful Soup or boilerpy3 (the former is available in our Anaconda channel, whereas the latter can be imported via a stage) embedded as a UDF (user-defined function) in a Snowflake SQL statement. This makes it fast and easy to transform your data all in Snowflake.

Other high quality datasets

Additionally, we added a few other sources of data that we found very useful for some targeted capabilities during the pre-training phase.

Data Processing Techniques

How did we filter the data to ensure it was high quality?

It is important for downstream performance to filter the data to ensure it is high quality and safe. We focused on filtering for deduplication, removing gibberish, and pornographic content.

We label a document as redundant if the content of the document is very repetitive. To calculate whether a document is redundant, we count the frequency of n-grams. If an n-gram’s count exceeds a minimum value, we consider that document too redundant. For example, if a 6-gram “data is still loading from source” made up 70% of a document, we remove the document.

We can perform similar word statistics to throw out docs with garbage text by observing the word length, number of ellipses, and number of bullets. In addition to word statistics, we also used the Wikipedia-based KenLM model to filter out very high perplexity text. We list the full filters we used in the section on web data.

For pornographic text, we used a blocklist of words and a separate blocklist of URLs to ignore. All of these filtering heuristics were inspired by RefinedWeb, C4, and the Gopher paper.

How did we deduplicate across documents?

We employed a MinHash + Local Sensitivity Hashing (LSH) pipeline to dedupe across documents. In particular, we followed the approach from the literature which reduces the quadratic problem of comparing all pairs of minhashes to a linear look-up via connected components. To do this, each minhash was partitioned into a specific number of parts (call them sub-hashes), and each document which shared a sub-hash with another was considered a duplicate and part of the same connected component. We determined the number of sub-hashes for each minhash by optimizing the false positive and false negative rate on a holdout set with the duplication information pre-calculated, which resulted in 14 subhashes as our default.

If done without proper optimization, the total size of the minhashes exceeds 10TB for some of our datasets, which would not fit into memory. We leveraged (again) the data processing capabilities of Snowflake to address this. To perform a global fuzzy connected components problem in Snowflake, we first imported the minhashes and created a table where each row consisted of a document and a minhash partition. We then performed a window function to keep only one instance per subhash. Note that this doesn’t actually compute connected components, as one document can have multiple edges. However, we made a “best effort” to fix this — we first removed all min clusters which were connected to another smaller min cluster. While this approach fails for potential clusters which have nodes with a distance of greater than 2 and are not covered by the other removal strategies, we measured this to be an insignificant percentage of docs that should be removed (<1%).

Preview of Arctic Data Composition

The final question we need to address is how we compose these datasets into a pre-training dataset and schedule. Arctic was trained with a three stage curriculum each with different data composition focusing on generic skills in the first phase (1T Tokens), and enterprise focused skills on the latter two phases (1.5T and 1T tokens). A high-level summary of our dynamic curriculum is shown in Table 2. In a follow-up blog we will deep dive into a review of the data composition of our model, what techniques worked and equally importantly, what didn’t.

Table 1. Dynamic data composition for three-phase training of Arctic with emphasis on Enterprise Intelligence.

Learn more in our Snowflake Arctic series

Check out our other blog posts that dive into Snowflake Arctic training, including data cleaning, training system design, modeling and system design for optimal throughput, etc. Stay tuned as more updates continue to drop in the Snowflake Arctic cookbook catalog.

--

--