Decoding Grok-1: Rust-Powered AI, Benchmarks, and the Uncharted Territory of Real-Time Insights

6 min readNov 23, 2023

Today, I subscribed to X Premium+, expecting access to Grok as advertised on xAI, but I quickly found out this wasn’t the case. According to Musk, users like me will have access to the chatbot next week, but Musk is not known for his accurate timelines.

However, time is of the essence because OpenAI is as vulnerable as ever after its recent leadership shake-up.

As a student, it's essential to keep up with new technology, and I want to look into how Grok was built, some benchmarks, and what it plans to offer over other models.

Grok-0 / Grok-1:

xAI is calling the engine that powers the model Grok-1, which is the predecessor to Grok-0. Grok is set to be a powerful LLM under the power of Grok-1 and can compete with any other engine. The model’s classification is an autoregressive transformer-based model that predicts the next token, limited to 8,192 context tokens. At 33 billion parameters, it is fairly small compared to GPT-4’s 1.76 trillion, but it will continue to grow with time. Grok-1 is trained on data from the internet up to Q3 of 2023 and human-graded data. In addition, it has access to all of X.com, providing a vast amount of real-time data. More information can be found here.

The Peculiar Backbone of Grok:

The most surprising thing about Grok is the extensive use of Rust by the engineering team. This is a big contrast to GPT, primarily written in Python.

xAI argues that Rust has an excellent track record for building scalable, reliable systems that have fewer bugs and require less oversight. They aren’t wrong, but for years, Python has been the leader in anything machine learning. Only time will tell if this was the right decision, and it will be interesting to see if this will ultimately prove that Python might not be the optimal language to get to AGI.

Elon even joined the conversation on X and supported the language.

Benchmarks:

GSM8k: Middle school math word problems, (Cobbe et al. 2021), using the chain-of-thought prompt.

MMLU: Multidisciplinary multiple choice questions, (Hendrycks et al. 2021), provided 5-shot in-context examples.

HumanEval: Python code completion task, (Chen et al. 2021), zero-shot evaluated for pass@1.

MATH: Middle school and high school mathematics problems written in LaTeX, (Hendrycks et al. 2021), prompted with a fixed 4-shot prompt.

Source: https://x.ai/

I want this article to provide context to the tests that xAI performed.

Before we get into the results, let me explain what N-shot learning is and add some context about each test.

A shot refers to the amount of example data a model was given to learn from and then make a prediction. For instance, a 5-shot would mean that it was given a five examples before making a prediction, and a 0-shot would mean it was given no examples.

Now, time for the tests themselves.

GSM8k

As detailed by xAI, GSM8k is a dataset of middle school math problems. Here are a few examples provided on the GitHub repository.

According to readme, the dataset is a compilation of 8.5K different problems. Each problem takes between two and eight steps to solve, and each problem only requires multiplication, division, subtraction, and addition.

Next, we have the MMLU, a dataset containing questions about various elementary subjects such as history, chemistry, and math. In the past, LLM’s have had relatively low success on this test, and most accurate answers came down to random luck, according to the paper Measuring Massive Multitask Language Understanding (Hendrycks et al. 2021).

Here is an example of a question presented to the model:

Which factor will most likely cause a person to develop a fever?

a. a leg muscle relaxing after exercise

b. a bacterial population in the bloodstream

c. several viral particles on the skin

d. carbohydrates being digested in the stomach

The label provided would then be the correct answer choice, which in this case is B. Just for fun, I gave GPT-3.5 and Google Bard the question, and they were both able to select the correct answer choice.

HumanEval

The next benchmark is HumanEval, which consists of 164 programming problems. Think of easy or medium-level Leetcode questions. Here is an example featured in the dataset:

“Check if in given list of numbers, are any two numbers closer to each other than given threshold. >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True”

The solution the model generated is then subject to a few unit tests to test the viability and quality of the solution. More info can be found on the official paper.

Finally, the MATH benchmark features more complex problems than you would find in the MMLU dataset. According to the official paper, the dataset contains 12,500 math problems featured in math competitions. Here is an example featured on the GitHub:

Ultimately, math and scientific reasoning remain a weakness of LLM’s, a problem that xAI hopes to address.

Let’s continue with the benchmark results.

Grok is … good?

I will display the results of the testing again for easy access.

Based on their results, GPT-4 remains the king, followed closely followed by Claude 2, created by Anthropic, and Palm 2, created by Google AI. The strongest category that Grok-1 had was in MMLU where it had a 78% accuracy, and the weakest test was the MATH test. That result is not unique to less powerful models, as the more advanced models perform best in GSM8k instead of the MMLU. It is important to note that Grok-1 is in a different class than GPT-4 and has yet to be trained on nearly as much data. xAI says this is significant progress when the computing power and training are proportionated.

In addition to the previous tests, xAI also performed its hand-made test to rule out that most of these models were trained on the data featured in these benchmarks. To do this, they used the 2023 Hungarian national high school finals in mathematics. Here is an example.

https://dload-oktatas.educatio.hu/erettsegi/feladatok_2023tavasz_kozep/k_matang_23maj_fl.pdf

Here are the results:

Grok boasted a 59% accuracy, which is low, but was second behind the much more robust GPT-4. This is an exciting result as compared to the conventional benchmarks. It also shows that Grok might be performing better than expected in logic-based questions, which is ultimately the goal, according to Musk.

That prompts the question, what advantages is Grok-1 expected to have over the rest of the industry?

The main advantage they will have at xAI is accessibility to Real-Time information. Grok will have access to the entirety of X.com, which means it can change its response based on real-time news, honest opinions, and trending topics. The access to such a vast pool of up-to-date real-world data is not available to any other model, and this is where the real advantage will start to show. Grok can summarize news topics, give information on trending themes, answer questions about specific X accounts, and more. The possibilities are endless.

Conclusion

The future for Grok is exciting, and with time, it will begin to provide some real advantages as opposed to other models. The unconventional use of Rust, a small development team, and a vision led by one of the most prominent entrepreneurs makes the company's future interesting. I plan to look deeper into Grok next week (hopefully) when access is given to all X premium+ members.

I hope this has given a better perspective into Grok's current state and the current available information.