AI Showdown: Microsoft Copilot vs. Google Gemini vs. ChatGPT 3.5 vs. Mistral vs. Claude 3

Sorin Ciornei
thereach.ai
Published in
16 min readMar 6, 2024

--

A Comprehensive Guide to the Best AI Assistants in 2024 in AI Technology

With the upgrade of two AI chatbots in the first two months of this year, it’s time for a new comparison between the giants. If you remember, we already did a comparison between ChatGPT, Copilot and Gemini (this was before Gemini got upgraded). If you want to read that article, check it out here.

Now we are looking at two more tools we can compare: Claude — recently got an upgrade to 3 new models and we will talk about what they are. And Mistral, from the French startup backed by Microsoft which claims to be very close to ChatGPT 4.

This is what we tested:

  • Copilot in Creative mode, which uses ChatGPT 4
  • ChatGPT 3.5
  • Gemini free
  • Mistral free
  • Claude 2.1 and 3

I want to focus more on the free versions since this is what people will use the most. I believe people will trials them and once they are happy with the results, they will possibly upgrade.

Image generated by Midjourney, prompt: painting in the style of kristoffer zetterstrand showing a five colorful robots having fun in a park, jumping holding hands

Test scenarios

  • Content writing (since we skipped it the first time)
  • Coding in python
  • Riddles test
  • Math (simple test)
  • Creative writing

If you want to stay connected with me and read my future articles, you can subscribe to my free newsletter. You can also reach out to me on Twitter, Facebook or Instagram. I’d love to hear from you!

Testing Content Writing For Copilot, Gemini, ChatGPT 3.5, Mistral and Claude 3

The prompt was to “write a blog post about the benefits of doing art workshops”. Here are the results:

Copilot Creative response
Gemini Free response
Claude 3 Sonnet response
Claude 2.1 response
ChatGPT 3.5 response
Mistral response

A few things about Claude. Anthropic is still locked to be used in the US/UK exclusively and since I live in the EU, I need to use a VPN etc to try and test it. What I did instead was to sign up for using the workbench (gives you access to use the APIs) and you can still use all versions of Claude. This is also the reason why the screenshots might look different from normal Claude pictures you see online. Here is the dashboard:

Claude API workbench

What I noticed is that Opus, the best version of Claude, that surpasses ChatGPT4 in Anthropic’s benchmarks, does not respond to some of the prompts. It’s very eager to answer coding and math questions, but not creative writing. In this test for content writing we tried it with Sonnet (the second best Claude version). Read at the end of this article about what these versions are and what’s the difference.

As for the content writing test, Copilot and Gemini have the lead for formatting since they numbered the sub-titles properly. Overall this is a matter of preference, whatever you like and works for you. I think all of them provided a reasonable response, the interesting part for me is that Claude 3 Sonnet response is more of a rhetorical marketing type of pitch. In this case it works and quite possibly is a bit more compelling, while all the others gave a standard type of blog post response.

Testing Coding In Python For Copilot, Gemini, ChatGPT 3.5, Mistral and Claude 3

Here I tried to replicate a prompt I saw on Twitter, the prompt I used is:

“Using Python, code a currency converter app with USD, EUR, and CAD. Make it colorful. Can you write the code with fixed conversation rates please, for an online compiler. It also needs to include a GUI dashboard”

Now, this would be a LOT easier to test if you 1) knew what you are doing 2) have a local python compiler instead of an online one. But I wanted to keep it simple, after reading a lot of articles and seeing youtube videos of “never coded and just created my app with AI”.

What would a person without coding experience want in terms of response and instructions. I had to update the prompt a few times just because, and this is the funny part, ChatGPT 3.5 got everything correct in one go while Copilot with ChatGPT 4 did not. The reason I felt this might not be fair, is because the code would have worked with a local compiler, instead of an online version. In any case, even after updating the prompt, here are the results:

Mistral gave an error, Copilot the same. Sure, you can find out why the error happened and fix it — but in this case the task is to get a currency converter with the least steps possible. ChatGPT3.5 also forgot about the color, but at least it got it right. Claude 3 Opus forgot about the button and forgot to display the results. ChatGPT 3.5 wins this round.

Riddle Test For Copilot, Gemini, ChatGPT 3.5, Mistral and Claude 3

Here are the riddles we used, same ones as before in our previous comparison.

  • Riddle 1 — I’m light as a feather, but not even the strongest person can hold me for more than 5 minutes. What am I? Breath
  • Riddle 2 — How many bananas can you eat if your stomach is empty? One
  • Riddle 3 — What is 3/7 chicken, 2/3 cat and 2/4 goat? Chicago
  • Riddle 4- Using only addition, add eight 8s to get the number 1,000. 888 + 88 + 8 + 8 + 8 = 1000
  • Riddle 5- If a hen and a half lay an egg and a half in a day and a half, how many eggs will half a dozen hens lay in half a dozen days? Two dozen. If you increase both the number of hens and the amount of time available four-fold (i.e., 1.5 x 4 = 6), the number of eggs increases 16 times: 16 x 1.5 = 24.

Here are the results:

  • All of them knew to answer “breath”.
  • Mistral explained how big a stomach is and tried to look at the riddle from a volume perspective. Claude 3 Opus didn’t accept the prompt, Claude 3 Sonnet told me it doesn’t eat, being an AI and all. I then changed the prompt to “how many bananas can I eat” — it told me to visit a nutritionist for a more tailored response.
  • Both Mistral and Claude tried to literally split the farm animals to figure out what I want. Claude Opus couldn’t (or wouldn’t) read the prompt, Claude Sonnet and Claude 2.1 did but they didn’t figure it out.
  • Mistral aced it, Claude Opus couldn’t read the prompt, Claude Sonnet failed and surprisingly, Claude 2.1 got it right!
  • Mistral got the eggs right, Claude 2.1 failed, Claude Sonnet failed, but again, surprisingly, Claude Opus finally got a prompt it liked and gave a correct response! Seems Claude Opus has a soft spot for a coding session while munching on some egg sandwiches.

Math Test For Copilot, Gemini, ChatGPT 3.5, Mistral and Claude 3

Prompt I used here is very simple: “is 89 a prime number?”. Answer — it is a prime number.

I will list the outliers. Gemini, ChatGPT 3.5 and Mistral got the answer correct. Copilot got it correct as well, and it probably decided my line of questioning is a bit dry so it even added a visual aid to spice things up!

Copilot Creative response

Then there is Claude. I sound like a broken record, but here is what I got: Claude Opus didn’t indulge in answering, Claude 2.1 got it wrong and Claude Sonnet said this:

Claude 3 Sonnet response

Creative Writing Test For Copilot, Gemini, ChatGPT 3.5, Mistral and Claude 3

I used the same prompt as I saw people trying out on Twitter, to see if I can replicate the answers. At the end of the end, one important criteria for these tools is consistency.

Prompt: “Write ten sentences that end with ‘th”

Let’s look at the results:

ChatGPT 3.5 response
Gemini Free response
Copilot Creative response
Mistral response
Claude 3 Sonnet response

ChatGPT 3.5 got one correct answer, Gemini none, Copilot GPT4 all of them but it does feel a bit like it took the easy way out. Mistral none and Claude 3 Sonnet got three correct.

Bonus Test — Finance Test for Copilot, Gemini, ChatGPT 3.5, Mistral and Claude 3

I was trying to help a friend with some advice and needed to find a percentage. Easy task but I decided to try out our chatbots to see what they say, since I was right in the middle of writing this comparison.

The question is perhaps phrased in a tricky way, but that’s the fun of it: “what is 3000 to 4000 in percentage, the difference being a discount”. The results are interesting, mainly because of how each of them got to the answer they provided. Here are the answers, with a quick disclaimer — I know the final answer but I did not spend time to check if the AIs got to the correct answer in a logical way:

ChatGPT 3.5 response
Gemini Free response
Copilot response
Mistral response
Claude 3 Sonnet response

You can see in Mistral’s answer why the phrasing of this prompt was a bit tricky, if you calculate the other way around you end up with 33%.

Which one is for you?

1. ChatGPT

  • Ideal for: Trying the most popular AI chatbot
  • Pros: Simple interface, frequent updates, and various features
  • Cons: Users have been complaining its performance has degraded over time
  • Upgrade Option: ChatGPT Plus offers GPT-4 access for $20/month, with a wide variety of features. Very useful for SEO, coders, people that enjoy tweaking different personas to get what they want.

If you’re seeking an AI chatbot experience that stands at the forefront of technological prowess, then ChatGPT is your go-to choice. Developed by OpenAI and thrust into the limelight through a widespread preview in November 2022, this cutting-edge chatbot has swiftly amassed an impressive user base, boasting over 100 million enthusiasts. The website alone witnesses a staggering 1.8 billion monthly visitors, a testament to its widespread appeal and influence.

Not without its fair share of controversies, ChatGPT has stirred debates, particularly as users uncover its potential applications in academic tasks and workforce replacement. OpenAI has democratically opened the gates to ChatGPT, allowing users to harness its power through the GPT-3.5 model, free of charge, upon registration. For those seeking a heightened experience, the Plus version beckons, offering access to the formidable GPT-4 and an array of enhanced features, all for a reasonable $20 per month.

GPT-4, touted as the most expansive Large Language Model (LLM) among all AI chatbots, takes the helm with data training up to January 2022 and the added ability to tap into the vast expanse of the internet, facilitated by Microsoft Bing. Boasting a colossal 100 trillion parameters, surpassing its predecessor GPT-3.5 with its 175 billion parameters, GPT-4 emerges as a juggernaut in the realm of AI. The significance of these parameters lies in the model’s ability to process and comprehend information — more parameters signify a broader scope of training data, resulting in heightened accuracy and reduced susceptibility to misinformation or misinterpretation. Opt for ChatGPT, and you immerse yourself in the epitome of AI sophistication.

2. Microsoft Copilot

  • Ideal for: Seeking up-to-date information, since it’s connected to the internet.
  • Pros: Access to GPT-4, free of charge, visual features
  • Cons: Limited to a number responses per conversation, occasional delays
  • Perk: Integration with a various Microsoft products on the way, already in Win 11.

Opting for Microsoft Copilot is the way to go if you crave the latest and most up-to-date information at your fingertips. In stark contrast to the free version of ChatGPT, which is confined to offering insights leading up to early 2022, Copilot boasts the ability to access the internet. This feature allows Copilot to furnish users with real-time information, accompanied by relevant source links for added credibility.

One of Copilot’s standout features is its utilization of GPT-4, the pinnacle of OpenAI’s Large Language Models (LLM). Remarkably, Copilot provides this upgraded experience entirely free of charge. However, it comes with certain limitations, such as allowing only five responses per conversation and a character cap of 4,000 per prompt. While Copilot’s user interface may lack the straightforward simplicity of ChatGPT, it remains user-friendly and navigable.

In comparison to Bing Chat, which also taps into the internet for timely results but tends to lag in responsiveness, Copilot emerges as a more reliable option. Copilot’s responsiveness and prompt adherence set it apart from the competition, ensuring a smoother and more efficient conversational experience.

Beyond text-based interactions, Copilot goes the extra mile by offering image creation capabilities. Users can provide a description of the desired image, prompting Copilot to generate four options for them to choose from.

Adding to its versatility, Microsoft Copilot introduces different conversational styles — Creative, Balanced, and Precise. These styles allow users to tailor their interactions, determining the tone and level of detail in the responses. In summary, Microsoft Copilot proves to be a multifaceted and dynamic AI tool, providing users with not just information, but an interactive and customizable experience.

3. Google Gemini

  • Ideal for: Fast, almost unlimited experiences
  • Pros: Speedy responses, accurate answers, full Google integration
  • Cons: Previous iterations had shortcomings, now improved
  • Upgrade Option: Gemini is available with a Advanced/Ultra version for $20 a month, which will be integrated in the future with all Google products and 2TB of cloud space.

If speed and an almost boundless conversational experience are at the top of your priorities, then Gemini, Google’s AI chatbot, deserves your attention. Having observed the evolution of various AI chatbots, I’ve witnessed Google Bard, now rebranded as Gemini, addressing and surpassing previous criticisms.

Gemini stands out for its impressive speed, delivering responses that have markedly improved in accuracy over time. While it may not outpace ChatGPT Plus in terms of speed, it often proves swifter in responses compared to Copilot and the free GPT-3.5 version of ChatGPT, although results may vary. Unlike its counterparts, Gemini avoids the restriction of a predetermined number of responses. Engaging in extensive conversations is seamless with Gemini, a stark contrast to Copilot’s 30-reply limit and ChatGPT Plus’s 40-message restriction every three hours.

Google has infused Gemini with a visual flair, incorporating more visual elements than its competitors. Users can harness Gemini’s image generation capabilities and even upload photos through integration with Google Lens. Additionally, Gemini extends its functionality with plugins for Kayak, OpenTable, Instacart, and Wolfram Alpha, enhancing the user experience.

The integration of Extensions further positions Gemini as a comprehensive Google experience. Users can augment their interactions by adding extensions for Google Workspace, YouTube, Google Maps, Google Flights, and Google Hotels.

Gemini also got an upgrade one week after its release, the upgrade being called Gemini 1.5 promises this:

  • 🔄 Gemini 1.5 Pro has a 128,000 token context window, expandable to 1 million tokens for developers and enterprise customers.
  • 🧠 The increased context window allows processing vast amounts of information, enabling new capabilities.
  • 📊 Gemini 1.5 Pro outperforms Gemini 1.0 Pro on 87% of benchmarks, maintaining high performance with a longer context window.
  • 📧 Early testers can sign up for a no-cost 2 month trial of the 1 million token context window, with improvements in speed expected. This token context window is crazy big, Gemini 1.5 Pro can analyze and summarize the 402-page transcripts from Apollo 11’s mission to the moon. We will do a deep dive into Gemini Pro with a future article.

Things to remember with Gemini, it recently got under backlash from being too woke — it refuses to generate images with white people and in some tests it even refused to generate pictures of vanilla icecream! If this wouldn’t impact whatever you need it to do, then Gemini looks like a great option.

4. Mistral

  • Ideal for: All arounder, scored good in most of our tests
  • Pros: The model’s native fluency in English, French, Spanish, German, and Italian, coupled with a 32K tokens context window, ensures precise information recall from large documents.
  • Cons: While Mistral AI acknowledges shortcomings in previous iterations, the latest release of Mistral Large reflects significant improvements, addressing these concerns and making it a more reliable and robust solution. Even in their own tests, Mistral Large is still performing below ChatGPT 4
  • Upgrade Option: Users have the option to upgrade to Mistral Small, an optimized model designed for low-latency workloads. Additionally, Mistral AI offers competitive pricing for open-weight endpoints, including mistral-small-2402 and mistral-large-2402, catering to users seeking varied performance/cost tradeoffs.

In addition to Mistral Large, Mistral AI introduces Mistral Small, designed for low-latency workloads. This optimized model outperforms its counterparts, such as Mixtral 8x7B, and offers lower latency, making it a refined choice for users seeking an intermediary solution between the open-weight offering and the flagship Mistral Large model.

Mistral AI’s commitment to simplifying endpoint offerings and providing competitive pricing is evident in the diverse range of models available. Developers can take advantage of Mistral AI’s innovative features, including JSON format mode and function calling, which are currently available on mistral-small and mistral-large. These features enable natural interactions and complex interfacing with internal code, APIs, or databases, catering to a wide range of user needs.

Mistral chat is free to use, with API and Azure integration the pricing varies from $2 for 1 million token input, to $8 for 1 million token output.

5. Claude 3 (Opus and Sonnet)

  • Ideal for: At this stage I would say coding. The coding capability of Opus seems to rival ChatGPT 4, in synthetic tests performed by Anthropic.
  • Pros: Competitive solution from Anthropic, remember Amazon is backing them so they have a massive training data set and also a playground for piloting in the massive ecommerce website. This means it should improve quite fast.
  • Cons: Geographically limited.
  • Upgrade Option: Sonnet is powering the free experience on claude.ai, with Opus available for Claude Pro subscribers. You can get Claude Pro for a monthly price of $20 (US) or £18 (UK) plus applicable tax for your region. For APIs, Opus is $15 / $75, Sonnet is $3 / $15 and Haiku will be $0.25 | $1.25 per input/output million tokens.

The Claude 3 model family by Anthropic introduces three new models — Haiku (available at a later date), Sonnet, and Opus — setting industry benchmarks across various cognitive tasks. Opus, the most intelligent model, exhibits near-human levels of comprehension, while Sonnet and Haiku offer faster and more cost-effective solutions.

One of the things Claude 3 is intended for is Enterprise customers, as it promises near-instant results for live customer chats, auto-completions and data extraction tasks. Haiku, soon to be released, is the fastest and most cost-effective. Claude 3 Sonnet is 2x faster than Claude 2, with both Sonnet and Opus having an initial context window of 200K tokens with the capability to handle inputs exceeding 1 million tokens. Claude 3 Opus achieves near-perfect recall, surpassing 99% accuracy in the ‘Needle In A Haystack’ evaluation.

You can also try Claude 3 if you have a Perplexity Pro subscription, limited to 5 answers a day. If you want more than 5 answers, you can try it with another tool here. If you don’t know what Perplexity is, it’s a service that gives you access to most AI tools with a $20 monthly subscription.

https://thereach.ai/2024/02/02/perplexity-ai-vs-bing-copilot/

Conclusion

To be honest, ChatGPT 3.5 is still so competitive even being a one year old AI. Unless you need very demanding tasks, I can probably perform most, if not all of the things you need. If you don’t really want to pay for any subscription, Copilot Creative with ChatGPT 4 built in and free to use gives very good results in all day to day activities.

If you do want to subscribe to something, probably the ecosystem you are already tied into will make a difference, with both Copilot and Gemini being integrated in Microsoft Office and Google Workplace respectively. If you built or inte a tool with an AI behind it, the $ difference would be important for you and then based on the use case probably select either Claude 3, Mistral or GPT4.

Also remember that these tools will only get better, two years from now they will perform everything you throw at them in a shockingly fast, precise and efficient manner because guess what?

Today is the day when AI apparently is smarter than the average of us.

I appreciate your time and attention to my latest article. Here at Medium and at LinkedIn I regularly write about AI, workplace, business and technology trends. If you enjoyed this article, you can also find it on www.thereach.ai, a website dedicated to showcasing AI applications and innovations.

If you want to stay connected with me and read my future articles, you can subscribe to my free newsletter. You can also reach out to me on Twitter, Facebook or Instagram. I’d love to hear from you!

--

--

Sorin Ciornei
thereach.ai

Passionate about technology, nature, ecosystems and exceptional cuisine. Newsletter - https://t.co/YApNUM9Pjq