Gemini- A beginner’s guide

Rahatara Ferdousi
5 min readDec 15, 2023

--

If you are waiting to access the Gemini API from your country but can’t wait to explore it, this blog is for you. You can easily use it as a user and as a developer through Vertex AI from Google Cloud. To do that, you just need to follow these steps:

  1. Go to the Google Cloud Platform and select the console
  1. Create a new project.
  2. Search for Vertex AI API and enable it for the project.
  3. Select Multimodal and by default you will see Gemini Pro under model.

They may ask you to provide payment information, but don’t worry; you will not be charged unless you use it as a service or exceed the free tier limits. Learn about Vertex AI pricing and use the Pricing Calculator to generate a cost estimate based on your projected usage.

What is Gemini??

Oh sorry, I forgot to introduce Gemini :D ??

Gemini, is a Generative AI model from Google DeepMind. It is designed to handle multimodal data like text, images, and speech, making it useful for a variety of applications.

what’s the big deal?

One of the main objectives of Gemini is to present GenAI as General AI. Which is still lacking in contemporary benchmark GenAI commercial tools. It is claimed by Deepmind that “Gemini is the first model to outperform human experts on MMLU (Massive Multitask Language Understanding), one of the most popular methods to test the knowledge and problem-solving abilities of AI models.” Read more

Gemini Models

Google has released three models for different needs. Gemini Ultra for highly complex tasks. Gemini Pro for scaling across a wide range of tasks. Gemini Nano for most efficient model for on-device tasks.

Among these models, Gemini Pro is the best model and is fully accessible for testing and development through vertex AI. There are two versions of Gemini which you can customize and develop for your own application.

1. Gemini Pro model (gemini-pro) is great for tasks involving natural language. For example, conversations with multiple text and code exchanges, as well as generating code.

2. Gemini Pro Vision model (gemini-pro-vision) supports vemultimodal prompts. This means you can use text, images, and videos in your prompts, and it will respond with either text or code.

What can the model do ultimately?

The Gemini Pro can analyze, understand, and interact with various forms of multimodal data.

Curious? Let’s put it to the test with Vertex AI

Hands-On-Demo

Gemni Hand-on-demo on Vertex I

Can Gemini Pro interpret multimodal input?

For example, I often send my friends photos of my dresses, asking which would look best for a photo at a certain location.

I decided to see if the Gemini model could assist with this: I created a prompt including different dress options and the restaurant I plan to visit.

The model’s response was impressive — it focused on specific details and explained why a particular dress would be ideal for a great photo.

I’ve saved this prompt as ‘Personal Stylist’. You might want to explore other fascinating use cases such as showing a plant photo and asking what’s wrong with it, or asking help for interior decoration.

Gemini-Pro explaining capability

Can Gemini Pro extract and transform?

Another example I tried out was to test the capability of the model for information extraction and transform it to my desired format.

So I uploaded an image from Walmart from the vegetable/fruits section where the item had its price tags. And asked the model using the microphone to provide me with it in JSON format and I was surprised with the output. Because first, the model extracted both visual and textual information and second, it structured the output to my desired format.

So greatly handled audio, text, and image input and output in JSON format. In the video, I forgot to turn on the Markdown. So if you keep the the Markdown on, you can directly copy it from the response.

Can Gemini Converse?

So the idea of converse is kind of similar to creating a virtual agent to talk to different APIs. You can check the Google Store demo here and follow the ReadMe files and code to create one for you.

How can I code it?

Coding Gemini on Google Colab

Well, so far we have seen the capability of Gemini. Now let’s see how you can customize the code and build with it.

If you have less coding experience. Click on

  1. Get Code

2. Then select your preferred Coding language.

3. If you are working with Python it may show you errors. To avoid this please make a copy of this notebook and then replace it with your project ID. You need to do some extra steps which are included in this code.

And then modify the code as you want.

For more exploration, I highly recommend using the official Starter notebook to have a good grip to customize the Gemini model programmatically.

Takeaways:

  • Gemini, enables the handling of multimodal input with Generative AI, making it a significant advancement in the field of GenAI.
  • While Gemini currently supports multimodal input, it does not yet offer multimodal output for images.
  • The Vertex AI platform provides a user-friendly interface for non-code experts to experiment with Gemini’s capabilities.
  • GenAI, the underlying technology behind Gemini, holds promise for evolving into a more comprehensive General AI.

References:

Google Deepmind Blogs and Articles

Gemini Github Repo

Build with Gemini

--

--

Rahatara Ferdousi

Doctoral Researcher at the University of Ottawa. Exploring AI integrated Digital Twin to Automate Railway Defect Inspection and Maintenance.