Web Scraping with Google Gemini

Published in

Nerd For Tech

4 min readJun 8, 2024

Introduction to Google Gemini

Google Gemini is a family of large language models (LLMs) offering state-of-the-art AI capabilities created by Google AI. Gemini models include:

Gemini Ultra — The largest and most powerful model excels in complex tasks like coding, logical reasoning, and creative collaboration. Available through Gemini Advanced (formerly Bard).
Gemini Pro — A mid-size model optimized for diverse tasks offers performance comparable to Ultra. Available through Gemini Chatbot and in Google Workspace and Google Cloud. Gemini Pro 1.5 has improved performance, including a breakthrough in long-context understanding up to a million tokens including text, code, images, audio, and video.
Gemini Nano — A lightweight model designed for on-device use, brings AI capabilities to mobile phones and small devices. Available on Pixel 8 and Samsung S24 series.
Gemma — Open-source models inspired by Gemini offer state-of-the-art performance at smaller sizes and are designed with responsible AI principles in mind.

In this blog, I will explain how to use Gemini API to web scrape any site and to extract the necessary information.

For example, let’s scrape all the calls and joint calls for proposals from the following sites:

Call for Proposals | Department Of Science & Technology | Department Of Science & Technology (DST)

The Department of Science & Technology plays a pivotal role in promotion of science & technology in the country.

dst.gov.in

Call For Proposal

Edit description

birac.nic.in

Let’s not forget to get the Gemini API first.

Google AI Studio | Google AI for Developers | Google for Developers

Google AI Studio is the fastest way to start building with Gemini, our next generation multimodal generative AI model.

ai.google.dev

Scroll down to see “Get a Gemini API Key” and click “Start Now”.

Click “Continue”

Click on “Create API key”

Click on “Create API key in new project”

Now your Gemini API key is created!!

Now, that it’s done. Let’s start coding!!

I am using Pycharm IDE. Make sure to install google.generativeai, streamlit, requests, and BeautifulSoup libraries.

Import the above libraries

import streamlit as st
import requests
from bs4 import BeautifulSoup
import os
import google.generativeai as genai

Initialize the Google API key and import Gemini-pro model.

st.title("Proposal Calls") # Title for the page

os.environ['GOOGLE_API_KEY'] = "********************************"
genai.configure(api_key = os.environ['GOOGLE_API_KEY'])

model = genai.GenerativeModel('gemini-pro')

Create a function read_input() to extract raw data from the site. And then to feed it to the model as a prompt to structure the data.

def read_input():
  # dictionary of all the links to be webscraped.
  # You can add more if you want to
   links = {
       "1":["DST","https://dst.gov.in/call-for-proposals"],
       "2":["BIRAC","https://birac.nic.in/cfp.php"]
   }
   for i in range(1,3):
       url = links[str(i)][1] # Get URL of each organization
       r = requests.get(url) # Request for data
       soup = BeautifulSoup(r.text, 'html.parser') # Parse the HTML elements
       data = soup.text # Get raw data in string format
       link = soup.find_all('a', href=True) # Get list of all links on the site in html formet
       l = ""
       for a in link:
           l = l +"\n"+ a['href'][1:] # Get the actual links
      # Create a query
       query = data + "name of organization is"+links[str(i)][0]+ "Jumbled links of calls for proposals:"+l+"\n Create a table with the following columns: Call for proposals or joint call for proposals along with respective link, opening date, closing date and the name of the organization."
       llm_function(query)

A glimpse of unstructured data given to Gemini.

Create another function llm_function() to generate the response.

def llm_function(query):
    response = model.generate_content(query) # Generate response
    st.markdown(response.text) # Print it out using streamlit

Call the main function.

if __name__ == __main__:
     read_input()

Let’s run the following command on the terminal to run the site.

streamlit run app.py

Now, we can see how that unstructured data has been converted to clean structured data. This is just the beginning, AI models can soon help us scrape data from the internet with 100% accuracy soon.

The above website is just a basic demo of how to leverage the Gemini model for web scraping. To make it useful, we can add an option on the site to get the link of the website to be scraped and the prompt from the user, the model then provides structured data as output.

I hope you found this tutorial useful. Happy coding!!