Web Scraping with Google Gemini

Madhu Shree Aravindan
Nerd For Tech
Published in
4 min readJun 8, 2024

Introduction to Google Gemini

Google Gemini is a family of large language models (LLMs) offering state-of-the-art AI capabilities created by Google AI. Gemini models include:

  • Gemini Ultra — The largest and most powerful model excels in complex tasks like coding, logical reasoning, and creative collaboration. Available through Gemini Advanced (formerly Bard).
  • Gemini Pro — A mid-size model optimized for diverse tasks offers performance comparable to Ultra. Available through Gemini Chatbot and in Google Workspace and Google Cloud. Gemini Pro 1.5 has improved performance, including a breakthrough in long-context understanding up to a million tokens including text, code, images, audio, and video.
  • Gemini Nano — A lightweight model designed for on-device use, brings AI capabilities to mobile phones and small devices. Available on Pixel 8 and Samsung S24 series.
  • Gemma — Open-source models inspired by Gemini offer state-of-the-art performance at smaller sizes and are designed with responsible AI principles in mind.

In this blog, I will explain how to use Gemini API to web scrape any site and to extract the necessary information.

For example, let’s scrape all the calls and joint calls for proposals from the following sites:

Let’s not forget to get the Gemini API first.

Login to Google AI Studio,

Scroll down to see “Get a Gemini API Key” and click “Start Now”.

Click “Continue”

Click on “Create API key”

Click on “Create API key in new project”

Now your Gemini API key is created!!

Now, that it’s done. Let’s start coding!!

I am using Pycharm IDE. Make sure to install google.generativeai, streamlit, requests, and BeautifulSoup libraries.

Import the above libraries

import streamlit as st
import requests
from bs4 import BeautifulSoup
import os
import google.generativeai as genai

Initialize the Google API key and import Gemini-pro model.

st.title("Proposal Calls") # Title for the page

os.environ['GOOGLE_API_KEY'] = "********************************"
genai.configure(api_key = os.environ['GOOGLE_API_KEY'])

model = genai.GenerativeModel('gemini-pro')

Create a function read_input() to extract raw data from the site. And then to feed it to the model as a prompt to structure the data.

def read_input():
# dictionary of all the links to be webscraped.
# You can add more if you want to
links = {
"1":["DST","https://dst.gov.in/call-for-proposals"],
"2":["BIRAC","https://birac.nic.in/cfp.php"]
}
for i in range(1,3):
url = links[str(i)][1] # Get URL of each organization
r = requests.get(url) # Request for data
soup = BeautifulSoup(r.text, 'html.parser') # Parse the HTML elements
data = soup.text # Get raw data in string format
link = soup.find_all('a', href=True) # Get list of all links on the site in html formet
l = ""
for a in link:
l = l +"\n"+ a['href'][1:] # Get the actual links
# Create a query
query = data + "name of organization is"+links[str(i)][0]+ "Jumbled links of calls for proposals:"+l+"\n Create a table with the following columns: Call for proposals or joint call for proposals along with respective link, opening date, closing date and the name of the organization."
llm_function(query)

A glimpse of unstructured data given to Gemini.

Create another function llm_function() to generate the response.

def llm_function(query):
response = model.generate_content(query) # Generate response
st.markdown(response.text) # Print it out using streamlit

Call the main function.

if __name__ == __main__:
read_input()

Let’s run the following command on the terminal to run the site.

streamlit run app.py

Now, we can see how that unstructured data has been converted to clean structured data. This is just the beginning, AI models can soon help us scrape data from the internet with 100% accuracy soon.

The above website is just a basic demo of how to leverage the Gemini model for web scraping. To make it useful, we can add an option on the site to get the link of the website to be scraped and the prompt from the user, the model then provides structured data as output.

I hope you found this tutorial useful. Happy coding!!

--

--