Geek Culture
Published in

Geek Culture

Scrape Google Finance Ticker Quote Data in Python

What will be scraped

Prerequisites

Basic knowledge scraping with CSS selectors

CSS selectors declare which part of the markup a style applies to thus allowing to extract data from matching tags and attributes.

If you haven’t scraped with CSS selectors, there’s a dedicated blog post of mine
about how to use CSS selectors when web-scraping that covers what it is, pros and cons, and why they matter from a web-scraping perspective.

Separate virtual environment

In short, it’s a thing that creates an independent set of installed libraries including different Python versions that can coexist with each other in the same system thus preventing libraries or Python version conflicts.

If you didn’t work with a virtual environment before, have a look at the
dedicated Python virtual environments tutorial using Virtualenv and Poetry blog post of mine to get a little bit more familiar.

📌Note: this is not a strict requirement for this blog post.

Install libraries:

pip install requests parsel

Reduce the chance of being blocked

There’s a chance that a request might be blocked. Have a look
at how to reduce the chance of being blocked while web-scraping, there are eleven methods to bypass blocks from most websites.

Scraping Google Finance Ticker Quote Data

Explanation on Extracting Ticker Data

Import libraries:

import requests, json, re
from parsel import Selector
from itertools import zip_longest
  • requests to make a request to the website.
  • json to convert extracted data to a JSON object.
  • re to extract parts of the data via regular expression.
  • parsel to parse data from HTML/XML documents. Similar to BeautifulSoup.
  • zip_longest to iterate over several iterables in parallel. More on that below.

Define a function:

def scrape_google_finance(ticker: str): # ticker should be a string
# further code...
scrape_google_finance(ticker="GOOGL:NASDAQ")

Create request headers and URL parameters:

# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"hl": "en" # language
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
# https://www.whatismybrowser.com/detect/what-is-my-user-agent
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36",
}

Pass requests parameters and request headers, make a request and pass response to parsel:

html = requests.get(f"https://www.google.com/finance/quote/{ticker}", params=params, headers=headers, timeout=30)
selector = Selector(text=html.text)
  • f”https://www.google.com/finance/quote/{ticker}" is a f-string where {ticker} will be replaced by actual ticker string e.g. “GOOGL:NASDAQ”.
  • timeout=30to stop waiting for response after 30 seconds.
  • Selector(text=html.text)where passed HTML from the response will be processed by parsel.

Create an empty dictionary structure where all the data will be filled in:

# where all extracted data will be temporarily located
ticker_data = {
"ticker_data": {},
"about_panel": {},
"news": {"items": []},
"finance_perfomance": {"table": []},
"people_also_search_for": {"items": []},
"interested_in": {"items": []}
}

Extarcting current price, quote and title data:

# current price, quote, title extraction
ticker_data["ticker_data"]["current_price"] = selector.css(".AHmHk .fxKbKc::text").get()
ticker_data["ticker_data"]["quote"] = selector.css(".PdOqHc::text").get().replace(" • ",":")
ticker_data["ticker_data"]["title"] = selector.css(".zzDege::text").get()

Extracting right panel data:

about_panel_keys = selector.css(".gyFHrc .mfs7Fc::text").getall()
about_panel_values = selector.css(".gyFHrc .P6K39c").xpath("normalize-space()").getall()
for key, value in zip_longest(about_panel_keys, about_panel_values):
key_value = key.lower().replace(" ", "_")
ticker_data["about_panel"][key_value] = value

Extracting description and extensions data from the right panel:

# description "about" and  extensions extraction
ticker_data["about_panel"]["description"] = selector.css(".bLLb2d::text").get()
ticker_data["about_panel"]["extensions"] = selector.css(".w2tnNd::text").getall()

Extracting news results:

# news extarction
if selector.css(".yY3Lee").get():
for index, news in enumerate(selector.css(".yY3Lee"), start=1):
ticker_data["news"]["items"].append({
"position": index,
"title": news.css(".Yfwt5::text").get(),
"link": news.css(".z4rs2b a::attr(href)").get(),
"source": news.css(".sfyJob::text").get(),
"published": news.css(".Adak::text").get(),
"thumbnail": news.css("img.Z4idke::attr(src)").get()
})
else:
ticker_data["news"]["error"] = f"No news result from a {ticker}."
  • if selector.css(“.yY3Lee”).get() to check if news results is present. No need to check if <element> is not None.
  • enumerate()to add a counter to an iterable and return it.
  • start=1 will start counting from 1, instead from the default value of 0.
  • ticker_data[“news”].append({}) to append extracted data to a list as dictionary.
  • ::attr(src) is also a parsel pseudo-element support to get src attribute from the node. Equivalent to XPath /@src.
  • ticker_data[“news”][“error”] to create a new “error” key and a message when the error occurs.

Extracting Financial Perfomance table data:

# finance perfomance table
# checks if finance table exists
if selector.css(".slpEwd .roXhBd").get():
fin_perf_col_2 = selector.css(".PFjsMe+ .yNnsfe::text").get() # e.g. Dec 2021
fin_perf_col_3 = selector.css(".PFjsMe~ .yNnsfe+ .yNnsfe::text").get() # e.g. Year/year change

for fin_perf in selector.css(".slpEwd .roXhBd"):
if fin_perf.css(".J9Jhg::text , .jU4VAc::text").get():

"""
if fin_perf.css().get() statement is needed, otherwise first dict key and sub dict values would be None:

"finance_perfomance": {
"table": [
{
"null": {
"Dec 2021": null,
"Year/year change": null
}
}
"""

perf_key = fin_perf.css(".J9Jhg::text , .jU4VAc::text").get() # e.g. Revenue, Net Income, Operating Income..
perf_value_col_1 = fin_perf.css(".QXDnM::text").get() # 60.3B, 26.40%..
perf_value_col_2 = fin_perf.css(".gEUVJe .JwB6zf::text").get() # 2.39%, -21.22%..

ticker_data["finance_perfomance"]["table"].append({
perf_key: {
fin_perf_col_2: perf_value_col_1, # dynamically add key and value from the second (2) column
fin_perf_col_3: perf_value_col_2 # dynamically add key and value from the third (3) column
}
})
else:
ticker_data["finance_perfomance"]["error"] = f"No 'finence perfomance table' for {ticker}."

Extracting you may be "interested in"/"people also search for" results:

# "you may be interested in" results
if selector.css(".HDXgAf .tOzDHb").get():
for index, other_interests in enumerate(selector.css(".HDXgAf .tOzDHb"), start=1):
ticker_data["interested_in"]["items"].append(discover_more_tickers(index, other_interests))
else:
ticker_data["interested_in"]["error"] = f"No 'you may be interested in` results for {ticker}"


# "people also search for" results
if selector.css(".HDXgAf+ div .tOzDHb").get():
for index, other_tickers in enumerate(selector.css(".HDXgAf+ div .tOzDHb"), start=1):
ticker_data["people_also_search_for"]["items"].append(discover_more_tickers(index, other_tickers))
else:
ticker_data["people_also_search_for"]["error"] = f"No 'people_also_search_for` in results for {ticker}"
# ....def discover_more_tickers(index: int, other_data: str):
"""
if price_change_formatted will start complaining,
check beforehand for None values with try/except or if statement and set it to 0.

however, re.search(r"\d{1}%|\d{1,10}\.\d{1,2}%" should get the job done.
"""
return {
"position": index,
"ticker": other_data.css(".COaKTb::text").get(),
"ticker_link": f'https://www.google.com/finance{other_data.attrib["href"].replace("./", "/")}',
"title": other_data.css(".RwFyvf::text").get(),
"price": other_data.css(".YMlKec::text").get(),
"price_change": other_data.css("[jsname=Fe7oBc]::attr(aria-label)").get(),
# https://regex101.com/r/BOFBlt/1
# Up by 100.99% -> 100.99%
"price_change_formatted": re.search(r"\d{1}%|\d{1,10}\.\d{1,2}%", other_data.css("[jsname=Fe7oBc]::attr(aria-label)").get()).group()
}

Return and print the data:

# def scrape_google_finance(ticker: str):
# ticker_data = {
# "ticker_data": {},
# "about_panel": {},
# "news": {"items": []},
# "finance_perfomance": {"table": []},
# "people_also_search_for": {"items": []},
# "interested_in": {"items": []}
# }
# extraction code... return ticker_dataprint(json.dumps(data_1, indent=2, ensure_ascii=False))

Full output:

Scrape Multiple Google Finance Tickers Quotes

for ticker in ["DAX:INDEXDB", "GOOGL:NASDAQ", "MSFT:NASDAQ"]:
data = scrape_google_finance(ticker=ticker)
print(json.dumps(data["ticker_data"], indent=2, ensure_ascii=False))

Outputs:

{
"current_price": "14,178.23",
"quote": "DAX:Index",
"title": "DAX PERFORMANCE-INDEX"
}
{
"current_price": "$2,665.75",
"quote": "GOOGL:NASDAQ",
"title": "Alphabet Inc Class A"
}
{
"current_price": "$296.97",
"quote": "MSFT:NASDAQ",
"title": "Microsoft Corporation"
}

Extract Google Finance Chart Time-Series Data

Scraping time-series data is not a particularly good idea so it’s better to use a dedicated API to get the job done.

How to find which API Google uses to build time series charts?

We can confirm that Google is using NASDAQ API to get time-series data by simply checking Nasdaq chart with the quote GOOGL:

In this case, I used Nasdaq Data Link API which has a support for Python, R and Excel. I believe other platforms provide Python integration as well.

I’m assuming that you have already installed a nasdaq-data-link package but if not here's how you can do it. If you set up a default version of Python:

# WSL
$ pip install nasdaq-data-link

If you don’t set up a default version of Python:

# WSL
$ python3.9 -m pip install nasdaq-data-link # change python to your version: python3.X

Get your API key at data.nasdaq.com/account/profile:

Create a .env file to store your API key there:

touch .nasdaq_api_key # change the file name to yours # paste API key inside the created file

Scraping Google Finance Time-Series Data

Time-Series Extraction Code Explanation

  • nasdaqdatalink.read_key(filename=”.nasdaq_api_key”) to read your API key.
  • “.nasdaq_api_key” is your .env variable with secret API key. All secret variables (correct me if I’m wrong) starts with a . symbol to showcase it.
  • nasdaqdatalink.ApiConfig.api_key to test out if your API is being recognized by the nasdaq-data-link package. Example output: 2adA_avd12CXauv_1zxs
  • nasdaqdatalink.get() to get the time-series data which is dataset structure.

Outputs a pandas DataFrame object:

Open     High      Low    Close      Volume  Ex-Dividend  Split Ratio    Adj. Open    Adj. High     Adj. Low   Adj. Close  Adj. Volume
Date
2004-08-31 102.320 103.71 102.16 102.37 4917800.0 0.0 1.0 51.318415 52.015567 51.238167 51.343492 4917800.0
2004-09-30 129.899 132.30 129.00 129.60 13758000.0 0.0 1.0 65.150614 66.354831 64.699722 65.000651 13758000.0
2004-10-31 198.870 199.95 190.60 190.64 42282600.0 0.0 1.0 99.742897 100.284569 95.595093 95.615155 42282600.0
2004-11-30 180.700 183.00 180.25 181.98 15384600.0 0.0 1.0 90.629765 91.783326 90.404069 91.271747 15384600.0
2004-12-31 199.230 199.88 192.56 192.79 15321600.0 0.0 1.0 99.923454 100.249460 96.578127 96.693484 15321600.0
... ... ... ... ... ... ... ... ... ... ... ... ...
2017-11-30 1039.940 1044.14 1030.07 1036.17 2190379.0 0.0 1.0 1039.940000 1044.140000 1030.070000 1036.170000 2190379.0
2017-12-31 1055.490 1058.05 1052.70 1053.40 1156357.0 0.0 1.0 1055.490000 1058.050000 1052.700000 1053.400000 1156357.0
2018-01-31 1183.810 1186.32 1172.10 1182.22 1643877.0 0.0 1.0 1183.810000 1186.320000 1172.100000 1182.220000 1643877.0
2018-02-28 1122.000 1127.65 1103.00 1103.92 2431023.0 0.0 1.0 1122.000000 1127.650000 1103.000000 1103.920000 2431023.0
2018-03-31 1063.900 1064.54 997.62 1006.94 2940957.0 0.0 1.0 1063.900000 1064.540000 997.620000 1006.940000 2940957.0
[164 rows x 12 columns]

As you can see, there are no data about the 2019–2022 years. It’s because I’m using the Free version, which is suitable for experimentation and exploration, as Nasdaq saying.

Nasdaq Rate Limits

Authenticated users have a limit of 300 calls per 10 seconds, 2,000 calls per 10 minutes and a limit of 50,000 calls per day. Premium data subscribers have a limit of 5,000 calls per 10 minutes and a limit of 720,000 calls per day.

Additional Nasdaq API Resources

Links

Outro

If you have anything to share, any questions, suggestions, or something that isn’t working correctly, reach out via Twitter at @dimitryzub or @serp_api.

Yours,
Dmitriy, and the rest of SerpApi Team.

Join SerpApi on Twitter | YouTube

Add a Feature Request💫 or a Bug🐞

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Dmitriy Zub ☀️

Dmitriy Zub ☀️

117 Followers

Developer Advocate at SerpApi. I help to make structured data from a pile of mess.