How to book your COVID vaccine using Spark

Paul Scalli
Apr 14 · 4 min read
Photo by Lars Kienle on Unsplash

If you are like me, eager to take the COVID vaccine, I have developed a quick and easy way to find out the closest appointment through data analysis. As you may have noticed, once you try to book an appointment through the CVS website, or other recommended websites like My Turn, you will quickly find out that your search is often limited to 50 miles, and depending on where you live, you won’t find any availability. Having the sun out, warmer weather, and looking forward to meeting with friends again motivates me enough to drive reasonably far to find the cure to ‘normal’ life again. I bet some of you will happily do it as well.

With that in mind, I decide to use Apache Spark and a simple web scraping technique to return a list of all cities with their point-to-point distance. Let’s get started!

Retrieving the CVS Website Cookie:

Open the CVS Vaccination website in Google Chrome, followed by opening the Developer Tools window clicking on View → Developer → Developer Tools. Open the Networking tab (step 1), then find the covid-19-vaccine payload, which I find easier to clear the results, and re-click on the state you are interested in (step 2), and finally copy the cookie.

Creating the ETL Pipeline:

I find it simple and useful to run notebooks from the Databricks CE edition, a free version of the platform, and it encapsulates Spark right into the runtime. We assume in this article you already have a cluster running (if not, you can follow these instructions), and it’s attached to the notebook.

Retrieve the JSON:

We are ready to make a request using URLLib and retrieve the payload. You can use the code below and replace the <COOKIE> flag with the cookie you copied above:

from urllib.request import Request, urlopen
import json
import gzip
url = "https://www.cvs.com//immunizations/covid-19-vaccine.vaccine-status.CA.json?vaccineinfo"req = Request(url, headers={'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip'})req.add_header('Accept-Encoding', 'gzip')req.add_header("Cookie", 'cookie=<COOKIE>')
req.add_header('Referer', 'https://www.cvs.com/immunizations/covid-19-vaccine?icid=cvs-home-hero1-link2-coronavirus-vaccine')
response = urlopen(req)
content = gzip.decompress(response.read())
decomp_req = content.splitlines()
temp = []
for line in decomp_req:
temp.append(line.decode())

Now that you retried the payload, it’s a good idea to store it in a temporary data store and potentially move it to the Filesystem. Feel free to assign your own custom path:

dbutils.fs.rm("/tmp/covid.json")
dbutils.fs.put("/tmp/covid.json", temp[0])
dbutils.fs.mv("/tmp/covid.json","dbfs:/FileStore/CUSTOM_PATH/covid.json")

You are ready to retrieve the JSON using a Spark Dataframe, process it to retrieve the payload, and finally filter by only available locations into a new Dataframe we named ‘df_avail’:

from pyspark.sql import functions as Fdf = spark.read.json('dbfs:/FileStore/CUSTOM_PATH/covid.json')
df = df.select(
F.array(F.expr('responseMetaData.*')).alias('responseMetadata'),
F.array(F.expr('responsePayloadData.data.*')).alias('payload'),
)
df = df.select(df['payload'][0])
df = df.withColumnRenamed('payload[0]', 'payload')
df = df.withColumn("new", F.explode("payload"))
df = df.withColumn('city', df['new']['city']) \
.withColumn('state', df['new']['state'] ) \
.withColumn('status', df['new']['status']) \
.drop('payload') \
.drop('new')
df_avail = df.filter(df['status'] == 'Available')

Geolocation API:

For this solution, we are using the OpenCage Geocoding API, where you can register an account, open your dashboard, then create a new project and generate a key. Also, you will need the geopy lib. For this example, you can import using PyPI:

  • geopy == 2.0.1
  • opencage == 1.2.2

We will create a UDF to be used in the ‘df_avail’ Dataframe based on your current location, which naturally you can change to the city or address you reside:

# Suggest to use Secrets
open_cage_key = <OPEN_CAGE_KEY>
from opencage.geocoder import OpenCageGeocode
from geopy.distance import geodesic, great_circle
# Change to your location
current_location = 'San Francisco, CA'
def find_distance(B):
geocoder = OpenCageGeocode(open_cage_key)
A = current_location
result_A = geocoder.geocode(A)
lat_A = result_A[0]['geometry']['lat']
lng_A = result_A[0]['geometry']['lng']

B = str(B +',CA')
result_B = geocoder.geocode(B)
lat_B = result_B[0]['geometry']['lat']
lng_B = result_B[0]['geometry']['lng']
return float("{:.3f}".format((geodesic((lat_A,lng_A), (lat_B,lng_B)).miles)))
distance = udf(find_distance)

Final Result:

In this last step, we can apply the UDF to the Dataframe:

df_avail.withColumn('distance', distance('city').cast("float")).orderBy('distance', ascending=True).show()

As a result, you can retrieve your final table with a straight line point to point distance from where you are. Remember, this is simply a guide and does not represent the real-time traffic information:

Final Table showing Distance from current Location to All Available Cities

Full code available here:

from urllib.request import Request, urlopen
import json
import gzip
from pyspark.sql import functions as F
from opencage.geocoder import OpenCageGeocode
from geopy.distance import geodesic, great_circle
# Set the API Key
open_cage_key = <OPEN_CAGE_KEY>
url = "https://www.cvs.com//immunizations/covid-19-vaccine.vaccine-status.CA.json?vaccineinfo"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0', 'Accept-Encoding': 'gzip'})
req.add_header('Accept-Encoding', 'gzip')
req.add_header("Cookie", 'cookie=<COOKIE>')
req.add_header('Referer', 'https://www.cvs.com/immunizations/covid-19-vaccine?icid=cvs-home-hero1-link2-coronavirus-vaccine')
response = urlopen(req)
content = gzip.decompress(response.read())
decomp_req = content.splitlines()
temp = []
for line in decomp_req:
temp.append(line.decode())
print(temp)
dbutils.fs.rm("/tmp/covid.json")
dbutils.fs.put("/tmp/covid.json", temp[0])
dbutils.fs.mv("/tmp/covid.json", "dbfs:/FileStore/CUSTOM_PATH/covid.json")
df = spark.read.json('dbfs:/FileStore/CUSTOM_PATH/covid.json')
df = df.select(
F.array(F.expr('responseMetaData.*')).alias('responseMetadata'),
F.array(F.expr('responsePayloadData.data.*')).alias('payload'),
)
df = df.select(df['payload'][0])
df = df.withColumnRenamed('payload[0]', 'payload')
df = df.withColumn("new", F.explode("payload"))
df = df.withColumn('city', df['new']['city']) \
.withColumn('state', df['new']['state'] ) \
.withColumn('status', df['new']['status']) \
.drop('payload') \
.drop('new')
df_avail = df.filter(df['status'] == 'Available')
def find_distance(B):
geocoder = OpenCageGeocode(open_cage_key)
A = current_location
result_A = geocoder.geocode(A)
lat_A = result_A[0]['geometry']['lat']
lng_A = result_A[0]['geometry']['lng']

B = str(B +',CA')
result_B = geocoder.geocode(B)
lat_B = result_B[0]['geometry']['lat']
lng_B = result_B[0]['geometry']['lng']
return float("{:.3f}".format((geodesic((lat_A,lng_A), (lat_B,lng_B)).miles)))
distance = udf(find_distance)df_avail.withColumn('distance', distance('city').cast("float")).orderBy('distance', ascending=True).show()

CodeX

Everything connected with Tech & Code

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store