D.I.Y. Computer Science in Drug Discovery: How to Scrape an HTML website for Chemical Information.

Yusuf Adeshina

Since this is my first post of a long series, I think it’s in order to tell you a little about what to expect. The purpose of this series is to demystify some of the ‘clandestine’ aspects of the application of computer science techniques in drug discovery, so that computational chemists and biologists or anybody interested in applying computational life science techniques in their research can benefit from these really cool techniques. I am really excited I, finally, found time to share my posts with you! Sit back and enjoy them!

Ok, Let’s get to the meat of my first post — WEB SCRAPING.

While this might be a mundane task for computer scientists and web developers, it’s not for a chemists/biologists, who might need some important chemical information from a web page that doesn’t have the ‘ALMIGHTY’ download button. To put it another way, if you have ever gone to a website that wasn’t designed for information download, where the only thing you could do is browse without being able to download SMILES or other chemical information and you wished there was a way to download the information you need; I believe this post is for you.

I tried my best to simplify the code as possible. However, a beginner level knowledge of python programming will be required to understand the logic of the code. If you don’t know how to code, I have supplied a ready-to-use, user-friendly version of the code version in my GitHub repository, please check it out.

So, let’s get started…

In this post, I will be scraping ZINC15 database — a database of commercially available compounds.

Every compound in ZINC database has it’s own page. We need to find that page, hence the URL, in order to successfully scrap information about that compound. Let’s try to find that page for ZINC19731973.

Go to the substances page,

Search for ZINC19731973.

The URL for this page is http://zinc15.docking.org/substances/ZINC000019731973/

Therefore, if we can programmatically replace the ZINC ID with some other ZINC ID’s, we can write an API that pulls information specific to different ZINC ID’s from the website.

If you are using a Safari browser, like me, go to Develop >Show Page Source. You will get a page that looks like:

Let’s start with scraping the smiles.

You need to install BeautifulSoup with pip install beautifulsoup4 ,requests with pip install requests and pandas with pip install pandas

Then…

Import requests ,BeautifulSoup, and pandas

import requests
from bs4 import BeautifulSoup
import pandas as pd

Then the body of the code:

#Create a dataframe for the ZINC ID's
df = pd.read_csv("files_containing_zinc_ids.txt")
#Open the output file to write into it.write_file = open("outfile.txt","a")i = 1for cid in df["id"]:
response = requests.get("http://zinc15.docking.org/substances/"+ str(cid) +".xml")
xml=BeautifulSoup(response.content)
zinc =xml.zinc_id.text
smi = xml.smiles.text
vendor = xml1.findAll("td")[1].text
write_file.write("{0},{1},{2}\n".format(zinc,smi,vendor))
print (xml.zinc_id.text,xml.smiles.text,xml1.findAll("td")[1].text)
b = "Downloaded " + str(i) + " from " + str(df["id"].count())
print ("%s" % b, end="\r")
i += 1

write_file.close()

Output:

How about if you are interested in getting vendors for a particular ZINC ID’s

df = pd.read_csv("files_containing_zinc_ids.txt")
df.head()
write_file = open("outfile.txt","a")i = 1print ("ZINC_ID",",","Vendor",",","Purchability",",","Updated")
write_file.write("{0},{1},{2},{3}\n".format("ZINC_ID","Vendor","Purchability","Updated"))
for cid in df["id"]:
response = requests.get("http://zinc15.docking.org/substances/"+ str(cid)+"/catalogs/subsets/for-sale.csv")

xml=BeautifulSoup(response.content)
j=xml.findAll("p")[0].text
myreader = csv.reader(j.splitlines())

for row in myreader:


if (row[2]=="In-Stock"):

p =([cid] + row)
write_file.write("{0},{1},{2}, {3}\n".format(p[0],p[2],p[3],p[7]))
print (p[0],",",p[2],",",p[3],",",p[7])
b = "Downloaded " + str(i) + " from " + str(df["id"].count())
print ("%s" % b, end="\r")
i += 1

write_file.close()

Output:

Finally, you can check out the jupyter notebook containing the full code in my GitHub repository. Also, be on the look out for my next post on how to scrape an AJAX website for chemical data.

Yusuf Adeshina

Written by

Drug Discovery | Machine Learning | Chemical Biology

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade