Get data for your DB schoolwork with 6 lines of python

Romes
5 min readMay 18, 2020

--

For my university’s Database course project we can come up with our own idea for a database we’d like to implement (and then we do). When it comes to getting data to fill up those tables, I was told that some students in previous years wrote insert lines manually, and some other same-year friends planned to do the same.

Since I’m writing code to generate the database insertions (and actually, everything else (but I won’t be covering that)), I showed the friend who told me this, right away, a 6-lined script in python that generated 4000+ lines of SQL insert commands to fill a table named Music Genres with unique music genres.

I’ll go through an explanation of the following 6 lines of code, and how you can do something very similar to fill your database tables more effortlessly. (And I’ll include a second example, with Minecraft Items instead of Music Genres, after the explanation). In the end, you should understand the idea behind these lines in a way you can apply them to your own situation.

The script above outputs these SQL commands to the console (they’re too many to display here)

What’s happening?

The first two lines:

We first need to get the data from somewhere. We’re getting all the music genres from http://everynoise.com/everynoise1d.cgi?scope=all

import requests — importing the python library that will allow us to request that webpage.

everynoisetext = requests.get(the above URL).text — saving the text response property from the request made to the website.

The request

If you 1) navigate to the website, 2) open up the browser inspect window, 3) choose the Network tab from the inspector, 4) refresh the page, and 5) click the first item that appears in the network list, you should see something like this:

Request header

And then, on clicking “Response”, you should see:

Request response

When we make a get request (request.get), to the above URL, we receive, as a response, the HTML code of the page.

Since the page displays a list of genres, we can probably find these genres in the HTML. We just need a way to search for them.

First genres in Every Noise at Once

We have this HTML code as a string, we saved it as the variable everynoisetext . (The response.get(URL).text returned the content of the response. (if you print it: '<html>\n<head>\n<meta charset="utf-8">\n<title>Every Noise at Once</ti...)).

We’ll find what we need with the next two lines.

The next two lines:

import re — importing the python library that will allow us to search the huge string (the HTML code saved in everynoisetext) — re stands for regular expressions. (A regular expression is a sequence of characters that define a search pattern, we’re learning this in Theory of Computation so I’ll just explain the python syntax we need up ahead)

allgenres = re.findall(REGULAR EXPRESSION, everynoisetext) — finding all music genres inside everynoisetext through the regular expression. findall returns a list, and we’re saving it as allgenres.

The regular expression

Start by looking at the HTML code, and think about what pattern happens only near the word.

Looking at our HTML I found that each music genre was in a separate line.

HTML Sample

If you scroll right on the sample above, you might notice as well that right before every genre name, we see style=”color: #some-color-number"> ; and right after, we have < .

We want to search in that text for precisely style="color: #acolorcode">anynumberofletters< . So our regular expression will be exactly that — we just need to change “a color code” and “any number of letters” to a syntax that python’s regular expressions will understand.

In python re:

. means any character

* means any number of times

? after the * is a bit harder to comprehend. Python docs put it this way: The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against '<a> b <c>', it will match the entire string, and not just '<a>'. Adding ? after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using the RE <.*?> will match only '<a>'.

So replacing the expression with the correct syntax we get:

style="color: #.*?">.*?< — there’s just one thing left for it to be complete.

If we search for this pattern in the string we’ll get not only the music genres, but also the preceding style=”color …

What we want to capture is just the .*? in between the > and < .

Luckily, python re has a feature called groups. We can create a group for a part of the expression with () .

And the function findall will only append to the return list what is inside a group.

So our final expression will be style="color: #.*?">(.*?)< .

And after using it inside re.findall, allgenres will store [‘pop’, ‘danc…]

The last two lines

for genre in allgenres: — iterate over all genres in the list

print('insert into MusicGenres values("{}")'.format(genre)) — prints that SQL command to the console, and replaces the {} in the string with the genre, through the python format function.

To do this for your own project you can start by searching for a webpage with the data you are looking for, inspecting the request response, creating a regular expression to find your data in the response text, and then printing it out.

The second example: Minecraft Items

This will be a way less in depth example — you should get the idea from the first. This one generates a sequential table_id i and captures two groups in the regular expression, both (see following image) the item name and the minecraft:type text.

The website has the items:

Minecraft Items

The code:

For the output of this one just start up the python interpreter (type python in the console) and copy then paste this code.

Extra

To save the output as a file, just redirect the standard output to a file in the terminal:

python script.py > output.txt

BTW, if you liked this guide’s writing style, you might enjoy How I Recorded an Album On My Own, In My Room

Hope this helped, ~romes

PS: If anything’s wrong, let me know and I’ll fix it

--

--