How to use web scraping tools to gain insight into a specific sector?

An analysis of the children’s literature sector in Lithuania

Stéphanie Crêteur
Geek Culture
8 min readApr 20, 2022

--

For my work, I needed concrete data on the children’s literature sector in Lithuania. However, despite some research on the net, apart from a few articles (available in the sources of this document), I could not find any precise figures.

I am also training in data analysis with Python and specifically wanted to learn more about web scraping. I could therefore get around this lack of resources while developing my understanding of beautiful soup (a Python library for parsing HTML) and selenium (a tool for controlling web browsers).

In writing this article, I want to show a concrete case of web scraping and how it can help us to obtain data that is otherwise difficult to acquire. Those data being the basis for developing a relevant analysis. Clearly, my topic was very specific and I could not limit myself to a generic idea of the field.

Photo by Stephen Andrews on Unsplash

I had concrete questions I wanted to answer: what is the average price of a children’s book, how many categories exist, which books are the most popular, and what is the proportion of Lithuanian authors in the selected books? In addition, I wanted to know if people tended to buy more children’s books during the Covid crisis. Since parents had to stay at home with their children during the quarantine, it is likely that they would purchase more books to find activities to do with their children.

To obtain this information I decided to retrieve the data available on the website knygos.lt (a Lithuanian version of Amazon). Indeed, on a single page, we can find almost 200 books for children and teenagers divided into 11 categories as well as a lot of other relevant information for our analysis.

1. Scraping the data

The first step is to check that the site can indeed be scraped by checking the robots.txt file. Once this is confirmed, I can start the actual scraping. I’ll begin with importing the needed libraries and creating the soup object.

Looking at the HTML file from the website, you can see that the class for the properties of the books is called “book-properties” divided into “.book-author”, “.book-title” and “.book-price”. To limit the selection to the grid containing youth literature, I need to select the class “.col-12” (otherwise we would include the promotions that are not related to our topic).

The Information Available Thanks to the Inspect Element Tool

There is a star rating system for each book (going from 0 to 5). However, those evaluations are not very diverse (between 4 and 5 stars for the vast majority). In order to analyse the popularity of a book, the number of reviews given seems more relevant. This is available under the class “.badge-secondary”.

For data storage, I used a dictionary whose key is the index and the value is a list containing the name of the author, the title, the number of reviews, and the price.

The books are as well divided into 11 categories: Knygos mažiausiems (books for toddlers), Knygos vaikams (books for children), Knygos paaugliams (books for teenagers), Pažintinė literatūra vaikams (Educational literature for children), Pasakos (Fairy tales), Kakė Makė (a popular Lithuanian children’s series featuring the character “Nelly Jelly”), Lavinamosios, užduočių knygelės (Educational and activity books), Veiklos knygelės (Workbooks), Kalėdinės knygelės (Christmas books), Mokiniams rekomenduojamos knygos (Books recommended by schools), Smagioji edukacija (Fun Education).

On this web page, for each category there are 16 books, so we can easily add those categories to the dictionary we created earlier.

Great, we now have everything to create our DataFrame! However, we need first to clean the data a bit (fill in the missing reviews with “0” and change the data type to float).

We now are left with this DataFrame which contains 176 rows and 5 columns.

2. Analysing the data

Using the describe() function will help us obtain some solid information regarding our quantitative data (reviews and price).

2.1. Analysing the reviews

Let’s check what books and what categories are the most reviewed in our selection.

We can see that the most reviewed book is Between Shades of Gray by American-Lithuanian author, Ruta Sepetys. The categories receiving the most reviews are the ones for toddlers and the ones recommended for school.

2.2. Analysing the price

The next part of our analysis will look at the average price of a book according to its category and which books are the most expensive.

The most expensive category is fairy tales (pasakos), and logically the most expensive book is also in this category. However, it should be noted that this book is actually a collection of volumes, hence the high price.

2.3. Correlation between price and reviews

I’ve noticed that the more expensive books don’t seem to get a lot of ratings. So I would like to see if there is a correlation between the two with df_lithuania_no_outliers.plot.scatter(x=”price”, y=”review”, c=”DarkBlue”). Before that, I removed the collection of volumes which was a clear outlier with df_lithuania_no_outliers = df_lithuania.drop([73]).

We can see that the most reviewed books are between the 5 and 15 euros price range. However, as we have noticed with the describe() function, almost 50% of our books are in this range. It is still interesting, though, that the books that do not fall into this interval receive almost no reviews.

3. Nationality of the author

One question that interested me at the beginning of the analysis was to know the proportion of Lithuanian authors in this corpus. This seems to be a bit complex, without having to track the Lithuanian authors one by one. However, we can try to get around this by using one of the “quirks” of the Lithuanian language which is its pretty unique diacritical marks (“ąčęėįšųūž”). These may indeed help me to note which authors are likely to be Lithuanian (or of Lithuanian descent). I used the regular expression module (import re) to classify the authors into two groups.

I noticed that the search did indeed seem to admit mainly Lithuanians although at least two persons whose nationality is not Lithuanian were included in the selection (Pavla Hanáčková and Ester Dobiášová, both of Czech origin). However, I kept this approach as it gave me a good approximation. I removed Mrs Hanáčková and Dobiášová with the .remove() function.

The last change to the list is the addition of some common names particular to Lithuania (Vytautas, Linas) and finally the addition of Ruta Sepetys (author of Between Shades of Gray) whose dual nationality is perhaps responsible for the absence of “ū”.

Despite the fact that she lives in the United States, the general theme of her book is directly related to Lithuania. Therefore, it seems logical to me to include her among the Lithuanian authors. Here is the complete code :

Thanks to this list, we now know that the percentage of Lithuanian authors in our corpus is 19,32%. The next step would be to add a new column in our DataFrame which would have a “Y” if the author is Lithuanian and “N” otherwise. This new column allows us to see if Lithuanian authors tend to receive more reviews than non-Lithuanians.

With this calculation we see that books written by Lithuanians receive on average 9.56 reviews were non-Lithuanians receive only 4.39 reviews. However, we do know that there is an outlier: the book Between Shades of Gray which has 118 reviews. Let’s see how it changes if we take it out of our corpus.

Without Sepetys’ book, the average drops to 6.27 reviews which is still higher than the average for foreign authors.

4. Comparing the top seller's books from 2019 to the ones from 2021

On knygos.lt, you can see that a fairly large number of children's literature books are among the top-selling books of this year. I would like to know if the percentage has changed over the years and more specifically if the covid crisis has had any impact. However, our site does not give the top 100 from previous years. So we have to turn to another website www.knyguklubas.lt. This site requires JavaScript to be enabled, which Beautiful Soup does not permit. So we have to turn to another python library: Selenium.

To compare 2019 to 2021, I decided to get a list of children’s authors from this site and then analyse the percentage of authors in the top sales of 2019 and 2021.

So I went to the page dedicated to children’s literature (set to 96 books sorted by popularity). Then I scraped the authors from the first two pages, which gave me a list of 192 popular authors in the field of children’s literature. I got rid of the “Nėra Autoriaus” (no author) with a list comprehension which gave me a final list of 172 authors.

I then went to the page giving the most popular books of 2019 and added all the authors to a list and did the same for 2021.

Finally, I’m checking to see if there are any children’s authors in 2019 or 2021.

So, drum roll: out of a total of 95 books for 2019, there are no children’s authors present, whereas, out of 97 books, there are 18 books written by children’s authors in the list of best-selling books in 2021.

To be honest, this part with Selenium was quite time-consuming, and, to be completely frank, the result is obvious from a quick glance at the two lists on the website. However, this process allowed me to discover and get to grips with Selenium, which is still a positive point.

This brings us to the end of the analysis. I am quite happy with the result. Thanks to these web-scraping tools, I got data that was otherwise unavailable to me.

Sources

https://www.knyguklubas.lt/literatura-vaikams

Here is the code on Github :

--

--

Stéphanie Crêteur
Geek Culture

Python | Data analysis lover. Learning about AI and Natural Language Processing.