Analytics Vidhya
Published in

Analytics Vidhya

Prepare a corpus in Sinhala language by crawling the web

If you want to build an information retrieval system, the very first thing you need to do is collecting a set of documents (corpus). In the process of collecting a set of documents, you have to face several problems.

  1. Identify the unit of document — eg: whole email thread/ only the first email of the thread/ email with or without attachment etc
  2. Language — eg: English/Sinhala/Tamil/Japanese etc
  3. Format — eg: PDF/HTML/JSON etc

However, by overcoming this challenges lets say we want to collect a set of Sinhala songs in JSON format. I found this website http://lyricslk.com/ which contains around 800 Sinhala lyrics. Lets crawl this site to extract information we required.

Note: The following procedure can be applied to any other website with a sitemap and the language doesn’t matter.

We are going to crawl the web using the tool called Scrapy. It is a framework application written in python, that uses to crawl web sites and extract structured data which can be used for various useful applications.

  1. Install Scrapy

As a prerequisites you need to have install python2.7 or above, pip/anaconda package manager

To install Scrapy using conda:

conda install -c conda-forge scrapy

To install Scrapy using pip:

pip install Scrapy

2. Create a new Scrapy project

Navigate where you want to create the project, open a terminal and issue

scrapy startproject lyrics

Here “lyrics” is the project name.

This commands create a new Scrapy project called “lyrics” and it contains a folder as “lyrics” and a file called “scrapy.cfg” .

folders and files inside the inner lyrics folder

3. Write a spider to crawl the web and extract data

Spider classes of Scrapy define how a site or group of sites will be crawled. Some of the generic spiders are CrawlSpider, XMLFeedSpider, CSVSpider and SitemapSpider. You can read more details from here.

In this post, I am using a SitemapSpider. SitemapSpider allows us to crawl a site by discovering the URLs using sitemap.xml.

Click on this link and visit the available sitemap.xml of the lyricslk.com site. http://lyricslk.com/sitemap.xml

Navigate to lyrics/lyrics/spiders and create a file “lyrics_spider.py” with the following content.

lyrics_spider.py

sitemap_rules = [(‘^(?!.*artist).*$’, ‘parse’)]

This sitemap_rule depicts that, any URL which contains the word “artist” is neglected. All other URLs are considered.

response.xpath is used to extract required information from each site. Since all page related to the extracted URLs from the sitemap are consistent, we can use a set of constant xpaths to extract information like song, singer and title.

4. Run the created spider

Navigate to the project’s top level directory and run:

scrapy crawl lyrics -o output.json

Here “lyrics” is the name used at the spider class.

class LyricsSpider(SitemapSpider):
name = “lyrics”

The extracted data will be written into “output.json” file.

[
{"song": " \u0d9a\u0db3\u0dd4\u0dc5\u0dd4 \u0d9a\u0dd2\u0dbb\u0dd2 \u0db4\u0ddc\u0dc0\u0dcf \u0dad\u0dd4\u0dbb\u0dd4\u0dbd\u0dda \u0dc3\u0dd9\u0db1\u0dd9\u0dc4\u0dc3\u0dd2\u0db1\u0dca \u0dc4\u0daf\u0dcf \u0d9a\u0db3\u0dd4\u0dc5\u0dd4 \u0dc0\u0dd2\u0dbd \u0daf\u0dd2\u0d9c\u0dda \u0db1\u0ddc\u0db4\u0dd9\u0db1\u0dd3 \u0d9c\u0dd2\u0dba\u0dcf\u0daf\u0ddd \u0d85\u0db8\u0dca\u0db8\u0dcf \u0dc3\u0dad\u0dca \u0db4\u0dd2\u0dba\u0dd4\u0db8\u0dca \u0dc0\u0dd2\u0dbd\u0dda \u0dc0\u0dd2\u0dbd \u0db8\u0dd0\u0daf \u0dba\u0dc5\u0dd2 \u0db4\u0dd2\u0db4\u0dd3 \u0daf\u0dd2\u0dbd\u0dda \u0daf\u0dd4\u0da7\u0dd4 \u0dc3\u0db3 \u0daf\u0dbb\u0dd4\u0dc0\u0db1\u0dca \u0dc0\u0dd9\u0dad \u0db8\u0dc0\u0d9a\u0d9c\u0dda \u0dc3\u0dd9\u0db1\u0dda \u0db8\u0dd2\u0dc4\u0dd2\u0d9a\u0dad \u0dc0\u0dd4\u0dc0 \u0dc4\u0dac\u0dcf \u0dc0\u0dd0\u0da7\u0dda \u0db1\u0dd2\u0dc0\u0db1\u0dca \u0db8\u0db1\u0dca \u0db4\u0dd9\u0dad\u0dda \u0d94\u0db6 \u0d9c\u0dd2\u0dba \u0db8\u0db1\u0dca \u0dbd\u0d9a\u0dd4\u0dab\u0dd4 \u0db4\u0dd9\u0db1\u0dda \u0dc0\u0da9\u0dd2\u0db1\u0dcf \u0daf\u0dcf \u0db1\u0dd2\u0dc0\u0db1\u0dca \u0db4\u0dd4\u0dad\u0dd4 \u0dc3\u0dd9\u0db1\u0dda \u0dc0\u0da9\u0dcf \u0db8\u0dcf \u0daf\u0dd9\u0dc3 \u0db6\u0dbd\u0db1\u0dd4 \u0db8\u0da7 \u0daf\u0dd0\u0db1\u0dda ", "title": "\u0d9a\u0db3\u0dd4\u0dc5\u0dd4 \u0d9a\u0dd2\u0dbb\u0dd2 \u0db4\u0ddc\u0dc0\u0dcf", "singer": "\u0d85\u0db8\u0dbb\u0daf\u0dda\u0dc0 W.D."},
{"song": " \u0d89\u0dbb \u0dc4\u0db3 \u0db4\u0dcf\u0dba\u0db1 \u0dbd\u0ddd\u0d9a\u0dda \u0d86\u0dbd\u0ddd\u0d9a\u0dba \u0d85\u0dad\u0dbb\u0dda \u0dc3\u0dd0\u0db4 \u0daf\u0dd4\u0d9a \u0dc3\u0db8\u0db6\u0dbb \u0dc0\u0dda \u0db8\u0dda \u0da2\u0dd3\u0dc0\u0db1 \u0d9a\u0dad\u0dbb\u0dda // \u0dc3\u0dd0\u0db4 \u0daf\u0dd4\u0d9a \u0dc3\u0db8\u0db6\u0dbb \u0dc0\u0dda \u0d8b\u0d9a\u0dd4\u0dbd\u0dda \u0dc5\u0db8\u0dd0\u0daf\u0dda \u0dc3\u0db8\u0db6\u0dbb \u0d8b\u0dc3\u0dd4\u0dbd\u0db1 \u0d9c\u0dd0\u0db8\u0dd2 \u0dbd\u0dd2\u0dba \u0dba\u0db1 \u0d9c\u0db8\u0db1\u0dda \u0db8\u0dd4\u0daf\u0dd4 \u0db6\u0db3 \u0db1\u0dd0\u0dc5\u0dc0\u0dd9\u0db1 \u0dc3\u0dda \u0d9a\u0db3\u0dd4\u0dc0\u0dd0\u0da7\u0dd2 \u0d9c\u0d82\u0d9c\u0dcf \u0dc3\u0dcf\u0d9c\u0dbb \u0d91\u0d9a\u0dc3\u0dda \u0db4\u0ddc\u0dc5\u0ddc\u0dc0\u0da7 \u0dc3\u0db8\u0db6\u0dbb \u0dc0\u0dda \u0db8\u0dda \u0da2\u0dd3\u0dc0\u0db1 \u0d9a\u0dad\u0dbb\u0dda // \u0dc3\u0dd0\u0db4 \u0daf\u0dd4\u0d9a \u0dc3\u0db8\u0db6\u0dbb \u0dc0\u0dda \u0dc0\u0dd0\u0da9\u0dd2\u0dc0\u0db1 \u0d86\u0dc1\u0dcf \u0db8\u0dd0\u0dac\u0dbd\u0db1 \u0dc0\u0dda\u0d9c\u0dda \u0da2\u0dd3\u0dc0\u0db1 \u0db8\u0d9f \u0d9a\u0dd0\u0dc5\u0db8\u0dda \u0d92\u0d9a\u0db8 \u0dbb\u0dc3\u0db8\u0dd4\u0dc3\u0dd4 \u0dc0\u0dda \u0db8\u0dc4 \u0dc0\u0db1 \u0dc0\u0daf\u0dd4\u0dbd\u0dda \u0dc0\u0db1 \u0dc0\u0dd2\u0dbd\u0dca \u0db8\u0dad\u0dd4\u0dc0\u0dda \u0db4\u0dd2\u0dba\u0dd4\u0db8\u0dca \u0db4\u0dd2\u0db4\u0dd3 \u0db1\u0dd0\u0da7\u0dc0\u0dda \u0db8\u0dda \u0da2\u0dd3\u0dc0\u0db1 \u0d9a\u0dad\u0dbb\u0dda // \u0dc3\u0dd0\u0db4 \u0daf\u0dd4\u0d9a \u0dc3\u0db8\u0db6\u0dbb \u0dc0\u0dda", "title": "\u0d89\u0dbb \u0dc4\u0db3 \u0db4\u0dcf\u0dba\u0db1 \u0dbd\u0ddd\u0d9a\u0dda", "singer": "\u0d85\u0db8\u0dbb\u0daf\u0dda\u0dc0 W.D."}, ....

Now you can see similar content in output.json file.

5. Convert unicode to Sinhala characters

Navigate to the folder where your “output.json” file exists.

Write a python script to convert unicodes to Sinhala characters and write the output to a separate file.

Execute the script by running python -m unicode_converter command on terminal.

Now you have the “song_lyrics.json” file with similar content to following.

[{"song": " කඳුළු කිරි පොවා තුරුලේ සෙනෙහසින් හදා කඳුළු විල දිගේ නොපෙනී ගියාදෝ අම්මා  සත් පියුම් විලේ විල මැද යළි පිපී දිලේ දුටු සඳ දරුවන් වෙත මවකගේ සෙනේ මිහිකත වුව හඬා වැටේ  නිවන් මන් පෙතේ ඔබ  ගිය මන් ලකුණු පෙනේ වඩිනා දා නිවන් පුතු සෙනේ වඩා මා දෙස බලනු මට දැනේ ", "singer": "අමරදේව W.D.", "title": "කඳුළු කිරි පොවා"}, {"song": " ඉර හඳ පායන ලෝකේ ආලෝකය අතරේ සැප දුක සමබර වේ මේ ජීවන කතරේ // සැප දුක සමබර වේ  උකුලේ ළමැදේ සමබර උසුලන ගැමි ලිය යන ගමනේ මුදු බඳ නැළවෙන සේ කඳුවැටි ගංගා සාගර එකසේ පොළොවට සමබර වේ මේ ජීවන කතරේ // සැප දුක සමබර වේ  වැඩිවන ආශා මැඬලන වේගේ ජීවන මඟ කැළමේ ඒකම රසමුසු වේ මහ වන වදුලේ වන විල් මතුවේ පියුම් පිපී නැටවේ මේ ජීවන කතරේ // සැප දුක සමබර වේ", "singer": "අමරදේව W.D.", "title": "ඉර හඳ පායන ලෝකේ"}, ....

Cool 😎 . Now you have a rich corpus to build your information retrieval system.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Anuradha Karunarathna

Anuradha Karunarathna

Senior Software Engineer@ WSO2 | Computer Science and Engineering graduate@ University of Moratuwa, SriLanka