Power BI — Scrap dictionary words from various web sources
Power BI can indeed be a convenient tool for web scraping, especially for those who prefer a no-code or low-code approach.
Power BI’s built-in web scraping capabilities, accessed through the “Get Data” feature, allow users to connect to various web sources, including HTML pages, PDF documents, and APIs, without the need for writing complex code. This can be particularly useful for business users, analysts, or report creators who are more comfortable with a visual, drag-and-drop interface rather than writing custom scripts.
In this article, we will be using Power BI to scrap dictionary words from various web sources.
[1] Scrap texts from (GitHub) text file
Get a url for the dictionary texts e.g. https://gist.githubusercontent.com/deekayen/4148741/raw/98d35708fa344717d8eee15d11987de6c8e26d7d/1-1000.txt
Paste the url into the From Web
dialog box.
Press OK.
Power BI shall display the preview of the text file content.
Press Transform Data
button.
Rename the column header as word
.
Click the drop down arrow on the right side of the column header and select Sort Ascending
.
Change to lower case.
In the Query Settings panel, rename the dataset as github_dictionary
.
In the File ribbon tab, Press Close & Apply
button.
[2] Scrap texts from Wikipedia
Get a url for the dictionary texts e.g.
https://simple.wikipedia.org/wiki/Wikipedia:List_of_1000_basic_words
Paste the url into the From Web
dialog box.
Press OK.
Power BI shall display the preview of the text file content.
Find the table that contains the words.
Press Transform Data
button.
Right-click the column that contains the words, select Remove Other Columns
.
Rename the column header as word
.
The column contains text with multiple words. Therefore, we need to split the text.
In the Transform ribbon tab, select Split_Column/By_Delimiter
.
In the Split Column by Delimiter dialog window, select comma
as delimiter character and split into Rows
option.
Rename the column header as word
.
Transform the column into lowercase.
Sort the column in ascending order.
Rename the dataset as wikipedia_dictionary
.
In the File ribbon tab, Press Close & Apply
button.
[3] Scrap texts from Wiktionary
Get a url for the dictionary texts e.g.
https://en.wiktionary.org/wiki/Appendix:1000_basic_English_words
Paste the url into the From Web
dialog box.
Press OK.
Power BI shall display the preview of the text file content.
This time, we will use the Add Table Using Examples
approach. Press the button.
In the Add Table Using Examples dialog window, type the first three words. Power BI will automatically load the remaining words for A.
Scroll further down to the last word in A i.e. away
.
Add the word baby
(which is the first word in B) below the word away
. Power BI will automatically add the rest.
If everything works well, all words down to Z will be added automatically.
Press OK
button.
We shall get a new table in the Navigator window
Press Transform Data
button.
Rename the column header as word
.
Transform the column into lowercase.
Sort the column in ascending order.
Rename the dataset as wiktionary_dictionary
.
In the File ribbon tab, Press Close & Apply
button.
[4] Scrap texts from PDF file
Get a url for the PDF texts e.g.
Paste the url into the From Web
dialog box.
Press OK.
Power BI shall display the preview of the text file content.
Select the tables for all the PDF pages.
Press Transform Data
button.
Select all columns for the Table001
. In the Transform ribbon tab, select Unpivot_Columns/Unpivot-Columns
.
Rename the column header as word
.
Transform the column into lowercase.
Sort the column in ascending order.
Repeat the same steps for the other 9 tables.
Some table may not be correctly processed e.g. Table005 contains a multi-word texts.
Apply Split Column by Delimiter to the table.
Power BI will try to guess the delimiter character e.g. #(If)
Press OK.
It looks like the split process has been successful.
Continue the unpivot columns process for the remaining tables.
Next, we will append all tables into a new table.
Select Table001 and then click the Append_Queries/Append_Queries_as_new
.
Select all tables.
Rename the new dataset as pdf_dictionary
.
In the File ribbon tab, Press Close & Apply
button.
All datasets shall appear in the Data Panel in the Power BI main window.
Power BI example file:
https://archive.org/download/dictionaryworld/scrap_dictionary_texts_from_various_web_sources.pbix