Power BI — Scrap dictionary words from various web sources

Mohamad Mahmood
Lexiconia
Published in
6 min readJun 16, 2024

Power BI can indeed be a convenient tool for web scraping, especially for those who prefer a no-code or low-code approach.

Power BI’s built-in web scraping capabilities, accessed through the “Get Data” feature, allow users to connect to various web sources, including HTML pages, PDF documents, and APIs, without the need for writing complex code. This can be particularly useful for business users, analysts, or report creators who are more comfortable with a visual, drag-and-drop interface rather than writing custom scripts.

In this article, we will be using Power BI to scrap dictionary words from various web sources.

[1] Scrap texts from (GitHub) text file

Get a url for the dictionary texts e.g. https://gist.githubusercontent.com/deekayen/4148741/raw/98d35708fa344717d8eee15d11987de6c8e26d7d/1-1000.txt

Paste the url into the From Web dialog box.

Press OK.

Power BI shall display the preview of the text file content.

Press Transform Data button.

Rename the column header as word.

Click the drop down arrow on the right side of the column header and select Sort Ascending.

Change to lower case.

In the Query Settings panel, rename the dataset as github_dictionary.

In the File ribbon tab, Press Close & Apply button.

[2] Scrap texts from Wikipedia

Get a url for the dictionary texts e.g.

https://simple.wikipedia.org/wiki/Wikipedia:List_of_1000_basic_words

Paste the url into the From Web dialog box.

Press OK.

Power BI shall display the preview of the text file content.

Find the table that contains the words.

Press Transform Data button.

Right-click the column that contains the words, select Remove Other Columns.

Rename the column header as word.

The column contains text with multiple words. Therefore, we need to split the text.

In the Transform ribbon tab, select Split_Column/By_Delimiter.

In the Split Column by Delimiter dialog window, select comma as delimiter character and split into Rows option.

Rename the column header as word.

Transform the column into lowercase.

Sort the column in ascending order.

Rename the dataset as wikipedia_dictionary.

In the File ribbon tab, Press Close & Apply button.

[3] Scrap texts from Wiktionary

Get a url for the dictionary texts e.g.

https://en.wiktionary.org/wiki/Appendix:1000_basic_English_words

Paste the url into the From Web dialog box.

Press OK.

Power BI shall display the preview of the text file content.

This time, we will use the Add Table Using Examples approach. Press the button.

In the Add Table Using Examples dialog window, type the first three words. Power BI will automatically load the remaining words for A.

Scroll further down to the last word in A i.e. away.

Add the word baby (which is the first word in B) below the word away. Power BI will automatically add the rest.

If everything works well, all words down to Z will be added automatically.

Press OK button.

We shall get a new table in the Navigator window

Press Transform Data button.

Rename the column header as word.

Transform the column into lowercase.

Sort the column in ascending order.

Rename the dataset as wiktionary_dictionary.

In the File ribbon tab, Press Close & Apply button.

[4] Scrap texts from PDF file

Get a url for the PDF texts e.g.

https://www.wgtn.ac.nz/lals/resources/paul-nations-resources/vocabulary-lists/general-service-list/general-service-list-headwords.pdf

Paste the url into the From Web dialog box.

Press OK.

Power BI shall display the preview of the text file content.

Select the tables for all the PDF pages.

Press Transform Data button.

Select all columns for the Table001. In the Transform ribbon tab, select Unpivot_Columns/Unpivot-Columns.

Rename the column header as word.

Transform the column into lowercase.

Sort the column in ascending order.

Repeat the same steps for the other 9 tables.

Some table may not be correctly processed e.g. Table005 contains a multi-word texts.

Apply Split Column by Delimiter to the table.

Power BI will try to guess the delimiter character e.g. #(If)

Press OK.

It looks like the split process has been successful.

Continue the unpivot columns process for the remaining tables.

Next, we will append all tables into a new table.

Select Table001 and then click the Append_Queries/Append_Queries_as_new.

Select all tables.

Rename the new dataset as pdf_dictionary.

In the File ribbon tab, Press Close & Apply button.

All datasets shall appear in the Data Panel in the Power BI main window.

Power BI example file:

https://archive.org/download/dictionaryworld/scrap_dictionary_texts_from_various_web_sources.pbix

🤓

--

--

Mohamad Mahmood
Lexiconia

Programming (Mobile, Web, Database and Machine Learning). Studies at the Center For Artificial Intelligence Technology (CAIT), FTSM, UKM, Malaysia.