How to Import Wikipedia Data into a Database

Kazuma Kusu
6 min readJul 29, 2024

--

Most well-known relational database management systems (RDBMS) now support the JSON type. With the rise of NoSQL databases, which cater to diverse data models, RDBMSs have evolved to meet various needs. This evolution allows a single cell in a table to store an object like a JSON rather than just a single value.

This article introduces how to store JSON files in the Oracle Database, a typical relational database system.

If you want to practice quickly, please refer to this article.

Preliminaries

This article show how to store Wikipedia articles in a database using JSON format.

Requirements

This article uses the following tools:

  • Docker (ver. 26.1.4 in my environment)
  • Git
  • Oracle Database 21c or 23ai

For this article, I prepared an instance of Oracle Database 23ai on Oracle Cloud Infrastructure. Note that it is impossible to store all the articles of the Wikipedia dump data into the database if yours a free licence due to resource limitations.

Therefore, the following measures are necessary:

  • Paid license of Oracle Database 23ai or 21c.
  • Limit the number of articles loaded from the Wikipedia dump data (see Section “2. Setup the connection information of your database”).

Program & Dataset

I have prepared benchmark-for-json, a program that converts Wikipedia dump data into JSON data and stores it into a database. To put it into practice for this article, benchmark-for-json and Wikipedia dump data are required. I will explain how to prepare Wikipedia dump data and benchmark-for-json.

Wikipedia Dump (enwiki)

Wikipedia dump data is available in each language with a link to the dump data page. For example, the English Wikipedia dump data can be downloaded in this page. This article uses the dump data as of May 1, 2024, but you may download the newer version if it is available.

Download links are provided for the Wikipedia dump data for each language. English Wikipedia dump data (enwiki) can be obtained from the following link.

This article uses the dump data as of May 1, 2024, but you may download newer versions if they are available.

benchmark-for-json

I prepared a program to convert Wikipedia dump data to JSON format and store the JSON data into Oracle Database.

You can clone it with the following command.

git clone — branch task-single-insert — single-branch git@github.com:meer-consulting/benchmark-for-json.git

Place the downloaded Wikipedia dump data under the dirctory benchmark-for-json/data/. I will explain how to run it in detail in a later section.

How to execute benchamrk-for-json

This section describes how to run the services defined in benchmark-for-json with docker commands to convert Wikipedia dump data to files formatted JSON. After that, this section explains how to import the JSON data into Oracle database.

1. Convert Wikipedia dump data to JSON files

Repository benchmark-for-json provides the function to convert Wikipedia dump data to JSON files with the generate-json service. As described later in other secition, the service “generate-json” uses WikiExtractor to convert dump data to JSON data.

Note that the dump data to be converted should be stored in the directory “benchmark-for-json/data/”, where the docker-compose.yml file is located.

cp enwiki-20240501-pages-articles-multistream.xml.bz2 data/

Next, set up the environment variables for the service by creating a file “.env” in the directory “docker/development” with the following contents.

DUMP_FILENAME=enwiki-20240501-pages-articles-multistream.xml.bz2

You can copy and use the file “.env.example” in the same directory.

WikiExtractor commands refer to the environment variable DUMP_FILENAME, so if you want to use a different version of the dump data, change the name of the dump file set to DUMP_FILENAME accordingly.

Start the wikiextractor service with the following command.

docker compose --profile generate-json up

After executing the above command, Service “wikiextractor” creates a directory named “wikiextractor” in repository root direcitory, and Wikipedia articles are sorted into multiple directories and exported many JSON data.

The schema of the generated JSON is as follows:

[
{
"id": string,
"revid": string,
"url": string,
"title": string,
"text": string
}
]

2. Setup the connection information of your database

The benchmark-for-json locust-master and locust-worker services connect to the database and store the JSON data in the database. To connect to an Oracle Database 23ai instance, database connection information should be added to “.env” in the following format.

DUMP_FILENAME=enwiki-20240501-pages-articles-multistream.xml.bz2
# Add the following
ORACLE_USER=<user name
ORACLE_PASSWORD=<password
ORACLE_DSN=<IP address of oracle db instance>:<port>/<SID
oracle_mode=thick max_json_load=3000000 # maximum read limit
TABLE_NAME=wiki_en
task_ratio_1_wiki_import=1

The meanings of variables other than database connection information are as follows:

  • MAX_JSON_LOAD: Limit the number of data. If you want to try it easily, set a small value.
  • TABLE_NAME: The name of the table created in Oracle Database.
  • TASK_RATIO_1_WIKI_IMPORT: For Locust task settings.

3. Import Wikipedia data

The JSON data has been prepared, the database connection information has been set up, and the data is ready to be stored in the database. Using the benchmarking tool Locust, we created a program in Python to store Wikipedia data in the Oracle Database.

The following command launches the container for Locust.

docker compose --profile locust up

If the database connection information has been properly set up, the Locust server will start up and you can access the URL “http://127.0.0.1:8089/” using your favorite Web browser. Press the Start button to start the execution.

Locust’s chart graph

Description of benchmark-for-json

This section describes the structure of docker-compose.yml described in benchmark-for-json and the libraries that are useful for storing JSON data.

Transform Wikipedia dump data to JSON format

The wikiextractor service performs the process of converting Wikipedia dump data to JSON format. This service uses WikiExtractor, which outputs the data in JSON format to a file by specifying --json as an option. When the service starts, a command wikiextractor — json — processes 60 — output /wikiextractor/ /data/${DUMP_FILENAME:?Set .env} is executed, and when the process is completed properly, the directory benchmark-for-json/wikiextractor/ will store the converted data. It consumes too much times due to large dump data, but you can be faster by setting the option --processes to a value that matches the CPU of the execution environment.

Insert JSON data into your database

The Services locust-master and locust-worker perform the process of storing JSON data in the database. These services use Locust, a benchmarking tool, to perform database operations in this service, using the python-oracledb library.

I will write a separate article about Locust when I have a chance, but to check the contents of the load test, refer to benchmark-for-json/locust/locustfile.py.

Summary

I have created “benchmark-for-json”, a toolset for storing JSON data in Oracle Database using Wikipedia dump data, and since it uses Locust, it can be used as an environment to define some tasks and evaluate their performance. If you want to store Wikipedia data in a database and test JSON-format queries or full-text search, please make use of this tool! I am planning to write a separate article on JSON storage, querying, and performance evaluation methods.

Put It into Practice

Thank you for reading this article. If you would like to put the contents of this article into practice, please try the following articles.

--

--