Scribe-Data: A Guide to Open Source Language Data

Mahfuza Humayra Mohona
4 min readApr 5, 2024

--

image from github.com/scribe-org

Welcome to the world of Scribe-Data, a fascinating project that’s all about making language data more accessible and useful for everyone. This blog post will take you through the basics of Scribe-Data, how it works, and how you can contribute to this exciting project.

What is Scribe-Data?

Scribe-Data is an open-source project that provides scripts and tools for extracting and formatting linguistic data from Wikidata and Wikipedia. This data powers various applications developed by Scribe, which aims to create tools and applications that help people learn and use different languages. The data collected by Scribe-Data comes from sources like Wikidata and Wikipedia, and the data is then used to power commands in Scribe applications to allow users to conjugate verbs and translate from one language to another.

At its core, Scribe-Data contains Python scripts that interact with the Wikidata and Wikipedia APIs to retrieve and process information related to languages and writing systems. This data is then used to build language interfaces and input methods for Scribe’s cross-platform applications.

How Does Scribe-Data Work?

Scribe-Data uses SPARQL to query data from Wikidata. SPARQL is a bit like asking a librarian for books on a specific topic. In this case, it’s asking for information about languages, words, and how they’re used. This data is then organized and made available for Scribe applications to use.

The main data update process is initiated by running a script named `update_data.py`. This script triggers SPARQL queries to query language data from Wikidata.

To run the `update_data.py` script, you use a command-line interface (CLI) command from the Scribe Data folder:
```
python3 src/scribe_data/extract_transform/wikidata/update_data.py
```
This command is executed to start the data update process, which is a fundamental part of how Scribe-Data works to keep language tools current and effective.

There are other scripts for retrieving popular words from Wikidata and the words that typically follow them to create an effective autosuggestion feature. This feature is crucial for enhancing the user experience by providing relevant suggestions as users type. Additionally, emojis are sourced from Unicode, ensuring a rich and engaging user interface.

Accessing and Using Scribe-Data

The Scribe-Data project is hosted on GitHub at the following repository: https://github.com/scribe-org/Scribe-Data. Developers and users can access the project’s source code, scripts, and documentation from this central location.

The repository’s directory structure is organized as follows:

  • extract_transform/: Contains scripts and tools for extracting and transforming language data from sources like Wikidata and Wikipedia, preparing it for use in Scribe applications.
  • load/: Includes scripts and utilities for loading and processing language data into a usable format for Scribe applications.
  • resources/: contains commands and metadata about supported languages, including their names, codes, scripts, writing directions and plural rules, which are used to configure language tools in Scribe applications.

The project’s README file offers detailed guidance on setting up the development environment, running the data extraction scripts, and understanding the overall project structure. This documentation serves as a valuable resource for those who want to get started with Scribe-Data.

Benefits of Using Scribe-Data

Scribe-Data offers so many benefits, making it an invaluable resource for developers and users :

- Rich Language Data: By leveraging the vast amount of language data from Wikidata and Wikipedia, Scribe-Data enables the creation of language applications that support a wide range of languages and writing systems.
- Cross-Platform Tools: The data and scripts provided by Scribe-Data facilitate the development of language tools that are not only functional and feature-rich but also compatible across different platforms, including iOS, Android, and Desktop.
- Accessibility and Communication: Scribe-Data plays a crucial role in promoting accessibility and breaking down communication barriers. It provides effective language input and interface solutions, making it easier for people from different linguistic backgrounds to communicate and interact.

Contributing to Scribe-Data

Scribe-Data welcomes contributions from the community. Whether you’re reporting bugs, working on new features, or adding language data via Wikidata, there’s a place for you in the Scribe-Data project. The repository’s contributing and README section provides detailed guidelines and resources for those interested in contributing. For first-time contributors, issues marked with `good first issue` are designed to help you get started.

Whether you’re a language enthusiast, a developer interested in open-source projects, or someone seeking to break down communication barriers, Scribe-Data is a project worth exploring and supporting. By getting involved, you can play a vital role in shaping the future of language-focused applications and making a positive impact on the way people learn, communicate, and connect across the world.

--

--