DLx Data Format: v1.0.0!

Published in

Digital Linguistics

4 min readMar 7, 2020

The Digital Linguistics (DLx) project is excited to announce the release of v1.0.0 of the Data Format for Digital Linguistics (DaFoDiL)!

Acknowledgments: A huge thank you to Brock Wrobleski and Vade Kamenitsa-Hale for their work in preparing this release, as well as Monica Macaulay and Hunter Lockwood for many discussions spent working out the details of this format!

What is the DLx data format?

The Data Format for Digital Linguistics (DaFoDiL) is a set of recommendations (i.e. schemas or specifications) for how to store linguistic data in JSON — a simple, human-readable text format which is supported by every major programming language, and is widely used for data storage and interchange on the web. The DLx format is useful for anybody who manages a linguistic database. The format includes recommendations (called “schemas”) for storing data about every kind of linguistic entity (e.g. Language, Morpheme, Text, etc.). It is part of a broader project called Digital Linguistics (DLx), which aims to create web-based tools for managing linguistic data, and to encourage best practice in digital linguistic data management.

Why is this format useful?

Tools which adhere to this recommended format will be interoperable, allowing users to migrate their data easily from one tool to another. In addition, this format is compatible with the modern web platform, making it easy to manage linguistic data online or in a browser. JSON (the format underlying DaFoDiL), is extremely easy to use and to write programming scripts with, greatly reducing the time researchers need to spend writing scripts. In fact, even tools which do not adhere to the DLx data format will nonetheless find data stored in this format very easy to work with or support, because of how easy JSON is to use when programming.

DaFoDiL is not intended to be the format that language scientists work in directly. It is a storage format that’s designed for use in databases or when working with language data programmatically. That said, because the DLx format uses JSON, it is highly human readable, and users can simply open the text document for the item they are interested in to examine and edit the data firsthand. Since JSON files are just simple text documents with Unicode encoding, this also ensures the longevity of the data beyond any particular tool or user interface.

This format also facilitates adherence to the Austin Principles for Data Citation in Linguistics by supporting the use of persistent identifiers, fields for identifying contributors to the data and their role(s), easy searchability, human-readability (in the form of human-readable keys in addition to opaque database IDs), and interoperability between different tools and web technologies more generally.

Tools that Use DaFoDiL

All of the Digital Linguistics projects use DaFoDiL as their data format. At present, this includes the following tools:

Concordance: A library for performing concordance-related tasks on a DaFoDiL corpus
Tagged Corpus > DLx library: Converts a tagged (monolinear) text to DaFoDiL
Scription > DLx conversion library: Converts data stored in Scription format to DaFoDiL
Transliteration library: A library for transliterating text from one writing system to another

Another project that utilizes DaFoDiL is Rezonator, a tool for visualizing resonance and engagement in dialogue.

You may also be interested in the Scription format, which standardizes the format of interlinear glossed texts in a way that makes them computer-readable (i.e. parseable by a programming language). Unlike DaFoDiL, this format is intended for direct use and editing by researchers, using the common interlinear gloss format that linguists are already familiar with. Scription is also 100% compatible with DaFoDiL; you can use the Scription > DLx library to convert Scription files to DaFoDiL files.

How can I use DaFoDiL in my project?

To use DaFoDiL in your project, all you have to do is store your data in JSON, using the recommended field names and formats listed in the specification. For example, if you were storing metadata about a language, you would go to the Language schema and see that each Language object must have a name property. That name property in turn must be formatted according to the Multi-Language String schema, which can be either a string (if that string is English), or an object with strings in multiple languages. It might look like this for the language called Gusii:

{
  "name": {
    "eng": "Gusii",
    "swa": "Kisii"
  }
}

You would then continue this process for other fields you have information for, and save this as a JSON file, perhaps Gusii.json or guz.json.

You can also use one of the available converters to convert data to DLx format.

What’s new in this version?

Version 1.0.0 is a major release with many additions and breaking changes since the release of v0.29.0. Most of the individual schemas have undergone major version bumps.

See the complete summary of changes here.

More additions are likely in the coming weeks as well! We’ll also be fixing any bugs that arise as others begin to use the DLx format in their projects.

I want to contribute!

Great! The Digital Linguistics project is 100% open source! Anyone is welcome to contribute, whether through reporting bugs, requesting features, or writing code. Check out the DLx organization on GitHub, and the developer documentation for DaFoDiL.

I have other questions

Ask away! Feel free to open an issue on GitHub with your question.