Querying, presenting and analysing XML: 3 pilot web applications

Reflections on 3 Digital Humanities projects presenting collections of poems and letters, built using eXist-db, Node.js and Angular.

--

Over the last 2 years of working as a Research Software Engineer at Newcastle University I’ve worked on 3 pilot projects where researchers have asked for web applications to explore collections of texts transcribed using XML. These projects have acted as a learning journey for working with XML for a developer without a background in Digital Humanities and a complete beginner at using eXist-db. This post reflects on these projects to provide an overview of issues to consider for a developer working with XML for the first time.

The project team included RSE Fiona Galston and researchers James Cummings and Tiago Sousa Garcia.

The Pilots

The pilot projects were completed as part of the Animating Text Newcastle University project between 2019 and 2021. This project aimed to equip researchers with a proof of concept application to assist them in acquiring further funding. Pilots 1 and 2 are publicly available(with researchers Mark Byers and Sinead Morrissey, and Jennifer Orr respectively) and pilot 3 is in development. This diagram shows the infrastructure and data used in each project.

Documents had been transcribed into XML, a markup language for encoding documents in a human and machine readable format. Databases included eXist-db , an open source document based database queried using XQuery submitted through an API, and TEI-Publisher, an eXist-db application that transforms XML documents into HTML.

Key development areas

The post provides an overview of 4 key development areas: querying the data; presenting the data; creating a spine index (defined below); and deployment. The diagram summarises how these development areas were implemented in each pilot:

Development area 1: Querying the data

Pilot 1: The server submits queries to the eXist-db database using the API. Query strings are written using XQuery and added to the http request as a query parameter called _query.

http://<username>:<password>@localhost:8080/exist/rest/db/<db-name>?_query=<query>

It took me a lot of trial and error to perfect the queries so I developed them within eXide, eXist-db’s web interface, before adding them to my server code.

There are 3 key queries used in pilot 1.

  1. An index of all texts was created by writing a query that returned XML:

2. Searching across all texts was implemented using Lucene full text searching (built into eXist-db already):

3. Requesting a single text as HTML transformed using a XSLT file (see development area 2):

Pilot 2: Queries to get text indexes and single versions remain the same. Full text searching was not implemented. Instead, the user can filter the documents based on multiple criteria using the form shown below.

Once the user has chosen filters, the client sends their selection to the server as a JSON object.

The server must then formulate a potentially longwinded query to submit to eXist-db. Multiple selections from the same filter need to be joined as ‘or’ queries e.g. ‘show me all letters where Person A or Person B are mentioned’. Where multiple filters have been selected these need combining as ‘and’ queries, e.g. ‘show me all letters from Person A and written in 1845’.

For each filter present, a string representing a section of XQuery is created, for example if the user has requested to view letters from sender with ID ppl: 104 this string is created: data($letter/opener//persName[@role="sender"]/@ref="ppl:104" . These snippets of queries are then strung together to form the final complete query, as seen in the code here.

Pilot 3: An index was maintained in Cosmos DB, a document based database. Different document types were stored as, in this project, a distinction was made between archetypal versions and versions with only small changes and other files summarising changes were also stored. Although not impossible to store document type in the XML it was deemed more practical to store this information in JSON format. This meant the server returned JSON indexes to the client instead of XML, reducing the chance of transformation errors in the client when processing XML.

Evaluation: Submitting convoluted queries via the eXist-db API is possible, but results in code that is difficult to follow and hard to modify. Compared to other databases, writing queries can be more verbose. However, XQuery is powerful in identifying specific sections of a document to search for and return. Being able to specify an XSLT file to transform XML to HTML in the query is highly advantageous. As a XQuery beginner I would have found code samples for different use cases beneficial.

Presenting the data

A core requirement of these projects is to present the text files in a way that reflects a reasonable amount of the formatting captured in the XML (and therefore of the original document).

Pilot 1: The server returns a HTML version of the XML to the client. This has been created using an XSLT file that specifies the rules of the transformation. For example, this rule dictates that lb(line break) elements in the XML should be transformed to a HTML line break:

<xsl:template match=”lb”> <! -- select XML element --><br/> <! -- replace with HTML element --></xsl:template>

In addition to structural elements, we can also preserve styling. For example, an XML rend element (indicating some kind of rendering) can be preserved as a HTML class and then styled in the client CSS. For example a rend=”italic" attribute in the XML becomes class="italic" in the HTML and a new CSS rule is created .italic { font-style: italic; } to style this section of text.

We can use Angular’s property binding to insert HTML fetched from the server into a div . document is a field in the component typescript file that is populated with the results of a call to the server.

 <div id="view-panel" [innerHTML]="document">

Pilot 2: We added functionality to the HTML text by way of the people, organisations and works mentioned in letters being clickable to view a modal with more information on them. The information was retrieved with another database query, this time to central files containing all people etc. with short descriptions.

To open a modal in Angular we could normally add click functionality to a HTML element that triggered an openModal() function in this way:

<span class=persName (click)=openModal(persID)>Joe Bloggs</span>

In this case, we would have to add this text to the HTML file at the point of transformation from XML (i.e. create a rule in the XSLT file). We can do this, but this functionality gets removed by Angular sanitisation when the HTML is loaded using [innerHTML]. Therefore, we have to add this behaviour after we have loaded the HTML by using ElementRef to access the DOM elements. For example, the XSLT file specifies that all text marked as a person’s name in the XML ( <persName> ) should be transformed to an anchor tag with the classpersName when the HTML is created. We can then use the method below in the client to add an event listener which then triggers the modal to open by callinghandlePersonClick():

// all anchor tags with a persName classnameconst persAnchors = this.elRef.nativeElement.querySelectorAll('a.persName');persAnchors.forEach((person: HTMLAnchorElement) => {person.addEventListener('click', this.handlePersonClick);});

Pilot 3: This pilot did not need the modal behaviour so was a good opportunity to try using TEI-Publisher’s web components. These enable the client to connect directly to the database and request a HTML transformation of a particular XML file.

To add web components to an Angular project:

  1. Add this to the <head> section of index.html :
<script src="https://unpkg.com/@webcomponents/webcomponentsjs@2.4.3/webcomponents-loader.js"></script>  <script type="module" src="https://unpkg.com/@teipublisher/pb-components@latest/dist/pb-components-bundle.js"></script>

2. Add this to the assets list within the architect section of angular.json :

{                
"glob": "{*loader.js,bundles/*.js}",
"input": "node_modules/@webcomponents/webcomponentsjs/", "output": "node_modules/@webcomponents/webcomponentsjs" }

3. Add a schema toapp.module.ts:

@NgModule({  
declarations: [AppComponent],
imports: [...],
providers: [...],
bootstrap: [AppComponent],
schemas: [ CUSTOM_ELEMENTS_SCHEMA]}) //add this line

You can then use TEI-Publisher’s web components as below, and read about the other components you can use here. On line 2 the TEI-Publisher endpoint is specified. Line 5 contains a path element where we have used Angular binding to insert a filename and an odd element where the ODD filename is specified.

An ODD (‘One Document Does it all’) file contains the rules for transforming the XML to HTML in a similar way to XSLT. TEI-Publisher offers a GUI to create the ODD document without writing code. This was useful although I struggled to find definitions for all the terms used in the GUI.

Evaluation: Manipulating the DOM after loading HTML was a clumsy solution to add click behaviour, but served it’s purpose.

TEI-Publisher worked in the context we used it as there were no queries needed beyond requesting and updating a specific file. For us to use it in other projects, TEI-Publisher would need to act as a layer on top of eXist-db so that we can use the web components of Publisher with the API querying flexibility of eXist-db. As far as I can tell, because the data for TEI-Publisher is not stored within eXist-db this is not possible.

I haven’t gone into detail here about writing a custom ODD file as I am working it out, but there are a couple of resources you can check out: the official documentation and this workshop. Both XSLT and ODD have a steep learning curve and I could do with a beginner’s guide. I find that the TEI-Publisher and related documentation assumes a level of pre-existing knowledge that I don’t have. However, they are powerful in terms of the transformations that can be achieved and worth some time to get to grips with.

Generating and updating a spine index

More of a self contained feature than the other themes covered in this post, I picked out the generation of a spine index to discuss as it illustrates the challenge of performing analysis on XML that is not possible using XQuery.

Different versions of the same text with lines that are deemed to be essentially the same (i.e. only small changes) are given the same spine index number. With each new version added, the spine index adjusts either by adding another line to a spine index or by inserting new lines as new numbers. In the example below, you can see how the spine index develops with each new version added, imagining a short poem (‘This is the first line, And the second, Finally the last’) and a sequence of 3 drafts:

Our aim was to automate some of the process of creating a spine index to help rather than replace a scholarly editor. Our approach does not take into account any meaning or context, only the words themselves.

Pilot 1: We created the spine index before the site was deployed. No additional documents would be added so this only needed to happen once. To do this, the XML for each version was converted to JSON and the lines extracted as a list. Every line was compared to every other line in other versions and these were then grouped based on them being deemed ‘similar enough’. We did this using cosine similarity and, through experimentation, chose 0.7 as a cut off point that seemed to work in this instance. I won’t go into the details of cosine similarity, other ways to compare texts and the obvious losses that this mechanistic comparison results in here, although these are all considerations for us going forward. These sites are good starting points to read more about cosine similarity and alternative text comparison methods.

The method resulted in a CSV representation of the spine index, an extract of which is below. The number on the left is the spine index. Each index has a list of codes, each referencing a line in a document (in the format doc ID.line ID ). You can see from this index that the first line from document M1 is only present in that document and removed in subsequent drafts of the poem.

1;M1.1
2;M1.2,M2.1,M3.1,M4.1,M6.2,M5.2
3;M1.3,M2.2,M3.2,M4.2,M5.3,M6.3
4;M1.4,M2.3,M3.8,M4.9
5;M1.5,M2.4,M4.10,M5.15,M6.15

Pilot 3: A primary aim of this pilot was to create a website where a researcher could add new transcriptions as and when they completed them. Therefore, the spine index needed to update after a new version is added. We approached this in a very similar way to pilot 1, however a document based database was used to maintain a working version of the spine index instead of a CSV file to aid in flexibility and querying.

When the first version of the poem is uploaded, a new spine index is created with a simple 1:1 relationship (line 1 is spine index 1 etc.). When the next version is uploaded, the spine index is retrieved from the database and each line from the version compared to the spine index. If a match is found, the line is added to the list of matching lines for that spine index number. If no match is found, the spine index adjusts to add a new entry.

Evaluation: Converting the XML to JSON and then extracting the poem lines made them easy to work with and compare. However, throughout these projects, converting to and from JSON could be a challenge as it results in deeply nested JSON prone to errors. It was vital to know the structure of the XML in advance and for all documents to follow this structure accurately in order to navigate this JSON.

Deployment

Pilot 1: The clients and servers used in these pilots were dockerised and then deployed using Azure App Service. Our instinct was to pursue the same route for eXist-db using the official docker image. For the clients and servers we use Terraform infrastructure as code tool so that if the infrastructure is destroyed, we can bring everything back up easily. In line with this approach, the server contained code that put all the required data and config files into eXist-db on start up using the API.

Pilot 2: Once deployed we discovered that our eXist-db deployment using Azure App Service periodically lost all data! Although data was entered when the server started , we had to repeatedly manually repopulate the database upon these losses. This was a high priority issue to address. My colleague replaced the Azure App Service approach by installing eXist-db on a VM that we can use for all projects requiring it and this has proved a much better option.

Next steps

The spine index creation remains quite fragile given the involvement of deeply nested JSON. There are also obvious cases that are not currently provided for, for example if there are two lines in a version that are deemed similar to one line from a previous version. This could also be improved by applying more sophisticated text comparison techniques. Further, an enhanced interaction between the editor and the computer generated spine index would allow the editor to correct errors and add more insightful comparisons.

Secondly, improvements could be made in the presentation of the documents. on a personal level, XSLT and ODD are still somewhat of a mystery. I would like to become more competent at creating transformation rules that accurately reflect and bring to life the original documents that have been transcribed. A blog post focusing on both formats would be a good target. It would also be brilliant to capture changes between versions on different scales, from illustrating how a single sentence evolves across drafts to ways to visualise the whole spine index to show the evolution of the whole document.

--

--