Vokter v0.2: Software architecture & design philosophy

Job Management

There are two types of jobs, concurrently executed and scheduled periodically (using Quartz Scheduler): detection and matching jobs.

Scaling

Vokter was conceived to be able to scale and to be future-proof, and to this effect it was implemented to deal with a high number of jobs in terms of batching and persistence.

  1. if the difference detection job fails to fetch content from a specific URL after 10 consecutive attempts, the entire cluster for that URL is expired. When expiring a cluster, all of the associated client REST APIs receive a time-out call;
  2. every time a matching job is canceled by its client, Vokter checks if there are still matching-jobs in its cluster, and if not, the cluster is cleared from the workspace.

Persistence

Documents, indexing results, found differences are all stored in MongoDB. To avoid multiple bulk operations on the database, every query (document, tokens, occurrences and differences) is covered by memory cache with an expiry duration between 20 seconds and 1 minute.

Indexing

The string of text that represents the document snapshot that was captured during the Reading phase is passed through a parser that tokenizes, filters stop-words and stems text. For every token found, its occurrences (positional index, starting character index and ending character index) in the document are stored. When a detected difference affected a token, the character indexes of its occurrences can be used to retrieve snippets of text. With this, Vokter can instantly show to user, along with the notifications of differences detected, the added text in the new snapshot or the removed text in the previous snapshot.

OSGi-based architecture

Vokter support for reading of a given MediaType is provided by Reader modules, where raw content is converted into a clean string filtered of non- informative data (e.g. XML tags). These modules are loaded in a OSGi-based architecture, meaning that compiled Reader classes can be loaded or unloaded without requiring a reboot. When needed, usually when reading a new document or snapshot, Vokter will query for available Readers by Content-Type supported.

Caveats / Future Work

Despite every part of its architecture having been optimized to accommodate to a massive amount of parallel tasks, Vokter has only been used in a academic environment and has yet to be battle-tested in high-usage consumer software. If you’re using Vokter in your projects, let me know! 😀

i) Web crawling

One way to improve user experience is by integrating web crawling in Reader modules, allowing users to set their visit policy (e.g. number of nested documents accessed). Within the current architecture where there is a unique detection job per document, detection jobs must be sorted by link hierarchy.

  • when differences are detected in A, only clients of A are notified;
  • when differences are detected in B, both clients of A & B are notified.

ii) Orchestration for matching jobs

After an attempt to load a new snapshot of the document fails too many times, only detection jobs are timed-out. However, the system can fail to send a response to the client as well, and there is currently no way to deprecate matching jobs when the client has “disappeared and lost interest” before canceling their jobs from Vokter. This means that a high number of active detection and matching jobs might be kept alive unnecessarily.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store