Vesuvius Challenge Ink Detection — Part 1: Introduction
As of July 2024, the Vesuvius Challenge seems to have encountered something of a roadblock. We’re not quite sure why, but the best ink detection models (including the 2023 Grand Prize winner) have not been able to adapt to reading Scrolls 2, 3, and 4. This domain adaption problem isn’t entirely unexpected, and is very similar to issues that were encountered when we initially moved over from fragments to Scroll 1.
In trying to tackle this problem, I’ve been experimenting with the aforementioned Grand Prize model created by Youssef Nader, Luke Farritor, and Julian Schillinger. Specifically, I’ve been doing some testing using their ink detection script, and have developed a slightly extended version based on their original submission. I’ve used this new script to conduct a wide variety of test runs on multiple segments from different scrolls. My main goal has been to establish baseline results that will help me (and hopefully others, too) determine the best way to locate ink in Scrolls 2, 3, and 4.
Vesuvius GP+
This post serves as a brief introduction to the Vesuvius Grand Prize Plus repository, and is intended as a launching-off point for newcomers looking to conduct their own experiments in ink detection. Whilst a vast majority of those currently working on the challenge are proficient programmers, I think it’s crucial that we continue to lower the barrier to entry for contributions. The next major breakthrough could very likely come from experts in completely unrelated fields — think radiology, signal analysis, or image processing. If we make it as easy as possible for these experts to integrate their knowledge into our efforts, our progress will almost certainly see a significant boost.
Key Findings
- Optimal Number of Layers: Incorporating more layers in the ink detection process enhances readability and reduces background noise. However, including too many layers (>25) can start to negatively impact predictions.
- Impact of Starting Layer: Starting inferences from higher layers (around 30–40.tif) generally produces optimal results, although this can vary depending on the segment.
- Single-Layer Inferences: The TimeSformer model is built for multi-layer inputs; thus, single-layer inputs do not yield meaningful predictions.
- Data Augmentation: Applying pre-processing techniques to the dataset can greatly enhance the clarity of ink detection, and should be considered almost essential.
Parameters
My first order of business in increasing the capabilities of the ink detection script was to add some extra parameters available for the user to control. In particular, I was very interested in being able to perform inference on a specific range of layers within a segment.
Number of Layers
The original script performs ink detection on around 20 images, which are usually taken from the central layers of the segment. Using the num_layers argument, users can now specify how many layers of the segment they want to perform inference on. As you’d probably expect, the number of layers used has a huge impact on the readability of the ink!
python inference_timesformer.py --segment_id <id> --num_layers <n>Another nice advantage of including more layers is that it generally reduces background noise. I’ve found the sweet spot to be around 15–25 layers — any more than that, and you’ll start to get diminishing returns with longer processing times. Including too many layers will also introduce images with very little ink signal, diluting the dataset and worsening predictions.
One interesting thing you might notice above is that single-layer inferences produce an almost blank output; this has been consistent across every scroll and segment that I’ve tested on. Obviously the ink is still there — so why can’t the model find it?
Based on the training that it has undergone, the TimeSformer model expects to be fed multiple layers of scroll data at a time, and hasn’t ever encountered single-layer inputs. This is somewhat unsurprising, given that TimeSformer was originally created for classifying a multitude of frames from a video input. It’s a little unfortunate, because localising ink signal down to a single layer would have been a really easy way to characterise the ink distribution within a segment — but don’t worry, there are other methods that we can employ to accomplish this. I’ll cover a method for constraining ink distribution in a separate writeup, and it will help us determine the optimal number of layers to include in our inferences.
Starting Layer
The choice of a starting layer is just as important as the number of layers you use for ink detection. By using the start argument, users can now also select a ‘depth’ in the segment that they want to perform inference.
python inference_timesformer.py --segment_id <id> --start <s>To understand the processes here, we need to consider the physical nature of the ink and the papyrus. Each segment is not perfectly flat and has some ‘waviness’ between layers. The ink sinks into the papyrus from where the pen touched it, percolating through to different depths. This waviness means that ink within a segment will show up best on higher layers in some areas, while in other areas within the same segment, it will appear far more clearly on lower layers.
The edge layers (around 00–10.tif and 54–64.tif) generally start to move the segmentation volume out of the papyrus, and include a lot of air gaps. I typically find the best results when starting my inferences between layers 30–40.tif, though this entirely depends upon which segment you’re using.
The remaining parameters have less to do with the physical nature of the ink, and are mostly about testing the functionality of the inference script.
Stride, Batch Size, and Size
In my testing, changing any of these parameters had no effect on the processing time of the script or ink detection results.
Workers
The number of workers also has no effect on the processing time of the script.
Filenames
The actual filenames or ‘absolute numbering’ of each layer (e.g. 31.tif compared to 3100.tif) has no effect on the ink detection results.
Layer Order
Interestingly, relative numbering, (i.e. the order in which layers are fed into the script) makes a significant difference for ink detection results. For this experiment, I tried out three different types of layer ordering:
Image Processing
While traditional image processing techniques haven’t yielded significant results on their own, they have proven invaluable for augmenting the available data. By applying various pre-processing techniques to the layers before feeding them into the machine learning model, we can achieve much clearer ink results. This is particularly crucial for Scrolls 2, 3, and 4, where any ink signal might be extremely faint or difficult for current models to detect.
Currently, there isn’t a definitive answer on which pre-processing method is best — though certainly, some techniques have been shown to work better than others. Once I’ve finished conducting comprehensive testing, I plan on releasing a detailed comparison of the available methods. For the time being, I would encourage you to experiment! Whether successful or not, any new approaches you explore will provide us with valuable insights for future reference.
Conclusion
Ultimately, the goal of these experiments is to use what we learn to find ink in Scrolls 2, 3, and 4. There’s a near-infinite number of variables to be tweaked, so by all means conduct your own test runs and publish any interesting results in the challenge Discord! Improvements to the ink detection pipeline are always welcome. I’m currently working on a method to constrain the location of ink within a segment, so you can expect more on that topic in future posts.
Massive thanks to Youssef, Luke, and Julian for the 2023 Grand Prize submission. Their model is really incredible, and I’d encourage everybody to check it out and play around with it for themselves.
If you’ve got any questions, please feel free to reach out to me on Discord, @tetradrachm.
