Note: this series of blog posts introduces the concept, the design and the development that led to launching this platform which automates the process of transforming heterogeneous cultural datasets into standardised IIIF and EDM datasets, along with offering a public space for the revamped collections. You can find the 1st part HERE.
This post will be presenting the architecture that we have designed for this project. But we first want to thank all the contributors to the open-source repositories we have used and tweaked. Thank you, everyone, this is tremendous work! We will give back something when it is ready for the community.
We also want to remind you that we have some free/demo Team account (with 30GB of storage for the images for 6 months) still available. To sign up, just click this link: https://bit.ly/2VXzfyY
Better safe than trendy
When starting a fresh data project, it can be tempting to use something new and shiny. And you might be right, as new frameworks and languages are always meant to deliver improved performances, to offer more features, etc. But technology is ever-changing and it’s wise to future proof getting on board with something stable and reliable. Hence, components need to be developed while respecting a list of critical features:
- Maintainability: Is this tool difficult to install, set up or monitor? Does it need a lot of temporary storage or use a lot of memory? Is it written with a language we are proficient with?
- Adoption: What is the actual user basis/company number trusting this tool? Is there a positive trend regarding its future adoption?
- Flexibility: What can be done with this tool? Does it require a lot of workarounds to work with other services? Can it be easily tweaked if necessary?
- Support: How active is the Github repository, how often is it updated?
- Performance: Is this library fast? Is there some limitations/performance drop?
We also had to stay humble in our development process due to our limited resources. No over-engineering, just an elegant service-oriented architecture that is flexible (e.g. ETL pipelines modifications), and whose building blocks can be modified or even replaced if limitations were to be encountered (load, asynchronicity, etc.). We wanted to stay lean with our design process, and not get stuck with rigid technical solutions.
For this preparation, we have simply duckducked a lot and read many Medium posts and browsed Github and Stackshare based on hearsays. We also read the latest Stackoverflow survey(s) regarding technology adoption and trends.
Architecture
Rather than a long explanation, we think that a nice diagram will help you understand our architecture.

Backbone
We wanted to use a Python framework to stay in a data-driven development approach, as the application codebase could smoothly be extended with pure data engineering components. From both the maintainability and workflow considerations, it could ensure solid foundations.
After comparing Django, Pyramid, and Flask, we chose to use the later as it is a minimal though powerful toolbox to build a data-oriented platform backend. We are basically in love with Flask now. Along with this, we are using Celery (which happens to work nicely with Flask, so we love it too) and Rabbit-MQ to handle our ETL pipeline. This combo is a real solid solution to build a platform meant to process datasets, including downloading pictures, calling external APIs and data transformation/load operations.
Since this is beta software, we needed to stay pragmatic. In a later stage, we shall implement a workflow manager such as Airflow but it is now a sugar-coating we could spare due to our humble beginnings: there is just not enough data for that yet.
What about the IIIF and computer vision features?
We will be telling you everything about it next week in our next post. In the meantime, feel free to register to our newsletter and follow us on Twitter to stay up-to-date with the future announcements and releases :).
We hope to hear from you soon!
The muzz.app team
