Digitizing a Century of History with Data
Over the past several years, we’ve been working on an internal project codenamed “Vintage” to digitize our archives. The process entails several detailed steps and oftentimes is manual and time consuming, but still worthwhile.
Turning our “dusty” archives into digital artifacts in our data warehouse would enable us to leverage our legacy for a myriad of purposes:
- Making the history of Hong Kong and China searchable and accessible for educational institutions and research
- Increase efficiency and ease of reference for our newsroom internally
- Syndicate content to partners, news agencies, and businesses
- Make selected content available to SCMP readers
- License archival content to individuals, companies, or institutions for commercial purposes
The first step is taking the microfilm from the archives and turning them into high-resolution digital scans. We scanned these in 300 DPI, but 600 DPI is actually recommended; the higher the resolution, the better given time and memory considerations, particularly if the broadsheet is large format. With distortion from wear and tear on the print copy itself over time or smudges on newsprint, small fonts can be difficult to decipher.
Once the high-resolution scans are completed, we need to transform these scans into text via OCR (optical character recognition) so that we can begin mapping each article into a semi-structured or structured format. We did so with XML (Extensible Markup Language) since it’s human and machine-readable, below:
As you can see from the results above, the mapping does have some inconsistencies and requires further cleaning and transformation, removal of extra spaces, special characters, and erroneous letters.
The final step in the process is to convert that text into structured data and to transfer it to our data warehouse.
In the past few months, our data engineering team (kudos to @chunchuck) has taken a century of our historical archives, and transformed it into structured data which is now in our data warehouse. We took a look at the archives and found some interesting insights.
Plotting our average article output per week we see a small dip during WWI and then a substantial drop during 1941–1945 as the Japanese occupy Hong Kong followed shortly thereafter by WWII. However, the SCMP continues to grow its volume of coverage over time through the 70’s into the late 90’s.
Once this data is available in our warehouse, it enables us to run various NLP models against it including sentiment analysis, readability scoring, keyword tagging, and topic analysis. However, archival content also poses particular challenges as the news cycle is ever-changing alongside the world we live in, and training an algorithm across a century’s worth of topics, places, and people brings in several layers of complexity.
Leveraging unsupervised learning to perform keyword tagging to count recurring words (excluding “stop words” like “the”, “is, and “and”) may be a more effective approach to extracting recurring themes over time in our content. After doing this we find, not surprisingly, that amongst our top keywords are China, Hong Kong, British, Chinese, government, and police.
Looking at the keyword “China”, we see the usage of it on a per article basis has fluctuated over time. As we have recently completed importing this data into our warehouse, we are just starting to scratch the surface of the depth of insights that digitizing a century of historical news coverage can reveal.
By bringing historical perspectives alive through the infusion of today’s data technology, we look forward to revealing more insights, news findings, and learnings in the near future!