My Internship Journey in DSAID-DE @ Cloak — The Central Privacy Toolkit

Sean
AI Practice GovTech
9 min readJan 25, 2024

Hello everyone! I’m Sean, a final-year computer science major at Nanyang Technological University.

From May to December 2023, I had the opportunity to work as a software engineer intern under Cloak (formerly enCRYPT) — part of the Data Privacy Protection Capability Centre(DPPCC) under GovTech’s Data Science and Artificial Intelligence Division (DSAID)’s, Data Engineering (DE) team.

Now, before we begin, here’s a short introduction about myself!

Background

As a final-year computer science major specialising in cybersecurity, I found myself at a crucial crossroads. Faced with two compelling internship offers — one in a cybersecurity role at another renowned public sector agency, and the other focusing on privacy and software engineering at GovTech, I chose GovTech.

What tipped the scales for me wasn’t just the reputation of GovTech, but the compelling nature of the job description and the insights gleaned during the interview process. The role promised not only to challenge my technical skills but also to broaden my understanding of privacy in the digital landscape — an increasingly relevant area.

I decided to step outside my comfort zone, intrigued by the opportunity to explore this new facet of technology. This article is a reflection of that journey — from the initial curiosity sparked by an engaging interview to the rich learning experiences that followed, reshaping my understanding of privacy and its critical role in our daily lives.

Overview

So what is Cloak, the Central Privacy Toolkit? It is a one-stop, self-service web application that helps everyday public officers apply data anonymisation techniques to datasets and review re-identification risks. As a central privacy toolkit for Singapore’s public sector users, Cloak makes it simple to transform and anonymise sensitive data so that resources can be allocated to other tasks.

Cloak currently offers a wide range of privacy-enhancing tools:

  • Tabular Data Anonymisation — employs advanced techniques like k-anonymity to ensure individual privacy while maintaining the data’s usefulness for analysis
  • Free Text Anonymisation — detects Personal Identifiable Information (PII) from free text, PDF, CSV, or DOCX files and transforms them based on the user’s desired settings (Redact/Replace, etc.)
  • Mock Data Generation — creates realistic but non-real datasets, allowing developers and analysts to test systems and processes without using actual sensitive information, thereby ensuring data privacy and system integrity.
  • Upcoming features: Synthetic Data Generation (SDG), Differential Privacy, and more!

My Time Being a Part of Cloak and What I Have Learnt

During my internship, I was involved in several tasks:

Synthetic Data Generation (SDG) Research and Prototyping

What is SDG?

SDG involves creating data that closely resembles real-world data in structure and statistical properties but is entirely artificial. The key purpose is to generate datasets that can be used for various purposes like testing, training machine learning models, or data analysis, without the privacy concerns or limitations associated with using real, sensitive data. By using synthetic data, organisations can ensure privacy and confidentiality, especially in fields where data is sensitive. The generated data is designed to be diverse and comprehensive enough to simulate real-world scenarios, allowing for robust testing and analysis without compromising real individuals’ data privacy.

I started learning what synthetic data was by watching various videos/lectures and reading research papers that were curated by our team’s privacy researcher. Next, I experimented with different SDG libraries (SDV, YData, etc.) and their respective models before evaluating them and sharing the results with her.

Following up on the research and experimentation, we designed several UI/UX Lo-Fi prototypes using Figma before I created the Hi-Fi prototypes through frontend development using NextJs, TypeScript, and Mantine (UI library similar to Material UI).

Through this task, I learnt many different useful libraries that can be used in a machine-learning pipeline such as Hyperimpute for missing values imputation and Optuna for hyperparameter tuning. It also solidified my understanding of how important data cleaning was in machine learning and provided my first real-world application of machine/deep learning outside of coursework. This experience also allowed me to enhance my skills in frontend development while using popular and modern frontend technologies like NextJs and TypeScript.

Free Text Anonymisation (FTA)

Building on the work done by the team and a previous intern, I was tasked to help improve the FTA module of Cloak. Previously, FTA’s analyser and anonymiser were already developed, along with a simple UI that allows users to input text into a rich text editor that analyses the text for PII before anonymising them according to the user’s preferences.

Fig 2 Screenshots of the original version of the FTA module

However, there was a lot of demand for a file upload feature, which made sense as most users would not want to manually copy and paste text from hundreds or thousands of rows in a CSV file or manually edit Word documents or PDF files.

I mainly worked on the backend for this. For the backend, we utilised Python and extensively involved various Amazon Web Services(AWS) services, including Lambda, SQS, DynamoDB and more, to architect and implement serverless solutions. This approach allowed me to deeply engage with cloud-native technologies and understand the intricacies of building scalable, efficient backend systems in a serverless environment.

Project Retention

Users now have the flexibility to create multiple projects concurrently, eliminating the need for manual copying or typing text from their sources into the rich text editor. This feature not only streamlines the transformation process but also preserves project details until file expiration. This means users can easily return to their projects without starting from scratch, even if they navigate away from the page.

In order to facilitate this functionality, I implemented CRUD (Create Read Update Delete) APIs using AWS Lambda functions and a DynamoDB table. To ensure robustness and maintainability, the implementation was carried out adhering to industry best practices. This included separating the data layer from the service layer and then to the application layer, ensuring a clear modular structure in line with the Single Responsibility Principle (SRP).

Fig 3 Sample of the FTA Projects Page where projects of different types can be displayed here

File Uploads

Based on our user studies, there was a clear demand for file upload support, primarily for CSV, PDF, and DOCX files.

For CSV files, a Lambda function was used to read and process the file before sending each row to an AWS Simple Queue Service(SQS) which then triggers another Lambda function that reads the messages from the message queue, before calling the FTA analyser and anonymiser endpoints. Finally, this Lambda function stores the “anonymised” CSV file in an S3 bucket.

PDF files are when things start to get a little complicated. Firstly, there are three types of PDF files: Native PDFs, Searchable PDFs and Scanned PDFs.

  • Scanned PDFs: These are digital copies of physical documents that have been scanned and saved as PDF files, typically containing images of the pages without any selectable or searchable text.
  • Searchable PDFs: These PDFs are often created from scanned documents, but with the addition of an invisible text layer through Optical Character Recognition (OCR), making the text selectable and searchable while maintaining the original scanned image.
  • Native PDFs: Native PDFs are created directly from electronic documents (like a Word file) and contain selectable, searchable text with well-defined fonts and layouts, offering the highest quality in terms of text clarity and formatting.

I focused on Native PDFs and Searchable PDFs during my internship.

Handling Native PDFs presented a unique set of challenges, especially when it came to preserving the original format of the documents post-transformation. Maintaining the layout, fonts, and overall structure during the anonymisation process was crucial and highly requested, yet no single library we explored provided a satisfactory solution. Each tool or library we tested fell short of accurately retaining the intricate details of the PDFs — most were effective at merely extracting text.

Drawing upon my cybersecurity background, I approached the PDF formatting challenge from an unconventional angle. From my experience in a Capture The Flag challenge, where I converted Word documents into zipped folders to expose their underlying XML structure — a technique often used in steganography to conceal information — I recognised a potential solution. By converting the PDFs to Word documents (.docx files), we could leverage this XML structure to our advantage. This method allowed us to maintain the integrity and layout of the original documents, including relational data like tables and charts, as well as images, during the anonymisation process.

Fig 4 Sample of anonymised PDF file

Searchable PDFs could not be converted to a Word document as they are typically screenshots of a PDF or Word document so technically, there is no text in the document and only an image. However, the file has been previously processed by an OCR and allows the text to be selected, albeit at a poorer accuracy. A PDF processing library (PyMuPDF) was used for these files to extract the text and enqueue the text into the SQS that was used by CSV files. Since the format could not be maintained, the output will be in CSV.

Scanned PDF support can be a future improvement for the FTA tool where Cloak can use an OCR tool to process these files.

Through this task, I gained invaluable insights and practical knowledge. Dealing with the SQS taught me how to prevent race conditions, a crucial aspect of ensuring data integrity and system reliability in concurrent processing environments. Additionally, using Docker to containerise the PDF processing Lambda function deepened my understanding of modern deployment practices, emphasising the importance of creating scalable and maintainable systems.

Reflections

Coming into my internship at DSAID, I carried with me the experience of a previous internship where I worked extensively with the LAMP (Linux, Apache, MySQL, PHP) stack as a full-stack developer intern. However, the technologies used for Cloak were notably different, presenting a new kind of challenge. Despite having some familiarity with these technologies through my coursework and personal projects, applying them in the context of a large-scale application was a different ball game — I was experiencing imposter syndrome.

Despite these initial challenges and feelings of imposter syndrome, the open and friendly culture of my team at Cloak and DE played a pivotal role in my adaptation and growth. The environment was one of collaboration and mutual support, where questions were encouraged and knowledge sharing was the norm. This supportive atmosphere not only helped ease my initial apprehensions but also fostered a space where I could learn and contribute effectively.

The culture extended beyond my immediate team to the wider department, where a sense of community and a collective drive towards innovation were palpable. Regular team meetings, knowledge-sharing sessions, and the encouragement of cross-departmental collaboration enhanced my understanding of the broader impact of our work. It was inspiring to be part of a group that was not only technically proficient but also deeply invested in nurturing a positive and inclusive work environment.

Fig 5 DE at a durian party organised by GovTech!

This culture of openness and support was instrumental in transforming my internship experience. It allowed me to feel like an integral part of the team, actively engaging in projects and discussions. Here, I learned that the technical aspects of a job are just one part of the equation — the work culture and team dynamics are equally crucial in shaping a fulfilling professional experience.

Fig 6 DE at Data Science Connect

Final Thoughts

Fig 7 The Cloak Team!

I have no regrets about choosing to intern at GovTech DSAID. As I look back on my time at Cloak, I am filled with a sense of profound gratitude and accomplishment. This journey has been more than just an internship; it has been a pivotal chapter in my professional and personal development. From confronting and overcoming imposter syndrome to learning modern technologies and thriving in a nurturing work culture, each day brought new opportunities for growth and learning. The support and mentorship provided by my team were invaluable, creating an environment where curiosity is encouraged, and innovation thrives.

I’m grateful to have been a part of such a vibrant team and I wish them all the best in their future endeavours!

For those looking for a fulfilling and exciting internship, I 100% recommend exploring opportunities at GovTech DSAID!

Fig 8 DE interns at AWS!

--

--