Advanced data analytics approaches are now offering digital solutions to transform how foreign ministries carry out diplomacy. Manually analysing large volumes of text from documents to extract meaningful information is an extremely arduous task, and so governments today are exploring the use of Artificial Intelligence for a more efficient and systematic approach to information processing. Adapting to this data-driven diplomacy, the Indonesian Ministry of Foreign Affairs (MoFA) teamed up with Pulse Lab Jakarta (PLJ) and the Ministry of National Development planning (Bappenas) to develop a machine learning visualisation tool using declassified documents to analyse digital information received from its global outposts and extract insights to inform diplomatic engagement.
To facilitate dialogue with foreign governments and other stakeholders, analysts within MoFA are typically required to produce summaries of documents per country for diplomatic staff, government officials and the Minister for Foreign Affairs. These summaries are traditionally generated by analysts reading, annotating and summarising large volumes of documents, which is time-intensive and risks a loss of information covered and incorporated into analysis reports.
Following the 2018 International Seminar on Digital Diplomacy co-organised by PLJ, MoFA and DiploFoundation, we worked with the Ministry via its Information and Media Department (known as Infomed) and Centre for Information Technology and Communication (known as Pustik KP) to explore possibilities in developing a tool with Natural Language Processing (NLP) capability. This would help shorten the time required for analysing the Ministry’s communications from around the world. The Lab initiated a collaborative process of ideation and design, leading to concepts and mockups presented at an early stage of the project to clarify needs, as well as ensure that the proposed solution is usable and will streamline the existing workflow of the staff within the Ministry.
Based on feedback from these mockups, changes were made to the proposed design leading to subsequent prototypes. A machine learning tool was developed, which reliably extracts metadata and text data from declassified documents shared within the Ministry, and automatically classifies new documents from Indonesian embassies and Representative Offices (ROs) around the world using a uniquely contextualised taxonomy.
Applying Natural Language Processing
As part of its day-to-day business operations, MoFA sends and receives scanned documents from its embassies and ROs. To analyse these documents (as inputs), we needed to perform image mining processes to extract text and information from the document images. This is followed by text preprocessing, which was carried out under the supervision of the Ministry to clean the HTML tags, numbers and punctuations in the text; split sentences into words; and filter out the most frequently used words from declassified documents.
Relying on the domain expertise and knowledge of the staff within the Ministry, a set of labels were established relevant for categorising the content of documents. Some of these labels included, economic and political issues, staff-related matters, and socio-cultural affairs. From here, the team at MoFA manually classified more than 5000 documents into these categories based on similarity of keywords, which then became the training dataset to develop the machine learning model. To make sense of the analysis performed by the computational method, the results of the classifications are visualised in the form of easy-to-read maps and graphs, thus improving the staff’s ability to synthesise massive amounts of communications and provide more relevant and timely insights.
The visualisations allow users to examine the distribution of documents in each country where the representative office resides, and with its added data analytics function, it is also possible to analyse the numbers of documents and its distribution within the categories and identify important issues and trends that need to be prioritised for diplomatic engagement. Each visualisation can be further examined for more detailed information on a particular focus area.
Whilst the NLP method comes with the benefits of being able to be deployed in a limited-resource environment (such as one with low specification of personal computers), and allows for the extraction of information from the declassified documents and insights generation from the text classifications, the process was not without challenges.
Below are some of the challenges we encountered during the NLP process:
- The scanned documents consist of unstructured text in a non-machine-friendly format, therefore requiring additional steps for conversion before classification;
- The conversion process involves a number of steps: first, the scanned documents are converted into images, and then these images are converted into plain text. Inaccuracies often occurred in the conversion from image to text;
- Some documents were up to 10-pages long, and the number of pages impact the conversion time from a scanned PDF to text;
- Manually labelling the more than 5,000 declassified documents based on the the set of predefined categories was a very labour intensive task; and
- Determining accuracy and relevance of associated keywords for the categories required human resources for assessment, instead of only relying on a count of frequent occurrence.
Impact and Future Collaboration
Following the project completion, PLJ and MoFA engaged in follow-up discussions regarding future development and sustainability of the system. The Ministry of Foreign Affairs expressed interest in furthering the collaboration, given the operational impact the system has had in classifying and categorizing its documents in a more timely manner. Furthermore, the system’s agile design has also allowed the Ministry to overlay external data collected from online and printed mass media as well as the social media to gain more comprehensive insights on selected issues.
Meanwhile, the Lab, learning from this collaboration with MoFA, sees the importance of designing a whole product with agility — and not merely a functioning prototype — for system-wide adoption and scaling. Despite being computational in nature, the NLP method relies on human expertise and the collaboration has provided capacity building in the areas of machine learning and advanced data analytics for MoFA during the first phase. The role of MoFA as the domain expert is important going forward to improve better classification results, while also placing the Ministry as a potential focal point for the development of machine learning knowledge within the Indonesian Government.
Future plans of this collaboration include integrating the system with a complementary automotive machine that will be able to collect, read and recognise the documents in their entirety. The system will also be further developed to infer insights from analysis in particular time, thus requiring time frames to be built as a new feature. In the meantime, our teams have agreed to continue text analysis using a machine learning approach and the project is expected to finish in May of this year (2021).
Authors: Annissa Zahara (Data Engineer), Utami Diah Kusumawati (Research Coordinator) and Dwayne Carruthers (Communication Manager)
Technical Team: Annissa Zahara (Data Engineer), Muhammad Rheza (Full Stack Engineer) and Sriganesh Lokanathan (Data Innovation and Policy Lead)
Pulse Lab Jakarta is grateful for the generous support from the Government of Australia