Building Bridges between Africa and Latin America: Fellows Projects on Big Data for Sustainable Development

Sokhar Samb, Jamiil Touré Ali and Fredy Rodriguez

This article reports on research project outcomes from fellows who were part of the students exchange program between the African Institute for Mathematical Sciences — Next Einstein Initiative (AIMS-NEI) and the Centro de Pensamiento Estratégico Internacional (CEPEI). The CEPEI is an independent, non-profit, data driven think tank, based in Bogota, Colombia. Its work focuses on field based analysis and high-level advocacy to scale up a multi-stakeholder participation within the development process at all levels.

We started six months of intensive practice of theoretical knowledge, punctuated by 3 projects on real life problems in the scope of Sustainable Developments Goals (SDGs). Some days off for discovery have flown by and it was finally time for us to reflect on the expectations from a Data Scientist within a challenging professional setting. The use of big data and open data for monitoring the 2030 Agenda is an international and very appealing concern. At CEPEI, we had the immense privilege to be introduced to some of those projects we worked on with the Data Area team. This was an opportunity to build capacity for professional life and strengthen the partnership between the AIMS-NEI and CEPEI, two partners of the Big Data for Development (BD4D) initiative.

The projects

During this six months’ full professional internship, we had three main tasks to achieve:

  1. Build an algorithm that aims to download data from websites in Latin America especially in Colombia, Costa Rica and Mexico. This project consisted of developing an algorithm that scrapes data linked to the Sustainable Development Goals (SDGs) from selected web pages in Colombia, Costa Rica and Mexico using R software.
  2. Research about the use of non-traditional data sources for the measurement of the Sustainable Development Goals (SDGs) and the data ecosystem of sustainable development in Dominican Republic. The first part of this project consisted of an overview report on how the non-traditional data sources are or can be integrated in the process of SDGs measurement in Dominican Republic. The second part of the project developed a report on how SDGs are approached in Dominican Republic, and provided with an analysis to the executive team or third party acting on the topic, on the progress and future actions Dominican Republic is implementing towards the attainment of the 2030 Agenda.
  3. Use Python to carry out research on Gender-based violence in Colombia using data from Twitter.

The process

Throughout these projects we were able to gain new skills and enhance a set of existing skills. First and foremost, we gained one of the important data scientist skills — scraping (this means searching for and extracting data from websites in Data Science parlance). We were then exposed to and gained a better understanding of the SDGs, for which certain concepts were very new to us at the start of the projects.

This exercise which was not so technical at first really helped on the different tasks we did since one of the main line actions of CEPEI is to transfer knowledge on SDGs by investigating and exploring data. Working in such environment and organization has really eased the work for us because we had the necessary support, guidance and were consistently engaged with the topic.

In fact, in the first month, we had put in place a timeline for all the work we had to do until the end of our internship. This agenda allowed us to keep in mind the beginning and the end of each task. And for following the progress, a meeting was planned with our supervisor, the Data Coordinator of CEPEI, Fredy Rodríguez, every Tuesday or Thursday.

The last point to mention is the noticeable progress we were able to achieve in building some advanced capacities using R and Python to do scraping, crawling, text mining, sentimental analysis and storytelling.


Completing the first task within a given timeline of three months we have come up with an algorithm that could allow us to download data linked with the Sustainable Development Goals. This algorithm could be used as a tool to enhance the data available on the website of Cepei: Below is a screenshot displaying the sample of downloaded data in a txt file which shows the link containing the data and the SDGs related to it.

Source: CEPEI, 2018

On the second task we produced two reports on the data ecosystem and use of non-traditional data sources in Dominican Republic. The mentioned documents detail on the approach Dominican Republic is implementing for attaining the objectives of the 2030 Agenda as well as the strategies to include open and big data when working with the SDGs. The documents are useful for CEPEI in a sense that it gives a gist to the Data Area for engaging discussions about Dominican Republic in the topic of data for SDGs.

The third and last task aimed at running a sentiment analysis on the gender-based violence problem in Colombia, Costa Rica and Mexico using data from Twitter. This task was rounded up with a storytelling using the data visualization tool Tableau. The main points of the storytelling show 1) how the gender problem is tweeted by men and women, and 2) which city of a chosen country should be investigated more for uncovering some findings on the crime pattern. This study is of an interest for Cepei as it allows them to keep working in their framework of actions of the SDGs, which case here is the SDG 5, Gender equality, and 10, Reduce inequalities.