Building a learning module to work with data in an open and reproducible way

Published in

CivicDataLab

5 min readDec 1, 2022

The primary objective behind our work at CivicDataLab is to make public data more accessible. We do this by releasing datasets from inaccessible websites and PDF documents, documenting the variables, building open data repositories to archive these datasets, and ensuring that these datasets are discoverable on the internet. All this work takes us a lot closer to achieving a desirable level of accessibility but what is left to be answered is — Do people have the skills and the capacity to work with these datasets? We can keep making datasets available on central data portals, but not having the users use these datasets will not help us move forward.

Often, while speaking to our partners and the users of our open data platforms — Justice Hub, Open Budgets India, and Open Contracting India, we get to know the challenges they face in collecting or managing data or using the data for its intended purposes. These challenges may further lead to:

Under or over-utilization of resources while collecting data
Lack of responsible data management practices which can lead to unintended data uses
Dependency on external consultants for engaging with public administrative datasets
Creation of data dashboards which are not usable and harder to update or maintain because of vendor lock-ins
Datasets which may not be fit for research and analysis

Most people feel that the only way to overcome these challenges is by getting comfortable with the technical know-how required to deal with data.

We empathise with our partners on this but at the same time, we believe that one does not need a title of a data analyst or a data scientist to interact with data. Most of the challenges listed above can be overcome through a basic understanding of collecting, processing, and analysing data. These conversations helped us in realising the importance and role of data literacy in enhancing access to public datasets.

Recently, our partners at Vidhi Centre for Legal Policy, reached out to us for conducting a training program for a team of lawyers, who undertake doctrinal and empirical research on the Indian Judiciary. This allowed us to curate a learning module about the best practices for working with data which included topics such as

Processes to handle quantitative datasets
Working with open data tools
Frameworks for data analysis and data visualisation
Working with databases
Processing datasets using SQL
Working with geospatial datasets
Working with qualitative datasets

A list of topics covered in the learning module. — A list of topics covered in the learning module

We have published the contents of this module under a Creative Commons license which can be viewed or downloaded from this link.

Since then, we have conducted sessions at the Data Dialogues event in Assam, for students and faculty members at the National Law University, Orissa and for the Monitoring and Evaluation team at Society for Nutrition, Education & Health Action (SNEHA), Mumbai.

The photograph was taken during a session at National Law University, Orissa — A session for students and faculty members at NLU — Orissa

Session at SNEHA, Mumbai

We learned about this opportunity through the Tech4Dev network curated by the Chintu Gudiya foundation. We got in touch with Vinod Rajasekaran, who is collaborating with SNEHA as part of “The Fractional CXO program”. He informed us about the requirements for organising SQL training sessions at SNEHA. This gave us a good opportunity to build upon our existing work and after a couple of conversations with the team at SNEHA, we scheduled a two-day in-person session at their office in Mumbai.

Session Requirements

The M&E team at SNEHA comprises researchers, field staff and data entry operators. The team was a bit familiar with handling datasets because of their experience in handling structured datasets and relational databases as part of their work. We wanted to cover the basics of data curation, share some insights on creating accessible datasets and then move on to data analysis with Excel and SQL. With the support from the team we finalised these topics for the workshop:

Identifying the features of the dataset
Creating accessible datasets
Analysing data in Excel or other spreadsheet-based software
Learning about Database Tools such as PostgreSQL, pgAdmin, and psql
Reading and Writing Structured Query Language (SQL)
Exploring data using SQL

To make it easier for the participants to evaluate the sessions and share feedback with us, we came up with a list of objectives. Our idea was that by the end of the two-day workshop, all participants:

Are aware of the basic structure of the dataset
Can describe any tabular dataset in terms of its features
Are knowledgeable about how each data point can be stored in a file
Can evaluate the data quality of any data set
Have a basic understanding of databases
Can read and write basic SQL queries
Have a pathway to develop their skills

We wanted to ensure that the participants get to practise both reading and writing SQL queries for a diverse set of use cases. For this, we worked on a couple of case studies. The first was to explore the population Census dataset which was downloaded from the National Data and Analytics Portal (NDAP) and the second case study was designed around the datasets collected by the SNEHA team.

The slides created for this workshop can be viewed or downloaded at this link.

This is the poster for the session at SNEHA, Mumbai.

Overall, this was a good experience for us to interact and engage with the M&E team at SNEHA. It also helped us in enhancing our understanding of data accessibility and things that we need to focus on while designing and working on some of our initiatives at CivicDataLab.

We would like your suggestions on other topics we can include in the data capacity-building module. If you have any feedback for us or if you would like to collaborate with us in organising these sessions please write to us at info[at]civicdatalab.in.

Let us demystify data together!

Building a learning module to work with data in an open and reproducible way

Session at SNEHA, Mumbai

Session Requirements

Written by Apoorv Anand