Extraction of Collections and Datasets Using data.gov.sg API

Sze Zhong LIM
Data And Beyond
Published in
14 min readMar 2, 2024
A snapshot of Data.gov.sg

According to their website:

Data.gov is Singapore’s one-stop portal for publicly available datasets from various public agencies. Users are able to explore national datasets, download raw data for research analysis, and access APIs for web development.

Features include:

High quality data presented via dashboards for quick reference

Shareable charts and tables of datasets

Data blogs of social trends and government policies explained using data analysis and visualisations.

data.gov.sg is not a new concept. Few years back, I used Department of Statistics Singapore’s (SingStat) website to search for data. However, alot has changed since then. I see more processed and analyzed data on SingStat, whereas data.gov.sg has more raw data (in my opinion).

A friend of mine had asked me to look for insights on Companies in Singapore and I decided to use data.gov.sg as my source for data. One thing I noticed is that they have some documentation on how to view / extract from their collection / dataset. Basically, a collection is a compilation of all related datasets of the same kind.

Let’s head right into the extraction of data from this portal.

EXTRACTION OF INFORMATION

We can extract the information via the browser / links, or via the API.

via Browser / Links

I wanted to find for data on companies in Singapore so my first stop would be ACRA. Upon typing ‘ACRA’, we can see the collections available.

I chose the first link “ACRA Information on Corporate Entities”. Upon entering the link, on the left you can see many files, which are all the datasets available for this collection. In the middle, you can see the sample data for that particular dataset.

Just click on Download files and you will be able to choose on which datasets to download. You may also click on the “Select all” tab to choose the whole collection.

The download will result in multiple csv files being download instead of a single zip file. Although I prefer the downloads to be of single individual files, I would suggest the devs to put in an option of zipping the information up and renaming the zip file also.

But what if we wanted to create dashboards and automatically update the data over time pragmatically. It would be troublesome to create a script an scrape from the site. Especially if changes are going to be made to the site interface over time.

Luckily, the devs had two types of APIs for our use. The first type is real-time APIs, and the other is Collection and Dataset APIs. Our scope is to only cover the Collection and Dataset APIs

via Collection and Dataset APIs

data.gov.sg provides an API guide as below:

Before using the API, we need two items. One is the collection_id, and the other is the dataset_id. The method to identify these 2 are clearly self explanatory in the guide as below.

The APIs are simple and easy to use, and provide basic building blocks for customizing what you want. I decided to use these building blocks to create a downloader for the whole collection. This would be more convenient than manually downloading csv files and converting them. Instead, we could use the APIs to directly create a DataFrame and save the data as backup in a csv format.

WRAPPER FUNCTION TO DOWNLOAD

I created a wrapper function with the intent to achieve the below objectives:

  1. Able to download the information needed purely with the collection number, and in a simple manner
  2. Able to save into csv file as a backup.
  3. Able to save into a pandas.DataFrame format to be used for analysis / fed into a pipeline.
  4. Flexible to not download the datasets if they have not been updated on the actual site, but use the previously saved csv file (if existing). The main purpose is to save bandwidth and resources.

Let’s start by importing all the relevant packages. I also created a TypeVar for pandas.DataFrame.

import pandas as pd
import requests
import os
from typing import Optional, List, Dict, Tuple, TypeVar
import src.config as CFG # Config file placed in src folder.

pdDF = TypeVar('pandas.core.frame.DataFrame')

We then create the class for Collections. By inputting the collection_id, it will initiate and get the collection name, the last updated date for the collection, and the dataset_id list.

class Collection:
def __init__(self, collection_id: str) -> Tuple[str, str, List[str]]:
'''
Parameters
----------
collection_id : str
collection id from data.gov.sg

Returns
-------
Tuple[str, str, List[str]]
collection name, last updated date, list of dataset id within
collection.

'''
self.collection_id = str(collection_id)
self.col_url_info = CFG.api['collection'] + f'/{self.collection_id}/' +\
CFG.api['meta']

self.collection_name = None
self.last_updated = None
self.dataset_id_list = None

self.collection_info()


def collection_info(self) -> Tuple[str, str, List[str]]:
'''
Returns
-------
Tuple(str, str, List[str]
returns the collection name, last updated date, and a list of datasetID

'''
json_content = requests.get(self.col_url_info).json()
content = json_content['data']['collectionMetadata']

self.collection_name = content['name']
self.last_updated = content['lastUpdatedAt'][:10]
self.dataset_id_list = content['childDatasets']

return (self.collection_name, self.last_updated, self.dataset_id_list)

We then create the class for Datasets. This class requires more inputs that allows the flexibility to create pandas Dataframe or a csv backup file, right from when the class initializes.

class Dataset:
def __init__(self, dataset_id: str, pdf: str = 'No', csv: str = 'No',\
csvdir: Optional[str] = None):
'''
Parameters
----------
dataset_id : str
dataset id found in url
pdf : str
Flag to generate pandas Dataframe.
Use 'Yes' if want to generate
csv : str
Flag to generate csv for whole dataset or not.
Use 'Yes' if want to generate
csvdir : TYPE, optional
csv directory name. The path will be the date and file name.

Returns
-------
None.

'''
self.dataset_id = dataset_id
self.url_download = CFG.api['dataset_dl'] + f'{self.dataset_id}'
self.url_info = CFG.api['dataset'] + f'/{self.dataset_id}/' + CFG.api['meta']

self.dataset_name = None
self.last_updated = None
self.col_info = None
self.dataframe_data = None
self.total_results = 0
self.csv_final_filepath = None

self.dataset_info()

if pdf == 'Yes':
self.dataset_download_pdf()

if csv == 'Yes':
self.dataset_download_csv(directory=csvdir)


def dataset_info(self) -> Tuple[str, str, Dict[str,List[str]]]:
'''
Returns
-------
Tuple(str, str, Dict[str:List[str]])
returns the collection name, last updated date, a dict of column info.
The dict will contain mapping_id, name, datatype, description, index,
as the keys.
'''
json_content = requests.get(self.url_info).json()
content = json_content['data']
content_inner = json_content['data']['columnMetadata']

dataset_name = content['name']
last_updated = content['lastUpdatedAt'][:10]

col_mapping_id = content_inner['order'] # returns a list
col_name = []
col_datatype = []
col_description = []
col_index = []

for x in col_mapping_id:
col_attr = content_inner['metaMapping'][x]
try:
col_name.append(col_attr['name'])
except:
print("No column name attribute")
try:
col_datatype.append(col_attr['dataType'])
except:
print('No dataType attribute')
try:
col_index.append(col_attr['index'])
except:
print('No index attribute')

col_info = {'col_mapping_id': col_mapping_id,
'col_name': col_name,
'col_datatype': col_datatype,
'col_description': col_description,
'col_index': col_index}

self.dataset_name = dataset_name
self.last_updated = last_updated
self.col_info = col_info

return (self.dataset_name, self.last_updated, self.col_info)

I have also included the code for the ability to download info into a pandas DataFrame or a csv file as below. The code will extract the total number of rows from the API, the followed by deciding how many loops will go thru to download. I have set the limit as 1000, because i tried some large number and realized if the number is too big, the download fails. On the safe side, I decided to use the limit of 1000. The below code is still part of the Dataset class.

    def dataset_download_csv(self, directory=None) -> pdDF:
'''
Parameters
----------
directory : TYPE, optional
folder / directory where the csv file will be saved to.

Returns
-------
pdf : pandas Dataframe
Function will produce a csv file in the directory with the
last updated date of the dataset together with the dataset name.

'''
if self.dataset_name == None or self.last_updated == None:
dataset_name, last_updated, col_info = self.dataset_info(self.dataset_id)

if directory==None:
filepath = f"{self.last_updated} {self.dataset_name}.csv"
else:
filepath = f"{directory}/{self.last_updated} {self.dataset_name}.csv"

self.csv_final_filepath = filepath

if self.dataframe_data is None:
self.dataframe_data = self.dataset_download_pdf()

self.dataframe_data.to_csv(self.csv_final_filepath, index=False)
print(f'{self.csv_final_filepath} downloaded')

return self.dataframe_data

def dataset_download_pdf(self) -> pdDF:
'''
Returns
-------
pdDF
pd.DataFrame that contains everything
'''
print(f"Downloading from {self.url_download}")
ini_json_content = requests.get(self.url_download).json()

self.total_results = ini_json_content['result']['total']
print(f'There are a total of {self.total_results} rows in this dataset')

offset_start, offset_end, limit = self.dataset_download_counter()
for x in range(offset_start, offset_end, limit):
if x == 0:
offset_limit_str = f"&limit={limit}"
elif x > 0:
offset_limit_str = f"&offset={x}&limit={limit}"
loop_url = self.url_download + offset_limit_str
loop_json_content = requests.get(loop_url).json()
loop_record = loop_json_content['result']['records']
loop_record_df = pd.DataFrame(loop_record)
if x == offset_start:
overall_loop_record_df = loop_record_df
else:
overall_loop_record_df = pd.concat([overall_loop_record_df, loop_record_df], axis=0, ignore_index=True)
if x > (offset_end-limit):
print(f"Processed {self.total_results} / {self.total_results} rows.")
else:
print(f"Processed {x+limit} / {self.total_results} rows.")

self.dataframe_data = overall_loop_record_df

return self.dataframe_data

def dataset_download_counter(self) -> Tuple[int, int, int]:
'''
Returns
-------
Tuple[int, int, int]
A range to feed the for loop in dataset_download_pdf that downloads the
whole dataset.
'''
if int(self.total_results) <= 1000:
offset_end = None
limit = None
elif int(self.total_results) > 1000:
offset_end = int(self.total_results) + 1
limit = 1000

return (0, offset_end, limit)

Now that we have both the Collection class and Dataset class, we can create a wrapper function that will:

  1. Check if the file has been downloaded. If downloaded, can choose to use back the old files, or can also re-download and overwrite.
  2. Option to create a combined dataset of the whole collection in both pandas DataFrame and csv format(File will be very big)
def download_collection(collection_id: str, chk: str = 'No', combcsv: str = 'No',\
indcsv: str = 'No', csvdir: Optional[str] = None)\
-> pdDF:
'''
Wrapper function to download the whole collection

Parameters
----------
collection_id : str
DESCRIPTION.
chk : str
Flag for whether to check csvdir for any existing csv's which are same.
Options of 'Yes' or 'No'.
If 'Yes', won't download csv if same last updated date.
If 'No', wont check and will just download. Will overwrite if have existing.
combcsv : str, optional
Flag for whether to combine the whole collection into one csv.
indcsv : str, optional
Flag for whether to download individual datasets in the collection.
csvdir : Optional[str], optional
The place where csv are stored.
Will check this directory if existing files are stored.
If not, will re-download.

Returns
-------
pdDF
A combined pandas Dataframe.

'''
collection_object = Collection(collection_id)
countofdataset = len(collection_object.dataset_id_list)
print(f'Collection {collection_id} contains {countofdataset} no(s) of dataset.')

combinedpdf = None

if chk == 'Yes':
### Confirm which dataset is missing
# Current list in directory.
listindir = []
for root, dirs, files in os.walk(os.getcwd()):
for file in files:
file_path = os.path.relpath(os.path.join(root, file), os.getcwd())
listindir.append(file_path.replace('\\','/'))

# Check for missing.
missingdataset = []
for x in range(countofdataset):
# Get the current file path
d_id = collection_object.dataset_id_list[x]
d_obj = Dataset(d_id, 'No', 'No', csvdir)
if csvdir==None:
filepath = f"{d_obj.last_updated} {d_obj.dataset_name}.csv"
else:
filepath = f"{csvdir}/{d_obj.last_updated} {d_obj.dataset_name}.csv"

if filepath in listindir:
if combinedpdf is None:
combinedpdf = pd.read_csv(filepath)
else:
pdf_not_missing = pd.read_csv(filepath)
combinedpdf = pd.concat([combinedpdf, pdf_not_missing], axis=0, ignore_index=True)
else:
missingdataset.append(d_id)
print(f'Checked {x+1} out of {countofdataset} datasets.')

if len(missingdataset) == 0:
print('All files were downloaded previously.')
else:
print(f'{len(missingdataset)} out of {countofdataset} datasets missing.')
for x in range(len(missingdataset)):
print(f'Start of Missing Dataset {x+1} out of {len(missingdataset)}')
d_id = missingdataset[x]

if indcsv == 'Yes':
d_obj = Dataset(d_id, 'Yes', 'Yes', csvdir)

elif indcsv == 'No':
d_obj = Dataset(d_id, 'Yes', 'No')

if combinedpdf is None:
combinedpdf = d_obj.dataframe_data
else:
combinedpdf = pd.concat([combinedpdf, d_obj.dataframe_data], axis=0, ignore_index=True)
print(f'End of Missing Dataset {x+1} out of {len(missingdataset)}')

elif chk == 'No':

for x in range(countofdataset):
print('Start of Dataset {x+1} out of {countofdataset}')
dataset_id = collection_object.dataset_id_list[x]

if indcsv == 'Yes':
dataset_obj = Dataset(dataset_id, 'Yes', 'Yes', csvdir)

elif indcsv == 'No':
dataset_obj = Dataset(dataset_id, 'Yes', 'No')

if x == 0:
combinedpdf = dataset_obj.dataframe_data
else:
combinedpdf = pd.concat([combinedpdf, dataset_obj.dataframe_data], axis=0, ignore_index=True)
print('End of Dataset {x+1} out of {countofdataset}')

if combcsv == 'Yes':
if csvdir==None:
combcsvfilepath = f"Combined {collection_object.last_updated} {collection_object.collection_name}.csv"
else:
combcsvfilepath = f"{csvdir}/Combined {collection_object.last_updated} {collection_object.collection_name}.csv"

combinedpdf.to_csv(combcsvfilepath, index=False)
print(f'{combcsvfilepath} downloaded')

return combinedpdf

The above code are put into a folder called src, with a filename of api_data_gov.py src/api_data_gov.py

I decided to put the config file separately, so that it will be easy to change if the API links are modified. The below code is put into a folder called src, with a filename of config.py src/config.py

'''
###### Sample Links ######
collection_list = 'https://api-production.data.gov.sg/v2/public/api/collections?page=2'
collection_metadata = 'https://api-production.data.gov.sg/v2/public/api/collections/1/metadata'
dataset_list = 'https://api-production.data.gov.sg/v2/public/api/datasets?page=2'
dataset_metadata = 'https://api-production.data.gov.sg/v2/public/api/datasets/d_dcda79be4aded5f9e769b8e23ff69b47/metadata'
dataset_info = 'https://data.gov.sg/api/action/datastore_search?resource_id=d_dcda79be4aded5f9e769b8e23ff69b47'
'''

###### Datasets ######
datasets = {
'acra_ent' : 'datasets/d_3f960c10fed6145404ca7b821f263b87',
'acra_reg_ent' : 'collections/1',
'bca_con' : 'datasets/d_dcda79be4aded5f9e769b8e23ff69b47',
'hdb_con' : 'datasets/d_9973d2c119ed4dd1560aebf8f0829b86'
}

###### APIS ######
api = {
'collection': 'https://api-production.data.gov.sg/v2/public/api/collections',
'dataset': 'https://api-production.data.gov.sg/v2/public/api/datasets',
'meta': 'metadata',
'dataset_dl': 'https://data.gov.sg/api/action/datastore_search?resource_id=',
'dataset_q_limit': '&limit=',
'dataset_q_offset': '&offset='
}

To run the code, I created a main file in the same directory as where the folder src is at. main.py . In this particular code, I will be downloading collection_id 2, and I want to check if the existing files are there. It will also create a combinedcsv file and download the individual csv files if necessary. The data files will be stored in a directory known as data

from src.api_data_gov import download_collection

col_id = 2
chk = 'Yes'
combcsv = 'Yes'
indcsv = 'Yes'
csvdir = 'data'
df_combined = download_collection(col_id, chk, combcsv, indcsv, csvdir)
print(df_combined)

WRAPPER FUNCTION OUTPUT AND ISSUES

The output of my code when there are existing csv files and we do not need to re-download is as below:

Output (at the start) from running my code.
Output (at the end) from running my code

One of the problems is of the dtype warning which comes when processing csv’s back to pandas. I would suggest that if the use is for something more customized, to just modify the code a little and include the dtype when reading the csv. One method is that when first initializing the code, when converting the info to pandas, to extract the dtype from the pandas DataFrame first, and saved to be used for later.

If we want to re-download the whole collection, we just have to change the chk parameter to ‘No’ and it will redownload the whole collection of datasets.

Output (at the start) from running my code
Output (in the middle) from running my code
Ouput (at the end) from running my code

As the collection is huge, it will take roughly 20 mins to download all the data.

SOLVING THE DTYPE ISSUE

To solve the dtype issue when loading the csv files into pandas.DataFrame, the dtype dictionary was saved into json files and downloaded together with the csv file downloaded.

When loading, the code will directly load the csv files together with the pre-determined dtypes. The code below should be added / modified to the Dataset class. Basically i have added the json_dtype_filepath and a function that will save the dtype dict to a json file.

class Dataset:
def __init__(self, dataset_id: str, pdf: str = 'No', csv: str = 'No',\
csvdir: Optional[str] = None):
'''
Parameters
----------
dataset_id : str
dataset id found in url
pdf : str
Flag to generate pandas Dataframe.
Use 'Yes' if want to generate
csv : str
Flag to generate csv for whole dataset or not.
Use 'Yes' if want to generate
csvdir : TYPE, optional
csv directory name. The path will be the date and file name.

Returns
-------
None.

'''
self.dataset_id = dataset_id
self.url_download = CFG.api['dataset_dl'] + f'{self.dataset_id}'
self.url_info = CFG.api['dataset'] + f'/{self.dataset_id}/' + CFG.api['meta']

self.dataset_name = None
self.last_updated = None
self.col_info = None
self.dataframe_data = None
self.total_results = 0
self.csv_final_filepath = None
self.json_dtype_filepath = None

self.dataset_info()

if pdf == 'Yes':
self.dataset_download_pdf()

if csv == 'Yes':
self.dataset_download_csv(directory=csvdir)

def dtype_to_json(self, pdf: pdDF) -> dict:
'''
Parameters
----------
pdf : pandas.DataFrame
pandas.DataFrame so we can extract the dtype

Returns
-------
Dict
The dtype dictionary used

To create a json file which stores the pandas dtype dictionary for
use when converting back from csv to pandas.DataFrame.
'''
dtype_dict = pdf.dtypes.apply(lambda x: str(x)).to_dict()

with open(self.json_dtype_filepath, 'w') as json_file:
json.dump(dtype_dict, json_file)

return dtype_dict

def dataset_download_csv(self, directory=None) -> pdDF:
'''
Parameters
----------
directory : TYPE, optional
folder / directory where the csv file will be saved to.

Returns
-------
pdf : pandas Dataframe
Function will produce a csv file in the directory with the
last updated date of the dataset together with the dataset name.

'''
if self.dataset_name == None or self.last_updated == None:
dataset_name, last_updated, col_info = self.dataset_info(self.dataset_id)

if directory==None:
filepath = f"{self.last_updated} {self.dataset_name}.csv"
jsonfp = f"{self.last_updated} {self.dataset_name}_dtype.json"
else:
filepath = f"{directory}/{self.last_updated} {self.dataset_name}.csv"
jsonfp = f"{directory}/{self.last_updated} {self.dataset_name}_dtype.json"

self.csv_final_filepath = filepath
self.json_dtype_filepath = jsonfp

if self.dataframe_data is None:
self.dataframe_data = self.dataset_download_pdf()

self.dataframe_data.to_csv(self.csv_final_filepath, index=False)

# Save the dtype to json
self.dtype_to_json(self.dataframe_data)

print(f'{self.csv_final_filepath} downloaded')
print(f'{self.json_dtype_filepath} created')

return self.dataframe_data

Similarly, in the wrapper function, we will also need to modify such that the if the json file is missing, it will just redownload the whole dataset. Only if the dataset AND the dtype dict json file is present, will it skip the download process.

def download_collection(collection_id: str, chk: str = 'No', combcsv: str = 'No',\
indcsv: str = 'No', csvdir: Optional[str] = None)\
-> pdDF:
'''
Wrapper function to download the whole collection

Parameters
----------
collection_id : str
DESCRIPTION.
chk : str
Flag for whether to check csvdir for any existing csv's which are same.
Options of 'Yes' or 'No'.
If 'Yes', won't download csv if same last updated date.
If 'No', wont check and will just download. Will overwrite if have existing.
combcsv : str, optional
Flag for whether to combine the whole collection into one csv.
indcsv : str, optional
Flag for whether to download individual datasets in the collection.
csvdir : Optional[str], optional
The place where csv are stored.
Will check this directory if existing files are stored.
If not, will re-download.

Returns
-------
pdDF
A combined pandas Dataframe.

'''
collection_object = Collection(collection_id)
countofdataset = len(collection_object.dataset_id_list)
print(f'Collection {collection_id} contains {countofdataset} no(s) of dataset.')

combinedpdf = None

if chk == 'Yes':
### Confirm which dataset is missing
# Current list in directory.
listindir = []
for root, dirs, files in os.walk(os.getcwd()):
for file in files:
file_path = os.path.relpath(os.path.join(root, file), os.getcwd())
listindir.append(file_path.replace('\\','/'))

# Check for missing.
missingdataset = []
for x in range(countofdataset):
# Get the current file path
d_id = collection_object.dataset_id_list[x]
d_obj = Dataset(d_id, 'No', 'No', csvdir)
if csvdir==None:
filepath = f"{d_obj.last_updated} {d_obj.dataset_name}.csv"
jsonfp = f"{d_obj.last_updated} {d_obj.dataset_name}_dtype.json"
else:
filepath = f"{csvdir}/{d_obj.last_updated} {d_obj.dataset_name}.csv"
jsonfp = f"{csvdir}/{d_obj.last_updated} {d_obj.dataset_name}_dtype.json"

if filepath in listindir and jsonfp in listindir:
dtypedict = json_to_dtype(jsonfp)
if combinedpdf is None:
combinedpdf = pd.read_csv(filepath, dtype=dtypedict)
else:
pdf_not_missing = pd.read_csv(filepath, dtype=dtypedict)
combinedpdf = pd.concat([combinedpdf, pdf_not_missing], axis=0, ignore_index=True)
else:
missingdataset.append(d_id)
print(f'Checked {x+1} out of {countofdataset} datasets.')

if len(missingdataset) == 0:
print('All files were downloaded previously.')
else:
print(f'{len(missingdataset)} out of {countofdataset} datasets missing.')
for x in range(len(missingdataset)):
print(f'Start of Missing Dataset {x+1} out of {len(missingdataset)}')
d_id = missingdataset[x]

if indcsv == 'Yes':
d_obj = Dataset(d_id, 'Yes', 'Yes', csvdir)

elif indcsv == 'No':
d_obj = Dataset(d_id, 'Yes', 'No')

if combinedpdf is None:
combinedpdf = d_obj.dataframe_data
else:
combinedpdf = pd.concat([combinedpdf, d_obj.dataframe_data], axis=0, ignore_index=True)
print(f'End of Missing Dataset {x+1} out of {len(missingdataset)}')

elif chk == 'No':

for x in range(countofdataset):
print(f'Start of Dataset {x+1} out of {countofdataset}')
dataset_id = collection_object.dataset_id_list[x]

if indcsv == 'Yes':
dataset_obj = Dataset(dataset_id, 'Yes', 'Yes', csvdir)

elif indcsv == 'No':
dataset_obj = Dataset(dataset_id, 'Yes', 'No')

if x == 0:
combinedpdf = dataset_obj.dataframe_data
else:
combinedpdf = pd.concat([combinedpdf, dataset_obj.dataframe_data], axis=0, ignore_index=True)
print(f'End of Dataset {x+1} out of {countofdataset}')

if combcsv == 'Yes':
if csvdir==None:
combcsvfilepath = f"Combined {collection_object.last_updated} {collection_object.collection_name}.csv"
else:
combcsvfilepath = f"{csvdir}/Combined {collection_object.last_updated} {collection_object.collection_name}.csv"

combinedpdf.to_csv(combcsvfilepath, index=False)
print(f'{combcsvfilepath} downloaded')

return combinedpdf

The output can be seen that the loading of the pandas.DataFrame happened without any dtype issue errors showing up anymore.

The full code can be found in the github below.

We will use this extraction code to extract some info and do some further analysis in another post.

--

--