Extraction of Collections and Datasets Using data.gov.sg API

Published in

Data And Beyond

14 min readMar 2, 2024

According to their website:

Data.gov is Singapore’s one-stop portal for publicly available datasets from various public agencies. Users are able to explore national datasets, download raw data for research analysis, and access APIs for web development.
Features include:
High quality data presented via dashboards for quick reference
Shareable charts and tables of datasets
Data blogs of social trends and government policies explained using data analysis and visualisations.

data.gov.sg is not a new concept. Few years back, I used Department of Statistics Singapore’s (SingStat) website to search for data. However, alot has changed since then. I see more processed and analyzed data on SingStat, whereas data.gov.sg has more raw data (in my opinion).

A friend of mine had asked me to look for insights on Companies in Singapore and I decided to use data.gov.sg as my source for data. One thing I noticed is that they have some documentation on how to view / extract from their collection / dataset. Basically, a collection is a compilation of all related datasets of the same kind.

Let’s head right into the extraction of data from this portal.

EXTRACTION OF INFORMATION

We can extract the information via the browser / links, or via the API.

via Browser / Links

I wanted to find for data on companies in Singapore so my first stop would be ACRA. Upon typing ‘ACRA’, we can see the collections available.

I chose the first link “ACRA Information on Corporate Entities”. Upon entering the link, on the left you can see many files, which are all the datasets available for this collection. In the middle, you can see the sample data for that particular dataset.

Just click on Download files and you will be able to choose on which datasets to download. You may also click on the “Select all” tab to choose the whole collection.

The download will result in multiple csv files being download instead of a single zip file. Although I prefer the downloads to be of single individual files, I would suggest the devs to put in an option of zipping the information up and renaming the zip file also.

But what if we wanted to create dashboards and automatically update the data over time pragmatically. It would be troublesome to create a script an scrape from the site. Especially if changes are going to be made to the site interface over time.

Luckily, the devs had two types of APIs for our use. The first type is real-time APIs, and the other is Collection and Dataset APIs. Our scope is to only cover the Collection and Dataset APIs

via Collection and Dataset APIs

data.gov.sg provides an API guide as below:

Collection APIs

Edit description

guide.data.gov.sg

Dataset APIs

https://beta.data.gov.sg/collections/522/datasets/d_11e68bba3b3c76733475a72d09759eeb/view…

guide.data.gov.sg

Before using the API, we need two items. One is the collection_id, and the other is the dataset_id. The method to identify these 2 are clearly self explanatory in the guide as below.

The APIs are simple and easy to use, and provide basic building blocks for customizing what you want. I decided to use these building blocks to create a downloader for the whole collection. This would be more convenient than manually downloading csv files and converting them. Instead, we could use the APIs to directly create a DataFrame and save the data as backup in a csv format.

WRAPPER FUNCTION TO DOWNLOAD

I created a wrapper function with the intent to achieve the below objectives:

Able to download the information needed purely with the collection number, and in a simple manner
Able to save into csv file as a backup.
Able to save into a pandas.DataFrame format to be used for analysis / fed into a pipeline.
Flexible to not download the datasets if they have not been updated on the actual site, but use the previously saved csv file (if existing). The main purpose is to save bandwidth and resources.

Let’s start by importing all the relevant packages. I also created a TypeVar for pandas.DataFrame.

import pandas as pd
import requests
import os
from typing import Optional, List, Dict, Tuple, TypeVar
import src.config as CFG  # Config file placed in src folder.

pdDF = TypeVar('pandas.core.frame.DataFrame')

We then create the class for Collections. By inputting the collection_id, it will initiate and get the collection name, the last updated date for the collection, and the dataset_id list.

class Collection:
    def __init__(self, collection_id: str) -> Tuple[str, str, List[str]]:
        '''
        Parameters
        ----------
        collection_id : str
            collection id from data.gov.sg

        Returns
        -------
        Tuple[str, str, List[str]]
            collection name, last updated date, list of dataset id within
            collection.

        '''
        self.collection_id = str(collection_id)
        self.col_url_info = CFG.api['collection'] + f'/{self.collection_id}/' +\
                            CFG.api['meta']
        
        self.collection_name = None
        self.last_updated = None
        self.dataset_id_list = None
        
        self.collection_info()
        
        
    def collection_info(self) -> Tuple[str, str, List[str]]:
        '''
        Returns
        -------
        Tuple(str, str, List[str]
            returns the collection name, last updated date, and a list of datasetID

        '''
        json_content = requests.get(self.col_url_info).json()
        content = json_content['data']['collectionMetadata']
        
        self.collection_name = content['name']
        self.last_updated = content['lastUpdatedAt'][:10]
        self.dataset_id_list = content['childDatasets']
        
        return (self.collection_name, self.last_updated, self.dataset_id_list)

We then create the class for Datasets. This class requires more inputs that allows the flexibility to create pandas Dataframe or a csv backup file, right from when the class initializes.

class Dataset:
    def __init__(self, dataset_id: str, pdf: str = 'No', csv: str = 'No',\
                 csvdir: Optional[str] = None):
        '''
        Parameters
        ----------
        dataset_id : str
            dataset id found in url
        pdf : str
            Flag to generate pandas Dataframe.
            Use 'Yes' if want to generate            
        csv : str
            Flag to generate csv for whole dataset or not.
            Use 'Yes' if want to generate
        csvdir : TYPE, optional
            csv directory name. The path will be the date and file name.

        Returns
        -------
        None.

        '''
        self.dataset_id = dataset_id
        self.url_download = CFG.api['dataset_dl'] + f'{self.dataset_id}'
        self.url_info = CFG.api['dataset'] + f'/{self.dataset_id}/' + CFG.api['meta']
        
        self.dataset_name = None
        self.last_updated = None
        self.col_info = None
        self.dataframe_data = None
        self.total_results = 0
        self.csv_final_filepath = None
        
        self.dataset_info()
        
        if pdf == 'Yes':
            self.dataset_download_pdf()
            
        if csv == 'Yes':
            self.dataset_download_csv(directory=csvdir)

        
    def dataset_info(self) -> Tuple[str, str, Dict[str,List[str]]]:
        '''
        Returns
        -------
        Tuple(str, str, Dict[str:List[str]])
            returns the collection name, last updated date, a dict of column info.
            The dict will contain mapping_id, name, datatype, description, index,
            as the keys.
        '''       
        json_content = requests.get(self.url_info).json()
        content = json_content['data']
        content_inner = json_content['data']['columnMetadata']
        
        dataset_name = content['name']
        last_updated = content['lastUpdatedAt'][:10]
        
        col_mapping_id = content_inner['order'] # returns a list
        col_name = []
        col_datatype = []
        col_description = []
        col_index = []
        
        for x in col_mapping_id:
            col_attr = content_inner['metaMapping'][x]
            try:
                col_name.append(col_attr['name'])
            except:
                print("No column name attribute")
            try:
                col_datatype.append(col_attr['dataType'])
            except:
                print('No dataType attribute')
            try:
                col_index.append(col_attr['index'])
            except:
                print('No index attribute')
        
        col_info = {'col_mapping_id': col_mapping_id,
                    'col_name': col_name,
                    'col_datatype': col_datatype,
                    'col_description': col_description,
                    'col_index': col_index}
        
        self.dataset_name = dataset_name
        self.last_updated = last_updated
        self.col_info = col_info
        
        return (self.dataset_name, self.last_updated, self.col_info)

I have also included the code for the ability to download info into a pandas DataFrame or a csv file as below. The code will extract the total number of rows from the API, the followed by deciding how many loops will go thru to download. I have set the limit as 1000, because i tried some large number and realized if the number is too big, the download fails. On the safe side, I decided to use the limit of 1000. The below code is still part of the Dataset class.

    def dataset_download_csv(self, directory=None) -> pdDF:
        '''
        Parameters
        ----------
        directory : TYPE, optional
            folder / directory where the csv file will be saved to.

        Returns
        -------
        pdf : pandas Dataframe
            Function will produce a csv file in the directory with the 
            last updated date of the dataset together with the dataset name.

        '''
        if self.dataset_name == None or self.last_updated == None:
            dataset_name, last_updated, col_info = self.dataset_info(self.dataset_id)
        
        if directory==None:
            filepath = f"{self.last_updated} {self.dataset_name}.csv"
        else:
            filepath = f"{directory}/{self.last_updated} {self.dataset_name}.csv"
        
        self.csv_final_filepath = filepath
        
        if self.dataframe_data is None:
            self.dataframe_data = self.dataset_download_pdf()
            
        self.dataframe_data.to_csv(self.csv_final_filepath, index=False)
        print(f'{self.csv_final_filepath} downloaded')
        
        return self.dataframe_data
    
    def dataset_download_pdf(self) -> pdDF:
        '''
        Returns
        -------
        pdDF
            pd.DataFrame that contains everything
        '''
        print(f"Downloading from {self.url_download}")
        ini_json_content = requests.get(self.url_download).json()
        
        self.total_results = ini_json_content['result']['total']
        print(f'There are a total of {self.total_results} rows in this dataset')
        
        offset_start, offset_end, limit = self.dataset_download_counter()
        for x in range(offset_start, offset_end, limit):
            if x == 0:
                offset_limit_str = f"&limit={limit}"
            elif x > 0:
                offset_limit_str = f"&offset={x}&limit={limit}"
            loop_url = self.url_download + offset_limit_str
            loop_json_content = requests.get(loop_url).json()
            loop_record = loop_json_content['result']['records']
            loop_record_df = pd.DataFrame(loop_record)
            if x == offset_start:
                overall_loop_record_df = loop_record_df
            else:
                overall_loop_record_df = pd.concat([overall_loop_record_df, loop_record_df], axis=0, ignore_index=True)
            if x > (offset_end-limit):
                print(f"Processed {self.total_results} / {self.total_results} rows.")
            else:
                print(f"Processed {x+limit} / {self.total_results} rows.")
                
        self.dataframe_data = overall_loop_record_df
            
        return self.dataframe_data
    
    def dataset_download_counter(self) -> Tuple[int, int, int]:
        '''
        Returns
        -------
        Tuple[int, int, int]
            A range to feed the for loop in dataset_download_pdf that downloads the
            whole dataset.
        '''
        if int(self.total_results) <= 1000:
            offset_end = None
            limit = None
        elif int(self.total_results) > 1000:
            offset_end = int(self.total_results) + 1
            limit = 1000
        
        return (0, offset_end, limit)

Now that we have both the Collection class and Dataset class, we can create a wrapper function that will:

Check if the file has been downloaded. If downloaded, can choose to use back the old files, or can also re-download and overwrite.
Option to create a combined dataset of the whole collection in both pandas DataFrame and csv format(File will be very big)

def download_collection(collection_id: str, chk: str = 'No', combcsv: str = 'No',\
                        indcsv: str = 'No', csvdir: Optional[str] = None)\
                        -> pdDF:
    '''
    Wrapper function to download the whole collection

    Parameters
    ----------
    collection_id : str
        DESCRIPTION.
    chk : str
        Flag for whether to check csvdir for any existing csv's which are same.
        Options of 'Yes' or 'No'.
        If 'Yes', won't download csv if same last updated date.
        If 'No', wont check and will just download. Will overwrite if have existing.
    combcsv : str, optional
        Flag for whether to combine the whole collection into one csv.
    indcsv : str, optional
        Flag for whether to download individual datasets in the collection.
    csvdir : Optional[str], optional
        The place where csv are stored.
        Will check this directory if existing files are stored.
        If not, will re-download.

    Returns
    -------
    pdDF
        A combined pandas Dataframe.

    '''
    collection_object = Collection(collection_id)
    countofdataset = len(collection_object.dataset_id_list)
    print(f'Collection {collection_id} contains {countofdataset} no(s) of dataset.')
    
    combinedpdf = None
    
    if chk == 'Yes':
        ### Confirm which dataset is missing
        # Current list in directory.
        listindir = []
        for root, dirs, files in os.walk(os.getcwd()):
            for file in files:
                file_path = os.path.relpath(os.path.join(root, file), os.getcwd())
                listindir.append(file_path.replace('\\','/'))
        
        # Check for missing.
        missingdataset = []
        for x in range(countofdataset):
            # Get the current file path
            d_id = collection_object.dataset_id_list[x]
            d_obj = Dataset(d_id, 'No', 'No', csvdir)
            if csvdir==None:
                filepath = f"{d_obj.last_updated} {d_obj.dataset_name}.csv"
            else:
                filepath = f"{csvdir}/{d_obj.last_updated} {d_obj.dataset_name}.csv"

            if filepath in listindir:
                if combinedpdf is None:
                    combinedpdf = pd.read_csv(filepath)
                else:
                    pdf_not_missing = pd.read_csv(filepath)
                    combinedpdf = pd.concat([combinedpdf, pdf_not_missing], axis=0, ignore_index=True)
            else:
                missingdataset.append(d_id)
            print(f'Checked {x+1} out of {countofdataset} datasets.')
                
        if len(missingdataset) == 0:
            print('All files were downloaded previously.')
        else:
            print(f'{len(missingdataset)} out of {countofdataset} datasets missing.')
            for x in range(len(missingdataset)):
                print(f'Start of Missing Dataset {x+1} out of {len(missingdataset)}')
                d_id = missingdataset[x]
                
                if indcsv == 'Yes':
                    d_obj = Dataset(d_id, 'Yes', 'Yes', csvdir)
                    
                elif indcsv == 'No':
                    d_obj = Dataset(d_id, 'Yes', 'No')
                
                if combinedpdf is None:
                    combinedpdf = d_obj.dataframe_data
                else:
                    combinedpdf = pd.concat([combinedpdf, d_obj.dataframe_data], axis=0, ignore_index=True)
                print(f'End of Missing Dataset {x+1} out of {len(missingdataset)}')
    
    elif chk == 'No':
        
        for x in range(countofdataset):
            print('Start of Dataset {x+1} out of {countofdataset}')
            dataset_id = collection_object.dataset_id_list[x]
            
            if indcsv == 'Yes':
                dataset_obj = Dataset(dataset_id, 'Yes', 'Yes', csvdir)
                
            elif indcsv == 'No':
                dataset_obj = Dataset(dataset_id, 'Yes', 'No')
            
            if x == 0:
                combinedpdf = dataset_obj.dataframe_data
            else:
                combinedpdf = pd.concat([combinedpdf, dataset_obj.dataframe_data], axis=0, ignore_index=True)
            print('End of Dataset {x+1} out of {countofdataset}')
    
    if combcsv == 'Yes':
        if csvdir==None:
            combcsvfilepath = f"Combined {collection_object.last_updated} {collection_object.collection_name}.csv"
        else:
            combcsvfilepath = f"{csvdir}/Combined {collection_object.last_updated} {collection_object.collection_name}.csv"

        combinedpdf.to_csv(combcsvfilepath, index=False)
        print(f'{combcsvfilepath} downloaded')
            
    return combinedpdf

The above code are put into a folder called src, with a filename of api_data_gov.py src/api_data_gov.py

I decided to put the config file separately, so that it will be easy to change if the API links are modified. The below code is put into a folder called src, with a filename of config.py src/config.py

'''
###### Sample Links ######
collection_list = 'https://api-production.data.gov.sg/v2/public/api/collections?page=2'
collection_metadata = 'https://api-production.data.gov.sg/v2/public/api/collections/1/metadata'
dataset_list = 'https://api-production.data.gov.sg/v2/public/api/datasets?page=2'
dataset_metadata = 'https://api-production.data.gov.sg/v2/public/api/datasets/d_dcda79be4aded5f9e769b8e23ff69b47/metadata'
dataset_info = 'https://data.gov.sg/api/action/datastore_search?resource_id=d_dcda79be4aded5f9e769b8e23ff69b47'
'''

###### Datasets ######
datasets = {
    'acra_ent' : 'datasets/d_3f960c10fed6145404ca7b821f263b87',
    'acra_reg_ent' : 'collections/1',
    'bca_con' : 'datasets/d_dcda79be4aded5f9e769b8e23ff69b47',
    'hdb_con' : 'datasets/d_9973d2c119ed4dd1560aebf8f0829b86'
    }

###### APIS ######
api = {
       'collection': 'https://api-production.data.gov.sg/v2/public/api/collections',
       'dataset': 'https://api-production.data.gov.sg/v2/public/api/datasets',
       'meta': 'metadata',
       'dataset_dl': 'https://data.gov.sg/api/action/datastore_search?resource_id=',
       'dataset_q_limit': '&limit=',
       'dataset_q_offset': '&offset='
       }

To run the code, I created a main file in the same directory as where the folder src is at. main.py . In this particular code, I will be downloading collection_id 2, and I want to check if the existing files are there. It will also create a combinedcsv file and download the individual csv files if necessary. The data files will be stored in a directory known as data

from src.api_data_gov import download_collection

col_id = 2
chk = 'Yes'
combcsv = 'Yes'
indcsv = 'Yes'
csvdir = 'data'
df_combined = download_collection(col_id, chk, combcsv, indcsv, csvdir)
print(df_combined)

WRAPPER FUNCTION OUTPUT AND ISSUES

The output of my code when there are existing csv files and we do not need to re-download is as below:

Output (at the start) from running my code.

Output (at the end) from running my code

One of the problems is of the dtype warning which comes when processing csv’s back to pandas. I would suggest that if the use is for something more customized, to just modify the code a little and include the dtype when reading the csv. One method is that when first initializing the code, when converting the info to pandas, to extract the dtype from the pandas DataFrame first, and saved to be used for later.

If we want to re-download the whole collection, we just have to change the chk parameter to ‘No’ and it will redownload the whole collection of datasets.

Output (in the middle) from running my code

As the collection is huge, it will take roughly 20 mins to download all the data.

SOLVING THE DTYPE ISSUE

To solve the dtype issue when loading the csv files into pandas.DataFrame, the dtype dictionary was saved into json files and downloaded together with the csv file downloaded.

When loading, the code will directly load the csv files together with the pre-determined dtypes. The code below should be added / modified to the Dataset class. Basically i have added the json_dtype_filepath and a function that will save the dtype dict to a json file.

class Dataset:
    def __init__(self, dataset_id: str, pdf: str = 'No', csv: str = 'No',\
                 csvdir: Optional[str] = None):
        '''
        Parameters
        ----------
        dataset_id : str
            dataset id found in url
        pdf : str
            Flag to generate pandas Dataframe.
            Use 'Yes' if want to generate            
        csv : str
            Flag to generate csv for whole dataset or not.
            Use 'Yes' if want to generate
        csvdir : TYPE, optional
            csv directory name. The path will be the date and file name.

        Returns
        -------
        None.

        '''
        self.dataset_id = dataset_id
        self.url_download = CFG.api['dataset_dl'] + f'{self.dataset_id}'
        self.url_info = CFG.api['dataset'] + f'/{self.dataset_id}/' + CFG.api['meta']
        
        self.dataset_name = None
        self.last_updated = None
        self.col_info = None
        self.dataframe_data = None
        self.total_results = 0
        self.csv_final_filepath = None
        self.json_dtype_filepath = None
        
        self.dataset_info()
        
        if pdf == 'Yes':
            self.dataset_download_pdf()
            
        if csv == 'Yes':
            self.dataset_download_csv(directory=csvdir)

    def dtype_to_json(self, pdf: pdDF) -> dict:
        '''
        Parameters
        ----------
        pdf : pandas.DataFrame
            pandas.DataFrame so we can extract the dtype
            
        Returns
        -------
        Dict
            The dtype dictionary used
        
        To create a json file which stores the pandas dtype dictionary for
        use when converting back from csv to pandas.DataFrame.
        '''
        dtype_dict = pdf.dtypes.apply(lambda x: str(x)).to_dict()
        
        with open(self.json_dtype_filepath, 'w') as json_file:
            json.dump(dtype_dict, json_file)
        
        return dtype_dict

    def dataset_download_csv(self, directory=None) -> pdDF:
        '''
        Parameters
        ----------
        directory : TYPE, optional
            folder / directory where the csv file will be saved to.

        Returns
        -------
        pdf : pandas Dataframe
            Function will produce a csv file in the directory with the 
            last updated date of the dataset together with the dataset name.

        '''
        if self.dataset_name == None or self.last_updated == None:
            dataset_name, last_updated, col_info = self.dataset_info(self.dataset_id)
        
        if directory==None:
            filepath = f"{self.last_updated} {self.dataset_name}.csv"
            jsonfp = f"{self.last_updated} {self.dataset_name}_dtype.json"
        else:
            filepath = f"{directory}/{self.last_updated} {self.dataset_name}.csv"
            jsonfp = f"{directory}/{self.last_updated} {self.dataset_name}_dtype.json"
        
        self.csv_final_filepath = filepath
        self.json_dtype_filepath = jsonfp
        
        if self.dataframe_data is None:
            self.dataframe_data = self.dataset_download_pdf()
            
        self.dataframe_data.to_csv(self.csv_final_filepath, index=False)
        
        # Save the dtype to json
        self.dtype_to_json(self.dataframe_data)
        
        print(f'{self.csv_final_filepath} downloaded')
        print(f'{self.json_dtype_filepath} created')
        
        return self.dataframe_data

Similarly, in the wrapper function, we will also need to modify such that the if the json file is missing, it will just redownload the whole dataset. Only if the dataset AND the dtype dict json file is present, will it skip the download process.

def download_collection(collection_id: str, chk: str = 'No', combcsv: str = 'No',\
                        indcsv: str = 'No', csvdir: Optional[str] = None)\
                        -> pdDF:
    '''
    Wrapper function to download the whole collection

    Parameters
    ----------
    collection_id : str
        DESCRIPTION.
    chk : str
        Flag for whether to check csvdir for any existing csv's which are same.
        Options of 'Yes' or 'No'.
        If 'Yes', won't download csv if same last updated date.
        If 'No', wont check and will just download. Will overwrite if have existing.
    combcsv : str, optional
        Flag for whether to combine the whole collection into one csv.
    indcsv : str, optional
        Flag for whether to download individual datasets in the collection.
    csvdir : Optional[str], optional
        The place where csv are stored.
        Will check this directory if existing files are stored.
        If not, will re-download.

    Returns
    -------
    pdDF
        A combined pandas Dataframe.

    '''
    collection_object = Collection(collection_id)
    countofdataset = len(collection_object.dataset_id_list)
    print(f'Collection {collection_id} contains {countofdataset} no(s) of dataset.')
    
    combinedpdf = None
    
    if chk == 'Yes':
        ### Confirm which dataset is missing
        # Current list in directory.
        listindir = []
        for root, dirs, files in os.walk(os.getcwd()):
            for file in files:
                file_path = os.path.relpath(os.path.join(root, file), os.getcwd())
                listindir.append(file_path.replace('\\','/'))
        
        # Check for missing.
        missingdataset = []
        for x in range(countofdataset):
            # Get the current file path
            d_id = collection_object.dataset_id_list[x]
            d_obj = Dataset(d_id, 'No', 'No', csvdir)
            if csvdir==None:
                filepath = f"{d_obj.last_updated} {d_obj.dataset_name}.csv"
                jsonfp = f"{d_obj.last_updated} {d_obj.dataset_name}_dtype.json"
            else:
                filepath = f"{csvdir}/{d_obj.last_updated} {d_obj.dataset_name}.csv"
                jsonfp = f"{csvdir}/{d_obj.last_updated} {d_obj.dataset_name}_dtype.json"

            if filepath in listindir and jsonfp in listindir:
                dtypedict = json_to_dtype(jsonfp)
                if combinedpdf is None:
                    combinedpdf = pd.read_csv(filepath, dtype=dtypedict)
                else:
                    pdf_not_missing = pd.read_csv(filepath, dtype=dtypedict)
                    combinedpdf = pd.concat([combinedpdf, pdf_not_missing], axis=0, ignore_index=True)
            else:
                missingdataset.append(d_id)
            print(f'Checked {x+1} out of {countofdataset} datasets.')
                
        if len(missingdataset) == 0:
            print('All files were downloaded previously.')
        else:
            print(f'{len(missingdataset)} out of {countofdataset} datasets missing.')
            for x in range(len(missingdataset)):
                print(f'Start of Missing Dataset {x+1} out of {len(missingdataset)}')
                d_id = missingdataset[x]
                
                if indcsv == 'Yes':
                    d_obj = Dataset(d_id, 'Yes', 'Yes', csvdir)
                    
                elif indcsv == 'No':
                    d_obj = Dataset(d_id, 'Yes', 'No')
                
                if combinedpdf is None:
                    combinedpdf = d_obj.dataframe_data
                else:
                    combinedpdf = pd.concat([combinedpdf, d_obj.dataframe_data], axis=0, ignore_index=True)
                print(f'End of Missing Dataset {x+1} out of {len(missingdataset)}')
    
    elif chk == 'No':
        
        for x in range(countofdataset):
            print(f'Start of Dataset {x+1} out of {countofdataset}')
            dataset_id = collection_object.dataset_id_list[x]
            
            if indcsv == 'Yes':
                dataset_obj = Dataset(dataset_id, 'Yes', 'Yes', csvdir)
                
            elif indcsv == 'No':
                dataset_obj = Dataset(dataset_id, 'Yes', 'No')
            
            if x == 0:
                combinedpdf = dataset_obj.dataframe_data
            else:
                combinedpdf = pd.concat([combinedpdf, dataset_obj.dataframe_data], axis=0, ignore_index=True)
            print(f'End of Dataset {x+1} out of {countofdataset}')
    
    if combcsv == 'Yes':
        if csvdir==None:
            combcsvfilepath = f"Combined {collection_object.last_updated} {collection_object.collection_name}.csv"
        else:
            combcsvfilepath = f"{csvdir}/Combined {collection_object.last_updated} {collection_object.collection_name}.csv"

        combinedpdf.to_csv(combcsvfilepath, index=False)
        print(f'{combcsvfilepath} downloaded')
            
    return combinedpdf

The output can be seen that the loading of the pandas.DataFrame happened without any dtype issue errors showing up anymore.

The full code can be found in the github below.

GitHub - LimSzeZhong/data_gov_sg_extractor: ETL for contractor info

ETL for contractor info. Contribute to LimSzeZhong/data_gov_sg_extractor development by creating an account on GitHub.

github.com

We will use this extraction code to extract some info and do some further analysis in another post.