Tool-Ally: Accelerating Data Science Development with your Trusted Ally

Josephlyr
d*classified
Published in
8 min readJul 19, 2023

We introduce Tool-Ally, the Data Science Toolkit developed by the Data Science team from Enterprise Digital Services (EDS) Programme Centre in collaboration with DSTA’s Distinguished Professor (Data Science) (hereinafter “DP(DS)”) Anthony Tung.

Tool-Ally was designed specifically for our internal teams to enhance their productivity in data science development and address challenges arising in the data science lifecycle. In addition to incorporating techniques from DP(DS)’s research, Tool-Ally offers a set of customizable and reusable components to streamline and automate various aspects of the data science lifecycle, thereby supporting data scientists in performing common data science tasks more efficiently and overcoming common challenges. Besides this article, we will be publishing related articles and delve deeper into the solutions provided by Tool-Ally, and explore their functionalities and benefits in detail.

Source: EAP Toolkit

Introduction

The data science development life cycle (see Figure 1) encompasses a dynamic and multifaceted journey, spanning various stages from data collection and preparation to exploration, modeling, and result interpretation. However, data scientists often find themselves mired in repetitive and time-consuming tasks, such as data cleaning and the laborious creation of charts for effective data exploration. These challenges can hinder productivity and impede the discovery of valuable insights gained from data analysis. Moreover, complex tasks like model explainability require advanced interpretability techniques to provide clear explanations for the predictions made by machine learning models.

During the development of Tool-Ally, we focus on the following set of objectives:

Objective 1: Increase Productivity and Standardize Development

One of the core objectives of Tool-Ally is to boost the productivity of data science development by providing standardized tools and techniques. By streamlining common data science tasks such as data cleaning, data exploration, error analysis etc., Tool-Ally eliminates repetitive efforts and allows data scientists to focus on the core analysis and interpretation of data. Additionally, the inclusion of data science code templates expedites the development process, enabling data scientists to efficiently utilize commonly used algorithms and methodologies.

Objective 2: Address Complex Data Science Challenges

In the ever-evolving field of data science, our team understands the complexity of the challenges that arise and strive to provide innovative solutions that address them. Tool-Ally aims to address common issues face by Data Scientists such as lack of labelled data to train the ML model, uncovering intrinsic patterns in data, data privacy preservation and model explainability to enhance understanding of inner workings of complex models. A scan of existing open-source algorithms was conducted including research projects by academia and selected algorithm was incorporated into Tool-Ally. In collaboration with Distinguished Professor (Data Science), we integrated popular techniques such as SHAP and LIME for model explainability, and enhanced the explainability with a set of custom metrics to evaluate the results of various techniques.

Objective 3: Accelerating Efficiency and Reusability through Modular Code Packaging

Tool-Ally is envisaged to encompass a comprehensive set of functionalities within a single package; these functionalities include model explainability, data profiling and synthetic data generation tool etc. By incorporating these functionalities within a single package, Tool-Ally allows data scientists to effortlessly adapt and integrate the capabilities into their projects, enhancing efficiency by leveraging existing code components across different projects. The streamlined implementation process facilitated by concise one-liner codes, enables quick and efficient reuse of the toolkit’s functionalities. Furthermore, our strong emphasis on code reusability fosters collaboration, knowledge sharing, and contributes to a more effective and efficient workflow.

Developing Tool-Ally: Addressing Data Scientist’s challenges

Drawing upon the experience and challenges faced by our team of data scientists in their daily work, we have developed a comprehensive array of solutions tailored to address challenges that span across the data science life cycle.

Let me guide you through some of the challenges that were shared by our team of data scientists and the first trench of tools that we have developed to address these challenges!

Figure 1: Illustration of the data science development life cycle.

Addressing Challenge #1: Preserving Data Privacy in Synthetic Data Generation

With growing concerns about data privacy, synthetic data offers a way to share or analyze data without revealing sensitive information. Besides privacy protection, synthetic data are also used to augment existing datasets, especially in scenarios where acquiring new real data is expensive or time-consuming. This helps increase the size and diversity of training data for improved performance and generalization of machine learning models.

However, generation of synthetic data for the purpose of sharing securely masked data and enhancing machine learning model training poses a challenge. This challenge arises due to the difficulty involved in balancing (1) the preservation of dataset distribution and (2) privacy preservation.

Tool-Ally provides a comprehensive suite of synthetic data generation modules that incorporates data masking techniques and deep generative models such as Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs). The data masking module offers efficient column masking capabilities for secure data sharing while allowing flexibility in retaining or modifying the original data set distribution, while the data synthesis module enables the generation of diverse synthetic data that accurately reflects real-world diversity, facilitating extensive model training, validation, and exploration without being limited by the lack of actual data or privacy concerns. With our synthetic data generation modules, our Data Scientists can seamlessly tackle data masking challenges and access a wide range of synthetic data to drive enhanced machine learning model development and testing.

Synthetic Data Generator

Figure 2.1: Overview of a Generative Adversarial Networks (GAN).
Figure 2.2: Overview of a Variational Autoencoder (VAE).

Addressing Challenge #2: Profiling Data

Data profiling plays a pivotal role in understanding the completeness and consistency of the data, which serves as a building block for downstream analysis. However, a challenge within this area of the lifecycle is the repetitive and time-consuming task of customizing the data preparation process (spanning data cleaning, preparation, and exploration) for every new dataset.

Tool-Ally offers a powerful data profiling tool that provides data scientists with comprehensive insights into their datasets. By automatically identifying column types and generating the corresponding plots for each column, this tool illuminates hidden patterns, identifies issues within the dataset, and even suggests data cleaning and processing steps. With this automated tool, data scientists can bid farewell to laborious manual inspection and embrace efficiency in their exploratory work.

The tool provides three levels of functionalities, each serving different aspects of data exploration. L1 enables users to quickly identify major issues or trends in the data by providing (1) an alert system that highlights potential issues in the data, and (2) a summary statistic of the data. L2 supports users in automatic visualization of data patterns, trends, and relationships through various chart types. L3 takes the data exploration to a deeper level by providing interactive capabilities, allowing users to interact with the data, apply filters, and drill down into specific subsets of the data.

Data Profiling Tool

Figure 3: Overview of Tool-Ally’s Data Profiling Tool

Addressing Challenge #3: Data Exploration to Identify Patterns

Data exploration is a mandatory phase in every data science project to provide data scientist with a better understanding of the characteristics and potential issues within a dataset. One of the common techniques to aid the data exploratory process is the application of clustering techniques to classify data into structures to aid interpretability of the data. However, the process of data exploration can be challenging and the application of clustering techniques to aid in exploration further adds to the complexity and difficulty.

Tool-Ally aids Data Scientist in the exploration and analyzing of datasets through clustering. The tool offers a range of clustering methods such as K-means, Agglomerative, and Gaussian Mixture. In addition to providing automatic selection of the optimal clustering algorithm, the tool also evaluates the performance of the clustering algorithms and provides a set of visualization tools, where data scientists can explore various methods to visualize their datasets and effectively interpret the outcomes derived from the clustering algorithms. This empowers data scientists to gain valuable insights and make informed decisions based on the results of clustering analysis. By streamlining the process of algorithm selection and enhancing the interpretability of clustering outcomes, the toolkit significantly enhances the efficiency and effectiveness of data exploration and analysis.

Clustering for Data Exploration Tool

Figure 4: Overview of Tool-Ally’s Data Exploration (Clustering) Tool.

Addressing Challenges #4: Model Explainability

While machine learning models gain stronger predictive performance as their complexity increases, they also become less interpretable, as their inner workings become progressively harder to comprehend. The lack of transparency behind their predictions leaves stakeholders with limited understanding of how specific decisions are made, thereby resulting in a lack of confidence in the model’s predictions. Such concerns about the black box nature of complex models poses a challenge to adoption of machine learning models in critical decision-making domains.

Our model explainability toolkit integrates popular model explainability techniques, simplifies their usage of the techniques while enhancing the interpretation and evaluation of explanation results by providing additional evaluation metrics. With just a few lines of code, data scientists can easily utilize the package to gain valuable insights into their models’ inner workings.

Model Explainability Tool

Figure 5.1: Model explainability techniques. Left: Local Interpretable Model-agnostic Explanations (LIME). Right: SHapley Additive exPlanations (SHAP).
Figure 5.2: Local Explanation evAluation Framework (LEAF). LEAF provides a set of metrics for the evaluation of Local Linear Explanations. Reiteration Similarity measures the similarity of a set of explanations of a single instance. Local Concordance measures how well g approximates the black box model f under the conciseness constraint. Local Fidelity measures the faithfulness of g in approximating the behaviour of black box model f for the target sample x around its synthetic neighbourhood.

Conclusion

By striving to increase productivity, standardize development practices, and address complex data science challenges, Tool-Ally aims to empower data scientists and elevate their ability to derive meaningful insights from data. It serves as a comprehensive resource that provides the necessary tools, techniques, and solutions to drive impactful data-driven decision-making. With the Data Science Toolkit as their ally, our data scientists can better navigate the intricacies of data science development and unlock the full potential of data.

Read more about our work on Synthetic Data Generation and Model Explainability as we delve deeper into each of the individual solutions provided by Tool-Ally, exploring their functionalities and benefits in detail!

--

--