QueryCraft : Creating Instruct dataset using Annotation Tool

Step Zero— Crafting High-Quality Data for Enhanced Text-to-SQL Fine-Tuning and model Evaluation

Himadri Talukder
Towards Generative AI
4 min readJun 10, 2024

--

What is Instruct dataset

The ‘Instruct’ dataset is interchangeably also referred to as the golden dataset. Creating an instruct dataset in the context of language model (LLM) SQL involves assembling a high-quality dataset that serves as a benchmark or reference point for evaluating and fine-tuning the performance of the language model. This dataset typically contains accurately labeled or annotated examples that cover a wide range of scenarios and tasks relevant to the intended use of the language model. The term “golden” implies that this dataset is of utmost quality and serves as a gold standard for comparison.

We developed the QueryCraft framework to provide an easy solution for fine-tuning Large Language Models (LLMs) to generate SQL queries from natural language (Text2SQL, Text2GraphQL, NL2Query). This framework simplifies the process to quickly build complete GenAI pipelines. This blog is the first step in the series.

We were working on text-to-SQL use cases and needed to curate a instruct dataset of 500 examples from subject matter experts. These examples had to be related to the target domain. We implemented an automatic and iterative process to improve and collect quality data for model evaluation and benchmarking.

We have implemented a tool with 3 different persona in mind

  1. Annotator: Responsible for curating and inserting data by executing SQL queries.
  2. Reviewer: Conducts Inter-Rater Relevancy (IRR) checks to ensure consistency and accuracy.
  3. Administrator: Organizes, manages, and analyzes the data.
Process flow curating instruct dataset

Steps to curate instruct dataset

Define Objectives

Clearly define the objectives, target domain and use cases for the language model. Our tool is employed for gathering data from subject matter experts (SMEs) and evaluating the dataset by providing feedback and ratings.

Data Collection

Gather data from various sources, encompassing a wide range of scenarios relevant to our intended use cases. We enlisted annotators from both technical and non-technical backgrounds, as well as from different yet related departments within the same company. This allowed us to grasp the types of questions or queries they might have, considering their diverse skill levels and domains. This approach ensures thorough coverage of the targeted domain’s boundaries.

Data curation process

Annotation and Labeling

Annotate the collected data with the appropriate labels or annotations, depending on the specific tasks. This may involve manual annotation by human annotators or automated techniques such as rule-based labeling or active learning. In our case we are labeling dataset with these tags,

  • Query difficulty level
    - Simple
    - Moderate
    - Challenging
  • SQL Query type
    - Join
    - Select
    - Aggregate
    - Filter
  • Functional view/Area

Quality Assurance

Conduct thorough quality assurance checks to ensure the accuracy and consistency of the annotated data. This may involve manual review, inter-annotator agreement analysis, and automated checks for inconsistencies or errors. We reviewed the dataset with multiple reviewer to get a quality dataset.

Reviewer view feedback IRR

Iterative Improvement

Continuously iterate on the instruct dataset based on feedback and evaluation results. Update the dataset with new examples, refine annotations, and adapt to evolving requirements or challenges.

Analysis

After collecting and aggregating all the feedback, we can analyze the dataset to gain an overall perspective. We can calculate Fleiss’ Kappa score to determine the overall agreement level. We can assess progress and scores by filtering the dataset based on ratings, analyzing feedback to identify areas for improving overall data quality. Additionally, we can examine participant perspectives and their progress.

Fleiss Kappa score
Analysis of human feedback instruct dataset

In conclusion, curating a instruct dataset for LLM SQL involves careful planning, data collection, annotation, and evaluation. By following best practices and leveraging appropriate approaches and technologies, researchers and practitioners can create high-quality datasets that serve as valuable resources for training, fine-tuning, evaluating, and improving language models for a wide range of applications.

Ready to take your NL2SQL models to the next level? Explore QueryCraft’s evaluation framework today and unlock the full potential of your LLMs!

To explore the specific functionalities of each evaluation component in greater detail, please refer to our details blog posts on Data Ingestion NL2SQL, Context Retrieval, and the Fine-Tunning NL2SQL.

Follow Towards Generative AI for more content related to latest in AI advancement.

--

--