ChatGPT & Non-Public Data Experiments: Getting Started

David Greenfield
ChatGPT Experiments
3 min readJan 4, 2023

While ChatGPT has caught the attention of the tech world with the combination impressive flexibility of responses and AI powered creativity. The most valuable uses will expand beyond the standard training data to eventually include private, sensitive and access restricted data.

This mini-blog series is dedicated to sharing knowledge around experiments using ChatGPT with a focus on creating a custom output the reflects the inclusion of data not included in the training models.

Background and Key Challenges:

One of the key limitations of the out of the box OpenAI models is that rely exclusively on open internet training dataset that ends in 2021. Given this limitation they can not provide factual context on recent events or on data or events from private knowledge stores. The goal of my experiments was to take a look at how easy it is to adapt the generically trained model to provide a useful output that requires context outside the initial training set.

Context:

In order to do that I will attempt use the scenario where the AI API is asked to generate cover letters for specific positions given knowledge about a candidate (myself). To do this well, this will require novel information about both the job posting and the applicant and integrating them into a cohesive response.

One known limitation in this experiment set is that the training sets for fine tuning will be significantly smaller than idea and will be closer to the 100 example minimum that OpenAI recommends. In generally, producing large and well structured training sets on recent events or specialized information will be a challenge for the ecosystem, but one that has great value if solved with even moderate success.

Experiments Planned:

In order to test results, I will test several methods of introducing novel data to OpenAI API.

  • Pre-Experiment: Setting Up the Plumbing
  • Experiment 1 — No Training: Prompt Only: Leveraging ChatGPT to write a cover letter including both job description and author resume details.
  • Experiment 2— Fine Tuning only: No new information in Prompt
  • Experiment 3— Hybrid: Fine Tuning + Information in Prompt

Two approaches that I won’t test but would potentially yield better results (but with more effort) are:

  • Hybrid Approach 2: Fine Tuning with large generic cover letter dataset + providing job and applicant context within prompt
  • Train from Scratch: I don’t believe this is currently available in OpenAI and would require a massive investment in dataset creation

Baseline Example: Before we start looking at applying novel data, here’s an example response based on generic and made up prompt:

We can’t really evaluate this response other than to say that its clear the AI has been trained to structurally understand cover letters and make some guesses about what would be required based on a hypothetical job title. It does beg the question for the experiments — how do we measure if the output is good?

Scoring Results:

Evaluating generative AI outputs is not as simple as evaluating predictive models with a definitive “correct” answer and likely must include some level of subjectivity by a human with contextual knowledge. For these experiments i’ll look at results based on four factors, the first two take inspiration from :

Edit Distance (Sentence Level) — What % of sentences would be edited before using the AI generated output? This gives a measure of how much human intervention would be needed to leverage the answer.

Edit Distance (Word Level) — What % of words would be edited/added/removed before using the AI generated output? This is probably a less useful metric, but will give an indication of the language cohesiveness of the responses.

Accuracy — % of Sentences that contain factually correct information, # of Incorrect statements. These metrics can look at how often the language model generates false statements.

Completeness — What % of sentences lack critical information to support the fact or statement? This is probably the most subjective of the metrics, but one I thought would be interested to track given that anecdotally i’ve noticed many cases where the API gives factually correct but vague answers to the point where they may be unusable to embarrassing to consumer. For example, if you ask a human “Where were you yesterday?” and they respond “Earth”, it would be factually correct but I expect not well received (unless you are a frequent space travelling astronaut).

--

--