Overview: What’s the Problem?
Dashbot has written on the concept of data enrichment before - adding new dimensions to your data to measure or view things in ways that were previously impossible. Customer frustration, predicted CSAT, and resolution rate are just a few examples. The single biggest concern most end users have with generative AI as a whole remains unanswered: how accurate are these systems? What about the fundamental problem of LLM hallucination?
Reviewing the approaches many other companies take to explaining their AI processes or value add has made me realize that companies often don’t know how to answer that question - or worse, obfuscate that answer. The question: ‘How accurate are these systems?’ requires an explanation of where hallucination most and least harms said systems and how we at Dashbot measure this to deliver the most reliable generative AI strategies.
Context: Data Enrichment & LLM Hallucinations
Data enrichment is fast becoming a necessary component of any form of big data AI analysis. In the customer service use case for example, being able to categorize your user-agent interactions by topic, effort, frustration, predicted CSAT, whether the conversation was resolved, and many other metrics allows you both functional and simple data overviews as well as the ability to slice your data any way you want (e.g. view only the topics with the shortest conversations and the highest resolution rate to determine automation opportunities).
Enriching data into this more functional form relies on using Large Language Models to create this data - but without knowing how often it gets it right. But what if we could know how often it gets it right? Further, what if we could use this knowledge to prioritize low-hallucination strategies - empowering end users with the highest fidelity LLM use cases?
LLM Accuracy Evaluation: Where Does Data Enrichment Actually Work?
We created a simple experimental design to determine how accurate our models are when producing insights or enriching data. Then, we compared which approaches worked best. For those newer to LLMs, it's important to remember that using the same LLM for different tasks does NOT yield the same quality result; LLMs have varying competences depending on the use case.
One of the experiments we performed determined how accurate our ‘Outcome’ metric was: given a customer service conversation, we predicted whether the conversation was successfully completed, escalated, or abandoned without resolution. Using multiple real, human evaluators per data point across thousands of sample conversations, we observed a 99%+ accuracy rate for our ‘Outcome’ metric.
‘Outcome’ was selected for this test as it is one of the least subjective components of our data enrichment strategy. This is contrasted against other metrics like a ‘predicted review’ or a ‘reason’ for conversation. A human evaluation should therefore give us a clear idea of how competent our pipeline is at this form of data enrichment - without the potential confound of subjectivity across human evaluators.
The labels our pipeline attributes to customer service conversations regarding Outcome are:
- complete
- escalated
- abandoned
Human evaluators are presented with a human-agent conversation and the label we assigned to that conversation. Each datapoint then has to be marked by the evaluator as correctly labelled or incorrectly labelled.
Results
of all 4250+ data points (unique conversations):
- 3269 were attributed to ‘complete’ (76.8%)
- 186 ‘escalated’ (4.3%)
- 799 ‘abandoned’ (18.7%)
Each data point (unique conversation) was evaluated by 3 different humans.
The final result was that every single label was correct besides 3 of the 4250+ data points. All 3 incorrect labels belonged to the ‘complete’ category.
Overall Accuracy: 99.9%
Limitations
Having 3 evaluators per data point means that sometimes evaluators may disagree, in which case the most common answer is accepted, but the confidence is poorer if there isn't unanimous consent. Ideally this test would have been performed with a greater evaluator count per datapoint, but cost would have tripled; this does not appear to be a substantial limitation given the end result. Experiments with more complex labels or a greater degree of subjectivity benefit more from a greater evaluator count per data point.
Low-Hallucination Strategies
Since we are not using traditional ML systems, the traditional ‘accuracy’ score is something generative AI users in principle do not have access to. However, comparing how well our model performs across different downstream tasks tells us which tasks we should use LLMs for versus those tasks which are prone to high hallucination rates.
Across the experiments we have run so far, we’ve observed data enrichment strategies ranging from 81% accuracy (confirmed by real humans every time) to 99.9%, depending on how potentially subjective or mature an enrichment is. We use these findings to adjust our data/AI pipeline for better performance and to know which tasks to drop entirely. In the case of our ‘Outcome’ metric, these results give us confidence that we can trust the data enrichment process practically all of the time.
What Next?
Greater transparency in generative AI capabilities is a step towards reassuring end users of the quality and veracity of the products they consume. Given that LLMs operate within the inherently subjective world of language, we face not only a hallucination problem but a subjectivity problem (in some cases). Gaining high quality evaluations of generative AI to identify low-hallucination strategies is a strong path towards more robust systems. Finally, for more details definitely download our E-Book and dive into Phase 2!