When we built Companion's prompting architecture, we needed a principled way to evaluate whether an AI conversation tool was actually doing what it claimed to do.
Standard software quality metrics – accuracy, latency, uptime – were necessary but far from sufficient. A chatbot could perform well on all three while still giving harmful advice, reflecting cultural biases, or eroding user trust through opacity.
The framework that gave us the most rigorous foundation was published in JMIR AI in 2025: "Standardizing and Scaffolding Health Care AI-Chatbot Evaluation: Systematic Review". The paper synthesised evaluation practices across dozens of healthcare chatbot studies and proposed a structured, multi-dimensional approach – what we refer to internally as the HAICEF (Health care AI-Chatbot Evaluation Framework) methodology.
Why evaluation frameworks matter
Healthcare AI chatbots face a challenge that productivity tools do not: the cost of failure is not inefficiency, it is harm.
A chatbot that hallucinates a project deadline causes frustration. A mental wellness chatbot that misidentifies a crisis signal, gives culturally inappropriate advice, or provides a fabricated therapeutic technique could cause real psychological harm to a vulnerable person.
The HAICEF systematic review identified that most evaluation studies in healthcare chatbot literature measured only one or two dimensions in isolation – typically accuracy or user satisfaction – leaving critical blind spots around safety and fairness. The paper proposed a more complete evaluative scaffold across four core dimensions.
The four HAICEF dimensions
1. Safety
Does the system actively avoid harm? This goes beyond avoiding offensive language:
- Detection of crisis signals (suicidal ideation, acute distress, self-harm risk)
- Appropriate refusal behaviours for requests outside scope
- Clear escalation paths to human support or emergency services
- Avoidance of fabricated clinical information
2. Fairness
Does the system perform equally well across different users?
- Does response quality degrade for non-native speakers or minority dialect users?
- Are cultural assumptions embedded in phrasing, examples, or advice?
- Do certain demographic groups receive responses that are less accurate or less empathetic?
3. Trustworthiness
Do users have accurate expectations of what the system can and cannot do?
- Transparency about the AI's limitations and scope
- Provenance indicators for factual claims
- Consistent behaviour across sessions
- Clear disclosure that the user is speaking to an AI
4. Usefulness
Does the system actually help?
- Task success rates (did the user achieve their stated goal?)
- User satisfaction and perceived helpfulness
- Longitudinal engagement – does the tool remain useful after novelty fades?
How Companion applies HAICEF
Each HAICEF dimension translated into concrete safeguards in Companion's prompting layer:
Safety – Multi-tier crisis detection. Certain patterns trigger escalation protocols routing toward human support or emergency services. The model is explicitly prompted to acknowledge its limits and never to present itself as a substitute for clinical care.
Fairness – Bias audits on prompt behaviour across simulated user profiles representing different cultural backgrounds, native languages, and life circumstances. Prompts that consistently produce quality gaps are revised.
Trustworthiness – Companion's prompts include explicit scope definitions: what the assistant will engage with, what it will not, and why. When the model is uncertain, it is prompted to say so.
Usefulness – Session-level satisfaction tracking and longitudinal engagement data to identify when the tool stops being helpful.
Why this matters
The HAICEF framework gave us a way to have honest internal conversations about the gap between what Companion claims to offer and what it demonstrably delivers.
It is easy to write marketing copy about empathy and evidence-based support. It is harder – and more valuable – to have a structured methodology for testing whether those claims hold up under real-world use, across diverse users, in high-stakes moments.
Source: "Standardizing and Scaffolding Health Care AI-Chatbot Evaluation: Systematic Review." JMIR AI 2025;1:e69006.