At a Glance
Generative AI is creating real opportunities in healthcare, but the gap between promising demos and safe clinical deployment remains wide. The most practical near-term use cases—documentation, summarization, and workflow triage—work because they keep clinicians in control while reducing cognitive and administrative burden. For healthcare technology teams, the path forward is not broad automation, but a staged roadmap built on retrieval, evaluation, uncertainty signaling, and mandatory human oversight.
Generative AI has arrived in healthcare with claims that range from the genuinely exciting to the dangerously premature. Proponents point to studies showing LLMs passing medical licensing exams and matching specialist performance on radiology reads. Sceptics point to documented hallucinations, demographic biases in training data, and the catastrophic consequences of an error that would be minor in another domain. Both are right, and the tension between them defines the engineering and product challenge for healthcare technology teams.
The productive path through this tension is not choosing between enthusiasm and caution — it is developing a rigorous framework for which clinical AI applications are ready for production deployment, which require additional infrastructure to be deployed responsibly, and which should not be deployed yet regardless of commercial pressure. This article proposes that framework, grounded in where the evidence is strongest and where the failure modes are most consequential.
Where Generative AI Helps Today
The clinical AI applications with the strongest current evidence base share a common characteristic: they assist with tasks where errors are catchable before they reach the patient, where human review is structurally part of the workflow, and where the volume and cognitive load of the task makes AI assistance genuinely valuable rather than merely novel.
Clinical documentation is the clearest case. Physicians spend a disproportionate share of their working hours on documentation — structured notes, discharge summaries, referral letters, prior authorisation requests — tasks that are cognitively demanding, time-consuming, and far removed from the clinical work that drew them to medicine. Ambient AI systems that listen to a patient encounter and generate a structured draft note, which the clinician reviews, edits, and approves before it enters the record, demonstrably reduce documentation burden without introducing patient safety risk. The human remains the author of record. The AI reduces the blank-page problem.
- Nuance’s Dragon Ambient eXperience (DAX) and similar ambient documentation tools have shown consistent reductions in documentation time across specialties, with clinician satisfaction scores that reflect genuine relief from administrative burden
- Discharge summary generation, conditioned on the structured data in the patient record rather than on free-form generation, produces drafts that require editing but are materially faster to finalise than starting from scratch
- Prior authorisation letter generation — summarising clinical evidence for a specific treatment decision in the format required by a specific payer — is a high-volume, formulaic task where AI generation with clinician review is appropriate and valuable
Key characteristic: In documentation assistance, the clinician is reviewing AI-generated text before it enters the record. The AI’s role is to reduce effort, not to make clinical decisions. This is the right model for near-term deployment.
Clinical summarisation is the second well-supported application. A physician seeing a complex patient with a lengthy EHR history may spend fifteen minutes reading through prior notes, lab results, and medication changes before a ten-minute appointment. An AI system that synthesises the relevant clinical history into a structured pre-visit summary — conditioned on the patient’s actual record data through retrieval-augmented generation — reduces this cognitive load without removing clinical judgment from the loop.
Radiology workflow triage represents a third validated application. AI systems that flag studies likely to contain urgent findings — a potential intracranial haemorrhage, a pulmonary embolism, a pneumothorax — for priority radiologist review have demonstrated both sensitivity and specificity sufficient for clinical deployment, with FDA 510(k) clearances providing the regulatory validation that distinguishes these systems from research prototypes.
Where Caution Is Warranted
The applications where greater caution is warranted are those where generative AI output would be acted upon without reliable human verification, where the failure mode is a patient safety event rather than a workflow inefficiency, or where the training data and evaluation methodology are insufficient to establish the performance required for the clinical context.
Diagnostic suggestion — a system that proposes a differential diagnosis based on patient data — is the application that generates the most attention and the most concern. The concern is well-founded. LLMs trained on medical literature have broad but shallow clinical knowledge that does not generalise reliably to the presentation complexity of real patients. They have demonstrated systematic biases in performance across patient demographics. And the automation bias risk — a clinician anchoring on an AI-suggested diagnosis and underweighting contradictory evidence — is a documented phenomenon in clinical settings.
Safety boundary: A diagnostic suggestion tool that a clinician treats as a second opinion is a different product from one they treat as a starting point. The interface design, the way suggestions are framed, and the training provided to clinicians all influence which mode of use predominates — and must be engineered with this in mind.
Treatment recommendations carry similar risks at higher stakes. Drug selection, dosing, and the sequencing of therapeutic interventions involve the integration of clinical evidence, patient-specific factors, local formulary constraints, and clinical judgment in ways that current LLMs cannot reliably replicate. The FDA’s framework for Software as a Medical Device (SaMD) applies to AI systems that influence treatment decisions, and the clinical validation requirements it implies are not met by general-purpose model performance benchmarks.
The Safety Infrastructure for Responsible Deployment
For the applications where deployment is appropriate, the safety infrastructure required goes beyond what most non-healthcare AI deployments demand. Building it correctly is what separates a clinical AI product from a general-purpose AI product repackaged for a clinical audience.
- Retrieval-augmented generation scoped to verified clinical sources — peer-reviewed guidelines, the patient’s own record, approved drug information databases — rather than the model’s general training knowledge reduces hallucination risk in the clinical domain specifically
- Output confidence calibration and explicit uncertainty signalling — the system surfacing its confidence level and the evidence basis for a generated output — gives clinicians the context to apply appropriate scepticism rather than treating AI output as authoritative
- Clinical safety testing must include out-of-distribution evaluation: how does the system perform on patients whose demographics, comorbidities, or presentations are underrepresented in the training data? Aggregate benchmark performance conceals subgroup failures that may be systematically skewed
Human oversight must be structurally enforced, not just recommended. Workflow design that routes AI outputs through a mandatory clinician review step — not an optional one — is the difference between a system where oversight is the expected mode and one where it is the exceptional one. Alert fatigue from low-quality AI suggestions is a real risk that undermines the entire clinical AI investment; precision matters as much as recall in a context where every false positive consumes clinician attention.
Regulatory positioning is an engineering decision as much as a legal one. The distinction between a clinical decision support tool that qualifies for enforcement discretion under FDA guidance and one that meets the definition of a medical device requiring clearance or approval depends on how the product is designed, how its outputs are framed, and what role it plays in clinical decision-making. Engaging with the regulatory framework during product design, not after, is what keeps the product on the right side of that line.
The Practical Roadmap
For healthcare technology teams deciding where to invest in generative AI, the sequencing is clearer than the market noise suggests. Start with documentation and summarisation — the evidence base is solid, the safety profile is manageable with appropriate workflow design, and the clinical value is immediate and demonstrable. Build the safety infrastructure — RAG pipelines, output monitoring, human review workflows, clinical evaluation frameworks — as shared platform capabilities rather than one-off implementations for each use case. Then expand to higher-stakes applications as the evidence base matures and the infrastructure has been validated in production.
The healthcare AI companies that will earn lasting clinical trust are not those that move fastest into the highest-stakes applications. They are those that demonstrate, through rigorous evaluation and transparent communication of limitations, that their systems perform as claimed on the patients who matter most — including the ones whose presentations are hardest.
At Nineleaps, we help healthcare technology companies build generative AI systems that meet the clinical and regulatory bar — designing the safety infrastructure, evaluation frameworks, and human oversight layers that responsible deployment requires.