LLM Report Evaluation - HealthProcessAI | SMAILE, Karolinska Institutet

Process Map Visualization

Case Description

Select Model:

Report Content

Select a case and model to view the generated report...

AI Evaluation Results

Automated evaluation performed by Claude AI using the 6-criteria rubric on all 20 reports (5 models × 4 cases). These scores provide an objective baseline assessment of report quality.

📊 Overall scores based on Claude AI evaluation using standardized rubric

🔮

Anthropic Claude

3.83

Overall Average

Consistent: 3.83 all cases

✨

Google Gemini

3.75

Overall Average

Range: 3.50-3.83

🔍

DeepSeek R1

3.08

Overall Average

Improved in Case IV

⚡

Grok 4

3.04

Overall Average

Most variable: 2.50-3.83

🤖

OpenAI GPT-4

2.63

Overall Average

Incomplete analysis

Comprehensive Scoring Across All Cases

📊 Evaluation Summary: Claude AI evaluated all 20 reports (5 models × 4 cases) using the 6-criteria rubric. Interesting patterns emerged:

Overall Average Scores Across All Cases

Model	Case I Infection	Case II Organ Damage	Case III GFR	Case IV Kidney	Overall Average
Anthropic Claude	3.83	3.83	3.83	3.83	3.83
Google Gemini	3.50	3.83	3.83	3.83	3.75
DeepSeek R1	2.83	2.83	2.83	3.83	3.08
Grok 4	2.50	2.83	3.00	3.83	3.04
OpenAI GPT-4	2.67	2.50	2.50	2.83	2.63

Key Insights from Cross-Case Analysis:

Most Consistent: Anthropic Claude maintained 3.83 across all cases
Most Improved: DeepSeek and Grok 4 performed significantly better on Case IV (Kidney Disease)
Best Innovation: Google Gemini excelled in Case II with novel "slow burn" hypothesis
Most Variable: Grok 4 ranged from 2.50 to 3.83 across cases
Consistent Underperformer: OpenAI GPT-4 had incomplete analysis across all cases

Multi-Dimensional Comparison (Overall)

Overall Scores by Model

Detailed Scoring by Case

Case I - Infection: Detailed Scoring

Model	Relevance	Structure	Understandability	Completeness	Innovation	Accuracy	Average
Anthropic Claude	4	4	4	4	3	4	3.83
DeepSeek R1	3	3	3	3	2	3	2.83
Google Gemini	4	4	4	3	3	3	3.50
OpenAI GPT-4	3	3	3	3	2	2	2.67
Grok 4	2	3	3	3	2	2	2.50

Case II - Organ Damage: Detailed Scoring

Model	Relevance	Structure	Understandability	Completeness	Innovation	Accuracy	Average
Anthropic Claude	4	4	4	4	3	4	3.83
Google Gemini	4	4	4	4	4	3	3.83
DeepSeek R1	3	3	3	3	2	3	2.83
Grok 4	3	3	3	3	2	3	2.83
OpenAI GPT-4	3	3	3	2	2	2	2.50

Case III - Glomerular Filtration Rate: Detailed Scoring

Model	Relevance	Structure	Understandability	Completeness	Innovation	Accuracy	Average
Anthropic Claude	4	4	4	4	3	4	3.83
Google Gemini	4	4	4	4	3	4	3.83
Grok 4	3	3	3	3	3	3	3.00
DeepSeek R1	3	3	3	3	2	3	2.83
OpenAI GPT-4	3	3	3	2	2	2	2.50

Case IV - Kidney Disease Progression: Detailed Scoring

Model	Relevance	Structure	Understandability	Completeness	Innovation	Accuracy	Average
Anthropic Claude	4	4	4	4	3	4	3.83
DeepSeek R1	4	4	4	4	3	4	3.83
Google Gemini	4	4	4	4	3	4	3.83
Grok 4	4	4	4	4	3	4	3.83
OpenAI GPT-4	3	3	3	3	2	3	2.83

Scoring Framework for LLM-Generated Healthcare Process Mining Reports

All reports are evaluated based on a comprehensive rubric designed specifically for epidemiological and healthcare process mining contexts. The rubric assesses six key criteria:

Criteria	Exemplary (4)	Proficient (3)	Needs Improvement (2)	Insufficient (1)
Relevance	Fully addresses key clinical and contextual issues. Strong alignment with process map and report purpose.	Addresses most relevant issues; minor gaps in alignment or scope.	Covers topic broadly but misses core clinical/contextual focus or the process framework.	Misaligned with clinical context; key issues not addressed.
Structure & Presentation	Clear, logical organization with defined sections. Effective use of tables/figures to support interpretation.	Generally well-structured; visuals used but may lack consistency or clarity.	Structure exists but is disjointed or difficult to follow. Visual aids are underused or unclear.	No discernible structure. Unformatted text and no visual supports.
Understandability	Clear, concise, jargon-free language. Accessible to a broad range of stakeholders.	Mostly clear with minor technical or dense sections.	Some sections are unclear or inconsistent in tone and terminology.	Poorly written throughout; impedes understanding.
Completeness	Comprehensive coverage of components: interpretation steps, clinical pathways, and KPIs.	Most components included but may lack depth in some areas.	Overview is present, but omits critical interpretive elements or performance metrics.	Lacks essential content. Missing interpretation or KPI references.
Innovation	Demonstrates creative approaches or novel clinical insights beyond standard practice.	Shows elements of creativity or innovation; may lack full development.	Limited originality; relies on conventional methods without new perspectives.	No evidence of innovation; basic, derivative, or rote output.
Accuracy	Clinically and contextually accurate. Terminology and figures aligned with process map and domain standards.	Mostly accurate with minor issues that don't affect the core message.	Noticeable errors in clinical interpretation, terms, or figure use.	Major inaccuracies or misinterpretations compromising validity.

Scoring Scale: Exemplary (4) | Proficient (3) | Needs Improvement (2) | Insufficient (1)

Orchestrated Reports - Multi-Model Synthesis

These orchestrated reports consolidate insights from all 5 language models (Anthropic Claude, DeepSeek R1, Google Gemini, OpenAI GPT-4, and Grok 4) to provide comprehensive, unified analyses. Each report preserves the best insights from individual models while identifying consensus findings and areas of disagreement.

Select Case:

Case I - Infection Progression

Consolidated analysis from 5 models examining infection progression patterns in sepsis patients.

Loading orchestrated report...

LLM Analysis Prompts

These are the prompts used to generate reports from different language models for each case, as well as the orchestration prompt used to consolidate multiple reports.

Select Prompt Type:

Case I - Infection Analysis

Prompt for analyzing infection progression patterns in sepsis patients

Prompt Content

Loading prompt...

How These Prompts Work

Analysis Prompts: Used with each LLM (Anthropic, DeepSeek, Gemini, OpenAI, Grok) to generate individual reports for each case
Orchestration Prompt: Used with Claude to consolidate all 5 model reports into a unified analysis
Evaluation Prompt: Used to score each report based on the 6-criteria rubric
All prompts emphasize clinical relevance, actionable insights, and clear communication
Process mining data (matrices and maps) are provided alongside these prompts

Expert Evaluation Results

Human expert evaluation of the AI-generated reports by clinical and epidemiological specialists.

hourglass_empty

Expert Evaluation In Progress

Clinical and epidemiological experts are currently reviewing the AI-generated reports. Expert scores will be available once the evaluation process is complete.

Expected Timeline: Results will be updated here once expert review is finalized

What Will Be Evaluated

fact_check Clinical Accuracy

Verification of medical interpretations and clinical relevance

insights Process Mining Validity

Assessment of process analysis accuracy and pathway interpretation

lightbulb Practical Value

Evaluation of actionability and implementation feasibility

psychology Innovation Quality

Assessment of novel insights and research hypotheses

Expert Review Panel

The evaluation will be conducted by a multidisciplinary panel including:

Clinical specialists in sepsis and critical care
Epidemiologists with process mining expertise
Healthcare quality improvement specialists
Medical informatics researchers

Each report will be independently scored by multiple experts using the same 6-criteria rubric used for AI evaluation.

Coming Soon: AI vs Expert Comparison

Once expert evaluation is complete, this section will include:

compare_arrows

Score comparisons between AI and expert evaluations

bar_chart

Detailed analysis of scoring differences

trending_up

Insights on AI evaluation reliability

assessment LLM Report Evaluation

Case Description

AI Evaluation Results

Overall Average Scores Across All Cases

Key Insights from Cross-Case Analysis:

Scoring Framework for LLM-Generated Healthcare Process Mining Reports

Orchestrated Reports - Multi-Model Synthesis

Case I - Infection Progression

LLM Analysis Prompts

Case I - Infection Analysis

Prompt Content

How These Prompts Work

Expert Evaluation Results

Expert Evaluation In Progress

What Will Be Evaluated

fact_check Clinical Accuracy

insights Process Mining Validity

lightbulb Practical Value

psychology Innovation Quality

Expert Review Panel

Coming Soon: AI vs Expert Comparison