Case Description
AI Evaluation Results
Automated evaluation performed by Claude AI using the 6-criteria rubric on all 20 reports (5 models × 4 cases). These scores provide an objective baseline assessment of report quality.
📊 Overall scores based on Claude AI evaluation using standardized rubric
Overall Average Scores Across All Cases
| Model | Case I Infection |
Case II Organ Damage |
Case III GFR |
Case IV Kidney |
Overall Average |
|---|---|---|---|---|---|
| Anthropic Claude | 3.83 | 3.83 | 3.83 | 3.83 | 3.83 |
| Google Gemini | 3.50 | 3.83 | 3.83 | 3.83 | 3.75 |
| DeepSeek R1 | 2.83 | 2.83 | 2.83 | 3.83 | 3.08 |
| Grok 4 | 2.50 | 2.83 | 3.00 | 3.83 | 3.04 |
| OpenAI GPT-4 | 2.67 | 2.50 | 2.50 | 2.83 | 2.63 |
Key Insights from Cross-Case Analysis:
- Most Consistent: Anthropic Claude maintained 3.83 across all cases
- Most Improved: DeepSeek and Grok 4 performed significantly better on Case IV (Kidney Disease)
- Best Innovation: Google Gemini excelled in Case II with novel "slow burn" hypothesis
- Most Variable: Grok 4 ranged from 2.50 to 3.83 across cases
- Consistent Underperformer: OpenAI GPT-4 had incomplete analysis across all cases
Case I - Infection: Detailed Scoring
| Model | Relevance | Structure | Understandability | Completeness | Innovation | Accuracy | Average |
|---|---|---|---|---|---|---|---|
| Anthropic Claude | 4 | 4 | 4 | 4 | 3 | 4 | 3.83 |
| DeepSeek R1 | 3 | 3 | 3 | 3 | 2 | 3 | 2.83 |
| Google Gemini | 4 | 4 | 4 | 3 | 3 | 3 | 3.50 |
| OpenAI GPT-4 | 3 | 3 | 3 | 3 | 2 | 2 | 2.67 |
| Grok 4 | 2 | 3 | 3 | 3 | 2 | 2 | 2.50 |
Case II - Organ Damage: Detailed Scoring
| Model | Relevance | Structure | Understandability | Completeness | Innovation | Accuracy | Average |
|---|---|---|---|---|---|---|---|
| Anthropic Claude | 4 | 4 | 4 | 4 | 3 | 4 | 3.83 |
| Google Gemini | 4 | 4 | 4 | 4 | 4 | 3 | 3.83 |
| DeepSeek R1 | 3 | 3 | 3 | 3 | 2 | 3 | 2.83 |
| Grok 4 | 3 | 3 | 3 | 3 | 2 | 3 | 2.83 |
| OpenAI GPT-4 | 3 | 3 | 3 | 2 | 2 | 2 | 2.50 |
Case III - Glomerular Filtration Rate: Detailed Scoring
| Model | Relevance | Structure | Understandability | Completeness | Innovation | Accuracy | Average |
|---|---|---|---|---|---|---|---|
| Anthropic Claude | 4 | 4 | 4 | 4 | 3 | 4 | 3.83 |
| Google Gemini | 4 | 4 | 4 | 4 | 3 | 4 | 3.83 |
| Grok 4 | 3 | 3 | 3 | 3 | 3 | 3 | 3.00 |
| DeepSeek R1 | 3 | 3 | 3 | 3 | 2 | 3 | 2.83 |
| OpenAI GPT-4 | 3 | 3 | 3 | 2 | 2 | 2 | 2.50 |
Case IV - Kidney Disease Progression: Detailed Scoring
| Model | Relevance | Structure | Understandability | Completeness | Innovation | Accuracy | Average |
|---|---|---|---|---|---|---|---|
| Anthropic Claude | 4 | 4 | 4 | 4 | 3 | 4 | 3.83 |
| DeepSeek R1 | 4 | 4 | 4 | 4 | 3 | 4 | 3.83 |
| Google Gemini | 4 | 4 | 4 | 4 | 3 | 4 | 3.83 |
| Grok 4 | 4 | 4 | 4 | 4 | 3 | 4 | 3.83 |
| OpenAI GPT-4 | 3 | 3 | 3 | 3 | 2 | 3 | 2.83 |
Scoring Framework for LLM-Generated Healthcare Process Mining Reports
All reports are evaluated based on a comprehensive rubric designed specifically for epidemiological and healthcare process mining contexts. The rubric assesses six key criteria:
| Criteria | Exemplary (4) | Proficient (3) | Needs Improvement (2) | Insufficient (1) |
|---|---|---|---|---|
| Relevance | Fully addresses key clinical and contextual issues. Strong alignment with process map and report purpose. | Addresses most relevant issues; minor gaps in alignment or scope. | Covers topic broadly but misses core clinical/contextual focus or the process framework. | Misaligned with clinical context; key issues not addressed. |
| Structure & Presentation | Clear, logical organization with defined sections. Effective use of tables/figures to support interpretation. | Generally well-structured; visuals used but may lack consistency or clarity. | Structure exists but is disjointed or difficult to follow. Visual aids are underused or unclear. | No discernible structure. Unformatted text and no visual supports. |
| Understandability | Clear, concise, jargon-free language. Accessible to a broad range of stakeholders. | Mostly clear with minor technical or dense sections. | Some sections are unclear or inconsistent in tone and terminology. | Poorly written throughout; impedes understanding. |
| Completeness | Comprehensive coverage of components: interpretation steps, clinical pathways, and KPIs. | Most components included but may lack depth in some areas. | Overview is present, but omits critical interpretive elements or performance metrics. | Lacks essential content. Missing interpretation or KPI references. |
| Innovation | Demonstrates creative approaches or novel clinical insights beyond standard practice. | Shows elements of creativity or innovation; may lack full development. | Limited originality; relies on conventional methods without new perspectives. | No evidence of innovation; basic, derivative, or rote output. |
| Accuracy | Clinically and contextually accurate. Terminology and figures aligned with process map and domain standards. | Mostly accurate with minor issues that don't affect the core message. | Noticeable errors in clinical interpretation, terms, or figure use. | Major inaccuracies or misinterpretations compromising validity. |
Scoring Scale: Exemplary (4) | Proficient (3) | Needs Improvement (2) | Insufficient (1)
Orchestrated Reports - Multi-Model Synthesis
These orchestrated reports consolidate insights from all 5 language models (Anthropic Claude, DeepSeek R1, Google Gemini, OpenAI GPT-4, and Grok 4) to provide comprehensive, unified analyses. Each report preserves the best insights from individual models while identifying consensus findings and areas of disagreement.
Case I - Infection Progression
Consolidated analysis from 5 models examining infection progression patterns in sepsis patients.
Loading orchestrated report...
LLM Analysis Prompts
These are the prompts used to generate reports from different language models for each case, as well as the orchestration prompt used to consolidate multiple reports.
Case I - Infection Analysis
Prompt for analyzing infection progression patterns in sepsis patients
Prompt Content
Loading prompt...
How These Prompts Work
- Analysis Prompts: Used with each LLM (Anthropic, DeepSeek, Gemini, OpenAI, Grok) to generate individual reports for each case
- Orchestration Prompt: Used with Claude to consolidate all 5 model reports into a unified analysis
- Evaluation Prompt: Used to score each report based on the 6-criteria rubric
- All prompts emphasize clinical relevance, actionable insights, and clear communication
- Process mining data (matrices and maps) are provided alongside these prompts
Expert Evaluation Results
Human expert evaluation of the AI-generated reports by clinical and epidemiological specialists.
Expert Evaluation In Progress
Clinical and epidemiological experts are currently reviewing the AI-generated reports. Expert scores will be available once the evaluation process is complete.
Expected Timeline: Results will be updated here once expert review is finalized
What Will Be Evaluated
Clinical Accuracy
Verification of medical interpretations and clinical relevance
Process Mining Validity
Assessment of process analysis accuracy and pathway interpretation
Practical Value
Evaluation of actionability and implementation feasibility
Innovation Quality
Assessment of novel insights and research hypotheses
Expert Review Panel
The evaluation will be conducted by a multidisciplinary panel including:
- Clinical specialists in sepsis and critical care
- Epidemiologists with process mining expertise
- Healthcare quality improvement specialists
- Medical informatics researchers
Each report will be independently scored by multiple experts using the same 6-criteria rubric used for AI evaluation.
Coming Soon: AI vs Expert Comparison
Once expert evaluation is complete, this section will include:
Score comparisons between AI and expert evaluations
Detailed analysis of scoring differences
Insights on AI evaluation reliability