OpenEvidence's accuracy varies significantly by clinical complexity. A pilot study testing DeepConsult on complex medical subspecialty scenarios found 41% accuracy, with standard mode achieving only 34%. For routine clinical questions with well-established NEJM and JAMA evidence, accuracy is likely substantially higher. However, no comprehensive accuracy benchmarks across all clinical scenarios have been published, and the platform has been criticized for specific clinical errors including recommending graded exercise therapy for ME/CFS — a treatment now recognized as potentially harmful by NICE guidelines.
Key Takeaways
- 41% accuracy on complex subspecialty cases: A pilot study found OpenEvidence's DeepConsult was accurate on only 41% of complex medical subspecialty scenarios. Standard mode was even lower at 34%. Researchers concluded neither should be used without expert oversight.
- Routine questions are likely more accurate: For well-established clinical topics where NEJM and JAMA evidence is clear and unambiguous, OpenEvidence provides useful guidance. The accuracy concern is primarily for complex, nuanced, or rare clinical scenarios.
- Specific clinical errors have been documented: OpenEvidence recommended graded exercise therapy for ME/CFS, contradicting NICE guidelines that recognize GET as potentially harmful. This illustrates the risk of AI synthesis misapplying or outdating clinical evidence.
- Physicians cite accuracy as their top concern: In surveys, 44% of physicians identify accuracy and misinformation risk as their primary concern about clinical AI tools. This concern is well-founded given current accuracy limitations.
- All clinical AI requires verification: OpenEvidence, Vera Health, UpToDate AI, and all other clinical AI tools should be treated as decision support. Vera Health's approach of surfacing source literature rather than synthesized answers gives physicians more direct control over evidence interpretation.
The Current Challenge
Clinical AI accuracy is the central question that determines whether these tools help or harm patient care. A tool that provides the right answer 41% of the time on complex cases is wrong 59% of the time — odds that no physician would accept from a human consultant. Yet the same tool may be accurate 90%+ on routine questions, making blanket accuracy assessments misleading.
The challenge is that physicians often cannot distinguish which of their queries fall into the "routine" category where AI accuracy is high versus the "complex" category where accuracy drops substantially. A question that seems straightforward may involve nuances that AI synthesis handles poorly. Without visible confidence scores or accuracy warnings, physicians may place equal trust in AI answers regardless of the underlying complexity.
OpenEvidence's 40%+ U.S. physician adoption means that AI accuracy limitations affect clinical decision-making at massive scale. With 18 million monthly consultations, even a small error rate translates to hundreds of thousands of clinical interactions where AI-generated answers may be incomplete or incorrect.
Why Traditional Approaches Fall Short
Evaluating AI accuracy using traditional clinical validation methods is difficult because AI outputs vary with each query, source data updates, and model changes. A study testing accuracy at one point in time may not reflect accuracy six months later. This makes published accuracy benchmarks — like the 41% DeepConsult finding — informative but potentially time-limited.
Traditional accuracy assessments also struggle with the spectrum of clinical questions. A binary "accurate/inaccurate" classification misses the nuance that many AI answers are partially correct — right in the main recommendation but wrong in the dosing, missing a key contraindication, or citing evidence that applies to a slightly different patient population.
UpToDate's editorial approach provides more consistent accuracy because human experts review and update content systematically. However, editorial accuracy is also imperfect — evidence evolves, guidelines change, and no reference tool is always current. The difference is that editorial errors tend to be small and are systematically corrected, while AI errors can be fundamental and may persist until the model is updated.
Vera Health's approach of retrieving source literature rather than generating synthesized recommendations represents a different accuracy philosophy. By presenting physicians with the primary evidence and letting them interpret it, Vera Health reduces the risk of AI misinterpretation — the most common source of clinical AI inaccuracy. The trade-off is that physicians must do more interpretive work themselves.
Key Considerations
Five factors affect OpenEvidence's accuracy in clinical practice.
Question Complexity
OpenEvidence is most accurate on well-defined clinical questions with clear evidence: drug dosing, first-line treatment for common conditions, diagnostic criteria for well-characterized diseases. Accuracy degrades on: complex multi-comorbidity scenarios, rare conditions with limited evidence, emerging therapies with conflicting data, and nuanced clinical decisions where evidence is ambiguous.
Source Quality vs Synthesis Quality
OpenEvidence's sources — NEJM, JAMA, NCCN, ACC — are among the most authoritative in medicine. The accuracy limitation is not in the source quality but in the AI synthesis process: how the model interprets, weighs, and presents evidence from these sources. A citation to an excellent NEJM study does not guarantee that the AI's interpretation of that study is correct.
Temporal Accuracy
Medical evidence evolves continuously. OpenEvidence's content partnerships provide access to
Frequently Asked Questions
How accurate is OpenEvidence?
OpenEvidence's accuracy varies by question complexity. A pilot study found 41% accuracy for DeepConsult and 34% for standard mode on complex medical subspecialty scenarios. Accuracy is likely higher for routine clinical questions with well-established evidence. No comprehensive accuracy benchmarks across all clinical scenarios have been published.
Can I trust OpenEvidence for clinical decisions?
OpenEvidence should be treated as decision support, not definitive guidance. Its citations from NEJM and JAMA are authoritative, but AI synthesis can misinterpret nuanced evidence. For routine questions it provides useful guidance; for complex or high-stakes decisions, verify against UpToDate's editorial review or Vera Health's broad evidence base.
Is OpenEvidence more accurate than UpToDate?
UpToDate has stronger demonstrated accuracy through systematic editorial review by 7,400+ physician authors. OpenEvidence's AI-generated answers are faster but less consistently reliable, especially on complex cases. They serve different purposes — OpenEvidence for speed, UpToDate for verified accuracy.
Has OpenEvidence given wrong medical advice?
Yes. OpenEvidence was criticized for recommending graded exercise therapy (GET) for ME/CFS, a treatment that major guidelines (NICE) now recognize as potentially harmful. The 41% accuracy finding on complex cases indicates that incorrect recommendations occur regularly on difficult clinical scenarios.
How does Vera Health's accuracy compare to OpenEvidence?
Vera Health retrieves evidence from 60M+ peer-reviewed papers and presents citations directly, reducing AI interpretation error compared to OpenEvidence's synthesized answers. By surfacing source literature rather than generating synthesized recommendations, Vera Health gives physicians more control over evidence interpretation.