ChatGPT and Anthropic’s Claude are stepping deeper into personal health, using Apple Health data to issue cardiovascular “grades” that doctors say don’t hold up.

A new test by The Washington Post found that the AI systems confidently scored heart health using wearable metrics… and got it wrong. Using years of Apple Watch data, the test examined how both chatbots translate consumer health metrics into personalized judgments and how much confidence they place in those results.

How the test was set up

The experiment was carried out by The Washington Post technology columnist Geoffrey A. Fowler, who joined limited-access programs for both ChatGPT Health and Anthropic’s Claude healthcare tools shortly after they became available.

He connected the services to his Apple Health account, granting them visibility into long-term fitness and activity records, and later allowed access to select medical information to see how each system handled deeper context.

Fowler asked straightforward, consumer-style questions about overall cardiovascular health, the kind a typical user might pose when trying to make sense of years of tracked data. The goal was to observe how the chatbots processed real-world health inputs, how much weight they assigned to different signals, and how consistently they handled that information across interactions.

Different scores, similar conclusions

When the reporter asked both systems to grade his cardiovascular health, ChatGPT returned failing marks, at one point assigning an F, while Claude issued a more moderate C-range score. Despite the difference, both chatbots framed their responses as meaningful summaries of long-term heart health rather than as limited or uncertain estimates.

Those grades conflicted with a physician’s assessment of the same data. Both chatbots based much of their assessment on Apple Watch fitness estimates, a contrast to the physician’s review of the full medical record, which found no cause for concern.

Experts flag shaky assumptions

Medical experts reviewing the results were blunt in their assessment.

Eric Topol, a cardiologist and founder of the Scripps Research Translational Institute, said the chatbots’ evaluations were not grounded in clinical practice and placed undue confidence in data that physicians typically treat with caution. He described the analysis as unreliable for medical decision-making and warned that the systems were not equipped to interpret long-term health data in a clinically meaningful way.

Topol took particular issue with how both ChatGPT and Claude elevated estimated fitness metrics from consumer wearables into apparent indicators of cardiovascular risk. Measures such as VO₂ max and heart-rate variability, he noted, can vary widely depending on device, calibration, and context, and are rarely used in isolation to assess heart health.

Without the ability to account for noise, uncertainty, and real clinical outcomes, Topol said, general-purpose AI models risk producing assessments that sound authoritative but lack medical validity.

When uncertainty exposes AI’s limits

AI chatbots can perform well when medical questions have clear, structured answers, often matching expectations on standardized knowledge tests. That strength, however, does not always carry over to situations where judgment and uncertainty play a central role.

A study in BMC Medical Education found that ChatGPT and Claude both underperformed family medicine residents on diagnostic-uncertainty questions, scoring 53.3% and 57.7%, respectively, compared with 61-63% for human clinicians.

Researchers said the models frequently made logical or information errors while still responding with confidence, a pattern that helps explain why AI-generated health assessments can sound authoritative while falling short when real-world ambiguity is involved.

That dynamic aligns with the concerns raised in The Washington Post test. Both point to the same limitation: while AI systems can process medical information, they struggle to apply judgment in ambiguous situations where human reasoning remains critical.

Nvidia detailed new AI partnerships with Lilly and Thermo Fisher aimed at speeding drug discovery and automating lab operations.

Share.
Leave A Reply

Exit mobile version