GOOGLE. HEALTH. Our latest health AI research updates.

At The Check Up, we shared updates on our medical LLM research and ways AI can improve maternal care, cancer treatments and tuberculosis screening.

We’ve spent the past several years researching artificial intelligence (AI) for healthcare — exploring how it can help detect diseases early, expand access to care and more. We’ve taken a “move slow and test things” approach to prove efficacy, equity, helpfulness and safety above all. Today, at our annual health event, The Check Up, we shared health AI updates including our progress on our medical large language model (LLM) research, partnerships that are bringing solutions into real-world settings, and new ways AI can help with disease detection. Here’s a look at what’s new.

Ongoing research on Med-PaLM 2, our expert-level medical LLM

Recent progress in large language models (LLMs) — AI tools that demonstrate capabilities in language understanding and generation — has opened up new ways to use AI to solve real-world problems. However, unlike some other LLM use cases, applications of AI in the medical field require the utmost focus on safety, equity, and bias to protect patient well-being. To work toward developing AI tools that can retrieve medical knowledge, accurately answer medical questions, and provide reasoning, we’ve invested in medical LLM research.

Last year we built Med-PaLM, a version of PaLM tuned for the medical domain. Med-PaLM was the first to obtain a “passing score” (>60%) on U.S. medical licensing-style questions. This model not only answered multiple choice and open-ended questions accurately, but also provided rationale and evaluated its own responses.

Recently, our next iteration, Med-PaLM 2, consistently performed at an “expert” doctor level on medical exam questions, scoring 85%. This is an 18% improvement from Med-PaLM’s previous performance and far surpasses similar AI models.

While this is exciting progress, there’s still a lot of work to be done to make sure this technology can work in real-world settings. Our models were tested against 14 criteria — including scientific factuality, precision, medical consensus, reasoning, bias and harm — and evaluated by clinicians and non-clinicians from a range of backgrounds and countries. Through this evaluation, we found significant gaps when it comes to answering medical questions and meeting our product excellence standards. We look forward to working with researchers and the global medical community to close these gaps and understand how this technology can help improve health delivery.



Go to Top