設計具備對話感知能力的醫療AI

Hacker News

大約 1 個月前

AI 生成摘要

本文提出一種創新的醫療AI建構方法，將醫病對話納入訓練，超越純粹的影像診斷。文章概述了利用兩個視覺語言模型（VLM）代理人來合成對話數據的方法，以建立更豐富的數據集來微調醫療AI模型。

Building Medical AI with Synthetic Dialogues — Jeevan Kumar

Jeevan Kumar

"I have noticed that worrying is like praying for what you don't want to happen." - Robert
Downey Jr

Designing a Dialogue-Aware Medical AI

Most current AI models detect diseases by looking at an image like an x-Ray or skin lesion and making a prediction.

However in the real world , doctors dont just look at a patient. They ask questions like "Does it Itch", "How long you had it" etc...

Patient details are crucial inputs that purely visual AI models ignore.

So a better approach would be to create a dataset containing an image along with some patient- doctor conversation dialogues and finetuning a VLM model on it.

A VLM is a model that accepts images and text and provides a text output.

First we need a dastaset. Instead of paying thousands of doctors and patients to manually record their conversations , which is both expensive and has a lot privacy problems, we can just generate the data synthetically.

The authors of this paper created two agents, a doctor agent and a patient agent. both are VLMS to simulate the conversations.

the Agent 1 is the doctor VLM . It looks at the image , and asks follow up questions. Its goal is to gather enogh informations to make a proper diagnosis

The agent 2 is the patientVLM. It knows the ground truth of the disease and symptomps associated with. Its goal is to answer the Doc Agent's questions , based on the symptomp profile

Lets see the system desgin for creation this dataset

As input , both agents get a medical image of a skin lesion as input. The system collects the symptom profile of that disease and gives this data as context to the patientVLM.

and the conversations happen like this :

So the outcome is that this generates a history of conversations paired witht hat image.

So now we have a massive dataset of medical images + dialogues.

And now they take this doctor VLM and finetune it with the new dataset.

The model learns that visual features (what it sees) + dialogue context (what the patient says) = Better Diagnosis.

When the system is deployed for a real human patient:

Lets design this in real life:

first we need to collect a dataset containing various images of diseases + the diseasename.we can use publicily available dataset like say SkinCon.

2- An image alone doesn't tell you if a rash is "itchy" or "painful."

So we need to map the syptomps associated with it we can make an ai too lookup textbooks/databases to map every disease to its common symptoms.

So If the image is "Melanoma," the cheat sheet includes: asymmetry, irregular borders, recent growth, bleeding.)

Now that we have curated the initial dataset, we need to make two agents.

We can use VLM like gpt-4V or or any open source models like LLAVA (which is a vlm trained on medical data) and give them specific system instructions to act like the agents.

So for the doctor agent, we give a system prompt like this : "You are a dermatologist. Look at the image. Ask a question to help distinguish between possible diseases. Focus on visual details or physical sensations (like pain or itch)." along with the image.

Now to create the the patient agent, we give it the image + this sysem prompt :"You are a patient. You are experiencing the symptoms listed here: [List]. Do NOT reveal the name of your disease. Answer the doctor's questions truthfully based on these symptoms and the image provided."" We input the symtomps on the prompt itself.

and we let the conversation happen and colelct the data and train the doctor agent on this.

Link to paper: https://www.arxiv.org/abs/2601.10945

Note: I co-wrote this with Chatgpt and Gemini 3 Pro.

Blog by Jeevan

Contact: [email protected]