Generative AI Supports Medical Interview Evaluation: AI Scoring Shows High Agreement with Instructor Evaluation
Juntendo University demonstrated the effectiveness of using generative AI for evaluating medical interviews, achieving high correlation (0.87-0.90) with human instructors and reducing scoring time by roughly 60%.
📋 Article Processing Timeline
- 📰 Published: April 10, 2026 at 20:00
- 🔍 Collected: April 11, 2026 at 00:25 (4h 25m after Published)
- 🤖 AI Analyzed: April 20, 2026 at 02:47 (218h 21m after Collected)
Adjunct Lecturer Hiromizu Takahashi and Chief Professor Toshio Naito of the Department of General Medicine at Juntendo University School of Medicine conducted a study to examine the effectiveness of AI scoring when evaluating medical interview conversations. They used conversation records of medical interviews conducted by seven individuals, including medical students, residents, and supervising physicians, with a generative AI simulated patient (a case of a 27-year-old male with lower limb weakness) built using a custom GPT in ChatGPT. When comparing the scoring results of the generative AI (GPT-o1 Pro / GPT-5 Pro) and five clinical instructors across 25 items assessing patient-centered medical interview communication skills, the AI scoring showed high agreement with human scoring (r=0.87–0.90, CCC=0.86–0.88) and proved stable in repeated scoring. The coefficient of variation was about half that of humans, and the scoring time was reduced by 58 to 67.6%. Although this is a preliminary study based on a small number of participants and a single case, an 'evaluation model where AI performs the primary scoring and instructors confirm the content' is expected to save labor in evaluation tasks and expand opportunities for rapid, standardized feedback. Future verification of generalizability across multiple cases and institutions is desired.
This paper was published online in JMIR Medical Education on February 17, 2026.
Key Points of this Research
- Conducted validation to compare automatic scoring by generative AI (GPT-o1 Pro/GPT-5 Pro) with scoring by five clinical instructors, using conversation records of medical interviews between AI simulated patients and medical students, residents, and supervising physicians.
- Confirmed that AI scoring shows high agreement with human scoring, with a small average point difference.
- Demonstrated the potential for scaling and saving labor in interview education through an evaluation model where AI performs the primary scoring and teachers verify it, as AI scoring reduced evaluation time by approximately 60% and showed high stability in repeated scoring.
Background
Knowledge is not the only thing required of a doctor. Interviewing skills that organize patients' complaints within a limited time, perform differential diagnosis without oversights, and provide a sense of security are crucial. The quality of the interview directly impacts diagnostic accuracy, medical safety, and patient satisfaction. In recent years, the importance of education that objectively evaluates the interviewing skills of medical students and nurtures them according to their proficiency levels has increased. However, evaluating and providing feedback requires securing instructors and simulated patients (actors), as well as grading work, which places a heavy labor burden on the educational frontlines. There is also the issue that it is difficult to provide a sufficient number of interview opportunities in education targeting large numbers of people. Furthermore, variations in scoring and delays in guidance are prone to occur, making it difficult to guarantee the quality of education and secure educational opportunities. If reliable automatic evaluation becomes possible, it will not only reduce the burden on educators but also allow for broader provision of repeated practice and immediate feedback. However, whether AI evaluation of medical interview transcripts is as reliable as evaluation by instructors had not been fully verified until now. Therefore, this study aimed to verify the degree of agreement and the effect of reducing evaluation time when AI and clinical instructors scored transcripts of medical interviews using the same criteria.
Content
In this study, medical interviews with a generative AI simulated patient (a 27-year-old male case with lower limb weakness) constructed with a custom GPT of ChatGPT were conducted by a total of seven people: two medical students, three residents, and two supervising physicians. The transcript data automatically generated from the conversation logs (without manual correction) was used for evaluation. For the interview evaluation, an evaluation scale consisting of 25 items and a total of 125 points that evaluates patient-centered medical interview communication skills was used. The average value scored independently by five clinical instructors was defined as the human evaluation. On the other hand, the generative AI (GPT-o1 Pro, GPT-5 Pro) scored each conversation record five times under the same instruction conditions, and verified the degree of agreement with the human evaluation and the stability of the scoring, that is, the small variance when the same record was evaluated repeatedly. As a result, the average score of the human evaluation was 53.7 points, while the AI showed close values of 52.1 points and 53.2 points, and the trend of the scores also matched well (correlation coefficient 0.87 to 0.90). In addition, the point difference between AI and humans averaged 0.43 points (range of difference -4.87 to 5.72) and 1.54 points (-8.60 to 11.68), and no major bias was observed. Regarding the scoring time, while humans required an average of 10 minutes and 16 seconds per case, AI required 4 minutes and 19 seconds (58% reduction) and 3 minutes and 2 seconds.
This paper was published online in JMIR Medical Education on February 17, 2026.
Key Points of this Research
- Conducted validation to compare automatic scoring by generative AI (GPT-o1 Pro/GPT-5 Pro) with scoring by five clinical instructors, using conversation records of medical interviews between AI simulated patients and medical students, residents, and supervising physicians.
- Confirmed that AI scoring shows high agreement with human scoring, with a small average point difference.
- Demonstrated the potential for scaling and saving labor in interview education through an evaluation model where AI performs the primary scoring and teachers verify it, as AI scoring reduced evaluation time by approximately 60% and showed high stability in repeated scoring.
Background
Knowledge is not the only thing required of a doctor. Interviewing skills that organize patients' complaints within a limited time, perform differential diagnosis without oversights, and provide a sense of security are crucial. The quality of the interview directly impacts diagnostic accuracy, medical safety, and patient satisfaction. In recent years, the importance of education that objectively evaluates the interviewing skills of medical students and nurtures them according to their proficiency levels has increased. However, evaluating and providing feedback requires securing instructors and simulated patients (actors), as well as grading work, which places a heavy labor burden on the educational frontlines. There is also the issue that it is difficult to provide a sufficient number of interview opportunities in education targeting large numbers of people. Furthermore, variations in scoring and delays in guidance are prone to occur, making it difficult to guarantee the quality of education and secure educational opportunities. If reliable automatic evaluation becomes possible, it will not only reduce the burden on educators but also allow for broader provision of repeated practice and immediate feedback. However, whether AI evaluation of medical interview transcripts is as reliable as evaluation by instructors had not been fully verified until now. Therefore, this study aimed to verify the degree of agreement and the effect of reducing evaluation time when AI and clinical instructors scored transcripts of medical interviews using the same criteria.
Content
In this study, medical interviews with a generative AI simulated patient (a 27-year-old male case with lower limb weakness) constructed with a custom GPT of ChatGPT were conducted by a total of seven people: two medical students, three residents, and two supervising physicians. The transcript data automatically generated from the conversation logs (without manual correction) was used for evaluation. For the interview evaluation, an evaluation scale consisting of 25 items and a total of 125 points that evaluates patient-centered medical interview communication skills was used. The average value scored independently by five clinical instructors was defined as the human evaluation. On the other hand, the generative AI (GPT-o1 Pro, GPT-5 Pro) scored each conversation record five times under the same instruction conditions, and verified the degree of agreement with the human evaluation and the stability of the scoring, that is, the small variance when the same record was evaluated repeatedly. As a result, the average score of the human evaluation was 53.7 points, while the AI showed close values of 52.1 points and 53.2 points, and the trend of the scores also matched well (correlation coefficient 0.87 to 0.90). In addition, the point difference between AI and humans averaged 0.43 points (range of difference -4.87 to 5.72) and 1.54 points (-8.60 to 11.68), and no major bias was observed. Regarding the scoring time, while humans required an average of 10 minutes and 16 seconds per case, AI required 4 minutes and 19 seconds (58% reduction) and 3 minutes and 2 seconds.