“Is ChatGPT Smarter Than a PCP?”

Oct 25, 2023

Is ChatGPT smarter than a PCP?  According to this Medscape article, the answer is a resounding no.

Get DocBuddy’s take on why ChatGPT had a tough time passing the UK’s NHS’ Applied Knowledge Test and why a more domain-specific LLM would have a better chance to succeed.

Click to expand and read this episode's transcript.

Erik Sunset: [00:00:00] Hey folks. Welcome back. This is Erik Sunset host of the DocBuddy journal. This is episode 41. We’re recording on Wednesday, October 25th, 2023. Thanks for spending a little bit of time with us, whether this is your first time listening to the DocBuddy journal, or whether you’re a repeat listener. We’re really glad that you’re here. If you haven’t already be sure you’re subscribed on apple podcasts, Spotify, or even YouTube. That way we can be sure you always get the freshest, newest episodes of the DocBuddy journal. And this is a busy week for everybody in the ambulatory surgery center world. Many of us are headed to Chicago to attend, to exhibit, to network, to have a good time at the Becker’s business and operation of Annual conference. This is I think the 28th Annual meeting here. So getting, getting pretty long in the tooth, and this is a favorite, it’s definitely a crowd [00:01:00] pleaser. Really excited to interact with all of the attendees, with our friends in the industry. And if you’d like to see DocBuddy up close and personal, visit us at booth number two 10. Believe that’s on the left-hand side of the exhibit hall. Again, stopped by the DocBuddy booth to see some of the DocBuddy crew booth. Number two 10. And with that, let’s get right into it. Last week we recorded a really interesting episode that revolved around healthcare’s top deck of the Titanic moment. It was a tweet thread. That basically viewed LLMs, which is something like chat GPT. That’s a large language model. It was viewing these these softwares as sort of a panacea. That’s going to bring about. Or set the tone for the next a hundred years of healthcare in the us.

The more of thought. About the thesis in that tweet thread, [00:02:00] the more I find that I disagree with it. I think some of the positions, you know, upon further reflection are indefensible. Especially when you talk about removing. The responsibility of care from the provider and giving it to a software. I just don’t really see how that’s going to work. And it’s. You know, to go back more macro picture to the LLM aspect of this. I certainly don’t feel that LLMs can’t or won’t, or won’t ever be useful for healthcare. They’re just going to require a really Giudice judicious, excuse me, application where you can use them and where you can’t. And for right now, they’re probably going to need more oversight than anyone would really want to give them. And their outputs, especially providers who are already crunched for time.

That point of view, that’s LLMs just aren’t quite there [00:03:00] yet. For general clinical use is actually an agreeance with a new study out of Scotland. That pitted chat GPT against the UK is national primary care. Examinations. And that’s the subject of this week’s episode. Before we share a take on this. On this article on this study that came out of Scotland, we need to acknowledge a couple of things. First that chat GPT is not the only large language model tool out there. Let’s also knowledge that is not trained specifically in the medical domain. It is however, the most broadly adopted LLM on the market today, or at least the one with the best marketing. And it makes it a solid peg point for expectations of LLMs in the medical arena. And the pain point is just something that you can make a standardized observation against. So [00:04:00] we’re all familiar with Chad GBT to some extent certainly. We need to go a little bit deeper in the acknowledgement that it is not trained specifically on medical data. It’s a, it’s got a broad knowledge base. It’s been trained on all different sorts of content. And as a refresher back two episodes from the summer, what is an LLM? Well, it’s a word predictor you and put a prompt. And even now Chad GPT four, you can input an image, but you’re inputting a prompt and then the LLM. The sides. What is the most likely next word in a sentence? We need to reiterate that these softwares don’t know anything. They appear to know a lot of things. And what they’re really doing is just predicting the next word in a sentence, the next sentence in paragraphs and so on and so forth. So all that out of the way. The article that we’re going to look at is something from Medscape and we’ll of course have a [00:05:00] link to that article in the show description. And it’s titled is Chad GPT smarter than a PCP? PCP, of course, as a primary care provider. Let’s get right to the chase. The answer is no. Chad GBT failed to pass the UKs national primary care exam and a new study highlighting how AI does not necessarily match human perceptions of medical complexity. Chad GPT also provided novel explanations. And novel is a nice way of saying that it frequently. Hallucinates by describing inaccurate info as if it were a fact. And this is according to one of the study authors Shatara Mukmood. VA who’s a fifth year med student at the university of Cambridge school of clinical medicine in Cambridge. UK. Who presented their findings to the Royal college of general practitioners annual conference 2023. And the study was also published in the JMR medical education earlier this [00:06:00] year. So to go a little bit deeper here, performance of AI on med school examinations has prompted much of this discussion that it isn’t. Any good at it. Often because of performance does not reflect real-world clinical practice. The team that tested chat. GPTs the applied knowledge test. You may see this as an acronym called AKT. Which allowed them to explore the potential and pitfalls of deploying LLMs in primary care to explore what further development of medical large language model applications is required.

The research has investigated the strengths and weaknesses of Chad GPT and primary care using the membership of the Royal college of general practitioners AKT. The computer-based multiple choice assessment is part of the UK specialty training to become a general practitioner, a GP. If you’re across the pond, we don’t really use that acronym much here. But the test evaluates the knowledge behind general practice within the context of [00:07:00] the UKs national health service or the NHS. This was a study done at pretty significant scale. The team of researchers entered a series of 674 questions into Chad GPT on two occasions or two runs. And by putting the questions into two separate dialogues, they hope to avoid the influence of one dialogue on the other. To validate that the answers were correct. The chat GBT responses were compared with the answers provided by the GP self test and past articles. And we started that, the examination of the study by saying that no. Chad GBT is not smarter than a primary care provider. And here’s why overall performance of the algorithm was good across both runs at 59.94 and 60.39%. And about four fifths or 83% of the questions produced the same answer on both runs. [00:08:00] But, and this is still according to the article. We’re not freestyle and here just yet, but 70% of the answers didn’t match, which is a statistically significant difference in the overall performance of Chad’s GBT was 10 points lower than the average. RCGP pass mark in the past, in the last few years. And this informs the author’s conclusions about it. Not being very precise at expert level recall in decision-making. Now. LLMs are only as good as the data that goes into them to train them. The model, right. So when we’re saying the chat GPT is not very good about are not very precise about expert level recall. I think that it has more to do with the the training dataset, you know, all the inputs that are put through the algorithm to give you an output. So I’m willing to cut it. A little slack there, cause it’s not a medical tool at all. You know, A good corollary would be if you’re using something like. [00:09:00] Nuance. On your desktop. Or Siri, which is powered by the general nuance and it can get tripped up on some words multi-syllable or medical terms. Don’t do very well. With Siri. But you know, where they do well is with doc, buddy. Because we have trained our models around a medical vocabulary. So it’s not the Siri on your phone. Isn’t any good. It’s just that it isn’t trained to recognize things like a total knee arthroplasty. I mean, give it a try, see where you can stop it. But where the where Chad GBT does need to take a knock is on decision-making.

One of the smartest. Tech and software guys I’ve ever worked with burned something into my brain by saying that a computer can only do what you tell it to do. Software can only do what you ask it to do. And there is no gray area computers operate on a yes or no. Level of understanding. So binary rate one is yes, zeros. [00:10:00] No, and it’s either the presence or the absence of an electrical charge. So with medical decision-making, there’s a huge amount of gray area. That’s not necessarily yes or no. So where these LLMs are going to need a lot of training and a lot of fine tuning and a lot of research and understanding to go in to making that better. It’s because real life isn’t always yes or no. Real life isn’t binary. So moving on. The study authors not only tested Chad GBT against the multiple choice questions, but they also used it to create a long form response. And this is where you’re going to see where that black or whites. And gray area. A Venn diagram, not overlapping a whole lot for Chad GPT in this experiment. Quote, novel X explanations regenerated upon running a question through chat GBT that then provided an extended answer. When the accuracy of the extended answer was checked against the correct [00:11:00] answers. No correlation was found. Chad GPT can hallucinate answers and there’s no way. Non-expert reading this could know is incorrect. One of the study authors said. Regarding the application, Chad GPT and similar algorithms to clinical practice.

Excuse me. As they stand. AI systems will not be able to replace the healthcare professional workforce in primary care. At least the study author continues. I think larger and more medically specific data sets are required to improve their outputs in this field. Continuing on from another of the authors, the study. I clearly showed Chad GPT struggled to deal with the complexity of the exam questions that are based on the primary care system. In essence, this is indicative of human factors involved in decision-making in primary care. The applied knowledge test is designed to test the knowledge required to be a generalist in the primary care setting. And as such, there are lots of nuances reflecting this within the questions. [00:12:00] And the author, the author’s conclusion is that Chad GPT may look at these in a more black and white way. Sound familiar. Whereas the generalist needs to be reflective of the complexities involved, the different possibilities that can present rather than take a binary. Yes or no stance. In fact, this highlights a lot about the nature of general practice and managing uncertainty. And this is reflected in the questions, asked in the exam. Being a generalist is about factoring in human emotion and human perception, as well as knowledge. So a couple of quick notes on this exam, and this is from the NHS website. The meeting score sits on or above 80% for each of the domains tested as a part of this exam. And the pass rate for this exam is approximately 80%. So now let’s go back to a comment. From the article that the overall performance of the algorithm was good across both runs at [00:13:00] 59.94 and 60.39%. So that’s, it’s called 60%. Is that actually. Any good at all. Especially when we’re talking about medical care here. The meeting’s score is eight. So that’s a B minus, right? Go back to your your college or your high school days. Passing. It’s fine. But the two chat GPT runs produced a D minus at 60%. And depending on your professor, that 59 is probably an AF. It’s at least the borderline half. That doesn’t seem very good to me. That’s certainly not close enough. If I were to be a patient of this algorithm, which thankfully I’m not. But, you know what I’d really like to see next? This is where it can get interesting because chat GBT is not meant to be a medical LLM. It certainly has an intaked. It’s been inputted and provided medical information, but it’s not purpose built to be a medical LLM. Like some of the other tools that are undoubtedly in use. [00:14:00] So what I really want to see next, like, let’s make this interesting. Let’s see an LLM trained on medical information. Take a task like this. Let’s see you take the exact same test. Let’s see if he can pass. Med school. Entrance exam here in the us. And you know, this is a little bit of an aside. One of the prevailing thoughts in actual artificial intelligence. Like we referenced AI or the article referenced AI a couple of times, this is not artificial intelligence because we all know how LLMs are a simple word predictor, and they’re very good at it. They can make it seem like they know a lot of things, but the fact is it’s a word predictor. But true. Generalized AI. May not ever be possible. And generalized AI is what you see in science fiction that, you know, has a robot walking around, interacting with the world in the same way that a human would, you know, Built to the demands and the restrictions of [00:15:00] humans. It’s something like interstellar movie interstellar comes to mind when they have the, the robots helping to fly the ships and be, you know, an assistant. That level of AI by actual data scientists, not LLM algorithm writers. And there I think is an important distinction to make there. But the real data scientists, the real PhDs in this type of stuff are skeptical. That general AI. Can never happen. The thing that I’ve heard is that you would need at least a generally available commercial grade. Quantum computing, which is a whole other ball of wax. But these same experts who are skeptical of a true generalized AI ever taking form. Are a little more bullish on domain specific AI. And that may actually be possible according to this group. And I, I think that’s what we’re closer to. If you can say really that close at all, but [00:16:00] even look at something like chat GPT. It’s a fantastic word. Predictor. It looks very good. That’s a very specific domain, right? Converse with me in a written form. Using the English language. Pretty good at that pretty specific domain too, but that doesn’t mean that it can cross the boundary into being able to pass. The NHS is a. AKT exam. So the fact that Chad J GPT. PT is this like broad base LLM and the fact that it couldn’t pass this exam, shouldn’t be too much of a surprise. But I think a domain specific medical LLM, whatever, a much better shot at success, I would imagine. It would handle the exam, the bed, any issue. And you can see that based on the failing scores, the chat GPT gods that sort of D minus borderline F level, that’s not too far away from the median score to add. You know, if you look at it, [00:17:00] percentage-wise. Now for a human taking a test, that’s not even close in my opinion. I would be bummed out to expect an ADE and get a 60 instead. Maybe that’s just me. But as we’re wrapping up this episode of the DocBuddy journal, we want to share a special thanks to the authors of the study. Last name, SWAT mood. Through through Karasu and pulmonic apologies for the butchery on any of the names. Your work is appreciated and noted. Gave us another interesting topic to disect here on the DocBuddy journal. And I’ll say it one more time. We’re heading to Chicago, we’re taking the show on the road for one of the last times this year, do be sure to stop by booth number two 10. Get yourself a good look at app notes and see how it can absolutely revolutionize your operative report process. By putting in place digital workflows that will make your surgeons happy. Make your administrators happy will make your rev cycle team, maybe even happier than the other two, because they will not be waiting on a [00:18:00] single opera port for signature ever again. Thank you for listening on behalf of the entire doc, buddy team. Be sure you’re subscribed across the apple pods, Spotify and YouTube. And until next time, I’m your host, Erik Sunset. We’ll talk to you again soon.