Medical Record Synthesis using LLM

10 min readMar 17, 2024

Patient data are well protected under HIPAA, GDPR, and a similar law in many countries. Researchers, either academic or independent, need access to open-source medical records for their machine learning,AI, or similar projects. Using synthetic medical data might be a solution to be compliant with privacy policies and laws.

This project is leveraging medical record synthesis work with the help of the GPT-3.5 language model.

In the medical record synthesis project, a prompt engineering technique was used to instruct LLM to generate synthetic patients and their medical records.

Objectives

-To study the feasibility of synthetic medical record generation

-To study the correctness and relevance of facts in the notes

-To study the whole process from prompt writing to generation of medical notes in popular database format.

-To apply synthetic medical notes in prototyping and future language model research.

Summary

The first step in generating medical data is to test the basic prompts in ChatGPT. Our goal is to get realistic medical notes in a semi-structured format. The following steps are used to generate medical records from diseases (image 2.0)

Image 2.0 steps in creating synthetic medical records

Prompt engineering

Finding most common diseases

The prevalence of diseases can vary widely depending on the location of residence, necessitating the inclusion of this variable to create a more accurate and relevant list of diseases. The prompt can be tailored by focusing on the most commonly presenting symptoms in clinics, the most prevalent types of cancer, or the most prevalent pediatric diseases, among other criteria. In cases a reliable dataset of disease prevalence is available, it can serve as a starting point to generate patient records closer to real-world problems.

Example prompts used in finding the most common diseases.

“List the 20 most common disease in USA”

“List 20 common reasons people visit general practitioner clinics in USA:”

Image 2.1: using ChatGPT to find the most common diseases in the USA.

Generating patient information with symptoms

When the list of diseases is obtained, it will be utilized to generate individual patient profiles exhibiting symptoms characteristic of each disease. This process resembles a patient visiting a clinic with presenting symptoms. The symptoms observed must be adequately managed and accurately attributed to the corresponding diseases generated beforehand. Failure to do so could result in the Language Model (LLM) deviating from its objective and producing repetitive symptoms for a single disease. Therefore, it is crucial to ensure that the symptoms observed in each patient profile align with the respective diseases generated to maintain the accuracy and efficiency of the LLM’s performance.

Example prompt used in the patient generation.

“ I will provide a list of differential diseases and countries of origin.

You will answer me with names, ages, genders, race, country of origin , chief complaint(reason to visit the doctor) , and disease.

Your chief complaint must be complex with overlap symptoms and should be relevant to the country of origin. Separate each line by \n and separate each cell by |.

Image 2.2: Generating patients from existing diseases.

Generating full medical notes for each patient

The third stage of the process involves generating a comprehensive medical record for each patient. This step is similar to taking a patient’s medical history and conducting a physical examination in a clinical setting. The goal is to capture all relevant information about the patient’s health and symptoms in the electronic medical record (EMR) system. Traditionally, doctors spend significant time gathering information about their patient’s health, conducting tests, and documenting their findings in clinical notes. However, this approach utilizes a language model (LLM) to generate relevant notes based on existing training data automatically. Specific instructions are provided to guide the LLM in generating accurate and relevant information for each patient, including the format for each piece of data, such as the date, number, or text, and examples to ensure consistency across all patient records. This process is known as few-shot prompting, as a limited amount of information is provided to guide the LLM’s output.

Example prompt used in the patient medical note generation.

“Generate a medical note for the given patient .I will provide you the patient_id,name,age,gender,race, country of origin and chief complaint. You will answer me these columns: record_id (format : ddmmpatient_id)| date_of_visit(dd/mm/yyyy) |patient_id |patient_name |age|gender|race|country_of_origin| chief_complaint| history_of_present_illness| past_medical_and_surgical_history|immunization_history |allergy_history| currently_taking_drugs |social_personal_history |investigation_records| lab_tests_and_results| differential_diagnosis| treatment_and_management_plan| additional_notes. Each column must be separated by “|”. Each entry must be similar to a natural medical note can user + or — for present and absent. currently taking drugs must include name,dosage,form,frequency and duration.Do not write extra sentences. If more information is required to add , put in additional notes.If there is nothing to generate ,write Null. The medical note must be as complete as possible and must be relevant to the chief complaint. History must contain all relevant additional information related to the patient chief complaint.

Example format is : ‘record_id: date_of_visit+patient_id’|’date_of_visit: dd/mm/yyyy’| ‘patient_id: P000003’| ‘patient_name: Aung Soe’| ‘age: 55’|’gender: Male’| ‘race: Asian’| ‘country_of_origin: Myanmar’|’chief_complaint: Extreme thirst, frequent urination, numbness or tingling of feet.’|’history_of_present_illness: explain here’|’past_medical_and_surgical_history: explain here’| ‘immunization_history: explain here’| ‘allergy_history: explain here’| ‘currently_taking_drugs: ‘| ‘social_personal_history: explain here’| ‘investigation_records: explain here’| ‘lab_tests_and_results: explain here’, ‘differential_diagnosis: explain here’ , ‘treatment_and_management_plan: explain here’, ‘additional_notes: explain here.’

My first request is “

The following image illustrates how ChatGPT responds when provided with a previous prompt as instruction and a second prompt as patient information. The model learns from the example and generates a record by referencing the input.

Image 2.3: Generating full medical notes from patients.

These prompts are required to modify several times until the desired result is obtained. The following evaluation criteria are used to know the performance of LLM.

Relevance history, test, diagnosis, and treatment plan for given symptoms
Consistency of the format
Accuracy of the information provided
Clarity and coherence of the generated output

Manually writing the prompt down into ChatGPT and copying the result is inefficient. Besides, ChatGPT has limited memory, so it can not work for long conversations. We will use a Python script to automate the whole process.

Python program to automate the whole process

I wrote the following functions:

generate_patients,

save_patients,

read_patients,

generate_records,

save_records.

The generated records are separated by “ | “ and saved in a data frame. The error control mechanism discards incorrect format or failed generation and restarts from the last successful attempt. The details process is in image 2.4.

Image 2.4: sequence diagram to generate synthetic medical record

Observations

After the Python script was run, 1079 patients’ data were generated using the GPT3.5 language model. Among 1079 records, 60.61% originated in the USA, and 39.39% are from Myanmar. Male patients are 50.32%, and female patients are 49.68%. Therefore, it can be assumed that the gender ratio is balanced.

Image 2.5: gender and country distribution across generated patients.

According to the data, the youngest patient is 5 years old and the oldest is 79. The age group with the highest population is between 25 to 34 years old, and a pattern can be observed in the age distribution. The LLM appears to have generated more patients with ages divisible by 5, resulting in a sudden spike in those ages. This pattern may be useful in distinguishing synthetic records from real patient records.

Image 2.6: age and age group distribution in the patients.

Additionally, the relevancy of differential diagnosis to age groups was observed. Acute and chronic infections are more prevalent in young ages less than 34, while chronic and non-communicable diseases are more common in those aged 35 and above. For patients over 65 years old, a higher incidence of stroke, Alzheimer’s, and arthritis-like diseases are common. Women’s and children’s diseases were not considered during the prompt design stage, and therefore, they are not present in the generated records for those populations.

Image 2.7: Differential diagnosis seen in the 25–34 age group.

It was determined that GPT-3.5 has the capability to generate medical notes based on a given prompt. However, when the format and instructions are unclear, the model may produce notes in a random format. The generated notes include relevant investigations and lab tests, but the results are usually excluded from the answer.

It was observed that the same symptoms presented in different patients could generate entirely different medical histories and differential diagnoses. To validate these findings, tests were conducted on two patients, David, and Aye Chan, using the following parameters.

Image 2.8: two patients with the same symptoms, different demographics

GPT-3.5 generated a history of present illness for Aye Chan

Aye Chan, a 38-year-old female of Asian ethnicity and from Myanmar, presented with a complaint of high-grade fever, chills, and headache for the past 3 days. She reports a temperature of 39.5°C, which she managed with acetaminophen. She denies any cough, chest pain, shortness of breath, sore throat, nausea, vomiting, diarrhea, abdominal pain, or urinary symptoms. No recent travel or sick contacts.

GPT-3.5 generated a history of present illness for David

David, a 32-year-old male of White ethnicity and from the USA, presented with a complaint of high-grade fever, chills, and headache for the past 2 days. He reports a temperature of 39.8°C, which he managed with acetaminophen. He also reports mild sore throat, cough with yellow sputum production, and nasal congestion. No chest pain, shortness of breath, nausea, vomiting, diarrhea, abdominal pain, or urinary symptoms. No recent travel or sick contacts.

Image 2.9: Comparison table between patient Aye Chan and patient David.

GPT-3.5 explains that the prevalence of specific infectious diseases in the region and patient demographics can affect differential diagnosis. Aye Chan is from Myanmar, where malaria, dengue fever, and typhoid fever are endemic, leading to their inclusion in the differential diagnoses. David is from the USA, where these infectious diseases are not as prevalent and, therefore, not included in the differential diagnoses. Additionally, David’s differential diagnoses include seasonal respiratory infections, which are more common in the winter months in the USA.

Concerns

The accuracy and reliability of AI-generated medical records have been a topic of discussion in the healthcare industry. While these records are typically flawless, they may not always reflect the reality of emergency situations in hospitals and clinics. Doctors and medical personnel in these settings often prioritize speed over thoroughness, resorting to shorthand and abbreviations to quickly document their notes. Additionally, real-life cases are often complicated and multifaceted, with symptoms that may overlap and lead to misdiagnosis. In contrast, synthetic cases used for training AI models tend to be more straightforward and less complex.

Another concern with AI-generated medical records is the potential for bias and discrimination in the training data. For example, the resulting model may be inaccurate or incomplete if the data used to train an AI language model only includes cases from certain demographics or regions. When using AI to generate lists of common diseases in a specific region, ensuring that the training data is diverse and representative is crucial. One way to achieve this is by using reliable sources such as the World Health Organization (WHO) to gather information and populate the patients based on factual information.

In a recent test case, the ChatGPT was asked to generate a list of the most common presenting complaints in general clinics in Myanmar. However, the AI-generated list provided was irrelevant, and apologized later for the lack of available data. This highlights the importance of using trustworthy sources and datasets when training AI models for medical applications.

Uses of synthetic patient data

Synthetic patient data can be utilized in many ways.

To train an open-source language model to generate better output.
To improve NLP works, such as name entity recognition, text retrieval, and classifications.
To build scenarios for medical education and research.
As a compliant health data to study the potential of LLMs .(summarization, tailored treatment plan)
To be used in healthcare prototype software.

Conclusion

The use of language models like GPT-3.5, also known as ChatGPT, has revolutionized the field of medical record synthesis. However, it is important to note that there are even more advanced and powerful models available in the market today. For example, the latest research and news indicate that GPT4 has the ability to pass a licensed medical exam in the USand provide more empathetic responses than human doctors. This is a remarkable achievement that underscores the potential of language models in the medical field.

This project was part of my capstone project. The codes are now available in Github.