Structured Extraction of Real World Medical Knowledge using LLMs for Summarization and Search

Abstract Creation and curation of knowledge graphs at scale can be used to exponentially accelerate the discovery, matching, and analysis of diseases in real-world data. While disease ontologies are useful for annotation, integration, and analysis biological data, codified disease and procedure categories e.g. SNOMED-CT, ICD10, CPT, etc. rarely capture all of the nuances a patient condition or, in the case of rare disease, may not even exist. Furthermore, there are multiple disease definitions used in data sources and publications, each having its own structure and hierarchy. Mapping between ontologies, finding disease clusters, and building a representation of the chosen disease area are resource-intensive, often requiring significant human capital. We propose the creation and curation of a patient knowledge graph utilizing large language model extraction techniques. In order to expand in volume and scale, knowledge graphs with generalized language capability allow for data to be extracted using natural language rather than being constrained by the exact terminology or hierarchy of existing ontologies. We develop a method of mapping back to existing ontologies such as MeSH, SNOMEDCT, RxNORM, HPO, etc. to ground the extracted entities to known entities in the medical community.

We have access to one of the largest ambulatory care EHR databases in the country. To demonstrate the effectiveness of our method, we benchmark our extraction in a test set with ver 33.6M unique patients, in the area of patient search. In this case study, we perform a patient search for a rare disease:
Dravet syndrome. Dravet syndrome was codified as an ICD10 recognizable disease in October 2020. In the following research, we describe our method of the construction of patient-specific knowledge graphs and subsequent searches for patients who exhibit symptoms of a particular disease. Using patients with confirmed ICD10 codes for Dravet syndrome as our ground truth, we utilize our LLM-based entity extraction techniques and formalize an algorithmic way of characterizing patients in grounded ontology to assist in mapping patients to specific diseases. Finally, we present the results of a real-world discovery method on Beta-propeller protein-associated neurodegeneration (BPAN), identifying patients with a rare disease, where no ground truth currently exists.