Characterizing Clinical Text and Sublanguage: A Case Study of the VA Clinical Notes

Qing T. Zeng; Doug Redd; Guy Divita; Samah Jarad; Cynthia Br; t; Jonathan R. Nebeker

doi:10.4172/2157-7420.S3-001

Characterizing Clinical Text and Sublanguage: A Case Study of the VA Clinical Notes

Abstract

Qing T. Zeng,Doug Redd,Guy Divita*,Samah Jarad,Cynthia Brandt,Jonathan R. Nebeker

Objective: To characterize text and sublanguage in medical records to better address challenges within Natural Language Processing (NLP) tasks such as information extraction, word sense disambiguation, information retrieval, and text summarization. The text and sublanguage analysis is needed to scale up the NLP development for large and diverse free-text clinical data sets. Design: This is a quantitative descriptive study which analyzes the text and sublanguage characteristics of a very large Veteran Affairs (VA) clinical note corpus (569 million notes) to guide the customization of natural language processing (NLP) of VA notes. Methods: We randomly sampled 100,000 notes from the top 100 most frequently appearing document types. We examined surface features and used those features to identify sublanguage groups using unsupervised clustering. Results: Using the text features we are able to characterize each of the 100 document types and identify 16 distinct sublanguage groups. The identified sublanguages reflect different clinical domains and types of encounters within the sample corpus. We also found much variance within each of the document types. Such characteristics will facilitate the tuning and crafting of NLP tools. Conclusion: Using a diverse and large sample of clinical text, we were able to show there are a relatively large number of sublanguages and variance both within and between document types. These findings will guide NLP development to create more customizable and generalizable solutions across medical domains and sublanguages.

PDF

Share this article

Awards & Nominations

50+ Million Readerbase

Journal Highlights

Google Scholar citation report

Citations: 2700

Journal of Health & Medical Informatics received 2700 citations as per Google Scholar report

Journal of Health & Medical Informatics peer review process verified at publons

Indexed In

Index Copernicus
Google Scholar
Sherpa Romeo
Open J Gate
Genamics JournalSeek
Academic Keys
JournalTOCs
ResearchBible
Access to Global Online Research in Agriculture (AGORA)
Electronic Journals Library
RefSeek
Hamdard University
EBSCO A-Z
OCLC- WorldCat
Proquest Summons
Scholarsteer
SWB online catalog
Virtual Library of Biology (vifabio)
Publons
Geneva Foundation for Medical Education and Research
Euro Pub

Journal of Health & Medical Informatics

Characterizing Clinical Text and Sublanguage: A Case Study of the VA Clinical Notes

Abstract

Awards & Nominations

50+ Million Readerbase

Journal Highlights

Google Scholar citation report

Citations: 2700

Journal of Health & Medical Informatics peer review process verified at publons

Indexed In

Related Links

Open Access Journals