47
views
0
recommends
+1 Recommend
0 collections
    0
    shares
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      Unified Medical Language System term occurrences in clinical notes: a large-scale corpus analysis

      research-article

      Read this article at

      ScienceOpenPublisherPMC
          There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

          Abstract

          Objective

          To characterise empirical instances of Unified Medical Language System (UMLS) Metathesaurus term strings in a large clinical corpus, and to illustrate what types of term characteristics are generalisable across data sources.

          Design

          Based on the occurrences of UMLS terms in a 51 million document corpus of Mayo Clinic clinical notes, this study computes statistics about the terms' string attributes, source terminologies, semantic types and syntactic categories. Term occurrences in 2010 i2b2/VA text were also mapped; eight example filters were designed from the Mayo-based statistics and applied to i2b2/VA data.

          Results

          For the corpus analysis, negligible numbers of mapped terms in the Mayo corpus had over six words or 55 characters. Of source terminologies in the UMLS, the Consumer Health Vocabulary and Systematized Nomenclature of Medicine—Clinical Terms (SNOMED-CT) had the best coverage in Mayo clinical notes at 106 426 and 94 788 unique terms, respectively. Of 15 semantic groups in the UMLS, seven groups accounted for 92.08% of term occurrences in Mayo data. Syntactically, over 90% of matched terms were in noun phrases. For the cross-institutional analysis, using five example filters on i2b2/VA data reduces the actual lexicon to 19.13% of the size of the UMLS and only sees a 2% reduction in matched terms.

          Conclusion

          The corpus statistics presented here are instructive for building lexicons from the UMLS. Features intrinsic to Metathesaurus terms (well formedness, length and language) generalise easily across clinical institutions, but term frequencies should be adapted with caution. The semantic groups of mapped terms may differ slightly from institution to institution, but they differ greatly when moving to the biomedical literature domain.

          Related collections

          Most cited references12

          • Record: found
          • Abstract: found
          • Article: not found

          Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program.

          The UMLS Metathesaurus, the largest thesaurus in the biomedical domain, provides a representation of biomedical knowledge consisting of concepts classified by semantic type and both hierarchical and non-hierarchical relationships among the concepts. This knowledge has proved useful for many applications including decision support systems, management of patient records, information retrieval (IR) and data mining. Gaining effective access to the knowledge is critical to the success of these applications. This paper describes MetaMap, a program developed at the National Library of Medicine (NLM) to map biomedical text to the Metathesaurus or, equivalently, to discover Metathesaurus concepts referred to in text. MetaMap uses a knowledge intensive approach based on symbolic, natural language processing (NLP) and computational linguistic techniques. Besides being applied for both IR and data mining applications, MetaMap is one of the foundations of NLM's Indexing Initiative System which is being applied to both semi-automatic and fully automatic indexing of the biomedical literature at the library.
            • Record: found
            • Abstract: not found
            • Conference Proceedings: not found

            Effective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program

              • Record: found
              • Abstract: found
              • Article: found
              Is Open Access

              The structural and content aspects of abstracts versus bodies of full text journal articles are different

              Background An increase in work on the full text of journal articles and the growth of PubMedCentral have the opportunity to create a major paradigm shift in how biomedical text mining is done. However, until now there has been no comprehensive characterization of how the bodies of full text journal articles differ from the abstracts that until now have been the subject of most biomedical text mining research. Results We examined the structural and linguistic aspects of abstracts and bodies of full text articles, the performance of text mining tools on both, and the distribution of a variety of semantic classes of named entities between them. We found marked structural differences, with longer sentences in the article bodies and much heavier use of parenthesized material in the bodies than in the abstracts. We found content differences with respect to linguistic features. Three out of four of the linguistic features that we examined were statistically significantly differently distributed between the two genres. We also found content differences with respect to the distribution of semantic features. There were significantly different densities per thousand words for three out of four semantic classes, and clear differences in the extent to which they appeared in the two genres. With respect to the performance of text mining tools, we found that a mutation finder performed equally well in both genres, but that a wide variety of gene mention systems performed much worse on article bodies than they did on abstracts. POS tagging was also more accurate in abstracts than in article bodies. Conclusions Aspects of structure and content differ markedly between article abstracts and article bodies. A number of these differences may pose problems as the text mining field moves more into the area of processing full-text articles. However, these differences also present a number of opportunities for the extraction of data types, particularly that found in parenthesized text, that is present in article bodies but not in article abstracts.

                Author and article information

                Journal
                J Am Med Inform Assoc
                J Am Med Inform Assoc
                jamia
                amiajnl
                Journal of the American Medical Informatics Association : JAMIA
                BMJ Group (BMA House, Tavistock Square, London, WC1H 9JR )
                1067-5027
                1527-974X
                4 April 2012
                June 2012
                4 April 2012
                : 19
                : e1
                : e149-e156
                Affiliations
                [1 ]Division of Biomedical Statistics and Informatics, Mayo Clinic, Rochester, MN, USA
                [2 ]Stanford Center for Biomedical Informatics Research, Stanford, CA, USA
                Author notes
                Correspondence to Dr Stephen T Wu, Department of Health Sciences Research, Mayo Clinic, 200 First Street SW, Rochester, MN 55905, USA; wu.stephen@ 123456mayo.edu
                Article
                amiajnl-2011-000744
                10.1136/amiajnl-2011-000744
                3392861
                22493050
                e0011bc1-0a50-41b3-bb93-e0d6a89580d2
                © 2012, Published by the BMJ Publishing Group Limited. For permission to use (where not already granted under a licence) please go to http://group.bmj.com/group/rights-licensing/permissions.

                This is an open-access article distributed under the terms of the Creative Commons Attribution Non-commercial License, which permits use, distribution, and reproduction in any medium, provided the original work is properly cited, the use is non commercial and is otherwise in compliance with the license. See: http://creativecommons.org/licenses/by-nc/2.0/ and http://creativecommons.org/licenses/by-nc/2.0/legalcode.

                History
                : 2 December 2011
                : 12 March 2012
                Categories
                Research and Applications
                1506
                FOCUS on clinical research informatics

                Bioinformatics & Computational biology
                Bioinformatics & Computational biology

                Comments

                Comment on this article

                Related Documents Log