Word Sense Disambiguation (WSD)
Test Collections


Word sense ambiguity is a pervasive characteristic of natural language. For example, the word "cold" has several senses and may refer to a disease, a temperature sensation, or an environmental condition. The specific sense intended is determined by the textual context in which an instance of the ambiguous word appears. In "I am taking aspirin for my cold" the disease sense is intended, in "Let's go inside, I'm cold" the temperature sensation sense is meant, while "It's cold today, only 2 degrees", implies the environmental condition sense.

It is convenient to refer to an ambiguous word along with all of its individual senses as an ambiguity case. Further, we call each textual occurrence of the ambiguity an instance. In the UMLS Metathesaurus, a large number of ambiguity cases are represented by separate concepts, each of which refers to one of the individual senses.

In order to support research investigating the automatic resolution of word sense ambiguity using natural language processing techniques, we manually constructed our original WSD Test Collection in 1999. In collaboration with several colleagues, that collection has been updated in several ways to increase its usability.

More recently, we have created a second, automatically constructed WSD test collection called the MSH WSD Test Collection that is larger and has broader semantic coverage than the original test collection. See the table below for further information about both collections.

To use either of the WSD test collections,you must have accepted the terms of the UMLS Metathesaurus License Agreement, which requires you to respect the copyrights of the constituent vocabularies and to file a brief annual report on your use of the UMLS. You also must have activated a UMLS Terminology Services (UTS) account.


Test Collection Specifics Description
MSH WSD Test Collection
  • 2009AB UMLS
  • 203 Ambiguous Words
  • 37,888 Ambiguity Cases
  • 37,090 MEDLINE Citations
  • 2010 MEDLINE Baseline

  • Requires UMLS Terminology
    Services (UTS) account
This test collection was constructed using a method that automatically extracts instances of ambiguous terms from MEDLINE without manual curation which also uses MeSH® indexing of MEDLINE as a resource. The resulting data set contains both biomedical terms and abbreviations and is automatically created using the UMLS Metathesaurus and the manual MeSH indexing of MEDLINE.
Original WSD Test Collection
  • 1999 UMLS
  • 50 Ambiguous Words
  • 5,000 Ambiguity Cases
  • 5,000 MEDLINE Citations
  • 1998 MEDLINE Baseline

  • Requires UMLS Terminology
    Services (UTS) account
This test collection was constructed using citations from the 1998 MEDLINE Baseline where the ambiguities were resolved by hand. Evaluators were asked to examine instances of an ambiguous word and determine the sense intended by selecting the Metathesaurus concept (if any) that best represents the meaning of that sense.

In June 2010, Bridget T. McInnes and Mark Stevenson developed a data set linking the WSD ambiguity choices to the 2007AB UMLS CUIs (Concept Unique Identifier). This data can be accessed via our Collaborations Page

A small utility package called nlm2sval2, which will take the WSD Test Collection and convert it into the Senseval-2 lexical sample format was developed by Dr. Ted Pedersen and can be accessed via our Collaborations Page

In May 2004,we created a version of this test collection using the PubMed Identifier (PMID) instead of the earlier form of Unique Identifier (MEDLINE UI). Direct link to PMID Original WSD Test Collection (Restricted).


Last Modified: October 18, 2012 ii-public2
     Contact Us    |   Contact Us (SemRep)    |   Copyright    |   Privacy    |   Accessibility    |   Freedom of Information Act    |   USA.gov    Get Acrobat Reader button
Links to Our Sites
MetaMap Public Release
NEW: Distributable version of the actual MetaMap program.
Indexing Initiative (II)
Investigating computer-assisted and fully automatic methodologies for indexing biomedical text. Includes the NLM Medical Text Indexer (MTI).
Semantic Knowledge Representation (SKR)
Develop programs to provide usable semantic representation of biomedical text. Includes the MetaMap and SemRep programs.
MetaMap Transfer (MMTx)
Java-Based distributable version of the MetaMap program.
Word Sense Disambiguation (WSD)
Test collection of manually curated MetaMap ambiguity resolution in support of word sense disambiguation research.
MEDLINE Baseline Repository (MBR)
Static MEDLINE® Baselines for use in research involving biomedical citations. Allows for query searches and test collection creation.
Structured Abstracts (SA)
Information about NLM's research on Structured Abstracts in the MEDLINE® Baselines.
 
Lister Hill Center Homepage Link - Image of Lister Hill Center Lister Hill National Center for Biomedical Communications   NLM Homepage Link - NLM Logo U.S. National Library of Medicine   NIH Homepage Link - NIH Logo National Institutes of Health
DHHS Homepage Link - DHHS Logo Department of Health and Human Services