Skip Navigation Bar

Fact Sheet
SPECIALIST Lexicon


Introduction

The SPECIALIST Lexicon provides the lexical information needed for the SPECIALIST Natural Language Processing (NLP) System. It includes commonly occurring English words and biomedical vocabulary. The Lexicon entry for each word or term records the syntactic, morphological, and orthographic information used with associated NLP tools.

The SPECIALIST Lexicon is one of the three Unified Medical Language System® (UMLS®) components: the Metathesaurus®, the Semantic Network, and the SPECIALIST Lexicon. The National Library of Medicine (NLM) updates the UMLS twice a year in May and November.

Scope and Content

The Lexicon consists of a set of lexical entries with one entry for each spelling or set of spelling variants in a particular part of speech. Lexical items may be multi-word terms if the term is determined to be a lexical item by its presence as a term in general English or medical dictionaries, or in medical thesauri such as MeSH®. The Lexicon also supports multi-word expansions of generally used acronyms and abbreviations.

A variety of sources contribute to the set of words selected for lexical coding. Approximately 20,000 words form the core of the words entered. This core of words originates from the UMLS Test Collection of MEDLINE® abstracts together with words that appear in both the UMLS Metathesaurus and Dorland’s Illustrated Medical Dictionary. Additionally, the core words include words from the general English vocabulary as well as the 10,000 most frequently used words listed in The American Heritage Word Frequency Book and the list of 2,000 words used in definitions in Longman's Dictionary of Contemporary English. Since the majority of the words selected for coding are nouns, we have purposely included verbs and adjectives by identifying verbs in MEDLINE records, by using the Computer Usable Oxford Advanced Learner's Dictionary and by identifying potential adjectives from Dorland's Illustrated Medical Dictionary using heuristics developed by McCray and Srinivasan (1990).

The process of coding lexical records uses a variety of reference sources. Coding is based on actual usage in the UMLS Test Collection and MEDLINE, dictionaries of general English, primarily learner's dictionaries that record the kind of syntactic information needed for NLP, and medical dictionaries. Reference sources include Longman's Dictionary of Contemporary English, Dorland's Illustrated Medical Dictionary, Collins COBUILD Dictionary, the Oxford Advanced Learner's Dictionary, and Webster's Medical Desk Dictionary.

Distribution Formats

NLM provides the SPECIALIST Lexicon in two formats:  a unit record format and a relational table format. The information associated with each lexical entry includes a unique identifier, a base form, a syntactic category code, certain agreement information, complementation information if relevant, and various other properties relevant to the particular lexical entry.

The unit record format is a frame structure consisting of slots and fillers. The slots are the basic lexical attributes, and the fillers express the possible values of those attributes for that particular lexical item. A set of relational tables represents the data for lexical entries. The lexicon relational format is not fully normalized. By design, there is duplication of data among different relations and within certain relations. Developers will need to decide whether to retain, reduce, or increase this redundancy for their applications. Among other tables, there are separate tables for agreement and inflection information, complementation patterns, spelling variants, and abbreviations and acronyms and their fully expanded forms.

Downloading the SPECIALIST Lexicon

The SPECIALIST Lexicon is an open source resource and it is available for download as part of the SPECIALIST NLP Tools. Distribution is subject to terms and conditions.

NLM also distributes the SPECIALIST Lexicon as a component of the UMLS Knowledge Sources through the Downloads menu of the UMLS Terminology Services (UTS). You must have an active UMLS Metathesaurus License to access the UTS. For instructions on requesting a license and accessing the UTS, see How to License and Access the Unified Medical Language System® (UMLS®) Data

Documentation

See the SPECIALIST NLP Tools Web site for information about the latest version of the SPECIALIST Lexicon.

See the SPECIALIST Lexicon and Lexical Tools chapter of the UMLS Reference Manual for detailed information about the Lexicon.

Other Fact Sheets in the UMLS series: UMLS MetamorphoSys, UMLS Metathesaurus, UMLS Semantic Network, UMLS Terminology Services, and Unified Medical Language System (UMLS).


For general information on NLM services, contact:

National Library of Medicine
Customer Service
8600 Rockville Pike
Bethesda, MD 20894
Telephone: 1-888-FINDNLM (1-888-346-3656)
email: http://www.nlm.nih.gov/contacts/contact.html
NLM Customer Service Form at http://apps.nlm.nih.gov/mainweb/siebel/nlm/index.cfm

A complete list of NLM Factsheets is available at:
(alphabetical list): http://www.nlm.nih.gov/pubs/factsheets/factsheets.html
(subject list): http://www.nlm.nih.gov/pubs/factsheets/factsubj.html

Or write to:

FACT SHEETS
Office of Communications and Public Liaison
National Library of Medicine
8600 Rockville Pike
Bethesda, MD 20894

Phone: (301) 496-6308
Fax: (301) 496-4450
email: publicinfo@nlm.nih.gov