LHNCBC: Document Abstract

Skip Navigation

Lister Hill Center Home

|

|

FAQs


	Home
	Welcome
	Organization
	Visitor Information
	Staff Directory

	Consumer Health Resources
	Image Processing
	Language & Knowledge Processing
	Medical Informatics
	Multimedia Visualization

	Published Articles
	Technical Reports
	Lectures

	Training Opportunities
	Employment Opportunities

LHNCBC: Document Abstract

Year: 2001	Download Free Adobe Acrobat Reader
LHNCBC-2001-012
Approximate String Matching Algorithms for Limited-Vocabulary OCR Output Correction
Lasko TA, Hauser SE
Proc. SPIE, Document Recognition and Retrieval VIII. 2001 Jan;4307:232-40.
Five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary were tested on data from the archives of the National Library of Medicine. The methods, including an adaptation of the cross correlation algorithm, the generic edit distance algorithm, the edit distance algorithm with a probabilistic substitution matrix, Bayesian analysis, and Bayesian analysis on an actively thinned reference dictionary were implemented and their accuracy rates compared. Of the five, the Bayesian algorithm produced the most correct matches (87%), and had the advantage of producing scores that have a useful and practical interpretation.
PDF

Lister Hill National Center for Biomedical Communications
U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
National Institutes of Health, Department of Health & Human Services
Copyright, Privacy, Accessibility, Freedom of Information Act
USA.gov, Viewers & Players
Site last updated: 17 September 2012