Skip Navigation
Lister Hill Center Home  

Search Tips
About the Lister Hill Center
Innovative Research
Publications and Lectures
Training and Employment
LHNCBC: Document Abstract
Year: 2001Adobe Acrobat Reader
Download Free Adobe Acrobat Reader
LHNCBC-2001-012
Approximate String Matching Algorithms for Limited-Vocabulary OCR Output Correction
Lasko TA, Hauser SE
Proc. SPIE, Document Recognition and Retrieval VIII. 2001 Jan;4307:232-40.
Five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary were tested on data from the archives of the National Library of Medicine. The methods, including an adaptation of the cross correlation algorithm, the generic edit distance algorithm, the edit distance algorithm with a probabilistic substitution matrix, Bayesian analysis, and Bayesian analysis on an actively thinned reference dictionary were implemented and their accuracy rates compared. Of the five, the Bayesian algorithm produced the most correct matches (87%), and had the advantage of producing scores that have a useful and practical interpretation.
PDF