Staff Bibliography
Document Abstract

Callaghan FM, Jackson MT, Demner-Fusham D, Abhyankar S, McDonald CJ.

NLP-derived Information Improves the Estimates of Risk of Disease Compared To Estimates Based On Manually Extracted Data Alone.

5th International Symposium on Semantic Mining in Biomedicine. Zurich. 2012.

Natural language processing (NLP) enables researchers to extract large quantities of information from free-text that otherwise could only be extracted manually. This information can then be used to answer clinical research questions via statistical analysis. However, NLP extracts information with some degree of error - the sensitivity and specificity of state-of-the-art NLP methods are typically 80-90% - and most statistical methods assume that the information has been observed "without measurement error". As we show in this paper, if an NLP-derived smoking status predictor is used, for example, to estimate the risk of smoking-related cancer without any adjustment for measurement error, the estimate is biased. Conversely, if a smaller subset of manually extracted data is used alone, then the estimate is unbiased, but imprecise, and the corresponding inference methods tend to have low power to detect significant relationships. We propose using a statistical measurement error method - a maximum likelihood (ML) method - that combines information from NLP with manually validated data to produce unbiased estimates that also have good power to detect a significant signal. This method has the potential to open-up large freetext databases to statistical analysis for clinical research. With a case study using smoking status to predict smoking-related cancer and simulations, we demonstrate that the ML method performs better under a variety of scenarios than using either NLP or manually extracted data alone.

More about this article:

Full Text (PDF) | View Citation