Skip Navigation
Lister Hill Center Home  

Search Tips
About the Lister Hill Center
Innovative Research
Publications and Lectures
Training and Employment
LHNCBC: Document Abstract
Year: 2006Adobe Acrobat Reader
Download Free Adobe Acrobat Reader
LHNCBC-2006-018
Combining DOM Tree and Geometric Layout Analysis for Online Medical Journal Article Segmentation
Zou J, Le DX, Thoma GR
Proc. Joint Conference on Digital Libraries (JCDL), June 2006, Chapel Hill, NC; 119-28
We describe an HTML web page segmentation algorithm, which is applied to segment online medical journal articles (regular HTML and PDF-Converted-HTML files). The web page content is modeled by a zone tree structure based primarily on the geometric layout of the web page. For a given journal article, a zone tree is generated by combining DOM tree analysis and recursive X-Y cut algorithm. Combining with other visual cues, such as background color, font size, font color and so on, the page is segmented into homogeneous regions. Evaluation is conducted with 104 articles from 11 journals. Out of 9726 ground-truth zones, 9376 zones are correctly segmented, for an accuracy of 96.40%. Segmenting the entire web page into zones can significantly expedite and increase the accuracy of the subsequent information retrieval steps.
PDF