MTI ML

Home
Please Note: The MTI ML package is protected under the MetaMap Terms and Conditions. Please review prior to downloading the MTI ML package.

Introduction


This package provides machine learning algorithms optimized for large text categorization tasks and is able to combine several text categorization solutions. The advantages of this package compared to existing approaches are: 1) its speed, 2) it is able to work with a large number of categorization problems and, 3) it provides the ability to compare several text categorization tools based on meta-learning. This website describes how to download, install and run MTI ML. An example data set is provided to verify the installation of the tool. More detailed instructions on using the tool are available here.

The latest changes in the MTI ML package can be found in the release notes here.


Download


The main components of the MTI ML distribution are easily downloaded via the single MTI_ML.tar.gz link below. MTI ML also requires a third party package monq-1.1.1.jar which is available via the link below.

The monq-1.1.1.jar package is a third party open source development resource package available from berliOS. The monq-1.1.1.jar package is used to parse XML and provide server capabilities to MTI ML.

For best results, download and install the MTI ML distribution file MTI_ML.tar.gz and then download and install the monq-1.1.1.jar package in the new MTI_ML directory that was created by the MTI ML distribution file.


JAR files included in the MTI_ML distribution:



Installation


MTI ML has been compiled and verified to work with java 1.6. If you need to install java or update your current version, follow this link: http//www.java.com. Be sure that the path to the java program is in the PATH environment variable.

Move the downloaded files into a directory where you want to install MTI ML. When you uncompress and untar the MTI ML distribution file it will create a subdirectory call "MTI_ML" and that directory will be referred to as <parent_directory> throughout the rest of the instructions. So, for example, if you create a directory "Project" and install MTI ML in Project then the <parent_directory> should be set to <path to Project>/Project/MTI_ML or <path to Project>\Project\MTI_ML under Windows.

# Windows.

Winrar and winzip can be used to uncompress and untar the files.

# Linux and Mac OS (from Bourne/Bash Shell)

gunzip -c MTI_ML.tar.gz | tar -xf -

The various MTI ML jar files have to be added to the CLASSPATH environment variable or configured directly using the -cp parameter of the Java Virtual Machine (JVM).


Training and testing classifiers using MTI ML


The following section provides a tutorial on how to use the MTI ML. All of the tutorial examples are provided in the three supported operating systems: Windows (XP & 7), Linux, and Mac OS. Before you start, please download the tutorial environment by saving all of the following files to your work area. Between these instructions and the tutorial environment, you should be able to recreate the entire MTI ML sample run from training the machine learning algorithms to evaluating the results.

The tutorial environment consists of sample training and test data sets, a sample configuration file, and a benchmark file with the gold standard annotations for the data sets to evaluate the results.

Please Note: The MEDLINE citations in the training and test data sets and the results in the benchmark file represent a static view of the MEDLINE database at the time the data was created. No attempt has been made to keep the data up-to-date.

Training, testing, and evaluation files included in the MTI_ML distribution:



Using MTI ML on Female, Humans, and Male MeSH Headings


Note: In Windows, the files citations.train.xml.gz and citations.test.xml.gz will need to be uncompressed before proceeding.  Winrar and winzip can be used to uncompress the files.


Prepare the command prompt


# Windows. Open a Windows command prompt. Go to <parent_directory>.

Move to the drive (e.g. C:) where the <parent_directory> is located

[Drive]:

cd <parent_directory>

# Linux and Mac OS. Open a terminal. Go to <parent_directory>.

cd <parent_directory>


Setting the CLASSPATH environment variable


# Windows

set CLASSPATH="<parent_directory>\monq-1.1.1.jar;<parent_directory>\utils.jar;<parent_directory>\mti_prod.jar"

# Linux and Mac OS

* in C Shell (csh or tcsh)

setenv CLASSPATH <parent_directory>/monq-1.1.1.jar:<parent_directory>/utils.jar:<parent_directory>/mti_prod.jar

* in Bourne Again Shell (bash)

export CLASSPATH=<parent_directory>/monq-1.1.1.jar:<parent_directory>/utils.jar:<parent_directory>/mti_prod.jar

* Bourne Shell (sh)

CLASSPATH=<parent_directory>/monq-1.1.1.jar:<parent_directory>/utils.jar:<parent_directory>/mti_prod.jar
export CLASSPATH


Training


To train the classifier. The training tool will generate a dictionary file stored in a trie structure (stored in trie.gz) and a set of models (stored in classifiers.gz) based on the definition in the configuration.txt file.

The file configuration.txt contains the details of the training. In this case, we are training models for the Humans, Male, and Female MeSH headings.

Details of the training are sent to the standard output. In the example, this is redirected to out.txt and can be used to follow the training progress and understand the generated model.

The training will take several minutes. Check the file out.log to ensure that there were no errors during training.

# Windows

type citations.train.xml | java -cp %CLASSPATH% -Xmx1G -Xms1G -ss6000k gov.nih.nlm.nls.mti.trainer.OVATrainer gov.nih.nlm.nls.mti.textprocessors.MEDLINEXMLTextProcessor "" gov.nih.nlm.nls.mti.featuresextractors.BinaryFeatureExtractor "-l -n -c -f1" configuration.txt trie.gz classifiers.gz 2> out.log > out.txt

# Linux and Mac OS (from Bourne/Bash Shell)

sh

gunzip -c citations.train.xml.gz | java -cp $CLASSPATH -Xmx1G -Xms1G -ss6000k gov.nih.nlm.nls.mti.trainer.OVATrainer gov.nih.nlm.nls.mti.textprocessors.MEDLINEXMLTextProcessor "" gov.nih.nlm.nls.mti.featuresextractors.BinaryFeatureExtractor "-l -n -c -f1" configuration.txt trie.gz classifiers.gz 2> out.log > out.txt



Testing


Given the trained model, we can then annotate a new set of citations. In the examples below, the outcome is stored in the annotation.txt file.

Annotating will take several minutes. Check the file annotation.log to ensure that there were no errors during annotating.

# Windows

type citations.test.xml | java -ss6000k -cp %CLASSPATH% gov.nih.nlm.nls.mti.annotator.OVAAnnotator gov.nih.nlm.nls.mti.textprocessors.MEDLINEXMLTextProcessor "" gov.nih.nlm.nls.mti.featuresextractors.BinaryFeatureExtractor "-l -n -c" trie.gz classifiers.gz > annotation.txt 2> annotation.log

# Linux and Mac OS (from Bourne/Bash Shell)

gunzip -c citations.test.xml.gz | java -ss6000k -cp $CLASSPATH gov.nih.nlm.nls.mti.annotator.OVAAnnotator gov.nih.nlm.nls.mti.textprocessors.MEDLINEXMLTextProcessor "" gov.nih.nlm.nls.mti.featuresextractors.BinaryFeatureExtractor "-l -n -c" trie.gz classifiers.gz > annotation.txt 2> annotation.log



Evaluation


The annotation can be evaluated using the following script. The file benchmark.test contains the MEDLINE MeSH indexing for each one of the citations in the test set and it is used as the Gold Standard. The output is stored in the file benchmark.txt.

# Windows

type annotation.txt | java -cp %CLASSPATH% gov.nih.nlm.nls.mti.evaluator.Evaluator benchmark.test > benchmark.txt

# Linux and Mac OS (from Bourne/Bash Shell)

cat annotation.txt | java -cp $CLASSPATH gov.nih.nlm.nls.mti.evaluator.Evaluator benchmark.test > benchmark.txt

grep "^Female|" benchmark.txt
grep "^Humans|" benchmark.txt
grep "^Male|" benchmark.txt

The evaluation file benchmark.txt contains results for all the MeSH headings. Each line shows the result for a single MeSH heading with fields separated by the pipe symbol. The first field is the MeSH heading name, then the number of positives in the test set, true positives, the false negatives, precision, recall, and F-measure.

The result for the MeSH headings Humans, Male and Female should be:

Female|4616|3164|859|0.786477752920706|0.6854419410745234|0.7324921865956707
Humans|7688|7086|784|0.9003811944091487|0.9216961498439126|0.9109139992286925
Male|4396|2997|1004|0.7490627343164209|0.6817561419472248|0.7138263665594855

Now that you have managed to train and evaluate classifiers based on the example data set, you can learn more about this tool from the document available here.


Publications



Last Modified: September 27, 2012 ii-public2
     Contact Us    |   Contact Us (SemRep)    |   Copyright    |   Privacy    |   Accessibility    |   Freedom of Information Act    |   USA.gov    Get Acrobat Reader button
Links to Our Sites
MetaMap Public Release
NEW: Distributable version of the actual MetaMap program.
Indexing Initiative (II)
Investigating computer-assisted and fully automatic methodologies for indexing biomedical text. Includes the NLM Medical Text Indexer (MTI).
Semantic Knowledge Representation (SKR)
Develop programs to provide usable semantic representation of biomedical text. Includes the MetaMap and SemRep programs.
MetaMap Transfer (MMTx)
Java-Based distributable version of the MetaMap program.
Word Sense Disambiguation (WSD)
Test collection of manually curated MetaMap ambiguity resolution in support of word sense disambiguation research.
MEDLINE Baseline Repository (MBR)
Static MEDLINE® Baselines for use in research involving biomedical citations. Allows for query searches and test collection creation.
Structured Abstracts (SA)
Information about NLM's research on Structured Abstracts in the MEDLINE® Baselines.
 
Lister Hill Center Homepage Link - Image of Lister Hill Center Lister Hill National Center for Biomedical Communications   NLM Homepage Link - NLM Logo U.S. National Library of Medicine   NIH Homepage Link - NIH Logo National Institutes of Health
DHHS Homepage Link - DHHS Logo Department of Health and Human Services