This package provides machine learning algorithms optimized for large text categorization tasks and is able to combine several text categorization solutions. The advantages of this package compared to existing approaches are: 1) its speed, 2) it is able to work with a large number of categorization problems and, 3) it provides the ability to compare several text categorization tools based on meta-learning. This website describes how to download, install and run MTI ML. An example data set is provided to verify the installation of the tool. More detailed instructions on using the tool are available here.
The latest changes in the MTI ML package can be found in the release notes here.
The main components of the MTI ML distribution are easily downloaded via the single MTI_ML.tar.gz link below. MTI ML also requires a third party package monq-1.1.1.jar which is available via the link below.
The monq-1.1.1.jar package is a third party open source development resource package available from berliOS. The monq-1.1.1.jar package is used to parse XML and provide server capabilities to MTI ML.
For best results, download and install the MTI ML distribution file MTI_ML.tar.gz and then download and install the monq-1.1.1.jar package in the
new MTI_ML directory that was created by the MTI ML distribution file.
MTI ML has been compiled and verified to work with java 1.6. If you need to install java or update your current version, follow this link: http//www.java.com. Be sure that the path to the java program is in the PATH environment variable.
Move the downloaded files into a directory where you want to install MTI ML. When you uncompress and untar the MTI ML distribution file it will create
a subdirectory call "MTI_ML" and that directory will be referred to as
<parent_directory> throughout the rest of the instructions. So,
for example, if you create a directory "Project" and install MTI ML in Project then the <parent_directory> should be set to <path to Project>/Project/MTI_ML or <path to Project>\Project\MTI_ML under
Windows.
The various MTI ML jar files have to be added to the CLASSPATH environment variable or configured directly using the -cp parameter of the Java Virtual Machine (JVM).
The following section provides a tutorial on how to use the MTI ML. All of the tutorial examples are provided in the three supported operating systems: Windows (XP & 7), Linux, and Mac OS. Before you start, please download the tutorial environment by saving all of the following files to your work area. Between these instructions and the tutorial environment, you should be able to recreate the entire MTI ML sample run from training the machine learning algorithms to evaluating the results.
The tutorial environment consists of sample training and test data sets, a sample configuration file, and a benchmark file with the gold standard annotations for the data sets to evaluate the results.
Please Note: The MEDLINE citations in the training and test data sets and the results in the benchmark file represent a static view of the MEDLINE database at the time the data was created. No attempt has been made to keep the data up-to-date.
Training, testing, and evaluation files included in the MTI_ML distribution:
Note: In Windows, the files citations.train.xml.gz and citations.test.xml.gz will need to be uncompressed before proceeding.
Winrar and winzip can be used to uncompress the files.
# Windows. Open a Windows command prompt. Go to <parent_directory>.
Move to the drive (e.g. C:) where the <parent_directory> is located
[Drive]:
cd <parent_directory>
# Linux and Mac OS. Open a terminal. Go to <parent_directory>.
cd <parent_directory>
# Windows
set CLASSPATH="<parent_directory>\monq-1.1.1.jar;<parent_directory>\utils.jar;<parent_directory>\mti_prod.jar"
# Linux and Mac OS
* in C Shell (csh or tcsh)
setenv CLASSPATH <parent_directory>/monq-1.1.1.jar:<parent_directory>/utils.jar:<parent_directory>/mti_prod.jar
* in Bourne Again Shell (bash)
export CLASSPATH=<parent_directory>/monq-1.1.1.jar:<parent_directory>/utils.jar:<parent_directory>/mti_prod.jar
* Bourne Shell (sh)
CLASSPATH=<parent_directory>/monq-1.1.1.jar:<parent_directory>/utils.jar:<parent_directory>/mti_prod.jar export CLASSPATH
To train the classifier. The training tool will generate a dictionary file stored in a trie structure (stored in trie.gz) and a set of models (stored in classifiers.gz) based on the definition in the configuration.txt file.
The file configuration.txt contains the details of the training. In this case, we are training models for the Humans, Male, and Female MeSH headings.
Details of the training are sent to the standard output. In the example, this is redirected to out.txt and can be used to follow the training progress and understand the generated model.
The training will take several minutes. Check the file out.log to ensure that there were no errors during training.
# Windows
type citations.train.xml | java -cp %CLASSPATH% -Xmx1G -Xms1G -ss6000k gov.nih.nlm.nls.mti.trainer.OVATrainer gov.nih.nlm.nls.mti.textprocessors.MEDLINEXMLTextProcessor "" gov.nih.nlm.nls.mti.featuresextractors.BinaryFeatureExtractor "-l -n -c -f1" configuration.txt trie.gz classifiers.gz 2> out.log > out.txt
# Linux and Mac OS (from Bourne/Bash Shell)
sh
gunzip -c citations.train.xml.gz | java -cp $CLASSPATH -Xmx1G -Xms1G -ss6000k gov.nih.nlm.nls.mti.trainer.OVATrainer gov.nih.nlm.nls.mti.textprocessors.MEDLINEXMLTextProcessor "" gov.nih.nlm.nls.mti.featuresextractors.BinaryFeatureExtractor "-l -n -c -f1" configuration.txt trie.gz classifiers.gz 2> out.log > out.txt
Given the trained model, we can then annotate a new set of citations. In the examples below, the outcome is stored in the annotation.txt file.
Annotating will take several minutes. Check the file annotation.log to ensure that there were no errors during annotating.
# Windows
type citations.test.xml | java -ss6000k -cp %CLASSPATH% gov.nih.nlm.nls.mti.annotator.OVAAnnotator gov.nih.nlm.nls.mti.textprocessors.MEDLINEXMLTextProcessor "" gov.nih.nlm.nls.mti.featuresextractors.BinaryFeatureExtractor "-l -n -c" trie.gz classifiers.gz > annotation.txt 2> annotation.log
# Linux and Mac OS (from Bourne/Bash Shell)
gunzip -c citations.test.xml.gz | java -ss6000k -cp $CLASSPATH gov.nih.nlm.nls.mti.annotator.OVAAnnotator gov.nih.nlm.nls.mti.textprocessors.MEDLINEXMLTextProcessor "" gov.nih.nlm.nls.mti.featuresextractors.BinaryFeatureExtractor "-l -n -c" trie.gz classifiers.gz > annotation.txt 2> annotation.log
The annotation can be evaluated using the following script. The file benchmark.test contains the MEDLINE MeSH indexing for each one of the citations in the test set and it is used as the Gold Standard. The output is stored in the file benchmark.txt.
# Windows
type annotation.txt | java -cp %CLASSPATH% gov.nih.nlm.nls.mti.evaluator.Evaluator benchmark.test > benchmark.txt
# Linux and Mac OS (from Bourne/Bash Shell)
cat annotation.txt | java -cp $CLASSPATH gov.nih.nlm.nls.mti.evaluator.Evaluator benchmark.test > benchmark.txt
grep "^Female|" benchmark.txt
grep "^Humans|" benchmark.txt
grep "^Male|" benchmark.txt
The evaluation file benchmark.txt contains results for all the MeSH headings. Each line shows the result for a single MeSH heading with fields separated by the pipe symbol. The first field is the MeSH heading name, then the number of positives in the test set, true positives, the false negatives, precision, recall, and F-measure.
The result for the MeSH headings Humans, Male and Female should be:
Now that you have managed to train and evaluate classifiers based on the example data set, you can learn more about this tool from the document available here.
Last Modified: September 27, 2012 | ii-public2 | |||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||||
Lister Hill National Center for Biomedical Communications | U.S. National Library of Medicine | National Institutes of Health | ||||||||||||||||||||||||||||
Department of Health and Human Services |