CIT Course on Genetic Linkage Analysis

Cookbook Linkage Analysis, Ab Initio

Alejandro A. Schäffer

NCBI

e-mail: schaffer@helix.nih.gov

Customer service through CIT/Division of Computational Biology (CIT/DCB):

Jim Tomlin: jtomlin@helix.nih.gov

http://bimas.cit.nih.gov/linkage/index.html

Download a copy of these course overheads in PostScript format

Recipe for disease gene hunting

Identify 1 or more sets of related patients
Connect patients into pedigrees
Genotype patients and their families
Look for polymorphisms that cosegregate with the disease
When linkage is found, narrow the region with more families
Once the linkage region is small, either find candidate genes and skip to step 9, or clone DNA from the area and put into artificial hosts (vectors).
Sequence the DNA in the region
Look for genes (e.g., by computer prediction, exon trapping, cDNA screen, etc.)
Screen genes for mutations
Prove mutations are causal (proof standards vary)

Ingredients for Cookbook Linkage Analysis

1 or more family trees (pedigrees) exhibiting a recurring disease (phenotype)
Genotypes for as many individuals at as many loci as feasible
1 or more UNIX, VMS, or DOS computer(s)
LINKAGE/FASTLINK and/or VITESSE software packages
1/2 cup of statistical theory

I will also briefly discuss GENEHUNTER, SimIBD software.

Desired Results

Approximate location of disease gene
Placement of disease gene relative to multiple other loci
Exclusion of a genome region for containing the disease gene

Caution: All conclusions are obtained with high probability, but not certainty

Goal of this lecture:

Reduce easy cases of linkage analysis to:

Family Ascertainment
S.M.O.G. (Simple Matter of Genotyping)
Looking up instructions
File manipulation and keystroking

References:

TO: J. D. Terwilliger and J. Ott, Handbook of Human Genetic Linkage, Johns Hopkins Press, 1994.

JO: Analysis of Human Genetic Linkage, Johns Hopkins Press, 1991

LD: LINKAGE Documentation

FD: FASTLINK Documentation

AD: APM Documentation

CD: GenoCheck Documentation

GD: GENEHUNTER Documentation

SD: SimIDB Documentation

VD: VITESSE Documentation

Linkage analysis software packages

Parametric:

LINKAGE/FASTLINK/GenoCheck, VITESSE
LIPED, MAPMAKER, CRI-MAP
MENDEL, PAP, S.A.G.E.

Non-parametric:

APM, S.A.G.E., GENEHUNTER, SimIBD

[S.A.G.E. sages: Joan Bailey-Wilson and Alexander Wilson, at NHGRI.]

Parametric means specify model of inheritance

many disease studies still use LINKAGE

Reason: TO and associated mini-courses

Recombinant

Suppose loci A and B are conjectured to be on the same chromosome.

Individual i is a recombinant w.r.t. parent p, if i inherits the A allele from one chromatid of p and the B allele from the other chromatid of p.

Recombination Fraction:

Probability that a child is recombinant. between loci A and B is .

Low prob. recombinant linkage

For ordered multiple loci:

A B C D

define a vector

which is an ordered list of recombination fractions between adjacent loci

Recombination fractions vary for male/female parents, but published maps tend to be gender-averaged

Recombination Fraction vs. Genetic Distance

Genetic Distance is the expected number of crossovers between locus A and locus B.

If A and B are on different chromosomes, , but is not defined.

, while may be larger than 1.

On the same chromosome, and they are close for small values.

x is measured in Morgans, is unitless

1 Morgan means 1 expected crossover

Relationship between Morgans and Megabases is poorly understood.

Mapping Functions

Mapping functions convert from to x and back.

Canonical application: compute knowing and

Apply m.f. to get and
Compute
Apply m.f. to get

Published maps show between adjacent markers, LINKAGE needs , but you may not want to use loci adjacent on map.

JO: Chapter 1 and FD: README.mapfun

Essence of LINKAGE:

Given a fixed order of markers and disease loci, e.g.:

and a candidate vector,

compute a number called the ``likelihood'' to quantify:

How well does the data fit the theta?

Higher likelihood is better.

What is FASTLINK?

New versions of LINKAGE main programs

Faster sequentially
Run in parallel
Robust against crashes and user errors
Substantial new documentation

What is VITESSE?

Program to replace MLINK and LINKMAP

Faster than FASTLINK
New running time diagnostics
``Simple'' pedigrees only

``Simple'' means each person has <=2 grandparents in pedigree, and no loops.

J. Tomlin has a utility program to combine FASTLINK (LINKMAP) and VITESSE output for mixed-simple data sets.

Reference: VD

Maximum Likelihood Estimation

Compare two candidates: ,

: is the correct value

Compute :

Like(

)/Like(

)

If this ratio is above a prescribed threshold, then accept , reject ; if below a different threshold, accept , reject .

Thresholds depend on the bad consequences to society from false negatives and false positives.

Testing for linkage

Unlinked means that between the disease and all other markers is 0.5

Linked means that between the disease and some other marker is (substantially) . Use the best choice for hypothesis .

Standard threshold for autosomal linkage: 1000

Standard threshold for autosomal nonlinkage: 1/100

Reference: JO, Sec. 4.4

What's a lod score?

Let be pedigree likelihood when disease is linked

Let be pedigree likelihood when disease is unlinked

The lod score is

Lod score of magnitude > z mathematically implies the statistical probability of error is , ignoring multiple testing issues.

Caution: What LINKAGE/FASTLINK prints for log(L) may be off by an additive term, but lod scores are correct.

Logarithms

Why do linkage analysis experts talk about lod scores rather than likelihoods?

Logarithms convert multiplication/division into addition/subtraction. This is very useful for multiple data sets.

So, the lod score can be computed as :

Stages of a generic linkage study

Genome scan to find possible regions
Identify the correct region
Pin disease gene between two known genes (Fine Mapping/Testing for Order)

Genome Scan

Use 150-400 polymorphic probes at 8--20cM spacing.

Always do two locus analysis between disease and marker probe of known position.

Interested in lodscores above 1 or 2.

Standard marker sets suggested by CHLC.

Two-stage scans can reduce lab costs.

Never, never compare in print the lod score with marker A against the lodscore with marker B

Identify the Correct Region

Use multiple probes at 1--4cM spacing.

Do two or three locus analysis between disease and marker probes of known position.

Interested in lodscores above 3.

Can find suitable markers from a combination of the Marshfield, MIT, and Généthon maps.

Use markers whose relative order is agreed upon.

Testing for marker order

Fixed marker map:

Find best vector for each of 4 placements of disease D.

Let the best likelihood for each order be , , ,.

Compare differences between log(), log(), log(), log()

If then the odds that j is incorrect in favor of i, are:

Desired level for t: 3.0

Marker order cautions

May have very high lodscores, yet uncertain marker order

Published maps may be incorrect over small regions

Published values are estimated from small samples

Limited variation among samples

Association between marker alleles and disease phenotypes (linkage disequilibrium) can confound the computations

LINKAGE/FASTLINK main programs:

Differ in: How thetas are chosen, output format

MLINK: Used to vary 1 at a time

LINKMAP: Used to vary 1 or 2 adjacent at a time

ILINK: Used to find a locally optimal vector.

Locally optimal means that it cannot be perturbed by a small amount to get a different vector with a better likelihood.

VITESSE coalesces LINKMAP and MLINK by taking the LINKAGE program name as a parameter.

When to use MLINK

You want a table of 2-point lodscores comparing disease against a variety of markers and for a variety of theta values.

You need to fill in the lodscore at specific theta vectors.

Used for genome scan.

Caution: Avoid publishing positive results based solely on 2-point MLINK analysis. This error is common in the literature. It used to be excusable due to running time limitations.

When to use LINKMAP

You want to draw a multipoint lodscore plot.

You want to find in which gap of a fixed marker order the disease gene lies.

Major Caution: LINKMAP gives relatively little information about whether the disease may be near but outside the set of fixed markers.

Note: LINKMAP will not draw anything, but it will yield a table of values that can be plotted.

When to use ILINK

Whenever you have time to wait

When you do not feel the published thetas are valid for your data set

When you wish to estimate allele frequencies (too advanced for this course)

In the context of GenoCheck to find genotyping errors.

Major Caution: The output of ILINK is hard to understand. See README.ILINK in FD.

ILINK avoids reliance on published distances and may lead to stronger results.

Disease Placement by Recombinants

Suppose the fixed map is

Suppose we established linkage between D and this region.

Suppose that there are and recombinants.

Conclude that D is between and .

This method of inference is incomparable in power to the likelihood method.

Allele Frequencies

Published by CEPH/Généthon for many markers using a reference panel of families. See list of FTP/WWW sites.

Results are quite sensitive to allele frequencies

Allele frequencies can be estimated from input data.

Caution: All families in reference panel are from France and Utah.

Reference: TO, Chapter 22; Jim Tomlin's allele frequency estimation recipe

Genotype and Phenotype

Dominant, Recessive, Co-dominant

Notation:

DD: Homozygous for disease

Dd: Heterozygous for disease

dd: Homozygous for non-disease

Penetrance, f: Probability that an individual of a given genotype is affected

In practice you may know affection status, but not genotype

Penetrance examples

Dominant:

DD: 1.0 Dd: 1.0 dd: 0.0

Recessive:

DD: 1.0 Dd: 0.0 dd: 0.0

Pure Codominant:

DD: 1.0 Dd: 0 < f < 1 dd: 0.0

Must See: TO, Chapter 9

Penetrance can be used to model:

Age-of-onset effect
Gender effect
Subset of symptoms
Diagnostic uncertainty
Errors in data
Environment

LINKAGE auxiliary programs

makeped: converts initial pedigree file into properly formatted file

preplink: interactively defines locus descriptions

lcp: prepares a script for execution of a sequence of main programs

lsp: used within script to extract desired loci from pedigree file

unknown: infers missing genotypes

lrp: converts the results of LINKMAP into a nice table suitable for multipoint graphing

IMPORTANT: Use the FASTLINK version of unknown

LINKAGE/FASTLINK Troubleshooting

``Nous vous conseillons de surveiller votre mail pendant les quelques premieres minutes apres le lancement. En effet si pour une raison quelconque (et elles sont nombreuses!) votre calcul `crashe' vous n'attendrez pas inutilement avant d'y remedier.''
-Lucien Bachner (Infobiogen, Paris)

Reference: FD: README.trouble

PEDCHECK is a useful program to identify "inconsistent" genotypes. E.g., a child has allele 4, but neither parent does.

Mendelian autosomal inheritance

Each person has two alleles at each locus.

A child inherits one allele from the father and one allele from the mother.

The set of sibling alleles S can be written so that

in such a way that M and F are of size 1 or 2 and each sibling has 1 allele in M and 1 allele in F.

Most linkage programs check these rules.

Mendelian X inheritance

A female has two alleles at a locus.

A male has one allele at a locus.

A female inherits her father's allele and one of her mother's alleles.

A male inherits one of his mother's alleles

The set of sibling alleles S can be written so that:

in such a way that M is of size 1, F is of size 1 or 2, each male sibling has 1 allele in F, each female sibling has 1 allele in M and 1 allele in F.

Checking Genotypes Further

Genotypes can be wrong, but still consistent with Mendelian rules.

Wrong genotypes usually increase the estimate of .

One can estimate between markers using ILINK.

Estimated >> published is a good clue that there are errors.

Use GenoCheck (derived from ILINK) to identify the most suspect person/locus pairs.

Convince laboratory personnel to follow-up based on GenoCheck output.

Reference: CD

Checking the status of a run

On UNIX: Use ps (with flags) and look for a process running one of: ilink, linkmap, mlink.

Make sure that process status is R

For FASTLINK

Each likelihood evaluation takes roughly the same amount of time.

Programs take a checkpoint after every likelihood evaluation.

Use: ls -lt check*

to see when last checkpoint was taken.

For VITESSE

Use parental pairs number

Reference: FD: README.time

Non-parametric analysis

Affected relatives have disease for the same reason

Basic method is affected sibpairs

Newer software considers affected relative pairs

Useful when mode of inheritance is unclear

APM, GENEHUNTER, SimIBD

Identical By State vs. Identical By Descent

Reference: TO, Chapter 26

APM

= probability that i,j share an allele IBS

A = Sum of over all affected i,j pairs

Can also weight A by allele frequency.

Compare observed A to expected.

Very sensitive to allele frequencies.

Empirical p-values.

GENEHUNTER

General approach using inheritance vectors.

Inheritance vector: each allele is from grandpa/grandma?

Given a probability distribution of inheritance vectors, conditional on phenotypes, compare this to uniform distribution.

Two different statistics: ,

somewhat analogous to A, but using IBD.

Obtain a normally distributed Z-statistic for p-values.

Reference: GD

SimIBD

A different statistic: SimIBD

Improves APM by conditioning simulation on affection status.

Improves APM by moving away from IBS to IBD.

You get empirical p-values.

Currently does only 2-point analysis

Reference: SD

Filing a bug report

A run of a program is an experiment.

An effective bug report enables the software developer to exactly reproduce the experiment without bias as to the outcome.

Use e-mail, not phone or FAX, but beware that mailers can mangle files.

Send files as ASCII, not BinHex

State what problems occurred, but describe only effects (phenotypes) that you observe, not causes (genotypes) that you conjecture.

Reference: FD: README.bugreport

Refereeing a linkage analysis paper

Look for:

List of marker loci used
Description and justification of penetrance function
Allele frequency for disease
Method for obtaining marker allele frequencies

Look for:

2-point (disease with each locus) LOD score table
Multipoint analyses strengthening 2-point results
Explanation of what software was used

These 7 items are not sufficient, but ought to be necessary for a "parametric" analysis. If any of the 7 items is missing, do not recommend the manuscript be accepted.

Installation Options:

Invite me to make a ``field trip'' to your lab.
Use CIT's SGI multiprocessor machine "galaxy" and contact Jim Tomlin (jtomlin@helix.nih.gov)
Retrieve via ftp and install yourself

ftp: login as anonymous, leave full e-mail as password

Useful FTP/WWW/Email addresses:

Local help and pointers:

http://bimas.cit.nih.gov/linkage/index.html

Software locally available on galaxy.nih.gov. Accessible with an account on helix.nih.gov and a request.

http://bimas.cit.nih.gov/linkage/galaxy1.html

All Software Catalog:

http://linkage.rockefeller.edu

Mailing List: fastlink-list@fastlink.nih.gov

APM:

ftp watson.hgen.pitt.edu

under pub/apm

FASTLINK:

ftp fastlink.nih.gov

under pub/fastlink

GENEHUNTER:

http://waldo.wi.mit.edu/ftp/distribution/software/g enehunter/gh2

GENOCHECK:

ftp softlib.cs.rice.edu

under pub/GenoCheck

LINKAGE for DOS and utilities:

ftp linkage.rockefeller.edu

under software/linkage/DOS

under software/utilities

under software/linkage/UNIX

and other subdirectories also

PEDCHECK:

ftp watson.hgen.pitt.edu

under pub/pedcheck

SimIBD:

ftp wastson.hgen.pitt.edu

under pub/simibd

VITESSE:

ftp watson.hgen.pitt.edu

under pub/vitesse

Généthon allele frequencies:

ftp://www.genethon.fr/pub/Gmap/Nature-1995

http://research.marshfieldclinic.org/genetics

Follow various links to find markers or build maps.

http://www.genome.wi.mit.edi

Click on Human Physical Mapping Project

Course Overheads by Alejandro A. Schäffer, NCBI
for "Cookbook Linkage Analysis, Ab Initio" course given at CIT on October 18, 2000

Formatted for the Web by CIT/DCB/BIMAS
Thu Oct 12 16:59:54 EDT 2000

Cookbook Linkage Analysis, Ab Initio

Recipe for disease gene hunting

Ingredients for Cookbook Linkage Analysis

I will also briefly discuss GENEHUNTER, SimIBD software.

Desired Results

Goal of this lecture:

References:

Linkage analysis software packages

Recombinant

A B C D

Recombination Fraction vs. Genetic Distance

Mapping Functions

Essence of LINKAGE:

What is FASTLINK?

What is VITESSE?

Maximum Likelihood Estimation

Testing for linkage

What's a lod score?

Logarithms

Stages of a generic linkage study

Genome Scan

Identify the Correct Region

Testing for marker order

Marker order cautions

LINKAGE/FASTLINK main programs:

When to use MLINK

When to use LINKMAP

When to use ILINK

Disease Placement by Recombinants

Allele Frequencies

Genotype and Phenotype

Penetrance examples

Penetrance can be used to model:

LINKAGE auxiliary programs

IMPORTANT: Use the FASTLINK version of unknown

LINKAGE/FASTLINK Troubleshooting

Mendelian autosomal inheritance

Mendelian X inheritance

Checking Genotypes Further

Checking the status of a run

For FASTLINK

For VITESSE

Non-parametric analysis

APM

GENEHUNTER

SimIBD

Filing a bug report

Refereeing a linkage analysis paper

Installation Options:

Useful FTP/WWW/Email addresses:

APM:

FASTLINK:

GENEHUNTER:

GENOCHECK:

LINKAGE for DOS and utilities:

SimIBD:

VITESSE:

Généthon allele frequencies:

Course Overheads by Alejandro A. Schäffer, NCBI for "Cookbook Linkage Analysis, Ab Initio" course given at CIT on October 18, 2000

Course Overheads by Alejandro A. Schäffer, NCBI
for "Cookbook Linkage Analysis, Ab Initio" course given at CIT on October 18, 2000