Cookbook Linkage Analysis, Ab Initio

Alejandro A. Schäffer

NCBI

e-mail: schaffer@helix.nih.gov






Customer service through CIT/Division of Computational Biology (CIT/DCB):

Jim Tomlin: jtomlin@helix.nih.gov

http://bimas.cit.nih.gov/linkage/index.html

Download a copy of these course overheads in PostScript format


Recipe for disease gene hunting

  1. Identify 1 or more sets of related patients
  2. Connect patients into pedigrees
  3. Genotype patients and their families
  4. Look for polymorphisms that cosegregate with the disease
  5. When linkage is found, narrow the region with more families
  6. Once the linkage region is small, either find candidate genes and skip to step 9, or clone DNA from the area and put into artificial hosts (vectors).
  7. Sequence the DNA in the region
  8. Look for genes (e.g., by computer prediction, exon trapping, cDNA screen, etc.)
  9. Screen genes for mutations
  10. Prove mutations are causal (proof standards vary)


Ingredients for Cookbook Linkage Analysis

I will also briefly discuss GENEHUNTER, SimIBD software.


Desired Results

Caution: All conclusions are obtained with high probability, but not certainty


Goal of this lecture:

Reduce easy cases of linkage analysis to:

  1. Family Ascertainment

  2. S.M.O.G. (Simple Matter of Genotyping)

  3. Looking up instructions

  4. File manipulation and keystroking


References:

TO: J. D. Terwilliger and J. Ott, Handbook of Human Genetic Linkage, Johns Hopkins Press, 1994.

JO: Analysis of Human Genetic Linkage, Johns Hopkins Press, 1991

LD: LINKAGE Documentation

FD: FASTLINK Documentation

AD: APM Documentation

CD: GenoCheck Documentation

GD: GENEHUNTER Documentation

SD: SimIDB Documentation

VD: VITESSE Documentation


Linkage analysis software packages

Parametric:

Non-parametric:

[S.A.G.E. sages: Joan Bailey-Wilson and Alexander Wilson, at NHGRI.]

Parametric means specify model of inheritance

many disease studies still use LINKAGE

Reason: TO and associated mini-courses


Recombinant

Suppose loci A and B are conjectured to be on the same chromosome.

Individual i is a recombinant w.r.t. parent p, if i inherits the A allele from one chromatid of p and the B allele from the other chromatid of p.

Recombination Fraction:

Probability that a child is recombinant. between loci A and B is .

Low prob. recombinant linkage


For ordered multiple loci:

A B C D

define a vector

which is an ordered list of recombination fractions between adjacent loci

Recombination fractions vary for male/female parents, but published maps tend to be gender-averaged


Recombination Fraction vs. Genetic Distance

Genetic Distance is the expected number of crossovers between locus A and locus B.

If A and B are on different chromosomes, , but is not defined.

, while may be larger than 1.

On the same chromosome, and they are close for small values.

x is measured in Morgans, is unitless

1 Morgan means 1 expected crossover

Relationship between Morgans and Megabases is poorly understood.


Mapping Functions

Mapping functions convert from to x and back.

Canonical application: compute knowing and

  1. Apply m.f. to get and

  2. Compute

  3. Apply m.f. to get

Published maps show between adjacent markers, LINKAGE needs , but you may not want to use loci adjacent on map.

JO: Chapter 1 and FD: README.mapfun


Essence of LINKAGE:

Given a fixed order of markers and disease loci, e.g.:

D

and a candidate vector,

compute a number called the ``likelihood'' to quantify:

How well does the data fit the theta?

Higher likelihood is better.


What is FASTLINK?

New versions of LINKAGE main programs


What is VITESSE?

Program to replace MLINK and LINKMAP

``Simple'' means each person has <=2 grandparents in pedigree, and no loops.

J. Tomlin has a utility program to combine FASTLINK (LINKMAP) and VITESSE output for mixed-simple data sets.

Reference: VD


Maximum Likelihood Estimation

Compare two candidates: ,

: is the correct value

: is the correct value

Compute :

Like()/Like()

If this ratio is above a prescribed threshold, then accept , reject ; if below a different threshold, accept , reject .

Thresholds depend on the bad consequences to society from false negatives and false positives.


Testing for linkage

Unlinked means that between the disease and all other markers is 0.5

Linked means that between the disease and some other marker is (substantially) . Use the best choice for hypothesis .

Standard threshold for autosomal linkage: 1000

Standard threshold for autosomal nonlinkage: 1/100

Reference: JO, Sec. 4.4


What's a lod score?

Let be pedigree likelihood when disease is linked

Let be pedigree likelihood when disease is unlinked

The lod score is

Lod score of magnitude > z mathematically implies the statistical probability of error is , ignoring multiple testing issues.

Caution: What LINKAGE/FASTLINK prints for log(L) may be off by an additive term, but lod scores are correct.


Logarithms

Why do linkage analysis experts talk about lod scores rather than likelihoods?

Logarithms convert multiplication/division into addition/subtraction. This is very useful for multiple data sets.

So, the lod score can be computed as :


Stages of a generic linkage study

  1. Genome scan to find possible regions

  2. Identify the correct region

  3. Pin disease gene between two known genes (Fine Mapping/Testing for Order)


Genome Scan

Use 150-400 polymorphic probes at 8--20cM spacing.

Always do two locus analysis between disease and marker probe of known position.

Interested in lodscores above 1 or 2.

Standard marker sets suggested by CHLC.

Two-stage scans can reduce lab costs.

Never, never compare in print the lod score with marker A against the lodscore with marker B


Identify the Correct Region

Use multiple probes at 1--4cM spacing.

Do two or three locus analysis between disease and marker probes of known position.

Interested in lodscores above 3.

Can find suitable markers from a combination of the Marshfield, MIT, and Généthon maps.

Use markers whose relative order is agreed upon.


Testing for marker order

Fixed marker map:

Find best vector for each of 4 placements of disease D.

Let the best likelihood for each order be , , ,.

Compare differences between log(), log(), log(), log()

If then the odds that j is incorrect in favor of i, are:

Desired level for t: 3.0


Marker order cautions

May have very high lodscores, yet uncertain marker order

Published maps may be incorrect over small regions

Published values are estimated from small samples

Limited variation among samples

Association between marker alleles and disease phenotypes (linkage disequilibrium) can confound the computations


LINKAGE/FASTLINK main programs:

Differ in: How thetas are chosen, output format

MLINK: Used to vary 1 at a time

LINKMAP: Used to vary 1 or 2 adjacent at a time

ILINK: Used to find a locally optimal vector.

Locally optimal means that it cannot be perturbed by a small amount to get a different vector with a better likelihood.

VITESSE coalesces LINKMAP and MLINK by taking the LINKAGE program name as a parameter.


When to use MLINK

You want a table of 2-point lodscores comparing disease against a variety of markers and for a variety of theta values.

You need to fill in the lodscore at specific theta vectors.

Used for genome scan.

Caution: Avoid publishing positive results based solely on 2-point MLINK analysis. This error is common in the literature. It used to be excusable due to running time limitations.


When to use LINKMAP

You want to draw a multipoint lodscore plot.

You want to find in which gap of a fixed marker order the disease gene lies.

Major Caution: LINKMAP gives relatively little information about whether the disease may be near but outside the set of fixed markers.

Note: LINKMAP will not draw anything, but it will yield a table of values that can be plotted.


When to use ILINK

Whenever you have time to wait

When you do not feel the published thetas are valid for your data set

When you wish to estimate allele frequencies (too advanced for this course)

In the context of GenoCheck to find genotyping errors.

Major Caution: The output of ILINK is hard to understand. See README.ILINK in FD.

ILINK avoids reliance on published distances and may lead to stronger results.


Disease Placement by Recombinants

Suppose the fixed map is

Suppose we established linkage between D and this region.

Suppose that there are and recombinants.

Conclude that D is between and .

This method of inference is incomparable in power to the likelihood method.


Allele Frequencies

Published by CEPH/Généthon for many markers using a reference panel of families. See list of FTP/WWW sites.

Results are quite sensitive to allele frequencies

Allele frequencies can be estimated from input data.

Caution: All families in reference panel are from France and Utah.

Reference: TO, Chapter 22; Jim Tomlin's allele frequency estimation recipe


Genotype and Phenotype

Dominant, Recessive, Co-dominant

Notation:

DD: Homozygous for disease

Dd: Heterozygous for disease

dd: Homozygous for non-disease



Penetrance, f: Probability that an individual of a given genotype is affected

In practice you may know affection status, but not genotype


Penetrance examples

Dominant:

DD: 1.0 Dd: 1.0 dd: 0.0

Recessive:

DD: 1.0 Dd: 0.0 dd: 0.0

Pure Codominant:

DD: 1.0 Dd: 0 < f < 1 dd: 0.0

Must See: TO, Chapter 9


Penetrance can be used to model:


LINKAGE auxiliary programs

makeped: converts initial pedigree file into properly formatted file

preplink: interactively defines locus descriptions

lcp: prepares a script for execution of a sequence of main programs

lsp: used within script to extract desired loci from pedigree file

unknown: infers missing genotypes

lrp: converts the results of LINKMAP into a nice table suitable for multipoint graphing

IMPORTANT: Use the FASTLINK version of unknown


LINKAGE/FASTLINK Troubleshooting



``Nous vous conseillons de surveiller votre mail pendant les quelques premieres minutes apres le lancement. En effet si pour une raison quelconque (et elles sont nombreuses!) votre calcul `crashe' vous n'attendrez pas inutilement avant d'y remedier.''

-Lucien Bachner (Infobiogen, Paris)



Reference: FD: README.trouble


PEDCHECK is a useful program to identify "inconsistent" genotypes. E.g., a child has allele 4, but neither parent does.


Mendelian autosomal inheritance

Each person has two alleles at each locus.

A child inherits one allele from the father and one allele from the mother.

The set of sibling alleles S can be written so that

in such a way that M and F are of size 1 or 2 and each sibling has 1 allele in M and 1 allele in F.

Most linkage programs check these rules.


Mendelian X inheritance

A female has two alleles at a locus.

A male has one allele at a locus.

A female inherits her father's allele and one of her mother's alleles.

A male inherits one of his mother's alleles

The set of sibling alleles S can be written so that:

in such a way that M is of size 1, F is of size 1 or 2, each male sibling has 1 allele in F, each female sibling has 1 allele in M and 1 allele in F.


Checking Genotypes Further

Genotypes can be wrong, but still consistent with Mendelian rules.

Wrong genotypes usually increase the estimate of .

One can estimate between markers using ILINK.

Estimated >> published is a good clue that there are errors.

Use GenoCheck (derived from ILINK) to identify the most suspect person/locus pairs.

Convince laboratory personnel to follow-up based on GenoCheck output.

Reference: CD


Checking the status of a run

On UNIX: Use ps (with flags) and look for a process running one of: ilink, linkmap, mlink.

Make sure that process status is R

For FASTLINK

Each likelihood evaluation takes roughly the same amount of time.

Programs take a checkpoint after every likelihood evaluation.

Use: ls -lt check*

to see when last checkpoint was taken.

For VITESSE

Use parental pairs number

Reference: FD: README.time


Non-parametric analysis

Affected relatives have disease for the same reason

Basic method is affected sibpairs

Newer software considers affected relative pairs

Useful when mode of inheritance is unclear

APM, GENEHUNTER, SimIBD

Identical By State vs. Identical By Descent

Reference: TO, Chapter 26


APM

= probability that i,j share an allele IBS

A = Sum of over all affected i,j pairs

Can also weight A by allele frequency.

Compare observed A to expected.

Very sensitive to allele frequencies.

Empirical p-values.


GENEHUNTER

General approach using inheritance vectors.

Inheritance vector: each allele is from grandpa/grandma?

Given a probability distribution of inheritance vectors, conditional on phenotypes, compare this to uniform distribution.

Two different statistics: ,

somewhat analogous to A, but using IBD.

Obtain a normally distributed Z-statistic for p-values.

Reference: GD


SimIBD

A different statistic: SimIBD

Improves APM by conditioning simulation on affection status.

Improves APM by moving away from IBS to IBD.

You get empirical p-values.

Currently does only 2-point analysis

Reference: SD


Filing a bug report

A run of a program is an experiment.

An effective bug report enables the software developer to exactly reproduce the experiment without bias as to the outcome.

Use e-mail, not phone or FAX, but beware that mailers can mangle files.

Send files as ASCII, not BinHex

State what problems occurred, but describe only effects (phenotypes) that you observe, not causes (genotypes) that you conjecture.

Reference: FD: README.bugreport


Refereeing a linkage analysis paper

Look for:

Look for:

These 7 items are not sufficient, but ought to be necessary for a "parametric" analysis. If any of the 7 items is missing, do not recommend the manuscript be accepted.


Installation Options:

  1. Invite me to make a ``field trip'' to your lab.

  2. Use CIT's SGI multiprocessor machine "galaxy" and contact Jim Tomlin (jtomlin@helix.nih.gov)

  3. Retrieve via ftp and install yourself

ftp: login as anonymous, leave full e-mail as password


Useful FTP/WWW/Email addresses:


Local help and pointers:

http://bimas.cit.nih.gov/linkage/index.html

Software locally available on galaxy.nih.gov. Accessible with an account on helix.nih.gov and a request.

http://bimas.cit.nih.gov/linkage/galaxy1.html

All Software Catalog:

http://linkage.rockefeller.edu

Mailing List: fastlink-list@fastlink.nih.gov



APM:

ftp watson.hgen.pitt.edu

under pub/apm



FASTLINK:

ftp fastlink.nih.gov

under pub/fastlink



GENEHUNTER:

http://waldo.wi.mit.edu/ftp/distribution/software/g enehunter/gh2



GENOCHECK:

ftp softlib.cs.rice.edu

under pub/GenoCheck



LINKAGE for DOS and utilities:

ftp linkage.rockefeller.edu

under software/linkage/DOS

under software/utilities

under software/linkage/UNIX

and other subdirectories also




PEDCHECK:
ftp watson.hgen.pitt.edu

under pub/pedcheck



SimIBD:

ftp wastson.hgen.pitt.edu

under pub/simibd



VITESSE:

ftp watson.hgen.pitt.edu

under pub/vitesse



Généthon allele frequencies:

ftp://www.genethon.fr/pub/Gmap/Nature-1995



http://research.marshfieldclinic.org/genetics

Follow various links to find markers or build maps.



http://www.genome.wi.mit.edi
Click on Human Physical Mapping Project



Course Overheads by Alejandro A. Schäffer, NCBI
for "Cookbook Linkage Analysis, Ab Initio" course given at CIT on October 18, 2000


Formatted for the Web by CIT/DCB/BIMAS
Thu Oct 12 16:59:54 EDT 2000