Abstract
Various expressions related to the length of a conserved haplotype around a polymorphism of known frequency are derived. We obtain exact expressions for the probability that no recombination has occurred in a sample or subsample. We obtain an approximation for the probability that no recombination that could give rise to a detectable recombination event (through the fourgamete test) has occurred. The probabilities can be used to obtain approximate distributions for the length of variously defined haplotypes around a polymorphic site. The implications of our results for data analysis, and in particular for detecting selection, are discussed.
CHROMOSOMES that have inherited a particular mutation from a common ancestral chromosome must also have inherited a short chromosomal region surrounding the mutation. The length of the piece that is identical by descent to the ancestral chromosome will vary between the chromosomes in a complex pattern. The phenomenon is not directly observable, but gives rise to linkage disequilibrium and sharing of marker haplotypes, which can be observed in polymorphism data (reviewed, for example, in Nordborg and Tavaré 2002).
There is currently tremendous interest in linkage disequilibrium and haplotype sharing, the primary reason being its importance in association mapping of human disease loci (Dalyet al. 2001; Patilet al. 2001; Dawsonet al. 2002; Gabrielet al. 2002; Phillipset al. 2003). The extent of linkage disequilibrium (LD) and haplotype sharing plays an important role in determining the density of singlenucleotide polymorphisms (SNPs) that is required for mapping. In addition, the distribution of haplotype sharing can be important in actual mapping methods (McPeek and Strahs 1999; Morris et al. 2000, 2002; Liuet al. 2001).
Outside of mapping, the distribution of haplotype lengths is of importance to evolutionary biologists, because it is needed to evaluate claims about particular haplotypes being “too long” to fit neutrality. A mutation that has been driven by directional selection (either positive or negative) to its current population frequency will typically have reached that frequency much faster than if it had simply reached that frequency through genetic drift (Maruyama 1977). This means that recombination will have had less time to break up the ancestral haplotype (Kaplanet al. 1989). Mutations that are surrounded by regions of haplotype sharing that are unusually extensive given the frequency of the mutation are therefore candidates for having been influenced by selection. The problem is deciding what “unusual” means.
In this article we derive some basic results concerning the extent of LD and haplotype sharing surrounding a focal mutation (e.g., a SNP) that has been determined to have a certain frequency in a sample. We assume a standard neutral model: our results can therefore serve as a “null model” with which data can be compared.
RESULTS
The model: Consider a sample of n sequences from a population of N diploid individuals that evolves according to a standard neutral model. The coalescent approximation is employed throughout (see, e.g., Nordborg 2001). Imagine further that the sample contains a diallelic polymorphism at a particular site 0 (e.g., a SNP locus). We assume that the polymorphism was created by a unique mutational event so that the mutation rate at the focal site is zero. It is also assumed that the ancestral allelic state is known. Let i and j = n – i represent the number of the sampled haplotypes with the ancestral allele and those with the mutant allele, respectively. Denote this state A(i, j). We describe the genealogical history of a sample conditional on this configuration. An example genealogy for A(7, 3) is shown in Figure 1. A mutation from “A” to “T” creates the mutant allelic class, which consists of sequences, 8, 9, and 10. During the time between the mutation and present, no coalescence can occur between the two allelic classes. The coalescent conditional on A(i, j) differs from the standard theory (Kingman 1982; Hudson 1983; Tajima 1983) in many ways. Relevant theory has been developed by Innan and Tajima (1997), Griffiths and Tavaré (1998, 1999), and Wiuf and Donnelly (1999).
In this article, we consider how long the genealogical relationship among the sampled sequences at site 0 is conserved along the chromosome. In the absence of recombination, exactly the same genealogy is obtained at any site along the chromosome; recombination causes genealogies to change gradually as we move farther away from site 0.
The probability of no recombination: Consider the genealogy at a site L, L sites away from site 0. Recombination occurs with probability r per site per generation: in the coalescent setting, we use the scaled rate ρ= lim_{N→∞}4Nr (pe.g., Nordborg 2001). To simplify expressions, we also define R = Lρ.
We first seek the probability that no recombination has occurred between 0 and L in the history of the sample, i.e., before the sample reaches its most recent common ancestor (MRCA). In the absence of recombination the genealogies at these two sites must be identical. We derive this probability using the methods of Innan and Tajima (1997).
Consider the coalescent process at site 0 starting at A(i, j), j > 1. When the first coalescence event occurs, A(i, j) changes to either A(i – 1, j) or A(i, j – 1) with probabilities (i – 1)/(n – 1) and j/(n – 1), respectively (Innan and Tajima 1997, Figure 3A). The waiting time until the first coalescence is exponentially distributed with rate (
The above results concern the genealogy of the whole sample. If we are interested in questions like the extent of haplotype sharing between members of a particular allelic class, we need results for the genealogy of the appropriate subsample. First we consider the mutant allelic class. Let P_{M}(Ri, j) be the probability of no recombination between 0 and L in the ancestry of the mutant allelic class, so that the genealogies of the mutant allelic class at the two sites are exactly the same. The same reasoning that led to Equation 2 leads to the following recursion for P_{M}(Ri, j), j > 1,
Similarly, we consider P_{A}(Ri, j) to be the probability of no recombination between 0 and L in the ancestry of the ancestral allelic class. For j > 1 we have
Figure 2 shows P_{M}(Ri, j) and P_{A}(Ri, j) for A(40, 10) and A(10, 40). When the mutant allelic class is common (Figure 2A), the genealogy of the mutant allelic class decays more quickly than that of the ancestral allelic class. When the mutant allelic class is rare (Figure 2B), the genealogy of the mutant allelic class is conserved over quite a long distance, while that of the ancestral allelic class decays very quickly. Note that P_{M}(40, 10) > P_{A}(10, 40): this reflects the fact that the mutant allelic class is younger.
The probability of tree compatibility: In the previous section, we considered the extent of haplotype sharing in the sense of identity by descent with respect to recombination. This cannot of course be observed directly, but must be inferred from data. Under the infinitesite model, unless recombination has occurred between two sites in the history of the sample, there can be at most three out of four possible haplotypes (the “fourgamete” test of Hudson and Kaplan 1985) and D′(Lewontin 1964) must always be 1. These two statistics can be viewed as tests for recombination (in which case they amount to the same thing), but it must be remembered that they have very low power: most recombination events will not be detected. One reason for this is that mutations may not have occurred on the appropriate branches on the genealogy: even if recombination has occurred, any particular pair of SNPs may not reveal it. Another reason is that the underlying genealogy may be such that recombination cannot be detected, even with infinitely many polymorphic sites (Hudson and Kaplan 1985). This is particularly important in the present context because closely linked sites will tend to have similar genealogies.
A general analytical treatment of the extent of haplotype sharing visible in data would be nice, but seems extremely difficult. We derive a heuristic approximation for a particular problem, namely the probability that the genealogy at site L is “compatible” with the genealogy at site 0 in the sense that recombination between them cannot possibly be detected using either of the tests described above. Our definition of tree compatibility is illustrated in Figure 3. The allelic state at site 0 is A(7, 3) because haplotypes 8, 9, and 10 share a mutation. Now consider the genealogy at site L. We define this genealogy as compatible if 8, 9, and 10 are monophyletic at L, i.e., if they are related more closely to each other than to any other sampled haplotype at L. Genealogies that do not have this property are incompatible: with infinitely many polymorphic sites, the recombination between 0 and L would be detected.
Let P_{C}(Ri, j) be the probability that the genealogy at site L is compatible with that at site 0 where the allelic state is A(i, j). An approximate expression for this probability can be obtained by modifying the equations developed in the previous section. Our approximation is best explained through an example. Consider a sample where the allelic state at 0 is A(7, 3) (Figure 4A). Seven haplotypes belong to the ancestral class and three belong to the mutant class. Going backward in time, a recombination might occur before the first coalescence. Assume that an ancestral haplotype (number 7, say) undergoes recombination and breaks into two haplotypic lineages (for more on the ancestral recombination graph, see, e.g., Nordborg 2001). Recombination occurs with either an ancestral haplotype or a mutant haplotype. The probabilities of these two events are 1 – X and X, respectively, where X is the frequency of the mutant haplotype (allelic class) in the population. Given that the recombination occurred in an ancestral haplotype, only recombination with a mutant haplotype could make the genealogy at site L incompatible with that of site 0. Of course we do not know X, and will use X̄, its expected value given the sample configuration A(i, j), as a proxy. As is shown in the appendix, X̄ = j/(n + 1).
Suppose this latter type of recombination occurred, so that there are six unrecombined haplotypes of the ancestral class, three unrecombined haplotypes of the mutant class, and two recombinants derived from sequence 7. As shown in Figure 4A, haplotypes 8–10 and 7rec2 now belong to the mutant allelic class, whereas 1–6 and 7rec1 belong to the ancestral class. Assuming that no further recombination occurs, haplotypes 8–10 and 7rec2 must now be monophyletic. Depending on the order in which they coalesce, the resulting genealogy is either compatible or incompatible in the sense of Figure 3. An example of an incompatible genealogy is shown in Figure 4B; an example of a compatible one is shown in Figure 4C. The latter kind of genealogy occurs if and only if haplotypes 8–10 are monophyletic with respect to 7rec2. Let α be the probability of the outcome exemplified in Figure 4B, given that recombination occurs as just described when there are i ancestral and j nonancestral haplotypes. It is easy to show that α= 1 – 2/[j(j + 1)] (Saunderset al. 1984). Therefore, when the process is in A(i, j), recombination that leads to an incompatible genealogy at site L occurs approximately at rate iXαR/2 among the i haplotypes in the ancestral class. The approximation assumes that recombination occurs only once before the lineages coalesce and is therefore likely to work best for small R. It also relies on using X̄ instead of integrating over distribution of the random variable X.
Next, we consider recombination in a mutant haplotype. Each such haplotype undergoes recombination with an ancestral haplotype at rate (1 – X)R/2. When this happens, incompatibility is highly likely since the mutant haplotypes must coalesce first. Therefore, when the process is in A(i, j), recombination that leads to an incompatible genealogy at L occurs approximately at rate j(1 – X̄)R/2 among the j haplotypes in the mutant class. Again, this approximation will work best for small R.
Putting all this together, the probability that there is no recombination that gives rise to an incompatible genealogy at site L before the first coalescence in A(i, j) is given by
Simulations indicate that our heuristic approximation works rather well, at least for reasonably large sample size and moderate R. Figure 5 shows some results for n = 50. As expected, the lower the frequency of the mutation, the larger the region in which recombination cannot be detected is likely to be. Our approximation tends to be smaller than the real value because we ignore multiple recombination events, which may return an incompatible tree to the compatible state.
The theoretical results shown in this section are based on recursion equations. We have also obtained closed forms for the four probabilities, P_{W}(Ri, j), P_{M}(Ri, j), P_{A}(Ri, j), and P_{C}(Ri, j) (available upon request), although they are not shown in this article.
DISCUSSION
We have derived several probabilities related to the preservation of an ancestral haplotype along the chromosome. We consider a sample of n haplotypes and focus on a polymorphic site at which i haplotypes carry the ancestral allele, and j = n – i carry a mutant allele. First, we derive the probability of there having been no recombination in the genealogy of the sample (or of a subsample, such as all the members of a particular allelic class), between this site and an arbitrary linked site. The distribution of the length of the segment (on one side of the focal site) in which no recombination occurred is readily obtained from this probability. By assuming independence of recombination on either side of the focal site, we can also obtain the distribution of the total length. For example, the probability density of the length of this region for the mutant allelic class would be given by the convolution
Second, we derive an approximation for the probability that a tree at a given distance from the focal site is such that recombination between it and the focal site cannot be detected, in the sense that there will always be less than four gametes, and that D′ = 1. This probability can of course also be used to find an approximation for the distribution of the length of the region in which recombination cannot be detected.
Our results are relevant for understanding the extent of linkage disequilibrium and haplotype sharing around a particular polymorphism. This is important in association mapping of human diseases, where we might be interested in how the pattern of haplotype sharing around a disease allele is expected to depend on the frequency of that allele. It should be noted in this context that whereas other treatments of this problem (Kaplanet al. 1995; Thompson and Neel 1997; Slatkin and Bertorelle 2001) are based on assumptions that are valid only for rare alleles, our results are valid for all frequencies. We do, on the other hand, assume a constant population size.
Another application of our results is in evaluating claims of past selection. There is currently great interest in detecting past selection at polymorphic sites by looking at the extent of haplotype sharing surrounding each site and determining whether it is too extensive to be compatible with neutrality (e.g., Andolfattoet al. 1999; Sabetiet al. 2002). Our results are relevant to such questions. For example, Innan et al. (2003) found a local region with a very high level of linkage disequilibrium on human chromosome 21, in the data of Patil et al. (2001). In this ∼50kb region, there is a highfrequency (15 out of 20) haplotype. Since there is almost no variation within this haplotype, it is likely to be derived, as opposed to ancestral. Missing data complicates the interpretation (Innanet al. 2003), but it suffices as an example. Figure 6 shows the probability density of the length of the region in which recombination cannot be detected in this case. An unbroken 50kb haplotype seems unlikely under the standard neutral model. Alternative explanations include selection, demography, or a local decrease in recombination (Innanet al. 2003).
However, our results should not be used to test for selection by rejecting neutrality. The density in Figure 6 is for the length of the region in which recombination could not possibly be detected, irrespective of the number of markers. With finitely many marker loci, it is necessary to take into account the fact that many loci will not reveal recombination even if recombination has occurred in such a manner that marker loci could potentially reveal it. Figure 7 illustrates the distinction by comparing the distribution of the length of haplotypes in which no recombination (1) occurred (from Equation 2); (2) could have been detected, irrespective of the mutation rate (from Equation 12); or (3) was in fact detected, given a particular mutation rate. The events in 3 are a subset of those in 2, which are a subset of those in 1 (Hudson and Kaplan 1985).
The utility of our results lies in the fact that they provide a lower bound for the length of haplotype conservation that might be observed. The difference between this bound and case 3 above is determined by the density of markers, which in turn depends on the mutation rate (Figure 8). Our results allow us to determine whether there is any reason to consider selection as an explanation for a particular data set. They also provide a very simple method for exploring the effect of the allele frequency on the distribution of haplotype lengths. Figure 9 illustrates the crucial role played by the frequency of the mutant allele in determining the length of the surrounding haplotype. Haplotype sharing surrounding a lowfrequency mutant allele can clearly be very extensive even in the absence of selection.
Acknowledgments
We thank N. Rosenberg and two anonymous reviewers for comments on the manuscript. We also thank P. Donnelly and C. Wiuf for many discussions and for sharing an unpublished manuscript in which they, inter al., study P_{M}(R  i, j) by simulation.
APPENDIX
In a constantsize population at equilibrium, the probability density function (pdf) of the frequency of the mutant allelic class is given by
Footnotes

Communicating editor: J. Hein
 Received December 11, 2002.
 Accepted May 22, 2003.
 Copyright © 2003 by the Genetics Society of America