Posted on

olfactory genetics seeds

Olfactory genetics seeds
For the heritability calculations, we used the GCTA software[27]. The calculations were done on genotyped SNPs only within a group of 13,628 unrelated Europeans. Unrelated filtering here was done using GCTA to remove individuals with estimated relatedness larger than 0.025. Thus, this group is slightly different from the GWAS set, as the GWAS set’s relatedness filtering was done using IBD. We assumed a prevalence for soapy-taste detection of 0.13 for the transformation of heritability from the 0–1 scale to the liability scale. Otherwise, default options were used. We calculated heritability for autosomal and X chromosome SNPs separately; the estimates were 0.0869 (standard error 0.0634, p value 0.0805) for autosomal SNPs and 2 × 10 −6 (standard error 0.010753, p value 0.5) for the X chromosome.

A genetic variant near olfactory receptor genes influences cilantro preference

Article metrics



The leaves of the Coriandrum sativum plant, known as cilantro or coriander, are widely used in many cuisines around the world. However, far from being a benign culinary herb, cilantro can be polarizing—many people love it while others claim that it tastes or smells foul, often like soap or dirt. This soapy or pungent aroma is largely attributed to several aldehydes present in cilantro. Cilantro preference is suspected to have a genetic component, yet to date nothing is known about specific mechanisms.


Here, we present the results of a genome-w >−9 , odds ratio 0.81 per A allele), lies within a cluster of olfactory receptor genes on chromosome 11. Among these olfactory receptor genes is OR6A2, which has a high binding specificity for several of the aldehydes that give cilantro its characteristic odor. We also estimate the heritability of cilantro soapy-taste detection in our cohort, showing that the heritability tagged by common SNPs is low, about 0.087.


These results confirm that there is a genetic component to cilantro taste perception and suggest that cilantro dislike may stem from genetic variants in olfactory receptors. We propose that one of a cluster of olfactory receptor genes, perhaps OR6A2, may be the olfactory receptor that contributes to the detection of a soapy smell from cilantro in European populations.


The Coriandrum sativum plant has been cultivated since at least the second millennium BCE[1]. Its fruits (commonly called coriander seeds) and leaves (called cilantro or coriander) are important components of many cuisines. In particular, South Asian cuisines use both the leaves and the seeds prominently, and Latin American food often incorporates the leaves.

The desirability of cilantro has been debated for centuries. Pliny claimed that coriander had important medicinal properties: ‘vis magna ad refrigerandos ardores viridi’ (‘while green, it is possessed of very cooling and refreshing properties’)[2]. The Romans used the leaves and seeds in many dishes, including moretum (a herb, cheese, and garlic spread similar to today’s pesto)[3]; the Mandarin word for cilantro, (xiāngcài), literally means ‘fragrant greens.’ However, the leaves in particular have long inspired passionate hatred as well, e.g., John Gerard called it a ‘very stinking herbe’ with leaves of ‘venemous quality’[4, 5].

It is not known why cilantro is so differentially perceived. The proportion of people who dislike cilantro varies widely by ancestry[6]; however, it is not clear to what extent this may be explained by differences in environmental factors, such as frequency of exposure. In a twin study, the heritability of cilantro dislike has been estimated as 0.38 (confidence interval (CI) 0.22–0.52) for odor and 0.52 (CI 0.38–0.63) for flavor[7].

The smell of cilantro is often described as pungent or soapy. It is suspected, although not proven, that cilantro dislike is largely driven by the odor rather than the taste. The key aroma components in cilantro consist of various aldehydes, in particular (E)-2-alkenals and n-aldehydes[8, 9]. The unsaturated aldehydes (mostly decanal and dodecanal) in cilantro are described as fruity, green, and pungent; the (E)-2-alkenals (mostly (E)-2-decenal and (E)-2-dodecenal) as soapy, fatty, ‘like cilantro,’ or pungent[8, 9].

Several families of genes are important for taste and smell. The TAS1R and TAS2R families form sweet, umami, and bitter taste receptors[10, 11]. The olfactory receptor family contains about 400 functional genes in the human genome. Each receptor binds to a set of chemicals, enabling one to recognize specific odorants or tastants. Genetic differences in many of these receptors are known to play a role in how we perceive tastes and smells[12–15].

Results and discussion

Here, we report on a genome-wide association study (GWAS) of cilantro soapy-taste detection. Briefly, the GWAS was conducted in 14,604 unrelated participants of primarily European ancestry who responded to an online questionnaire asking whether they thought cilantro tasted like soap (Table1). Two single-nucleotide polymorphisms (SNPs) were genome-wide significant (p −8 ) in this population. One SNP, in a cluster of olfactory receptors, replicated in a non-overlapping group of 11,851 participants (again, unrelated and of primarily European ancestry) who reported whether they liked or disliked cilantro (see the ‘Methods’ section for full details). Figure1 shows p values across the whole genome; Figure2 shows p values near the most significant associations. A quantile-quantile plot (Additional file1) shows little (λ = 1.007) global inflation of p values. Index SNPs with p values under 10 −6 are shown in Table2 (along with replication p values); all SNPs with p values under 10 −4 are shown in Additional file2.

Manhattan plot of association with cilantro soapy-taste. Negative log10p values across all SNPs tested. SNPs shown in red are genome-wide significant (p −8 ). Regions are named with the postulated candidate gene.

Associations with cilantro soapy-taste near rs72921001 (A) and rs78503206 (B). Negative log10p values for association (left axis) with recombination rate (right axis). Colors depict the squared correlation (r 2 ) of each SNP with the most associated SNP ((A) rs72921001 and (B) rs78503206, shown in purple). Gray indicates SNPs for which r 2 information was missing.

We found one significant association for cilantro soapy-taste that was confirmed in the cilantro preference population. The SNP rs72921001 (pdiscovery = 6.4 × 10 −9 , odds ratio (OR) = 0.81, prepl = 0.0057) lies on chromosome 11 within a cluster of eight olfactory receptor genes: OR2AG2, OR2AG1, OR6A2, OR10A5, OR10A2, OR10A4, OR2D2, and OR2D3. The C allele is associated with both detecting a soapy smell and disliking cilantro. Of the olfactory receptors encoded in this region, OR6A2 appears to be the most promising candidate underlying the association with cilantro odor detection. It is one of the most studied olfactory receptors (often as the homologous olfactory receptor I7 in rats)[16–19]. A wide range of odorants have been found to activate this receptor, all of which are aldehydes[17]. Among the unsaturated aldehydes, octanal binds best to rat I7[18]; however, compounds ranging from heptanal to undecanal also bind to this receptor[17]. Several singly unsaturated n-aldehydes also show high affinity, including (E)-2-decenal[17]. These aldehydes include several of those playing a key role in cilantro aroma, such as decanal and (E)-2-decenal. Thus, this gene is particularly interesting as a candidate for cilantro odor detection. The index SNP is also in high LD (r 2 > 0.9) with three non-synonymous SNPs in OR10A2, namely rs3930075, rs10839631, and rs7926083 (H43R, H207R, and K258T, respectively). Thus, OR10A2 may also be a reasonable candidate gene in this region.

The second significant association, with rs78503206 (pdiscovery = 3.2 × 10 −8 , OR = 0.68, prepl = 0.49), lies in an intron of the gene SNX9 (sorting nexin-9; see Figure2). SNX9 encodes a multifunctional protein involved in intracellular trafficking and membrane remodeling during endocytosis[20]. It has no known function in taste or smell and did not show association with liking cilantro in the replication population. This SNP is located about 80 kb upstream of SYNJ2, an inositol 5-phosphatase thought to be involved in membrane trafficking and signal transduction pathways. In candidate gene studies, SYNJ2 SNPs were found to be associated with agreeableness and symptoms of depression in the elderly[21] and with cognitive abilities[22]. In mice, a Synj2 mutation causes recessive non-syndromic hearing loss[23]. Given recent evidence that the perception of flavor may be influenced by multiple sensory inputs (cf.[24, 25]), we cannot exclude the SYNJ2-linked SNP as conveying a biologically meaningful association. While this SNP may be a false positive, it could also be the case that this SNP is associated only with detecting a soapy smell in cilantro (and not in liking cilantro). In addition, we were unable to replicate the SNPs that were found to be nominally significant for cilantro dislike in[26] (we saw p values in the GWAS of 0.53, 0.41, and 0.53 for rs11988795, rs1524600, and rs10772397, respectively).

We have used two slightly different phenotypes in our discovery and replication, soapy-taste detection and cilantro preference, which are correlated (r 2 ≈ 0.33). Detection of a soapy taste is reportedly one of the major reasons people seem to dislike cilantro. Despite having over 10,000 more people reporting cilantro preference, we have used soapy-taste detection as our primary phenotype because it is probably influenced by fewer environmental factors. Indeed, we see a stronger effect of rs72921001 on soapy-taste detection than on cilantro preference (OR of 0.81 versus 0.92). A GWAS on the replication set gave no genome-wide significant associations. SNPs with p values under 10 −6 for this analysis are shown in Additional file3.

We find significant differences by sex and ancestral population in soapy-taste detection (Tables1 and3). Women are more likely to detect a soapy taste (and to dislike cilantro) (OR for soapy-taste detection 1.36, p = 2.5 × 10 −10 ; Table1). African-Americans, Latinos, East Asians, and South Asians are all significantly less likely to detect a soapy taste compared to Europeans (ORs of 0.676, 0.637, 0.615, and 0.270, respectively, p

We calculated the heritability for cilantro soapy-taste detection using the GCTA software[27]. We found a low heritability of 0.087 (p = 0.08, 95% CI −0.037 to 0.211). This estimate is a lower bound for the true heritability, as our estimate only takes into account heritability due to SNPs genotyped in this study. While this calculation does not exclude a heritability of zero, the existence of the association with rs72921001 does give a non-zero lower bound on the heritability. Despite the strength of the association of the SNP near OR6A2, it explains only about 0.5% of the variance in perceiving that cilantro tastes soapy. Our heritability estimate is lower than those given in a recent twin study (0.38 for odor and 0.52 for flavor)[7]. This could be due to the differences in phenotypes measured between the two studies, or it could be possible that other genetic factors not detected here could influence cilantro preference. For example, there could be rare variants not typed in this study (possibly in partial linkage disequilibrium with rs72921001) that have a larger effect on cilantro preference. Such rare variants could cause the true heritability of this phenotype to be larger than we have calculated. For example, the heritability of height is estimated to be about 0.8; however, the heritability tagged by common SNPs is calculated at about 0.45[26]. We note that there can be epigenetic modifiers of taste as well, for example, food preferences can even be transmitted to the fetus in utero through the mother’s diet[24].

Survey responses, while very efficient for collecting large amounts of data, can only approximately measure the detection and/or perception of the chemicals in cilantro. This has implications for the interpretation of our results. For example, it is possible that the SNP rs72921001 could have a large effect on detection of a specific chemical in cilantro, but that the resulting effect on liking cilantro is much weaker, being modulated by environmental factors. For example, many people might initially dislike cilantro yet later come to appreciate it. This environmental component could also be the reason that our heritability estimates are low. It would thus be interesting to study the genetics of cilantro taste/odor perception in a group without prior exposure to cilantro to reduce the environmental effect, using more direct measures of cilantro perception (i.e., having the subjects actually taste and smell cilantro).


Through a GWAS, we have shown that a SNP, rs72921001, near a cluster of olfactory receptors is significantly associated with detecting a soapy taste to cilantro. One of the genes near this SNP encodes an olfactory receptor, OR6A2, that detects the aldehydes that may make cilantro smell soapy and thus is a compelling candidate gene for the detection of the cilantro odors that give cilantro its divisive flavor.

Availability of supporting data

We have shared full summary statistics for all SNPs with p values under 10 −4 in Additional file2. Due to privacy concerns, under our IRB protocol, we are unable to openly share statistics for all SNPs analyzed in the study.



Participants were drawn from the customer base of 23andMe, Inc., a consumer genetics company. This cohort has been described in detail previously[15, 28]. Participants provided informed consent and participated in the research online, under a protocol approved by an external AAHRPP-accredited IRB, Ethical and Independent Review Services (E&I Review).

Phenotype data collection

On the 23andMe website, participants contribute information through a combination of research surveys (longer, more formal questionnaires) and research ‘snippets’ (multiple-choice questions appearing as part of various 23andMe webpages). In this study, participants were asked two questions about cilantro via research snippets:

‘Does fresh cilantro taste like soap to you?’ (Yes/No/I’m not sure)

‘Do you like the taste of fresh (not dried) cilantro?’ (Yes/No/I’m not sure)

Among all 23andMe customers, 18,495 answered the first question (as either yes or no), 29,704 the second, and 15,751 both. Participants also reported their age. Sex and ancestry were determined on the basis of their genetic data. In both the GWAS set and the replication set, all participants were of European ancestry. In either group, no two shared more than 700 cM of DNA identical by descent (IBD, approximately the lower end of sharing between a pair of first cousins). In total, we were left with a set of 14,604 participants who answered the ‘soapy’ question for GWAS and 11,851 who answered only the taste preference question for a replication set. IBD was calculated using the methods described in[29]; the principal component analysis was performed as in[15]. To determine European and African-American ancestry, we used local-ancestry methods (as in[30]). Europeans had over 97% of their genome painted European, and African-Americans had at least 10% African and at most 10% Asian ancestry. Other groups were built using ancestry-informative markers trained on a subset of 23andMe customers who reported having four grandparents of a given ancestry.


Subjects were genotyped on one or more of three chips, two based on the Illumina HumanHap550+ BeadChip and the third based on the Illumina OmniExpress+ BeadChip (San Diego, CA, USA). The platforms contained 586,916, 584,942, and 1,008,948 SNPs. Totals of 291, 5,394, and 10,184 participants (for the GWAS population) were genotyped on the platforms, respectively. A total of 1,265 individuals were genotyped on multiple chips. For all participants, we imputed genotypes in batches of 8,000–10,000 using Beagle and Minimac[31–33] against the August 2010 release of the 1000 Genomes reference haplotypes[34], as described in[35].

A total of 11,914,767 SNPs were imputed. Of these, 7,356,559 met our thresholds of 0.001 minor allele frequency, average r 2 across batches of at least 0.5, and minimum r 2 across batches of at least 0.3. The minimum r 2 requirement was added to filter out SNPs that imputed less well in the batches consisting of the less dense platform. Positions and alleles are given relative to the positive strand of build 37 of the human genome.

Statistical analysis

For the GWAS, p values were calculated using a likelihood ratio test for the genotype term in the logistic regression model:

where Y is the vector of phenotypes (coded as 1 = thinks cilantro tastes soapy or 0 = does not), G is the vector of genotypes (coded as a dosage 0–2 for the estimated number of minor alleles present), and pc 1 , … , pc 5 are the projections onto the principal components. The same model was used for the replication, with the phenotype coded as 1 = dislikes cilantro or 0 = likes. We used the standard cutoff for genome-wide significance of 5 × 10 −8 to correct for the multiple tests in the GWAS. ORs and p values for the differences in soapy-taste detection between sexes and population were calculated directly, without any covariates. Table3 uses a proxy SNP for rs72921001, as our imputation was done only in Europeans, so we did not have data for rs72921001 in other populations.

For the heritability calculations, we used the GCTA software[27]. The calculations were done on genotyped SNPs only within a group of 13,628 unrelated Europeans. Unrelated filtering here was done using GCTA to remove individuals with estimated relatedness larger than 0.025. Thus, this group is slightly different from the GWAS set, as the GWAS set’s relatedness filtering was done using IBD. We assumed a prevalence for soapy-taste detection of 0.13 for the transformation of heritability from the 0–1 scale to the liability scale. Otherwise, default options were used. We calculated heritability for autosomal and X chromosome SNPs separately; the estimates were 0.0869 (standard error 0.0634, p value 0.0805) for autosomal SNPs and 2 × 10 −6 (standard error 0.010753, p value 0.5) for the X chromosome.


Association for the Accreditation of Human Research Protection Programs

Olfactory genetics seeds
The complex structure of 7E SDs. Each 7E SD is shown with its constituent repeat-element structure (colored bars) around 7E gene(s) (black bars). The arrows, which are not drawn to scale, represent the 5′ to 3′ transcriptional direction of each 7E gene; black and red arrows designate members of clade A and clade B, respectively. The labels to the left of each 7E SD give the chromosome and position in the UCSC August 2001 assembly at which the SD begins. The top line shows the putative ancestral locus on human chromosome 19 and includes non-SD material. The brackets a–d, a‘, and b‘ denote features discussed in the text. The four SDs that include unfinished sequence are marked with asterisks.

Complex Evolution of 7E Olfactory Receptor Genes in Segmental Duplications


Large segmental duplications (SDs) constitute at least 3.6% of the human genome and have increased its size, complexity, and diversity. SDs can mediate ectopic sequence exchange resulting in gross chromosomal rearrangements that could contribute to speciation and disease. We have identified and evaluated a subset of human SDs that harbor an 88-member subfamily of olfactory receptor (OR)-like genes called the 7Es. At least 92% of these genes appear to be pseudogenes when compared to other OR genes. The 7E-containing SDs (7E SDs) have duplicated to at least 35 regions of the genome via intra- and interchromosomal duplication events. In contrast to many human SDs, the 7E SDs are not biased towards pericentromeric or subtelomeric regions. We find evidence for gene conversion among 7E genes and larger sequence exchange between 7E SDs, supporting the hypothesis that long, highly similar stretches of DNA facilitate ectopic interactions. The complex structure and history of the 7E SDs necessitates extension of the current model of large-scale DNA duplication. Despite their appearance as pseudogenes, some 7E genes exhibit a signature of purifying selection, and at least one 7E gene is expressed.

[Supplemental material is available online at]

Large segmental duplications (SDs) are defined as duplicated blocks of genomic DNA that contain both interspersed high-copy repeat elements, such as Alus, and the intervening coding and intergenic sequences (IHG Sequencing Consortium 2001 ). A recent comprehensive survey by Bailey et al. found that SDs of ≥90% identity and ≥1 kb comprise at least 3.6% of the human genome (Bailey et al. 2001 ). They found SDs as large as 300 kb. Approximately 86% of these duplications appeared to involve the transfer of material within, rather than between, chromosomes (Bailey et al. 2001 , 2002 ).

The mechanism by which SDs are generated has not been determined. The process is thought to involve replicative transposition or nonreciprocal recombination (Lundin 1993 ; Venter et al. 2001 ; Samonte and Eichler 2002 ). A possible clue to the duplicative mechanism is the observation that SDs are found more often in pericentromeric and subtelomeric regions than expected by chance (Bailey et al. 2001 ). One explanation for this finding is that these regions are less gene-dense than typical euchromatic regions, and insertion of a large segment of DNA is less likely to cause disruption of critical loci. However, SDs are found in euchromatic sequence as well as near genes in pericentromeric and subtelomeric regions, suggesting that multiple types of insertion sites for duplication events are tolerated in the genome (Hattori et al. 2000 ; Bailey et al. 2001 ).

Mounting evidence indicates that SDs mediate ectopic (i.e., homologous, but nonallelic) interaction of loci that can result in chromosomal rearrangements such as duplications, deletions, and inversions (Mazzarella and Schlessinger 1998 ). Some recurring SD-mediated rearrangements cause human disease, such as Velocardiofacial, Smith-Magenis, Prader-Willi, and Angelman syndromes (Ji et al. 2000 ; Emanuel and Shaikh 2001 ; Stankiewicz and Lupski 2002 ). The frequency of detrimental genomic rearrangements mediated by SDs is high, estimated at 0.7 per 1000 births, making the propensity of SDs to interact an important factor in human disease (Mazzarella and Schlessinger 1998 ).

Duplication of genomic segments containing genes can also be beneficial. This process can generate or expand the membership and diversity of gene families. After the duplication of a gene, selective pressure on one of the two copies is relieved only after it accumulates mutation that renders it nonfunctional. Once relieved of selective pressure, a gene may acquire further mutation, which, in some cases, gives it function distinct from the other copy (Hughes 2002 ; Kondrashov et al. 2002 ; Prince and Pickett 2002 ; Zhang et al. 2002 ). This model may explain the expansion of large gene families such as the olfactory receptors (ORs). ORs comprise the largest gene family in the human genome, with ∼900 members, and encode the proteins responsible for odorant binding and discrimination (Buck and Axel 1991 ; Glusman et al. 2001 ; Zozulya et al. 2001 ). New ORs generated by duplication and subsequent sequence divergence could increase the repertoire of perceived odorants and/or acquire new functions beyond olfaction.

Most OR genes have arisen by local duplication, but some, especially in humans, have duplicated interchromosomally (Trask et al. 1998 ; Brand-Arpon et al. 1999 ; Glusman et al. 2000b ; Young et al. 2002 ). A subfamily of OR genes, called the 7Es (Glusman et al. 2000a ), have expanded extensively in the human genome as part of large segmental duplications (Trask et al. 1998 ), such that 7Es account for ∼10% of all the human OR gene sequences (Glusman et al. 2001 ). The 7E SDs also account for ∼50% of the locations where ORs are found, demonstrating the significant contribution that 7E SDs have made to the genomic landscape of the human OR gene family (Trask et al. 1998 ; Glusman et al. 2001 ; Young et al. 2002 ). The 7E genes have been reported to be predominantly pseudogenes (Glusman et al. 2001 ) and therefore are unlikely to confer a selectively beneficial function. Moreover, there is evidence that 7E SDs can be disadvantageous, as they can mediate harmful genomic rearrangements (Giglio et al. 2001 , 2002 ). So far, 7E SDs have been found at the break-points of multiple large intrachromosomal rearrangements of 8p causing mental handicap and a common translocation between 4p and 8p that leads to either Wolf-Hirschhorn syndrome or a variety of dysmorphic phenotypes (Giglio et al. 2001 , 2002 ).

Using publicly available sequence databases and custom computational tools, we have identified human segmental duplications that contain 7E genes and evaluated their structure and genomic location. Our analyses provide insight into the dispersal of 7E genes in the genome via the 7E SDs, their subsequent ectopic interaction, and their potential for function.


>We identified 7E gene sequences in the UCSC August 2001 human genome assembly by using the 112 unique 7E genes described in the Human Olfactory Receptor Data Exploratorium (HORDE) database (Glusman et al. 2001 ) ( as BLAT queries. This process yielded >350 ORs in the genome that matched a query gene with 60%–100% nucleot ><"type":"entrez-nucleotide","attrs":<"text":"AL360083","term_id":"17977886">> AL360083 and <"type":"entrez-nucleotide","attrs":<"text":"AC073648","term_id":"15145629">> AC073648) that are not included in the August 2001 assembly (details in Fig. ​ Fig.1 1 legend), but are mapped to a chromosomal location in later assemblies. Of the 88 genes in our set, 60 are wholly contained within finished sequence and therefore are expected to contain less than one error in 10 4 nucleotides.

A parsimony tree of the 88 7E nucleotide sequences. The two major clades of the tree are labeled (A) and (B). Bootstrap values (% of 1000 iterations) are indicated when >75% on the major branches and marked with black dots when ≥85% on the minor banches. The 7E genes are labeled by their position in the UCSC August 2001 assembly of human draft sequence (see Methods); Supplementary Table A gives the corresponding names assigned to the genes by Glusman et al. (2000a) and/or in HORDE. We also included in our set of 88 the sequences for three 7E genes that are found in two finished BACs ( <"type":"entrez-nucleotide","attrs":<"text":"AL360083","term_id":"17977886">> AL360083 and <"type":"entrez-nucleotide","attrs":<"text":"AC073648","term_id":"15145629">> AC073648) included in later assemblies (these names of these genes carry the prefix Dec). Genes in bold type are those found in the ancestral locus on chromosome 19. Genes with names in red contain a common substitution resulting in a stop codon in TM6, and those in green contain a common frame shift leading to a stop codon in TM3. The gene marked by a red dot encodes an ORF containing seven TM regions and also encodes a methionine at the beginning of the first predicted extracellular region. Genes marked by a gray dot have ORFs that are predicted to encode six TM regions. Genes marked + have a Ks/Ka value ≥5 on average when compared to 75% of the other 7E genes. “EST” designates genes that match (≥98%) human ESTs. “EST*” designates a gene that matches spliced ESTs. Gene names followed by a dash are not part of 7E SDs.

Forty HORDE 7E genes are not mapped in the August 2001 assembly. Thirty-two of these 40 genes are GenBank entries of single sequencing reads from PCR products. These genes differ by 2%–4% from their best match in our set of 88 genes, possibly due to some combination of sequencing errors, artifacts, and allelic variation. Some of these genes might represent paralogues not yet in the current draft assembly. We excluded them from further analysis, because they lack flanking genomic sequence and map information. An additional eight 7E sequences in the HORDE database were identified in five unfinished BACs that are not included in the UCSC August 2001 assembly, or the most recent June 2002 assembly. We did not include these eight genes in our analysis, but when the sequence and assembly of these BACs becomes reliable, there may be opportunity to analyze at most two additional 7E clusters and three additional orphan 7E genes in the genome.

The 88 7E genes are 87%–99% >​ Figure1 1 by a parsimony tree based on the alignment of their nucleot >​ Fig.1) 1 ) that are ∼7% divergent at the nucleotide level on average. The A clade is slightly larger than the B clade, and contains 55% of the 7E genes. For approximately half of the 7E genes on branches supported by bootstrap values >85% (numbers and black dots in Fig. ​ Fig.1), 1 ), the closest phylogenetic neighbor is located on a different chromosome, indicating that 7E genes are as likely to duplicate interchromosomally as intrachromosomally and/or undergo gene conversion with distant neighbors.

Protein-Coding Potential of 7E Genes

Only seven of the 88 7E sequences have predicted ORFs exceeding 300 amino ac >​ (Fig.1, 1 , gray and red dots). The single-exon ORF of one of these seven genes is predicted to encode seven transmembrane (TM) domains, as is typical for most intact OR genes (Fig. ​ (Fig.1, 1 , red dot) (Sosinsky et al. 2000 ). An N-terminal glycosylation site, another common sequence element of ORs (Gat et al. 1994 ), is located in the first TM domain, 22 amino ac >​ (Fig.1, 1 , gray dots) encode their first methionine at a position usually found within the first TM region of OR genes, a highly atypical location. These six genes also encode a putative N-terminal glycosylation site seven amino acids downstream of this methionine, but hydrophobicity plots of these six sequences predict six TM regions and place the N-terminus in the first intracellular region (data not shown). Alternatively, mRNA of these genes might include a 5′ coding exon(s), and an earlier starting methionine could be included through splicing, as has been observed in other OR genes (Walensky et al. 1998 ; Linardopoulou et al. 2001 ).

Of the remaining 81 genes with shorter predicted 7E ORFs, 24 contain the same nucleot >​ (Fig.1, 1 , red names). Except this stop codon, 15 of the 24 genes would encode an ORF of 293–304 amino ac >​ (Fig.1, 1 , green names). Seventeen of these genes have at least one other deleterious mutation further downstream. All the TM3-truncated genes except one are in clade B. The proteins encoded by the remaining 38 genes would be prematurely truncated because of a variety of mutations causing early stop codons.

We also determined the longest ORF for each of 30 nonhuman primate 7E gene sequences reported by Rouquier et al. (2000) and compared these to the 88 human 7E genes (not shown). The nonhuman sequences were obtained from seven hominid and New and Old World monkey species. The TM6 stop mutation is present in some, but not all, of the 7E genes reported for chimpanzee (1 of 7), orangutan (1 of 8), and gibbon (3 of 6). Thus, the TM6 mutation predates the last common ancestor of humans and gibbons. The TM3 frameshift (but not the TM3 stop mutation) was seen in three of the seven chimpanzee 7E sequences collected by Rouquier et al., but was not found in any of their other nonhuman primate 7E sequences.Because the sequences of nonhuman primates are far from complete, we cannot rule out the presence of the TM3 and TM6 stop mutations in more distantly related species.

The Ancestral 7E Locus Is in 19p13.2

Through comparative analysis of the mouse genome, we have determined that the entire set of 7E genes in humans descended from a single locus on chromosome 19p13.2, in general agreement with a previous analysis (Glusman et al. 2001 ). Pair-wise comparisons of each human 7E gene to all known mouse OR genes (Young et al. 2002 ) reveal that every human 7E gene is most similar at the nucleot ><"type":"entrez-nucleotide","attrs":<"text":"AY073534","term_id":"18480365">> AY073534 and <"type":"entrez-nucleotide","attrs":<"text":"AY073536","term_id":"18480369">> AY073536) than to any of the other ∼1500 mouse OR genes. These two genes are the only 7E-like genes in the public and Celera mouse genome sequence available as of August 2001. They map to mouse chromosome 9 in a location that is syntenic to the 7E locus on human chromosome 19. Both the mouse and the human chromosome 19 7E clusters are neighbored by orthologous non-7E OR genes on both s >​ (Fig.2). 2 ). The 7E genes on human chromosome 19 are found in both the A and B clades (Fig. ​ (Fig.1, 1 , names in bold), and several are among the human genes phylogenetically closest to the mouse 7E-like genes. One of the five chromosome 19 7E genes encodes a full-length ORF (Fig. ​ (Fig.1, 1 , 19.12479301, red dot). Both of the mouse 7E orthologs are expressed in mouse olfactory epithelium (J. Young, unpubl.) and are predicted to encode full-length ORFs of ≥310 amino acids.

The complex structure of 7E SDs. Each 7E SD is shown with its constituent repeat-element structure (colored bars) around 7E gene(s) (black bars). The arrows, which are not drawn to scale, represent the 5′ to 3′ transcriptional direction of each 7E gene; black and red arrows designate members of clade A and clade B, respectively. The labels to the left of each 7E SD give the chromosome and position in the UCSC August 2001 assembly at which the SD begins. The top line shows the putative ancestral locus on human chromosome 19 and includes non-SD material. The brackets a–d, a‘, and b‘ denote features discussed in the text. The four SDs that include unfinished sequence are marked with asterisks.

>We next collected genomic sequences to determine the extent of similarity between 7E-containing regions of the genome. We used an 11-kb sequence centered around a 7E gene on chromosome 3 (3.142575647 [OR7E130P], Fig. ​ Fig.1) 1 ) to probe the August 2001 UCSC human genome assembly and the BAC sequences <"type":"entrez-nucleotide","attrs":<"text":"AL360083","term_id":"17977886">> AL360083 and <"type":"entrez-nucleotide","attrs":<"text":"AC073648","term_id":"15145629">> AC073648 mentioned above. We downloaded a total of ∼2 Mbp of sequence around each of 44 matches to this probe (≥1000 bp and ≥70% identical). Many of these 2-Mbp regions contained more than one 7E gene. Cursory examination also indicated that the 44 regions contain varying amounts of common sequence elements, which are not always in the same relative orientation. Additionally, these regions have independently acquired many new interspersed repeat elements, such as Alus, because of the original duplication events that formed them, making multiple alignments of even small spans of sequence difficult. Therefore, we adopted the technique of “fuguization,” that is, excising the repeat elements (Bailey et al. 2001 ), to decrease the computational time needed to compare the regions and identify common sequences. We applied a custom algorithm to process the output of cross_match analyses of the fuguized regions to identify paralogous segments among the 7E SDs. Any fuguized segment of ≥1 kb shared by two or more of the 44 regions was considered part of a 7E SD. The boundaries of the SDs were defined as the positions where a continuous match with high (≥70%) similarity to the sequence of any other 7E SD decreased markedly (typically from ≥70% to unalignable). Sequence segments at the same locus sharing sufficient similarity and length to other loci, but interrupted by >5 kb of low similarity, were divided into separate SDs. The 2-Mb window around each of the 44 matches to our probe was sufficient to contain each 7E SD.

We defined 35 7E-containing SDs within the 44 regions (Fig. ​ (Fig.2). 2 ). These 7E SDs range in size from 10 to 800 kb, with an average length of 113 kb. The remaining nine regions contained no 7E gene and shared only 3–5 kb of repeats with our probe (data not shown). Of the sequences in the 35 SDs, 94% was finished at the time of our analysis, and draft sequence is confined to only 4 SDs (Fig. ​ (Fig.2). 2 ). The 7E SDs contain 70 (79%) of the 88 7E genes >​ Fig.1). 1 ). The 7E SDs contain at least one, and as many as six (e.g., 3_142.3, Fig. ​ Fig.2), 2 ), 7E genes. The genes are unevenly distributed within SDs, but the average amount of SD material per gene is ∼66 kb. Sixteen of the 17 SDs that contain multiple 7E genes contain representatives of both clades A and B (Figs. ​ (Figs.1 1 and ​ and2). 2 ). Only one SD other than the ancestral locus on chromosome 19 contains ORs from a different family (11_61.5 contains a member of the 5F family, Fig. ​ Fig.2 2 ).

Portions of 23 non-OR genes are annotated in the UCSC Genome Browser to be within the boundaries of the regions defined as SDs. Although none of these genes is annotated at more than one locus, we found paralogues for each in one to five other 7E SDs, depending on the gene (not shown). The annotated genes are distributed among the SDs on chromosomes 2, 3, 7, 10, 11, and 13. The described functions of the non-OR genes vary widely and include a hypothetical zinc finger gene, an oxytocin receptor, a HERV-H protein, and an A-kinase anchoring protein (AKAP)-binding sperm roporin (Kimura et al. 1992 ; Lindeskog and Blomberg 1997 ; Carr et al. 2001 ). PC3–96, an autophagy-like protein, is the only non-OR gene that lies within 50 kb of a 7E gene (∼9 kb 5′ of the 7E gene at 3.125919193).

Locations of the 7E SDs Correlate Well With Locations >The 35 7E SDs are distributed across 12 human chromosomes (Fig. ​ (Fig.3), 3 ), indicating that the 7Es have been part of at least 11 interchromosomal duplicative transfers. The 7E genes not found in SDs are distributed on the same 12 chromosomes (Fig. ​ (Fig.3). 3 ). The 7E SDs are not biased for pericentromeric or subtelomeric regions, and no 7E SDs lie within 500 kb of subtelomeric or pericentromeric sequence motifs (data not shown).

The 7E SDs (tall gray vertical bars) and 7E genes (short black vertical bars) in the human draft sequence assembly of August 2001 ( and estimates of positions of cytogenetic bands showing cross-hybridization to 7E SD-containing clones by FISH (black horizontal bars) (Trask et al. 1998 ). Eighty-five 7E genes and 35 7E SDs are depicted, but because of their proximity to each other, not all sites are independently resolved. SDs are drawn to scale, but genes are not. The three 7E genes from BACs not included in the August 2001 assembly are not depicted in this figure. The number of 7E genes in each SD is indicated. The correlation between the UCSC map and the FISH results is imperfect, because FISH signals are mapped only with the precision of a chromosome band (∼6 Mb), and band boundaries are only approximately defined in the draft sequence (BAC Consortium 2001 ; Kent et al. 2002 ).

We compared the placement of 7E SDs to results published by Trask et al. (1998) , who used fluorescence in situ hybridization (FISH) with probes containing portions of 7E SDs to survey the human genome for homology. FISH signals of varying intensities were found at 20 cytogenetically resolved locations on 13 chromosomes. Using UCSC’s correlation of cytogenetic bands to genome sequence (BAC Consortium 2001 ; Kent et al. 2002 ), we compared the coordinates of the bands showing FISH signals to the locations of the 7E SDs in the August 2001 assembly. Almost all 7E SDs have corresponding FISH signals. Overall, 15 of the 20 FISH-positive locations overlap the sequence locations of 7E SDs, and all but one signal (on chromosome 16) are within 10 Mb of a 7E SD (within the precision with which the two maps are correlated).

The Structures of the 7E SDs Are Complex

Each 7E SD, except for a highly similar pair of SDs on chromosome 3, is a different complex mosaic of repetitive elements, 7E gene(s), and nonrepetitive sequence. On average, ∼50% of 7E SD sequence is occupied by interspersed repeat elements. Alu, satellite, LTR/ERV1, and L1 elements make up ∼70% of the repeat sequence, with component percentages of 18, 16, 14, and 21%, respectively. These densities of the first three repeat classes are notably higher than the human genome average (International Human Genome Sequencing Consortium 2001 ).

Two relatively common sequence patterns can be observed among 7E SDs. First, 30 7E genes are flanked on their 3′ s >​ Fig.2). 2 ). Second, 31 genes are flanked on their 5′ s >​ Fig.2). 2 ). Other repeat patterns are conserved in a smaller number of SDs, such as the interspersed Alu and MIR repeats seen on chromosomes 3_142.3 and 8_15.6 (Fig. ​ (Fig.2, 2 , bracket c).

There is great variability in the elements contained in each 7E SD (Fig. ​ (Fig.2). 2 ). In some cases, very large segments have duplicated to generate 7E SDs. For example, ∼100 kb of the SDs 3_10.3 and 3_17.0 are nearly identical. Other 7E SDs show only partial or disrupted blocks of similarity to other SDs. The structural diversity among 7E SDs suggests that no specific sequence elements are necessary or important for duplication. Indeed, we found no common sequence or repeat elements just inside or just outside the break points of the SDs or in the regions directly flanking the common Alu/L1 and ERV/SAT patterns around many 7E genes.

The ancestral locus on human chromosome 19 contains rudiments of the common repeat patterns of ERV/SATR1/SATR2 and Alu/L1 elements seen in other 7E SDs (Fig. ​ (Fig.2, 2 , brackets a‘ and b‘), but these are arranged differently than they are in all other 7E SDs. The mouse ancestral locus has no interspersed repeat patterns in common with any of the 7E SDs in humans.