associate professor of biology
Fields of Interest
DNA sequence variation, genome data mining and informatics, population genetics, medical genetics.
The genetic blueprint of our species, the sequence of human DNA, is nearly identical from person to person. Genetic variations, features within our genome that do show variation within different individuals, result from mutation events, and are passed down from generation to generation. Genetic variations are important because they carry the patterns imprinted by the inheritance of these mutations and hence permit the reconstruction of our origins and demographic past. As importantly, a subset of genetic variations alters gene expression or function, and results in phenotypic variance such as variations of height. In some cases, a variant form (or allele) causes disease. My laboratory is interested in several aspects of sequence variation research.
Discovery of single-nucleotide polymorphisms (SNPs) in DNA sequence data
Efficient polymorphism detection requires that sequences representing the same loci from multiple individuals are correctly clustered and accurately aligned in a base-to-base fashion. Apparent sequence differences are then examined to determine if they represent true polymorphisms as opposed to sequencing errors. We have developed a set of algorithms and a corresponding software package, PolyBayes (http://genome.wustl.edu/gsc/polybayes) that implements these steps. PolyBayes is one of the primary methods used in genome-scale as well as locus-scale SNP discovery. Current work focuses on polymorphism mining in model organisms, and building pipelines for public use.
Population genetic theory development
Random genetic drift, the mutation process, recombination, long-term demography, and selection act collectively to shape the landscape of human polymorphism structure. Higher mutation rates lead to more polymorphisms; random drift drives most novel mutations to extinction while preserving some; recombination shuffles mutations that originally arose on the same chromosome and breaks down allelic association; population bottlenecks reduce genetic diversity; selection promotes the spread of an advantageous allele, allowing non-functional alleles in close proximity to “hitchhike” with it. Based on the powerful methodology termed the “coalescent” we have developed mathematical models and simulation procedures to describe the shape of two characteristic SNP distributions, marker density and the allele frequency spectrum under complex scenarios of demographic history, and realistic recombination rates. Current research is aimed at refining these models, and at describing other characteristic distributions such as the distribution of inter-marker spacing.
Reconstruction of human demographic history
Changes in long-term population size, such as population expansion, collapse, or bottleneck imprint genome-wide SNP distributions, e.g. expansion gives rise to many rare alleles, a collapse preferentially weeds out rare alleles leading to an over-representation of high-frequency or common alleles. Human polymorphism and genotype data available on the genome scale now provides data sufficiency to infer these patterns for large world populations. Our own results show that long-term demographic history was different for some of these populations: European and Asian groups have undergone a population bottleneck, an event that was not observed in African samples. Current research aims at better understanding of these population-specific differences, and describing the spatial aspects of human variation structure as observed along the human chromosomes.
Human haplotype structure and the HapMap
A haplotype is a combination of alleles at adjacent marker locations, co-inherited from generation to generation. Co-inheritance can be disrupted by recombination events that occur between markers. Recent results indicate that human haplotype structure is characterized by long (tens of kilobases) regions where allelic association hence haplotypes are preserved. These regions, termed “haplotype blocks”, are interrupted by regions of minimal allelic association. Haplotypes within blocks can be described by a small subset of markers that defined them, permitting substantial savings in genotyping cost. This fact prompted the HapMap initiative, a project aimed at describing human haplotype structure at a fine scale, in multiple populations. Haplotypes are governed by the same forces that give rise to polymorphism structure, hence the same principles can be used in their analysis. Current research in this lab focuses on understanding how general haplotype blocks are, how uniform they are across different human population groups, how deep sampling is required to find them in a stable fashion. Answering these questions is critical to find the right experimental design for the HapMap project, and to ensure the generality and utility of this costly resource.
The main driving force behind public and private investments into variation resources is the promise that these resources will be useful in tracking down the genetic causes of heritable diseases. The goal is to either find the specific functional mutations that cause disease, or to find molecular markers that are predictive of disease susceptibility, response to treatment, and possible side effects. The difficulty is that common diseases affecting millions of people are thought to be multi-factorial i.e. susceptibility depends on a possibly very large number of genes, the individual effect of each gene possibly being very modest. This means that the effects of a given locus are very difficult to measure. This lab is interested in discovering those features in genome variation data that can be interpreted as signatures of causative loci. We are also interested in developing tools that bring the fruits of the HapMap project to the specialized laboratory involved in hunting down disease genes.
Sackton, T.B., Kulathinal, R.J., Bergman, C.M., Quinlan, A.R., Dopman, E.B., Carneiro, M., Marth, G.T., Hartl, D.L., Clark, A.G. 2009. Population genomic inferences from sparse high-throughput sequencing of two populations of Drosophila melanogaster. Genome Biology and Evolution 2009: 449–65 (link to PubMed abstract).
Smith, D.R., Quinlan, A.R., Peckham, H.E., Makowsky, K., Tao, W., Woolf, B., Shen, L., Donahue, W.F., Tusneem, N., Stromberg, M.P., Stewart, D.A., Zhang, L., Ranade, S.S., Warner, J.B., Lee, C.C., Coleman, B.E., Zhang, Z., McLaughlin, S.F., Malek, J.A., Sorenson, J.M., Blanchard, A.P., Chapman, J., Hillman, D., Chen, F., Rokhsar, D.S., McKernan, K.J., Jeffries, T.W., Marth, G.T., and Richardson, P.M. 2008. Rapid whole-genome mutational profiling using next-generation sequencing technologies. Genome Research 18(10): 1638–1642 (link to PubMed abstract).
Huang, W., Marth, G. 2008. EagleView: A genome assembly viewer for next-generation sequencing technologies. Genome Research 18(9): 1538–1543 (link to PubMed abstract).
Hillier, L.W., Marth, G.T., Quinlan, A.R., Dooling, D., Fewell, G., Barnett, D., Fox, P., Glasscock, J.I., Hickenbotham, M., Huang, W., Magrini, V.J., Richt, R.J., Sander, S.N., Stewart, D.A., Stromberg, M., Tsung, E.F., Wylie, T., Schedl, T., Wilson, R.K., and Mardis, E.R. 2008. Whole-genome sequencing and variant discovery in C. elegans. Nature Methods 5: 183–188 (link to PubMed abstract).
Quinlan, A.R., Stewart, D.A., Strömberg, M.P., and Marth, G.T. 2008. Pyrobayes: an improved base caller for SNP discovery in pyrosequences. Nature Methods 5: 179–181 (link to PubMed abstract).
Quinlan, A.R., Marth, G.T. 2007. Primer-site SNPs mask mutations. Nature Methods 4: 192 (link to PubMed abstract).
Marth, G.T., Czabarka, E., Murvai, J., and Sherry, S.T. 2004. The allele frequency spectrum reveals differential demographic histories in three large world populations. Genetics 166: 351–372 (link to PubMed abstract).
Marth, G.T., Cutler, D., Wooding, S., Schuler, G., Yeh, R., Davenport, R., Agarwala, R., Church, D., Wheelan, S., Baker, J., Ward, M., Kholodov, M., Phan, L., Czabarka, E., Murvai, J., Cutler, D., Wooding, S., Rogers, A., Chakravarti, A., Harpending, H.C., Kwok, P.Y., and Sherry, S.T. 2003. Sequence variations in the public human genome data reflect a bottlenecked population history. Proceedings of the National Academy of Sciences of the USA 100: 376–381 (link to PubMed abstract).
Marth, G.T. 2003. Computational SNP discovery in DNA sequence data. In: Single Nucleotide Polymorphisms: Methods and Protocols (Ed. Kwok, P.Y.), Humana Press (link to PubMed abstract).
Weber, J.L., David, D., Heil, J., Fan, Y., Zhao, C., and Marth, G.T. 2002. Human diallelic insertion/deletion polymorphisms. American Journal of Human Genetics 71: 854–862 (link to PubMed abstract).
Marth, G., Yeh, R., Minton, M., Donaldson, R., Li, Q., Duan, S., Davenport, R., Miller, R.D., and Kwok, P.Y. 2001. Single-nucleotide polymorphisms in the public domain: how useful are they? Nature Genetics 27: 371–372 (link to PubMed abstract).
Sachidanandam, R., Weissman, D., Schmidt, S.C., Kakol, J.M., Stein, L.D., Marth, G., Sherry, S., Mullikin, J.C., Mortimore, B.J., Willey, D.L., Hunt, S.E., Cole, C.G., Coggill, P.C, Rice, C.M., Ning, Z., Rogers, J., Bentley, D.R., Kwok, P.Y., Mardis, E.R., Yeh, R.T., Schultz, B., Cook, L., Davenport, R., Dante, M., Fulton, L., Hillier, L., Waterston, R.H., McPherson, J.D., Gilman, B., Schaffner, S., Van Etten, W.J., Reich, D., Higgins, J., Daly, M.J., Blumenstiel, B., Baldwin, J., Stange-Thomann, N., Zody, M.C., Linton, L., Lander, E.S., Altshuler, D. The international SNP map working group. 2001. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409: 928–933 (link to PubMed abstract).
Marth, G.T., Yandell, M.D., Korf, I., Gu, Z., Yeh, R.T., Zakeri, H., Stitziel, N.O., Hillier, L., Kwok, P.Y., and Gish, W. 1999. A general approach to single-nucleotide polymorphism discovery. Nature Genetics 23: 452–456 (link to PubMed abstract).
Dear, S., Durbin, R., Hillier, L., Marth, G., Thierry-Mieg, J., and Mott, R. 1998. Sequence assembly with CAFTOOLS. Genome Research 8: 260–267 (link to PubMed abstract).