About

I am a postdoc at the New York Genome Center, jointly advised by Dr’s David Knowles and Tuuli Lappalainen. For the first two years of my postdoc, I was funded as a Data Science Institute Fellow at Columbia University, where I hold a joint appointment. I am broadly interested in the development and application of statistical and computational methods in genetics and genomics with a focus on complex traits. I am particularly interested in large-scale exploratory data analysis, causal inference, ‘omic data integration, and cross-ancestry analysis.

I received the PhD from the Department of Computer Science at the University of California, Berkeley in 2016, where I was jointly supervised by Dr’s Lior Pachter and Noah Zaitlen. Prior to joining NYGC, I worked as a computational biologist at Verily Life Sciences. In a past life, I studied quantum complexity theory.

Research Highlights

Multi-omic data integration

Multi-omic studies are becoming commonplace, with studies simultaneously measuring RNA sequencing, DNA methylation, chromatin accessability, and more. However, integration of these data remains challenging. I am interested in the development of methods to extract meaningful shared signal from these data and their application to the biology of human complex traits and disease. To this end, I have spent substantial time studying the application of Canonical Correlation Analysis (CCA) to genomics. As a graduate student, I helped develop Principal Component Correlation Analysis, which we used to find under-appreciate population structure in the GEUVADIS study. More recently, we developed a multi-modal extension of CCA which uses a probabilistic graphical model to simultaneously estimate shared and private features in multi-omic data.

Causal network inference and Mendelian randomization

Recently, the role of network structure in complex trait genetics has received renewed attention. Network structure estimation (also called causal discovery) is a challenging problem, however progress can be made by leveraging perturbation-response data, where one or more nodes in the network are perturbed and the response of the remaining variables is observed. We consider two sources of large-scale perturbation response data: large-scale CRISPR-based inhibition data, and genetic association data. Genetic variants provide a natural source of perturbations, which can be used to estimate putatively causal effects using a technique called Mendelian randomization. We have developed Welch-weighted Egger regression (WWER), a technique for fast estimation of pairwise bi-directed causal effects in bio-bank scale data, as well as inverse sparse regression (inspre), a strategy for turning a dense network of pairwise causal effects into a sparse directed graph. We have applied these techniques to the genome-wide perturb seq and UK biobank datasets.

Cross-population complex trait analysis

Human phenotypes vary in their global distributions due to a combination of genetic and environmental factors, however the vast majority of genetic studies of disease focus on individuals of European ancestry. We asked a simple question - does the same genetic variant have the same phenotypic effect in two ancestal populations? This led to the development of the cross-population genetic correlation, and one of the first studies that revealed practical concerns applying European-derived genetic information to other populations in the context of complex traits. The resulting tool popcorn is now widely used.

Note