3D Chromatin Organization

The role that chromatin architecture plays in the mechanism of gene regulation has undergone a dramatic transformation with the emergence of Hi-C and other 3C-derived high throughput technology for interrogating the three-dimensional configurations of the genome and identifying regions that are in close spatial proximity in a genome-wide fashion. As the sequencing depth grows, the resolution to investigate the 3D chromatin interactions goes from Megabase scale to kilobase or even hundred-base genomic scale. Furthermore, with the development of other “C”-related techniques, for instance, HiChIP, DNaseHiC, Capture Hi-C, etc., we will be able to investigate all kinds of genomic interactions at ultra-high resolution across the whole genome from the anchor regions that can be associated with clinically relevant loci such as regulatory elements, single nucleotide polymorphisms (SNPs) from GWAS studies, chromatin domain boundaries or promoters. Analysis of 3D genome data will resolve many fundamental and long-standing questions involving the mechanism of gene regulation from distal non-coding regulatory variants, linking novel SNPs to putative disease-associated genes related to all sorts of disorders, the connection between chromatin domain structure variation and cancer or other genetic diseases.

Such technological advances in 3D genomics lead to great needs for new statistical models and computational tools to harness the embedding conformation information present in Hi-C data. To achieve decent interaction resolution, namely, how specific the region can be if we zoom in the genome to the largest extent, one study usually has billions of reads for each biological replicate. Hence, it poses a tough challenge on the model which should be theoretically sound and practically applicable and implementable within a short time. Realizing the demand in the field, I have developed a few fundamental methods, mHi-C, FreeHiC and TreeHiC, with the corresponding software packages or pipelines to address many basic needs in dealing with Hi-C data.

te

Due to the importance of sequencing depth in Hi-C analysis, I developed mHi-C (Zheng et al. 2018 bioRxiv) to utilize multi-mapping reads that are discarded in Hi-C benchmark analysis pipeline. Multi-mapping reads are reads originated from repetitive regions of the genome hence they can be aligned to multiple positions. The ambiguity of multi-reads alignment renders them to be deleted wasting ~20% of the sequencing depth which can be translated into hundreds of millions of reads for one biological replicate. Through a hierarchical model that probabilistically allocates Hi-C multi-reads to their most likely genomic origins by utilizing specific characteristics of the paired-end reads of the Hi-C assay, the chromatin domain boundaries are refined and we can also detect novel promoter-enhancer interactions that involve disease-associated genes. Moreover, mHi-C, itself is programmed into a python pipeline with C acceleration, can process from raw data and output valid genomic interactions. Another urgent demand from both experimental labs and computational groups is a tool to fast simulate Hi-C contact matrices to work as biological replicates to improve the detection power or employed as a test of models under development. Also, the tool should be easy to use for researchers from all kinds of background. Therefore, I have developed FreeHiC (Zheng et al. 2018 bioRxiv) which empirically learns the interacting fragment behaviors, sequence characters, according to the Hi-C experimental protocol procedures. Sequently, with the user-controlled level of noise being inserted into the simulation process, a similar Hi-C interaction matrix is generated. Additionally, under the scenarios where researchers want to compare tissues or experimental conditions, differential interaction testing is requisite. TreeHiC (Nguyen et al. 2018 manuscript, co-first author), built on a hierarchical multiple testing procedure, has demonstrated that this framework can detect differential interactions while assuring the control of the False Discovery Rate in complex large-scale Hi-C studies under a wide range of settings.

te

Avatar
Ye Zheng
Ph.D. Candidate in Statistics

Publications

Current Hi-C analysis approaches are unable to account for reads that align to multiple locations, and hence underestimate biological …

In genomics, statisticians and computational groups rely on data-driven simulations to benchmark analysis approaches and develop best …

Hi-C sequencing technology provides key insights into the 3D structures of the human genome. Although peak detection from Hi-C …