Integration of multiple epigenomic marks improves prediction of variant impact in saturation mutagenesis reporter assay

Dustin Shigaki, Orit Adato, Aashish N Adhikari, Shengcheng Dong, Alex Hawkins-Hooker, Fumitaka Inoue, Tamar Juven-Gershon, Henry Kenlay, Beth Martin, Ayoti Patra, Dmitry D Penzar, Max Schubach, Chenling Xiong, Zhongxia Yan, Alan P Boyle, Anat Kreimer, Ivan V Kulakovskiy, John Reid, Ron Unger, Nir Yosef, Jay Shendure, Nadav Ahituv, Martin Kircher, Michael A Beer

Correspondence should be addressed to: Michael A. Beer (mbeer AT jhu DOT edu)

The integrative analysis of high-throughput reporter assays, machine learning, and profiles of epigenomic chromatin state in a broad array of cells and tissues has the potential to significantly improve our understanding of noncoding regulatory element function and its contribution to human disease. Here, we report results from the CAGI 5 regulation saturation challenge where participants were asked to predict the impact of nucleotide substitution at every base pair within five disease-associated human enhancers and nine disease-associated promoters. A library of mutations covering all bases was generated by saturation mutagenesis and altered activity was assessed in a massively parallel reporter assay (MPRA) in relevant cell lines. Reporter expression was measured relative to plasmid DNA to determine the impact of variants. The challenge was to predict the functional effects of variants on reporter expression. Comparative analysis of the full range of submitted prediction results identifies the most successful models of transcription factor binding sites, machine learning algorithms, and ways to choose among or incorporate diverse datatypes and cell-types for training computational models. These results have the potential to improve the design of future studies on more diverse sets of regulatory elements and aid the interpretation of disease-associated genetic variation.

Citation

If you use this data, please cite as:
Shigaki D, Adato O, Adhikar A, Dong S, Hawkins-Hooker A, Inoue F, Juven-Gershon T, Kenlay H, Martin B, Patra A, Penzar D, Schubach M, Xiong C, Yan Z, Boyle A, Kreimer A, Kulakovskiy IV, Reid J, Unger R, Yosef N, Shendure J, Ahituv N, Kircher M, and Beer MA. Human Mutation 2019. doi:10.1002/humu.23797, and
Lee D, Gorkin DU, Baker M, Strober BJ, Asoni AL, McCallion AS, Beer MA. 2015. A method to predict the impact of regulatory variants from DNA sequence. Nat Genet. 2015 doi:10.1038/ng.3331

DeltaSVM models for ENCODE/Roadmap DHS enhancer, promoter and TF ChIP-seq are below:

  • Sample label index sample_list.txt
  • Variant scoring script (usage example in comments and below) score_snp_seq.py
  • ENCODE2 enhancer models (159) deltasvm_models_e2e.tar.gz
  • ENCODE2 promoter models (163) deltasvm_models_e2p.tar.gz
  • ENCODE2 TF models (345) deltasvm_models_e2tf.tar.gz
  • ENCODE3 enhancer models (182) deltasvm_models_e3e.tar.gz
  • ENCODE3 promoter models (196) deltasvm_models_e3p.tar.gz
  • ENCODE3 TF models (699) deltasvm_models_e3tf.tar.gz
  • Roadmap enhancer models (313) deltasvm_models_rme.tar.gz
  • Roadmap promoter models (317) deltasvm_models_rmp.tar.gz
  • Note: the promoter models are quite similar, the set of promoter TFs does not vary much across cell types.

    If you have any questions, please contact Mike Beer at mbeer AT jhu DOT edu.


    This is example is the as yet uncharacterized SNP rs1498232, associated with neuropsychiatric disease, which we predict dirupts an RFX family binding site after training on brain DHS or RFX ChIP-seq.


    python score_snp_seq.py CCCGTTTCCATGGCAACCAGA CCCGTTTCCACGGCAACCAGA TF_E3_530_hg38_300_top10k_vs_neg1x_avg_weights.out
    CCCGTTTCCAT 0.84 CCCGTTTCCAC -0.10
    CCGTTTCCATG 1.86 CCGTTTCCACG 0.52
    CGTTTCCATGG 3.60 CGTTTCCACGG 1.19
    GTTTCCATGGC 4.91 GTTTCCACGGC 1.80
    TTTCCATGGCA 2.96 TTTCCACGGCA 1.17
    TTCCATGGCAA 3.03 TTCCACGGCAA 1.16
    TCCATGGCAAC 6.23 TCCACGGCAAC 2.32
    CCATGGCAACC 6.92 CCACGGCAACC 2.90
    CATGGCAACCA 4.30 CACGGCAACCA 1.44
    ATGGCAACCAG 2.26 ACGGCAACCAG 0.65
    TGGCAACCAGA 1.46 CGGCAACCAGA 0.37
    CCCGTTTCCATGGCAACCAGA 38.38 CCCGTTTCCACGGCAACCAGA 13.43
    deltaSVM=-24.94

    python score_snp_seq.py CCCGTTTCCATGGCAACCAGA CCCGTTTCCACGGCAACCAGA DHS_E3_100_300_noproms_nc30_hg38_top10k_vs_neg1x_avg_weights.out
    CCCGTTTCCAT 0.06 CCCGTTTCCAC -0.51
    CCGTTTCCATG 0.77 CCGTTTCCACG -0.20
    CGTTTCCATGG 1.54 CGTTTCCACGG 0.15
    GTTTCCATGGC 2.46 GTTTCCACGGC 0.56
    TTTCCATGGCA 1.86 TTTCCACGGCA 0.39
    TTCCATGGCAA 1.88 TTCCACGGCAA 0.21
    TCCATGGCAAC 3.23 TCCACGGCAAC 0.66
    CCATGGCAACC 3.63 CCACGGCAACC 0.80
    CATGGCAACCA 1.74 CACGGCAACCA 0.20
    ATGGCAACCAG 0.64 ACGGCAACCAG -0.07
    TGGCAACCAGA 0.20 CGGCAACCAGA -0.23
    CCCGTTTCCATGGCAACCAGA 18.01 CCCGTTTCCACGGCAACCAGA 1.96
    deltaSVM=-16.05