Genomic research has long benefited from using diverse population panels, increasing the statistical power of association studies for participants from admixed populations [1–3]. However, mass spectrometry-based proteomic workflows often project all data to a set of reference protein sequences. Consequently, we obscure a portion of the proteome, restricting our ability to fully analyze complex samples [4–6]. Moreover, we risk introducing a bias against populations with a different haplotypic structure.
Alleles co-occurring in the protein-coding regions of the same gene produce a unique protein sequence - protein haplotype [5]. These haplotypes are present in biological samples, and detectable by mass spectrometry. We have demonstrated that thousands of amino acid substitutions can be discovered in a single sample, sometimes featuring alleles in linkage disequilibrium within the same peptide after a tryptic digestion of the protein [6]. We have recently released ProHap, a bioinformatic pipeline that allows building proteomic databases from genetic reference panels [7].
We generated proteomic databases from the 1000 Genomes Project [1] and showed that participants of the African superpopulation diverge from the reference proteome more than others, while all the included ancestry groups show notable differences from the reference proteome [7]. ProHap alleviates this bias by creating databases that capture the diversity of human proteomes and allows the fair competition of protein haplotypes during proteomic searches. The pipeline can be run on public as well as local reference panels, with great flexibility in terms of types of genetic variants and haplotype frequency, empowering researchers to tailor their proteomic studies to populations.
To allow a rapid insight into the complexity of such proteogenomic datasets, we have developed a web-based visual interface mapping identified peptides to genes, haplotypes, and spliced transcripts. ProHap Explorer allows researchers to browse the influence of common haplotypes on any gene of interest, and view the coverage of the resulting proteoforms in public mass spectrometry data sets.