About

EXPANSION is a new pipeline and webserver to explore the functional consequences of an input list of protein-coding alternative splice variants, for example differentially expressed (DE) instances from transcriptomics datasets. We combined information of DE protein-coding transcripts from cancer genomics with information of domain architecture, protein interaction network and gene enrichment analysis.

We have retrieved all the protein-coding Ensembl transcripts using the ensembldb package (version 2.22.0, R4.2) (1). We pooled all the protein sequences coded by each gene and clustered them using CD-HIT(2), using 0.6 as a similarity cutoff. Clustered sequences were aligned using ClustalOmega (3) using default parameters. Multiple Sequence Alignments (MSAs) were used to detect insertions, deletions or divergent positions of alternatively spliced protein isoforms with respect to Uniprot canonical sequences. We also mapped Interpro domain definitions, as well as Post-translational modifications (PTMs) from PhosphoSitePlus on Uniprot canonical sequences to identify splicing events affecting these structural and functional sites. Further, we retrieved protein isoform specific interactions from IntAct (version 1.38.0, R4.2) (4). As a use example, we considered RSEM transcript abundances from corresponding tissues from TCGA and GTEX reprocessed through the TOIL pipeline (5), available from the UCSC Xena Browser (6). We have computed differential expression of transcript isoforms via EBSeq (7), considering significant instances (i.e. PPDE, posterior probability of being DE>0.95).

EXPANSION allows for gene-centric queries of protein alternative splice-forms derived from a given gene and clustered based on similarity. In the results page, a Summary section describes the most relevant findings, such as the number of significantly regulated transcripts and if and how many splice variants affect functional sites (i.e. domain, PTM sites and PPI binding regions). A bubble plot diagram provides information about differential expression of transcripts (rows) in each cancer tissue (columns) , where each circle is colored based on log-fold change (TCGA over GTEx) while diameter is proportional to significance. Optionally, the user can upload a transcript DE dataset of choice, with predefined fields (i.e. ENST ids, log-fold change and adjusted P-value fields), to be analyzed through our pipeline. Results will be shown only if a given transcript ID is matched to our internal database. For each protein-coding transcript in the DE dataset, cartoon panels on the right provide information about splicing variation (i.e. insertion, deletion or divergence) affecting domain architecture as well as PTMs, respectively represented through colored boxes and lollipops on the canonical protein sequence. A central panel provides information about interaction networks (IntAct) mediated by the protein isoform considered. The number of interactors can be tuned using a MI score toggle (default MI score=0.2). Each node represents a protein and can be expanded to visualize its domain architecture. If the binding region information is provided by IntAct, region-specific interaction edges are drawn on the network. If a splicing event affects a binding region, the corresponding edge in the network is highlighted in red to ease the interpretation of splicing functional consequence. It is also possible to visualize over-representation analysis (ORA) of functional categories, computed via the g:profiler python library (8), of the genes in the network. Whenever isoform-specific interactors are present in the network, ORA can be calculated for each isoform-specific group, allowing the comparison of distinct biological processes mediated by isoform-specific interactors.

Libraries

Flask (v2.0.2)
D3 (v4)
jQuery (v3.2.1)
DataTables (v2.3.2)
biopython(v1.79)
gprofiler-official(v1.0.0)
Mysql-connector-python (v8.0.31)

References

Rainer,J., Gatto,L. and Weichenberger,C.X. (2019) ensembldb: an R package to create and use Ensembl-based annotation resources. Bioinformatics, 35, 3151–3153. https://doi.org/10.1093/BIOINFORMATICS/BTZ031, http://www.ncbi.nlm.nih.gov/pubmed/30689724.
Li,W., Jaroszewski,L. and Godzik,A. (2001) Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics, 17, 282–283. https://doi.org/10.1093/BIOINFORMATICS/17.3.282, http://www.ncbi.nlm.nih.gov/pubmed/11294794.
Sievers,F., Wilm,A., Dineen,D., Gibson,T.J., Karplus,K., Li,W., Lopez,R., McWilliam,H., Remmert,M., Söding,J., et al. (2011) Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol. Syst. Biol., 7, 539. https://doi.org/10.1038/msb.2011.75, http://www.ncbi.nlm.nih.gov/pubmed/21988835.
Orchard,S., Ammari,M., Aranda,B., Breuza,L., Briganti,L., Broackes-Carter,F., Campbell,N.H., Chavali,G., Chen,C., Del-Toro,N., et al. (2014) The MIntAct project--IntAct as a common curation platform for 11 molecular interaction databases. Nucleic Acids Res., 42. https://doi.org/10.1093/NAR/GKT1115, http://www.ncbi.nlm.nih.gov/pubmed/24234451.
Vivian,J., Rao,A.A., Nothaft,F.A., Ketchum,C., Armstrong,J., Novak,A., Pfeil,J., Narkizian,J., Deran,A.D., Musselman-Brown,A., et al. (2017) Toil enables reproducible, open source, big biomedical data analyses. Nat. Biotechnol., 35, 314–316. https://doi.org/10.1038/nbt.3772, http://www.ncbi.nlm.nih.gov/pubmed/28398314.
Goldman,M.J., Craft,B., Hastie,M., Repečka,K., McDade,F., Kamath,A., Banerjee,A., Luo,Y., Rogers,D., Brooks,A.N., et al. (2020) Visualizing and interpreting cancer genomics data via the Xena platform. Nat. Biotechnol., 38, 675–678. https://doi.org/10.1038/S41587-020-0546-8, http://www.ncbi.nlm.nih.gov/pubmed/32444850.
Leng,N., Dawson,J.A., Thomson,J.A., Ruotti,V., Rissman,A.I., Smits,B.M.G., Haag,J.D., Gould,M.N., Stewart,R.M. and Kendziorski,C. (2013) EBSeq: an empirical Bayes hierarchical model for inference in RNA-seq experiments. Bioinformatics, 29, 1035–1043. https://doi.org/10.1093/BIOINFORMATICS/BTT087, http://www.ncbi.nlm.nih.gov/pubmed/23428641.
Reimand,J., Arak,T., Adler,P., Kolberg,L., Reisberg,S., Peterson,H. and Vilo,J. (2016) g:Profiler—a web server for functional interpretation of gene lists (2016 update). Nucleic Acids Res., 44, W83–W89. httphttps://doi.org/10.1093/nar/gkw199, http://www.ncbi.nlm.nih.gov/pubmed/27098042.

Cite

Chakit Arora, Natalia De Oliveira Rosa, Marin Matic, Mariastella Cascone, Pasquale Miglionico, Francesco Raimondi, EXPANSION: a webserver to explore the functional consequences of protein-coding alternative splice variants in cancer genomics, Bioinformatics Advances, Volume 3, Issue 1, 2023, vbad135, https://doi.org/10.1093/bioadv/vbad135

Contact

Francesco Raimondi - francesco.raimondi@sns.it

Chakit Arora - chakit.arora@sns.it

Natalia de Oliveira Rosa - natalia.deoliveirarosa@sns.it