Introduction
PIK3CA is a frequently mutated oncogene in human cancers, with mutations found in various cancer types, including endometrial, ovarian, colorectal, breast, cervical, squamous cell cancer of the head and neck, chondroma, and thyroid carcinoma1. The PIK3CA gene encodes the p110α catalytic subunit of phosphatidylinositol 3-kinase2.
Somatic missense mutations in PIK3CA have been reported in many human cancer types, and are predominantly found in the kinase and helical domains of the PIK3CA subunit3.
According to karakas_2006 at al, some of the most common PIK3CA mutations associated with cancer include:
- Exon 9 (G1624A:E542K)
- Exon 20 (A3140G:H1047R)
- Exon 1 (E545K)
PIK3CA mutations are more common in tumors with mutant PTEN genes than in tumors with wild-type PTEN genes4. The discovery of oncogenic mutations in PIK3CA has emphasized the important role of PI3K in cancer pathogenesis and made it possible to quickly and easily identify tumors with activation of PI3K signaling by virtue of mutations in PIK3CA4.
In this vignette we use agvgd to classify the pathogenecity of the missence mutations/variations reported in Correia and Magno et al (2022)5. Here they report several mutations found in breast cancer, from which we selected the 17 variants found in the metabric dataset. These will be used to showcase how the usage of the agvgd package can complement the findings from other biomedical data analyses.
Running agvgd
Data sets
To use the agvgd package to predict the protein impact of the metabric variants reported in Correia and Magno et al5, we need two data sets:
- The list of the 17 metabric PIK3CA missense variants;
- The protein sequence alignment of PIK3CA.
These first dataset is readily available with the agvgd package, and the protein alignment is provide within the companion package protean. More information regarding the protean package can be found in its GitHub repository.
1. Get the alignment dataset from {protean}
To retrieve the alignment of PIK3CA from the protean
package, we will use the functions read_profile()
and
profile_path()
using the gene name
"PIK3CA"
:
# Load the required packages
library(protean)
library(dplyr)
# Using protean
# Check all oncoKB proteins that have alignment files in the protean package
head(protean::oncokb_genes)
#> [1] "ABL1" "AKT1" "ALK" "AMER1" "APC" "AR"
# Confirm that PIK3CA is present
grep("PIK3CA", protean::oncokb_genes, value = TRUE)
#> [1] "PIK3CA"
# Get the alignment profile information for PIK3CA
pik3ca_profile <- protean::read_profile(protean::profile_path("PIK3CA"))
head(pik3ca_profile)
#> # A tibble: 6 × 11
#> timestamp human_prot_id ortho_prot_id ortho_species human_align_seq
#> <chr> <chr> <chr> <chr> <chr>
#> 1 2023-11-26 21:55:44… ENSP00000263… ENSGGOP00000… gorilla_gori… MPPRPSSGELWGIH…
#> 2 2023-11-26 21:55:44… ENSP00000263… ENSMFAP00000… macaca_fasci… MPPRPSSGELWGIH…
#> 3 2023-11-26 21:55:44… ENSP00000263… ENSMMUP00000… macaca_mulat… MPPRPSSGELWGIH…
#> 4 2023-11-26 21:55:44… ENSP00000263… ENSMNEP00000… macaca_nemes… MPPRPSSGELWGIH…
#> 5 2023-11-26 21:55:44… ENSP00000263… ENSPANP00000… papio_anubis MPPRPSSGELWGIH…
#> 6 2023-11-26 21:55:44… ENSP00000263… ENSMLEP00000… mandrillus_l… MPPRPSSGELWGIH…
#> # ℹ 6 more variables: ortho_align_seq <chr>, human_ortho_perc_id <dbl>,
#> # ortho_human_perc_id <dbl>, cigar <chr>, human_profile_seq <chr>,
#> # ortho_profile_seq <chr>
2. Format the sequence profile to an alignment with
{agvgd}
Next, the sequence profile must be converted to an alignment format.
For this we must use the profile_to_alignment()
function
from the agvgd package.
# Using agvgd
# Convert the protean sequence profile to an alignment
pik3ca_alignment <- agvgd::profile_to_alignment(pik3ca_profile)
# Check the first lines and columns of the alignment
pik3ca_alignment[1:15, 1:30]
#> homo_sapiens_ENSP00000263967 1 MPPRPSSGELWGIHLMPPRILVECLLPNGM
#> gorilla_gorilla_ENSGGOP00000009158 1 MPPRPSSGELWGIHLMPPRILVECLLPNGM
#> macaca_fascicularis_ENSMFAP00000027509 1 MPPRPSSGELWGIHLMPPRILVECLLPNGM
#> macaca_mulatta_ENSMMUP00000006866 1 MPPRPSSGELWGIHLMPPRILVECLLPNGM
#> macaca_nemestrina_ENSMNEP00000020318 1 MPPRPSSGELWGIHLMPPRILVECLLPNGM
#> papio_anubis_ENSPANP00000023010 1 MPPRPSSGELWGIHLMPPRILVECLLPNGM
#> mandrillus_leucophaeus_ENSMLEP00000019594 1 MPPRPSSGELWGIHLMPPRILVECLLPNGM
#> chlorocebus_sabaeus_ENSCSAP00000009123 1 MPPRPSSGELWGIHLMPPRILVECLLPNGM
#> pongo_abelii_ENSPPYP00000041486 1 MPPRPSSGELWGIHLMPPRILVECLLPNGM
#> rhinopithecus_roxellana_ENSRROP00000011898 1 MPPRPSSGELWGIHLMPPRILVECLLPNGM
#> equus_asinus_ENSEASP00005039543 1 MPPRPSSGELWGIHLMPPRILVECLLPNGM
#> capra_hircus_ENSCHIP00000015546 1 MPPRPSSGELWGIHLMPPRILVECLLPNGM
#> panthera_leo_ENSPLOP00000019775 1 MPPRPSSGELWGIHLMPPRILVECLLPNGM
#> ailuropoda_melanoleuca_ENSAMEP00000011350 1 MPPRPSSGELWGIHLMPPRILVECLLPNGM
#> bos_indicus_hybrid_ENSBIXP00005013124 1 MPPRPSSGELWGIHLMPPRILVECLLPNGM
3. Read the file with substitutions of interest
The substitutions file should have one substitution per line, and the substitutions should be formatted in the following way:
H1047Y where H (Histidine) is the original aminoacid, 1047 is the protein position for this aminoacid, and Y (Tyrosine) is the new aminoacid originated by the missence mutation.
To read in the 17 PIK3CA missense variants, we use the function
read_substitutions()
from the agvgd
package:
# Read the variants file
path <- system.file("extdata/correia_magno2022.txt", package = 'agvgd', mustWork = TRUE)
missense_variants <- agvgd::read_substitutions(path)
missense_variants
#> # A tibble: 17 × 4
#> res poi ref sub
#> <int> <int> <chr> <chr>
#> 1 88 NA R Q
#> 2 111 NA K E
#> 3 345 NA N K
#> 4 385 NA E K
#> 5 420 NA C R
#> 6 453 NA E K
#> 7 542 NA E K
#> 8 545 NA E A
#> 9 545 NA E K
#> 10 546 NA Q R
#> 11 726 NA E K
#> 12 909 NA F L
#> 13 1021 NA Y H
#> 14 1043 NA M V
#> 15 1047 NA H L
#> 16 1047 NA H R
#> 17 1047 NA H Y
The list of missense variants comes with the indication of the
residue number in the protein sequence, i.e. column res
in
missense_variants
. As we’ll see next, the function
agvgd()
uses the alignment positions (column
poi
in missense_variants
) instead to refer to
those positions in the alignment.
The difference between res
and poi
(position of interest) is that res
counts the residues in
the protein primary sequence reference, while poi
refers to
the positions in the alignment, accounting for gaps.
So, before we proceed, we will update the
missense_variants
tibble and replace the NA
values with the corresponding poi
values. For that we use
the function res_to_poi()
from agvgd:
missense_variants$poi <- agvgd::res_to_poi(pik3ca_alignment, missense_variants$res)
head(missense_variants)
#> # A tibble: 6 × 4
#> res poi ref sub
#> <int> <int> <chr> <chr>
#> 1 88 88 R Q
#> 2 111 111 K E
#> 3 345 345 N K
#> 4 385 385 E K
#> 5 420 420 C R
#> 6 453 453 E K
4. Calculate the scores with agvgd
Run Align-GVGD algorithm with the function agvgd()
:
scores <- agvgd::agvgd(alignment = pik3ca_alignment,
poi = missense_variants$poi,
sub = missense_variants$sub)
print(scores, n = Inf)
#> # A tibble: 17 × 7
#> res poi ref sub gv gd prediction
#> <int> <int> <chr> <chr> <dbl> <dbl> <chr>
#> 1 88 88 R Q 354. 0 C0
#> 2 111 111 K E 354. 0 C0
#> 3 345 345 N K 354. 0 C0
#> 4 385 385 E K 354. 0 C0
#> 5 420 420 C R 354. 0 C0
#> 6 453 453 E K 354. 0 C0
#> 7 542 542 E K 354. 0 C0
#> 8 545 545 E A 354. 0 C0
#> 9 545 545 E K 354. 0 C0
#> 10 546 546 Q R 354. 0 C0
#> 11 726 726 E K 354. 0 C0
#> 12 909 909 F L 354. 0 C0
#> 13 1021 1021 Y H 354. 0 C0
#> 14 1043 1043 M V 354. 0 C0
#> 15 1047 1047 H L 354. 0 C0
#> 16 1047 1047 H R 354. 0 C0
#> 17 1047 1047 H Y 354. 0 C0
Critical analysis
These results show that using all the sequences present in the
ENSEMBL orthologous database will not provide insightful information for
agvgd, since this method is very sensitive to the alignment. Here, it
has classified all the variants with a score of C0, meaning that none of
the changes of interest are clinically relevant.
This is because using sequences that are so distantly related in the
tree of life it is no longer reasonable to assume that those distant
orthologous proteins share the same functions as the human protein of
interest. Accordingly, the alignment must be curated to include a
reasonable amount of sequence variation, but keeping only a subset of
the proteins that are still evolutionarily close enough, so that they
still share the same function, and therefore the same sequence
constraints.
5. Curate the alignment and rerun the score calculation
A first round of curation can be done automatically by removing sequences so distant that they are over 90% different from the protein of interest (bidirectionally, i.e., between human and the orthologous, and between the orthologous and the human).
# Remove distant orthologs and redundant seqs
pik3ca_profile |>
dplyr::filter(human_ortho_perc_id >= 90 & ortho_human_perc_id >= 90) -> profile2
# Format the profile to an alignment
alignment2 <- agvgd::profile_to_alignment(profile2)
# Calculate the scores for the filtered alignment
scores2 <- agvgd::agvgd(alignment = alignment2,
poi = missense_variants$poi,
sub = missense_variants$sub)
print(scores2, n = Inf)
#> # A tibble: 17 × 7
#> res poi ref sub gv gd prediction
#> <int> <int> <chr> <chr> <dbl> <dbl> <chr>
#> 1 88 88 R Q 0 42.8 C35
#> 2 111 111 K E 0 56.9 C55
#> 3 345 345 N K 0 93.9 C65
#> 4 385 385 E K 223. 39.9 C0
#> 5 420 420 C R 0 180. C65
#> 6 453 453 E K 223. 39.9 C0
#> 7 542 542 E K 56.9 0 C0
#> 8 545 545 E A 0 107. C65
#> 9 545 545 E K 0 56.9 C55
#> 10 546 546 Q R 0 42.8 C35
#> 11 726 726 E K 239. 0 C0
#> 12 909 909 F L 0 21.8 C15
#> 13 1021 1021 Y H 354. 0 C0
#> 14 1043 1043 M V 232. 3.24 C0
#> 15 1047 1047 H L 260. 4.86 C0
#> 16 1047 1047 H R 260. 1.62 C0
#> 17 1047 1047 H Y 260. 4.05 C0
Critical analysis after alignment filtering
After removal of distantly related proteins, the results show enough variability to allow the classification of variants that are know to be clinically relevant (for example, E545K with C65 score), showing the relevance of this step for the accuracy of the algorithm.
Additional rounds of curation can be done manually, by visually
inspecting the alignment (for example using Jalview), and re-running the
agvgd()
score calculation function using the manually
curated alignments as input.