PGS Catalog Entity Subsetting
Source:vignettes/pgs-cat-ent-subsetting.Rmd
pgs-cat-ent-subsetting.Rmd
Introduction
PGS Catalog entities are represented in quincunx as S4 objects. In this article, we explain how to subset these objects using the [
operator. In a nutshell, we provide subsetting by either position or by the object’s respective identifier. The main entities/objects are:
- Scores
- Publications
- Traits
- Sample Sets
- Performance Metrics
The general approach to subset the various S4 objects is the same. Hence, to avoid repetition, we only provide a set of comprehensive examples for the scores object. Subsetting with the other objects is only illustrated when subsetting with identifiers to emphasise that different objects have different associated identifiers.
If you do not know how to subset the tables included in the S4 objects, please take a look at Subsetting tibbles.
Start by loading quincunx:
Subsetting scores
Subsetting scores by position
For illustrative purposes, let us get some arbitrary polygenic scores objects, say, the first 10 PGSs in the catalog:
pgs_ids <- sprintf('PGS%06d', 1:10)
my_scores <- get_scores(pgs_ids)
The object my_scores
is an S4 object of class scores
, see class?scores
for details. In quincunx, each S4 object contains at least one (the first) table where each observation refers to an entity. To access tables in S4 objects you use the @
operator. The first table in my_scores
is scores
:
my_scores@scores
#> # A tibble: 10 × 12
#> pgs_id pgs_name scori…¹ match…² repor…³ trait…⁴ pgs_m…⁵ pgs_m…⁶ n_var…⁷
#> <chr> <chr> <chr> <lgl> <chr> <chr> <chr> <chr> <int>
#> 1 PGS000001 PRS77_BC https:… TRUE Breast… NA SNPs p… P<5x10… 77
#> 2 PGS000002 PRS77_ERpos https:… TRUE ER-pos… NA SNPs p… P<5x10… 77
#> 3 PGS000003 PRS77_ERneg https:… TRUE ER-neg… NA SNPs p… P<5x10… 77
#> 4 PGS000004 PRS313_BC https:… TRUE Breast… NA Hard-T… p < 10… 313
#> 5 PGS000005 PRS313_ERp… https:… TRUE ER-pos… NA Hard-T… p < 10… 313
#> 6 PGS000006 PRS313_ERn… https:… TRUE ER-neg… NA Hard-T… p < 10… 313
#> 7 PGS000007 PRS3820_BC https:… TRUE Breast… NA LASSO … p < 0.… 3820
#> 8 PGS000008 PRS3820_ER… https:… TRUE ER-pos… NA LASSO … p < 0.… 3820
#> 9 PGS000009 PRS3820_ER… https:… TRUE ER-neg… NA LASSO … p < 0.… 3820
#> 10 PGS000010 GRS27 https:… TRUE Corona… NA Genome… NA 27
#> # … with 3 more variables: n_variants_interactions <int>, assembly <chr>,
#> # license <chr>, and abbreviated variable names ¹scoring_file,
#> # ²matches_publication, ³reported_trait, ⁴trait_additional_description,
#> # ⁵pgs_method_name, ⁶pgs_method_params, ⁷n_variants
#> # ℹ Use `colnames()` to see all variable names
nrow(my_scores@scores)
#> [1] 10
This table has as many rows as polygenic scores. This is one way of knowing how many scores there are in the object. Alternatively, you can use the function n()
on the object:
quincunx::n(my_scores)
#> [1] 10
It is important to know the number of scores if you plan to subset the my_scores
object by position. In this case there are 10 scores. If you want to subset the first, fifth, and tenth score, then you could do:
my_scores[c(1, 5, 10)]@scores[1:2]
#> # A tibble: 3 × 2
#> pgs_id pgs_name
#> <chr> <chr>
#> 1 PGS000001 PRS77_BC
#> 2 PGS000005 PRS313_ERpos
#> 3 PGS000010 GRS27
This returns a new object containing only the data for the scores "PGS000001"
, "PGS000005"
and "PGS000010"
.
Notice that this operation automatically traverses all tables in the my_scores
object and subsets all tables accordingly keeping only those rows corresponding to the first, fifth and tenth scores. For example, compare the table samples
from the my_scores
object before and after the subsetting.
Before subsetting:
my_scores@samples[1:4]
#> # A tibble: 16 × 4
#> pgs_id sample_id stage sample_size
#> <chr> <int> <chr> <int>
#> 1 PGS000001 1 gwas 22627
#> 2 PGS000002 1 gwas 22627
#> 3 PGS000003 1 gwas 22627
#> 4 PGS000004 1 gwas 158648
#> 5 PGS000004 2 dev 10444
#> 6 PGS000005 1 gwas 87368
#> 7 PGS000005 2 dev 5159
#> 8 PGS000006 1 gwas 87368
#> 9 PGS000006 2 dev 5159
#> 10 PGS000007 1 gwas 158648
#> 11 PGS000007 2 dev 10444
#> 12 PGS000008 1 gwas 87368
#> 13 PGS000008 2 dev 5159
#> 14 PGS000009 1 gwas 87368
#> 15 PGS000009 2 dev 5159
#> 16 PGS000010 1 gwas 86995
After subsetting with c(1, 5, 10)
:
my_scores[c(1, 5, 10)]@samples[1:4]
#> # A tibble: 4 × 4
#> pgs_id sample_id stage sample_size
#> <chr> <int> <chr> <int>
#> 1 PGS000001 1 gwas 22627
#> 2 PGS000005 1 gwas 87368
#> 3 PGS000005 2 dev 5159
#> 4 PGS000010 1 gwas 86995
Subsetting scores by identifer
To subset by identifier you simply use a character vector with the identifiers of interest. Let us say now you want two identifiers: "PGS000002"
and "PGS000008"
. Then only you need to do is:
my_scores[c('PGS000002', 'PGS000008')]@scores[1:2]
#> # A tibble: 2 × 2
#> pgs_id pgs_name
#> <chr> <chr>
#> 1 PGS000002 PRS77_ERpos
#> 2 PGS000008 PRS3820_ERpos
Subsetting using repeated positions or identifiers
Please note that if you repeat the same position or identifier, you will get that score repeated:
my_scores[c('PGS000003', 'PGS000003')]@scores[1:2]
#> # A tibble: 2 × 2
#> pgs_id pgs_name
#> <chr> <chr>
#> 1 PGS000003 PRS77_ERneg
#> 2 PGS000003 PRS77_ERneg
quincunx::n(my_scores[c('PGS000003', 'PGS000003')])
#> [1] 2
Or using the third position twice:
Subsetting using negative positions
Just like with basic R objects, we can also use negative indices to drop elements of an object. This is also supported with quincunx’s S4 objects. For example, to drop now the first, fifth and tenth score:
# Notice the minus sign before c(1, 5, 10)
my_scores[-c(1, 5, 10)]@scores[1:2]
#> # A tibble: 7 × 2
#> pgs_id pgs_name
#> <chr> <chr>
#> 1 PGS000002 PRS77_ERpos
#> 2 PGS000003 PRS77_ERneg
#> 3 PGS000004 PRS313_BC
#> 4 PGS000006 PRS313_ERneg
#> 5 PGS000007 PRS3820_BC
#> 6 PGS000008 PRS3820_ERpos
#> 7 PGS000009 PRS3820_ERneg
Subsetting with non-existing positions or identifiers
If you request a position or identifier that does not match in the object, the result is an empty object. For example, the 11th position is not present in my_scores
so the returned object is empty:
my_scores[11]@scores[1:2]
#> # A tibble: 0 × 2
#> # … with 2 variables: pgs_id <chr>, pgs_name <chr>
#> # ℹ Use `colnames()` to see all variable names
quincunx::n(my_scores[11])
#> [1] 0
Please note that the returned object is still a valid scores
object and that it contains all the expected tables of such an object. It is just that all tables have no rows. The same behaviour is to be expected if you try to subset with non-existing identifiers:
my_scores['PGS000011']@scores[1:2]
#> # A tibble: 0 × 2
#> # … with 2 variables: pgs_id <chr>, pgs_name <chr>
#> # ℹ Use `colnames()` to see all variable names
quincunx::n(my_scores['PGS000011'])
#> [1] 0
Subsetting Publications
Subsetting publications objects, or any other S4 object in quincunx, works exactly the same way as described for scores. The only difference is that identifiers have to be changed accordingly. So in the next sections we only show how to subset using the respective identifiers.
# Get all publications where Abraham G is an author
my_publ <- get_publications(author = 'Abraham G')
# Note that the column `author_fullname` corresponds to the first author.
my_publ@publications[c('pgp_id', 'pubmed_id', 'publication_date', 'author_fullname')]
#> # A tibble: 8 × 4
#> pgp_id pubmed_id publication_date author_fullname
#> <chr> <chr> <date> <chr>
#> 1 PGP000005 27655226 2016-09-21 Abraham G
#> 2 PGP000007 30309464 2018-10-01 Inouye M
#> 3 PGP000027 31862893 2019-12-20 Abraham G
#> 4 PGP000028 24550740 2014-02-13 Abraham G
#> 5 PGP000029 26244058 2015-07-16 Abraham G
#> 6 PGP000052 32887683 2020-09-04 Cánovas R
#> 7 PGP000137 34750571 2021-11-08 Ritchie SC
#> 8 PGP000209 34039031 2021-05-27 Neumann JT
By visual inspection we can see that Abraham G is the first author in PGP000005, PGP000027, PGP000028, and PGP000029.
To keep only those publications we subset the publication object my_publ
by those PGP identifiers:
my_publ[c('PGP000005', 'PGP000027', 'PGP000028', 'PGP000029')]@publications[c('pgp_id', 'pubmed_id', 'publication_date', 'author_fullname')]
#> # A tibble: 4 × 4
#> pgp_id pubmed_id publication_date author_fullname
#> <chr> <chr> <date> <chr>
#> 1 PGP000005 27655226 2016-09-21 Abraham G
#> 2 PGP000027 31862893 2019-12-20 Abraham G
#> 3 PGP000028 24550740 2014-02-13 Abraham G
#> 4 PGP000029 26244058 2015-07-16 Abraham G
Subsetting Traits
To illustrate subsetting of a traits object with EFO identifiers, let us say you’d like to create a traits object with traits whose trait name contained the keyword "lymph"
. To do this, we will start by downloading all traits into a traits object. Then we look for the term "lymph"
in the trait
column, and find which EFO identifiers are matched. Finally, we will use those identifiers to create a traits object containing only those matched identifiers.
Get all traits:
all_traits <- get_traits(interactive = FALSE)
Find which traits have in their name (trait
column of traits
table) the term "lymph"
(we use grep
for this):
lymph_traits_positions <- grep('lymph', all_traits@traits$trait)
all_traits[lymph_traits_positions]@traits[c('efo_id', 'trait')]
#> # A tibble: 8 × 2
#> efo_id trait
#> <chr> <chr>
#> 1 EFO_0000095 chronic lymphocytic leukemia
#> 2 EFO_0000403 diffuse large B-cell lymphoma
#> 3 MONDO_0018906 follicular lymphoma
#> 4 EFO_0000183 Hodgkins lymphoma
#> 5 EFO_0004587 lymphocyte count
#> 6 EFO_0007993 lymphocyte percentage of leukocytes
#> 7 EFO_0004289 lymphoid leukemia
#> 8 EFO_0005952 non-Hodgkins lymphoma
Select only those EFO identifiers whose trait name contained "lymph"
:
my_efo_ids <- all_traits[lymph_traits_positions]@traits$efo_id
my_efo_ids
#> [1] "EFO_0000095" "EFO_0000403" "MONDO_0018906" "EFO_0000183"
#> [5] "EFO_0004587" "EFO_0007993" "EFO_0004289" "EFO_0005952"
Finally, create a new traits object (traits_only_lymph
) with only those traits matching "lymph"
by subsetting by identifier:
traits_only_lymph <- all_traits[my_efo_ids]
Confirm that indeed only those traits with "lymph"
in the name are present:
traits_only_lymph@traits[c(1, 4)]
#> # A tibble: 8 × 2
#> efo_id trait
#> <chr> <chr>
#> 1 EFO_0000095 chronic lymphocytic leukemia
#> 2 EFO_0000403 diffuse large B-cell lymphoma
#> 3 MONDO_0018906 follicular lymphoma
#> 4 EFO_0000183 Hodgkins lymphoma
#> 5 EFO_0004587 lymphocyte count
#> 6 EFO_0007993 lymphocyte percentage of leukocytes
#> 7 EFO_0004289 lymphoid leukemia
#> 8 EFO_0005952 non-Hodgkins lymphoma
You might have noticed that we could have used lymph_traits_positions
to subset all_traits
by position instead to the same effect. That would have been more straightforward, but the point here is to illustrate subsetting with EFO identifiers. Moreover, as an exercise, you might want to compare the results obtained with this example with:
# Get traits containing the term 'lymph' in the name or its description
get_traits(trait_term = 'lymph', exact_term = FALSE)
# Get traits whose name is exactly 'lymph'
get_traits(trait_term = 'lymph', exact_term = TRUE)
Subsetting Sample Sets
To subset PGS Sample Sets you use identifiers of the form: "PSS000000"
. Here’s a simple example where we download two Sample Sets ("PSS000008"
and "PSS000042"
), and afterwards we take "PSS000008"
:
my_sample_sets <- get_sample_sets(pss_id = c('PSS000008', 'PSS000042'))
# Table `samples` contains the samples that comprise this Sample Set
my_sample_sets['PSS000008']@samples[1:6]
#> # A tibble: 3 × 6
#> pss_id sample_id stage sample_size sample_cases sample_controls
#> <chr> <int> <chr> <int> <int> <int>
#> 1 PSS000008 1 eval 6978 149 6829
#> 2 PSS000008 2 eval 27271 NA NA
#> 3 PSS000008 3 eval 8749 108 8641
Subsetting Performance Metrics
Without much more creativity, you subset Performance Metrics objects with identifiers of the form: "PPM000000"
. Example:
my_perf_metrics <- get_performance_metrics(ppm_id = c('PPM000001', 'PPM000002'))
# Table `samples` contains the samples that comprise this Performance Metrics
my_perf_metrics['PPM000002']@samples[1:6]
#> # A tibble: 1 × 6
#> ppm_id pss_id sample_id stage sample_size sample_cases
#> <chr> <chr> <int> <chr> <int> <int>
#> 1 PPM000002 PSS000003 1 eval 53923 21365