Cohorts, Samples and Sample Sets
Source:vignettes/cohorts-samples-sample-sets.Rmd
cohorts-samples-sample-sets.Rmd
Cohorts
A cohort is a group of individuals with a shared characteristic. Cohorts are identified in quincunx by the cohort_symbol
variable. See vignette('getting-cohorts')
on how to find associated polygenic scores.
Using get_cohorts()
to retrieve associated PGS identifiers with cohort "PROMIS"
:
get_cohorts('PROMIS')
#> An object of class "cohorts"
#> Slot "cohorts":
#> # A tibble: 1 × 2
#> cohort_symbol cohort_name
#> <chr> <chr>
#> 1 PROMIS The Pakistan Risk Of Myocardial Infarction Study
#>
#> Slot "pgs_ids":
#> # A tibble: 14 × 3
#> cohort_symbol pgs_id stage
#> <chr> <chr> <chr>
#> 1 PROMIS PGS000011 gwas/dev
#> 2 PROMIS PGS000012 gwas/dev
#> 3 PROMIS PGS000013 gwas/dev
#> 4 PROMIS PGS000018 gwas/dev
#> 5 PROMIS PGS000019 gwas/dev
#> 6 PROMIS PGS000020 gwas/dev
#> 7 PROMIS PGS000058 gwas/dev
#> 8 PROMIS PGS000349 gwas/dev
#> 9 PROMIS PGS000746 gwas/dev
#> 10 PROMIS PGS000747 gwas/dev
#> 11 PROMIS PGS000748 gwas/dev
#> 12 PROMIS PGS000749 gwas/dev
#> 13 PROMIS PGS000818 gwas/dev
#> 14 PROMIS PGS000899 gwas/dev
Samples
A sample is a group of participants associated with none, one or more catalogued cohorts. The selection from a cohort can be either a subset or its totality. Samples are not identified in PGS Catalog with a global unique identifier, but quincunx assigns a surrogate identifier (sample_id
) to allow relations between tables.
Sample composition is provided in slot cohorts
from objects scores
returned by the get_scores()
function.
library(dplyr, warn.conflicts = FALSE)
# PGS000011 is one of the polygenic scores that is based upon participants from
# cohort PROMIS
pgs_11 <- get_scores('PGS000011')
# Cohort PROMIS is included in sample no. 2, along with LOLIPOP
filter(pgs_11@cohorts, sample_id == 2L)
#> # A tibble: 2 × 4
#> pgs_id sample_id cohort_symbol cohort_name
#> <chr> <int> <chr> <chr>
#> 1 PGS000011 2 PROMIS The Pakistan Risk Of Myocardial Infarction …
#> 2 PGS000011 2 LOLIPOP London Life Sciences Population Study
To know a few more details about samples, look into the samples
slot of the object scores
:
filter(pgs_11@samples, sample_id == 2L)
#> # A tibble: 1 × 15
#> pgs_id sampl…¹ stage sampl…² sampl…³ sampl…⁴ sampl…⁵ pheno…⁶ ances…⁷ ances…⁸
#> <chr> <int> <chr> <int> <int> <int> <dbl> <chr> <chr> <chr>
#> 1 PGS0000… 2 gwas 8653 4394 4259 NA NA South … NA
#> # … with 5 more variables: country <chr>,
#> # ancestry_additional_description <chr>, study_id <chr>, pubmed_id <chr>,
#> # cohorts_additional_description <chr>, and abbreviated variable names
#> # ¹sample_id, ²sample_size, ³sample_cases, ⁴sample_controls,
#> # ⁵sample_percent_male, ⁶phenotype_description, ⁷ancestry_category, ⁸ancestry
#> # ℹ Use `colnames()` to see all variable names
Sample sets
A sample set is a group of samples used in a polygenic score evaluation. Each sample set is identified in the PGS Catalog by a unique sample set identifier (pss_id
).
To find the sample sets that included a specific cohort, we start by getting the PGS identifiers associated with a cohort, e.g. MHI:
# Note that by the definition of sample set, samples included in sample sets
# are only used at PGS evaluation stages.
filter(get_cohorts('MHI')@pgs_ids, stage == 'eval')
#> # A tibble: 2 × 3
#> cohort_symbol pgs_id stage
#> <chr> <chr> <chr>
#> 1 MHI PGS000013 eval
#> 2 MHI PGS000018 eval
PGS000013 is one of the polygenic scores whose evaluation used participants from the cohort MHI. We retrieve now the sample sets used in the evaluation of PGS000013:
# Sample sets used in the evaluation of the PGS000013
pgs_13_sset <- get_sample_sets(pgs_id = 'PGS000013')
glimpse(pgs_13_sset@sample_sets)
#> Rows: 45
#> Columns: 1
#> $ pss_id <chr> "PSS000015", "PSS000019", "PSS000020", "PSS000021", "PSS000022"…
One of the sample sets used to evaluate PGS000013 is PSS000020. We can retrieve a sample_set
object that contains its composition, i.e., the samples and cohorts included, along with other details:
get_sample_sets('PSS000020')
#> An object of class "sample_sets"
#> Slot "sample_sets":
#> # A tibble: 1 × 1
#> pss_id
#> <chr>
#> 1 PSS000020
#>
#> Slot "samples":
#> # A tibble: 2 × 15
#> pss_id sampl…¹ stage sampl…² sampl…³ sampl…⁴ sampl…⁵ pheno…⁶ ances…⁷ ances…⁸
#> <chr> <int> <chr> <int> <int> <int> <dbl> <chr> <chr> <chr>
#> 1 PSS0000… 1 eval 862 446 416 NA Recurr… Europe… French…
#> 2 PSS0000… 2 eval 2333 937 1396 NA Recurr… Europe… French…
#> # … with 5 more variables: country <chr>,
#> # ancestry_additional_description <chr>, study_id <chr>, pubmed_id <chr>,
#> # cohorts_additional_description <chr>, and abbreviated variable names
#> # ¹sample_id, ²sample_size, ³sample_cases, ⁴sample_controls,
#> # ⁵sample_percent_male, ⁶phenotype_description, ⁷ancestry_category, ⁸ancestry
#> # ℹ Use `colnames()` to see all variable names
#>
#> Slot "demographics":
#> # A tibble: 0 × 11
#> # … with 11 variables: pss_id <chr>, sample_id <int>, variable <chr>,
#> # estimate_type <chr>, estimate <dbl>, unit <chr>, variability_type <chr>,
#> # variability <dbl>, interval_type <chr>, interval_lower <dbl>,
#> # interval_upper <dbl>
#> # ℹ Use `colnames()` to see all variable names
#>
#> Slot "cohorts":
#> # A tibble: 2 × 4
#> pss_id sample_id cohort_symbol cohort_name
#> <chr> <int> <chr> <chr>
#> 1 PSS000020 1 MHI Montreal Heart Institute Biobank
#> 2 PSS000020 2 MHI Montreal Heart Institute Biobank