Cohorts, Samples and Sample Sets • quincunx

Cohorts

A cohort is a group of individuals with a shared characteristic. Cohorts are identified in quincunx by the cohort_symbol variable. See vignette('getting-cohorts') on how to find associated polygenic scores.

Using get_cohorts() to retrieve associated PGS identifiers with cohort "PROMIS":

get_cohorts('PROMIS')
#> An object of class "cohorts"
#> Slot "cohorts":
#> # A tibble: 1 × 2
#>   cohort_symbol cohort_name                                     
#>   <chr>         <chr>                                           
#> 1 PROMIS        The Pakistan Risk Of Myocardial Infarction Study
#> 
#> Slot "pgs_ids":
#> # A tibble: 14 × 3
#>    cohort_symbol pgs_id    stage   
#>    <chr>         <chr>     <chr>   
#>  1 PROMIS        PGS000011 gwas/dev
#>  2 PROMIS        PGS000012 gwas/dev
#>  3 PROMIS        PGS000013 gwas/dev
#>  4 PROMIS        PGS000018 gwas/dev
#>  5 PROMIS        PGS000019 gwas/dev
#>  6 PROMIS        PGS000020 gwas/dev
#>  7 PROMIS        PGS000058 gwas/dev
#>  8 PROMIS        PGS000349 gwas/dev
#>  9 PROMIS        PGS000746 gwas/dev
#> 10 PROMIS        PGS000747 gwas/dev
#> 11 PROMIS        PGS000748 gwas/dev
#> 12 PROMIS        PGS000749 gwas/dev
#> 13 PROMIS        PGS000818 gwas/dev
#> 14 PROMIS        PGS000899 gwas/dev

Samples

A sample is a group of participants associated with none, one or more catalogued cohorts. The selection from a cohort can be either a subset or its totality. Samples are not identified in PGS Catalog with a global unique identifier, but quincunx assigns a surrogate identifier (sample_id) to allow relations between tables.

Sample composition is provided in slot cohorts from objects scores returned by the get_scores() function.

library(dplyr, warn.conflicts = FALSE)

# PGS000011 is one of the polygenic scores that is based upon participants from
# cohort PROMIS
pgs_11 <- get_scores('PGS000011')

# Cohort PROMIS is included in sample no. 2, along with LOLIPOP
filter(pgs_11@cohorts, sample_id == 2L)
#> # A tibble: 2 × 4
#>   pgs_id    sample_id cohort_symbol cohort_name                                 
#>   <chr>         <int> <chr>         <chr>                                       
#> 1 PGS000011         2 PROMIS        The Pakistan Risk Of Myocardial Infarction …
#> 2 PGS000011         2 LOLIPOP       London Life Sciences Population Study

To know a few more details about samples, look into the samples slot of the object scores:

filter(pgs_11@samples, sample_id == 2L)
#> # A tibble: 1 × 15
#>   pgs_id   sampl…¹ stage sampl…² sampl…³ sampl…⁴ sampl…⁵ pheno…⁶ ances…⁷ ances…⁸
#>   <chr>      <int> <chr>   <int>   <int>   <int>   <dbl> <chr>   <chr>   <chr>  
#> 1 PGS0000…       2 gwas     8653    4394    4259      NA NA      South … NA     
#> # … with 5 more variables: country <chr>,
#> #   ancestry_additional_description <chr>, study_id <chr>, pubmed_id <chr>,
#> #   cohorts_additional_description <chr>, and abbreviated variable names
#> #   ¹sample_id, ²sample_size, ³sample_cases, ⁴sample_controls,
#> #   ⁵sample_percent_male, ⁶phenotype_description, ⁷ancestry_category, ⁸ancestry
#> # ℹ Use `colnames()` to see all variable names

Sample sets

A sample set is a group of samples used in a polygenic score evaluation. Each sample set is identified in the PGS Catalog by a unique sample set identifier (pss_id).

To find the sample sets that included a specific cohort, we start by getting the PGS identifiers associated with a cohort, e.g. MHI:

# Note that by the definition of sample set, samples included in sample sets
# are only used at PGS evaluation stages.
filter(get_cohorts('MHI')@pgs_ids, stage == 'eval')
#> # A tibble: 2 × 3
#>   cohort_symbol pgs_id    stage
#>   <chr>         <chr>     <chr>
#> 1 MHI           PGS000013 eval 
#> 2 MHI           PGS000018 eval

PGS000013 is one of the polygenic scores whose evaluation used participants from the cohort MHI. We retrieve now the sample sets used in the evaluation of PGS000013:

# Sample sets used in the evaluation of the PGS000013
pgs_13_sset <- get_sample_sets(pgs_id = 'PGS000013')
glimpse(pgs_13_sset@sample_sets)
#> Rows: 45
#> Columns: 1
#> $ pss_id <chr> "PSS000015", "PSS000019", "PSS000020", "PSS000021", "PSS000022"…

One of the sample sets used to evaluate PGS000013 is PSS000020. We can retrieve a sample_set object that contains its composition, i.e., the samples and cohorts included, along with other details:

get_sample_sets('PSS000020')
#> An object of class "sample_sets"
#> Slot "sample_sets":
#> # A tibble: 1 × 1
#>   pss_id   
#>   <chr>    
#> 1 PSS000020
#> 
#> Slot "samples":
#> # A tibble: 2 × 15
#>   pss_id   sampl…¹ stage sampl…² sampl…³ sampl…⁴ sampl…⁵ pheno…⁶ ances…⁷ ances…⁸
#>   <chr>      <int> <chr>   <int>   <int>   <int>   <dbl> <chr>   <chr>   <chr>  
#> 1 PSS0000…       1 eval      862     446     416      NA Recurr… Europe… French…
#> 2 PSS0000…       2 eval     2333     937    1396      NA Recurr… Europe… French…
#> # … with 5 more variables: country <chr>,
#> #   ancestry_additional_description <chr>, study_id <chr>, pubmed_id <chr>,
#> #   cohorts_additional_description <chr>, and abbreviated variable names
#> #   ¹sample_id, ²sample_size, ³sample_cases, ⁴sample_controls,
#> #   ⁵sample_percent_male, ⁶phenotype_description, ⁷ancestry_category, ⁸ancestry
#> # ℹ Use `colnames()` to see all variable names
#> 
#> Slot "demographics":
#> # A tibble: 0 × 11
#> # … with 11 variables: pss_id <chr>, sample_id <int>, variable <chr>,
#> #   estimate_type <chr>, estimate <dbl>, unit <chr>, variability_type <chr>,
#> #   variability <dbl>, interval_type <chr>, interval_lower <dbl>,
#> #   interval_upper <dbl>
#> # ℹ Use `colnames()` to see all variable names
#> 
#> Slot "cohorts":
#> # A tibble: 2 × 4
#>   pss_id    sample_id cohort_symbol cohort_name                     
#>   <chr>         <int> <chr>         <chr>                           
#> 1 PSS000020         1 MHI           Montreal Heart Institute Biobank
#> 2 PSS000020         2 MHI           Montreal Heart Institute Biobank