Skip to contents

Sample sets

A sample set is a group of samples used in a polygenic score evaluation.Each sample set is identified in the PGS Catalog by a unique sample set identifier (pss_id). See vignette('cohorts-samples-sample-sets') for more details on the relationship between cohorts, samples, and sample sets.

Getting sample sets

To get information on sample sets you can either search by the associated polygenic score identifiers, or by the sample set identifiers themselves (if you know them beforehand).

By the PGS identifier

# Sample sets used in the evaluation of the PGS000013
(pgs_13_sset <- get_sample_sets(pgs_id = 'PGS000013'))
#> An object of class "sample_sets"
#> Slot "sample_sets":
#> # A tibble: 45 × 1
#>    pss_id   
#>    <chr>    
#>  1 PSS000015
#>  2 PSS000019
#>  3 PSS000020
#>  4 PSS000021
#>  5 PSS000022
#>  6 PSS000219
#>  7 PSS000227
#>  8 PSS000228
#>  9 PSS000229
#> 10 PSS000230
#> # … with 35 more rows
#> # ℹ Use `print(n = ...)` to see more rows
#> 
#> Slot "samples":
#> # A tibble: 49 × 15
#>    pss_id  sampl…¹ stage sampl…² sampl…³ sampl…⁴ sampl…⁵ pheno…⁶ ances…⁷ ances…⁸
#>    <chr>     <int> <chr>   <int>   <int>   <int>   <dbl> <chr>   <chr>   <chr>  
#>  1 PSS000…       1 eval   288978    8676  280302    NA   CAD as… Europe… NA     
#>  2 PSS000…       1 eval     5762     173    5589    41.3 Preval… Europe… French…
#>  3 PSS000…       1 eval      862     446     416    NA   Recurr… Europe… French…
#>  4 PSS000…       2 eval     2333     937    1396    NA   Recurr… Europe… French…
#>  5 PSS000…       1 eval     1964     974     976    72.7 Preval… Europe… French…
#>  6 PSS000…       1 eval     3309    2492     817    72.4 Preval… Europe… French…
#>  7 PSS000…       1 eval    11010     126   10884    17.1 Phenot… Europe… NA     
#>  8 PSS000…       1 eval      544      40     504    NA   NA      Asian … NA     
#>  9 PSS000…       1 eval     1298     336     962    NA   NA      Africa… NA     
#> 10 PSS000…       1 eval      919     168     751    NA   NA      Hispan… NA     
#> # … with 39 more rows, 5 more variables: country <chr>,
#> #   ancestry_additional_description <chr>, study_id <chr>, pubmed_id <chr>,
#> #   cohorts_additional_description <chr>, and abbreviated variable names
#> #   ¹​sample_id, ²​sample_size, ³​sample_cases, ⁴​sample_controls,
#> #   ⁵​sample_percent_male, ⁶​phenotype_description, ⁷​ancestry_category, ⁸​ancestry
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
#> 
#> Slot "demographics":
#> # A tibble: 16 × 11
#>    pss_id  sampl…¹ varia…² estim…³ estim…⁴ unit  varia…⁵ varia…⁶ inter…⁷ inter…⁸
#>    <chr>     <int> <chr>   <chr>     <dbl> <chr> <chr>     <dbl> <chr>     <dbl>
#>  1 PSS000…       1 age     mean       34   years NA           NA iqr        30  
#>  2 PSS000…       2 age     mean       33   years NA           NA iqr        30  
#>  3 PSS000…       1 age     mean       54   years NA           NA iqr        46  
#>  4 PSS000…       2 age     mean       55   years NA           NA iqr        49  
#>  5 PSS000…       1 age     mean       60.6 years NA           NA iqr        54.4
#>  6 PSS000…       2 age     mean       52.8 years NA           NA iqr        46.3
#>  7 PSS000…       1 follow… median      9.2 years NA           NA iqr         5.5
#>  8 PSS000…       1 follow… median      9.2 years NA           NA iqr         5.5
#>  9 PSS000…       1 follow… median     11.7 years NA           NA iqr         6  
#> 10 PSS000…       1 follow… median     11.7 years NA           NA iqr         6  
#> 11 PSS000…       1 follow… median     10.4 years NA           NA iqr         5.7
#> 12 PSS000…       1 follow… median     10.4 years NA           NA iqr         5.7
#> 13 PSS000…       1 follow… median     21.3 years NA           NA iqr        16.1
#> 14 PSS000…       1 follow… median     23.2 years NA           NA iqr        17.6
#> 15 PSS000…       1 follow… median      8.1 years NA           NA iqr         7.4
#> 16 PSS001…       1 follow… median     14   years NA           NA iqr        14  
#> # … with 1 more variable: interval_upper <dbl>, and abbreviated variable names
#> #   ¹​sample_id, ²​variable, ³​estimate_type, ⁴​estimate, ⁵​variability_type,
#> #   ⁶​variability, ⁷​interval_type, ⁸​interval_lower
#> # ℹ Use `colnames()` to see all variable names
#> 
#> Slot "cohorts":
#> # A tibble: 114 × 4
#>    pss_id    sample_id cohort_symbol cohort_name                                
#>    <chr>         <int> <chr>         <chr>                                      
#>  1 PSS000015         1 UKB           UK Biobank                                 
#>  2 PSS000019         1 CARTaGENE     CARTaGENE cohort (CHU Sainte-Justine, Queb…
#>  3 PSS000020         1 MHI           Montreal Heart Institute Biobank           
#>  4 PSS000020         2 MHI           Montreal Heart Institute Biobank           
#>  5 PSS000021         1 MHI           Montreal Heart Institute Biobank           
#>  6 PSS000022         1 MHI           Montreal Heart Institute Biobank           
#>  7 PSS000219         1 CG            Color Genomics                             
#>  8 PSS000227         1 VIRGO         Variation in Recovery: Role of Gender on O…
#>  9 PSS000227         1 MESA          Multi-Ethnic Study of Atherosclerosis      
#> 10 PSS000228         1 VIRGO         Variation in Recovery: Role of Gender on O…
#> # … with 104 more rows
#> # ℹ Use `print(n = ...)` to see more rows

By the sample set identifier

# Sample set PSS000020
(pss_20 <- get_sample_sets(pss_id = 'PSS000020'))
#> An object of class "sample_sets"
#> Slot "sample_sets":
#> # A tibble: 1 × 1
#>   pss_id   
#>   <chr>    
#> 1 PSS000020
#> 
#> Slot "samples":
#> # A tibble: 2 × 15
#>   pss_id   sampl…¹ stage sampl…² sampl…³ sampl…⁴ sampl…⁵ pheno…⁶ ances…⁷ ances…⁸
#>   <chr>      <int> <chr>   <int>   <int>   <int>   <dbl> <chr>   <chr>   <chr>  
#> 1 PSS0000…       1 eval      862     446     416      NA Recurr… Europe… French…
#> 2 PSS0000…       2 eval     2333     937    1396      NA Recurr… Europe… French…
#> # … with 5 more variables: country <chr>,
#> #   ancestry_additional_description <chr>, study_id <chr>, pubmed_id <chr>,
#> #   cohorts_additional_description <chr>, and abbreviated variable names
#> #   ¹​sample_id, ²​sample_size, ³​sample_cases, ⁴​sample_controls,
#> #   ⁵​sample_percent_male, ⁶​phenotype_description, ⁷​ancestry_category, ⁸​ancestry
#> # ℹ Use `colnames()` to see all variable names
#> 
#> Slot "demographics":
#> # A tibble: 0 × 11
#> # … with 11 variables: pss_id <chr>, sample_id <int>, variable <chr>,
#> #   estimate_type <chr>, estimate <dbl>, unit <chr>, variability_type <chr>,
#> #   variability <dbl>, interval_type <chr>, interval_lower <dbl>,
#> #   interval_upper <dbl>
#> # ℹ Use `colnames()` to see all variable names
#> 
#> Slot "cohorts":
#> # A tibble: 2 × 4
#>   pss_id    sample_id cohort_symbol cohort_name                     
#>   <chr>         <int> <chr>         <chr>                           
#> 1 PSS000020         1 MHI           Montreal Heart Institute Biobank
#> 2 PSS000020         2 MHI           Montreal Heart Institute Biobank

By trait or disease

If you wish to search by other criteria other than the PGS identifier or the PSS identifier, then you will need to do it in several steps. The general approach is to map your criteria to matching PGS identifiers and from those PGS IDs to sample sets using get_sample_sets().

Let’s say that you want to retrieve all sample sets used in the evaluation of polygenic scores for the disease Vitiligo (loss of skin melanocytes that causes areas of skin depigmentation).

Vitiligo of the hands in a person with dark skin. Source (CC BY-SA 3.0): https://pt.wikipedia.org/wiki/Vitiligo.

Vitiligo of the hands in a person with dark skin. Source (CC BY-SA 3.0): https://pt.wikipedia.org/wiki/Vitiligo.

We start by searching for this disease in the PGS Catalog with get_traits():

(traits_vitiligo <- get_traits(trait_term = 'Vitiligo'))
#> An object of class "traits"
#> Slot "traits":
#> # A tibble: 1 × 6
#>   efo_id      parent_efo_id is_child trait    description                  url  
#>   <chr>       <chr>         <lgl>    <chr>    <chr>                        <chr>
#> 1 EFO_0004208 NA            FALSE    Vitiligo Generalized well circumscri… http…
#> 
#> Slot "pgs_ids":
#> # A tibble: 3 × 4
#>   efo_id      parent_efo_id is_child pgs_id   
#>   <chr>       <chr>         <lgl>    <chr>    
#> 1 EFO_0004208 NA            FALSE    PGS000738
#> 2 EFO_0004208 NA            FALSE    PGS000760
#> 3 EFO_0004208 NA            FALSE    PGS001536
#> 
#> Slot "child_pgs_ids":
#> # A tibble: 0 × 4
#> # … with 4 variables: efo_id <chr>, parent_efo_id <chr>, is_child <lgl>,
#> #   child_pgs_id <chr>
#> # ℹ Use `colnames()` to see all variable names
#> 
#> Slot "trait_categories":
#> # A tibble: 1 × 4
#>   efo_id      parent_efo_id is_child trait_categories      
#>   <chr>       <chr>         <lgl>    <chr>                 
#> 1 EFO_0004208 NA            FALSE    Immune system disorder
#> 
#> Slot "trait_synonyms":
#> # A tibble: 1 × 4
#>   efo_id      parent_efo_id is_child trait_synonyms
#>   <chr>       <chr>         <lgl>    <chr>         
#> 1 EFO_0004208 NA            FALSE    vitiligo      
#> 
#> Slot "trait_mapped_terms":
#> # A tibble: 14 × 4
#>    efo_id      parent_efo_id is_child trait_mapped_terms
#>    <chr>       <chr>         <lgl>    <chr>             
#>  1 EFO_0004208 NA            FALSE    DOID:12306        
#>  2 EFO_0004208 NA            FALSE    ICD10:L80         
#>  3 EFO_0004208 NA            FALSE    ICD10CM:L80       
#>  4 EFO_0004208 NA            FALSE    ICD9:709.01       
#>  5 EFO_0004208 NA            FALSE    MESH:D014820      
#>  6 EFO_0004208 NA            FALSE    MONDO:0008661     
#>  7 EFO_0004208 NA            FALSE    MeSH:D014820      
#>  8 EFO_0004208 NA            FALSE    MedDRA:10047642   
#>  9 EFO_0004208 NA            FALSE    NCIT:C26915       
#> 10 EFO_0004208 NA            FALSE    NCIt:C26915       
#> 11 EFO_0004208 NA            FALSE    OMIM:193200       
#> 12 EFO_0004208 NA            FALSE    Orphanet:247871   
#> 13 EFO_0004208 NA            FALSE    SNOMEDCT:56727007 
#> 14 EFO_0004208 NA            FALSE    UMLS:C0042900

The slot pgs_ids contains the polygenic score identifiers associated with Vitiligo.

traits_vitiligo@pgs_ids
#> # A tibble: 3 × 4
#>   efo_id      parent_efo_id is_child pgs_id   
#>   <chr>       <chr>         <lgl>    <chr>    
#> 1 EFO_0004208 NA            FALSE    PGS000738
#> 2 EFO_0004208 NA            FALSE    PGS000760
#> 3 EFO_0004208 NA            FALSE    PGS001536

Now to search for the sample sets, we can pass those PGS identifiers to get_sample_sets():

(pss_vitiligo <- get_sample_sets(pgs_id = traits_vitiligo@pgs_ids$pgs_id))
#> An object of class "sample_sets"
#> Slot "sample_sets":
#> # A tibble: 7 × 1
#>   pss_id   
#>   <chr>    
#> 1 PSS000907
#> 2 PSS000970
#> 3 PSS004173
#> 4 PSS004174
#> 5 PSS004175
#> 6 PSS004176
#> 7 PSS004177
#> 
#> Slot "samples":
#> # A tibble: 7 × 15
#>   pss_id   sampl…¹ stage sampl…² sampl…³ sampl…⁴ sampl…⁵ pheno…⁶ ances…⁷ ances…⁸
#>   <chr>      <int> <chr>   <int>   <int>   <int>   <dbl> <chr>   <chr>   <chr>  
#> 1 PSS0009…       1 eval     4008    1827    2181      NA Cases … Europe… NA     
#> 2 PSS0009…       1 eval     1584      NA      NA      NA NA      Europe… NA     
#> 3 PSS0041…       1 eval     6497      17    6480      NA NA      Africa… NA     
#> 4 PSS0041…       1 eval     1704       6    1698      NA NA      East A… NA     
#> 5 PSS0041…       1 eval    24905      45   24860      NA NA      Europe… NA     
#> 6 PSS0041…       1 eval     7831      71    7760      NA NA      South … NA     
#> 7 PSS0041…       1 eval    67425     131   67294      NA NA      Europe… NA     
#> # … with 5 more variables: country <chr>,
#> #   ancestry_additional_description <chr>, study_id <chr>, pubmed_id <chr>,
#> #   cohorts_additional_description <chr>, and abbreviated variable names
#> #   ¹​sample_id, ²​sample_size, ³​sample_cases, ⁴​sample_controls,
#> #   ⁵​sample_percent_male, ⁶​phenotype_description, ⁷​ancestry_category, ⁸​ancestry
#> # ℹ Use `colnames()` to see all variable names
#> 
#> Slot "demographics":
#> # A tibble: 0 × 11
#> # … with 11 variables: pss_id <chr>, sample_id <int>, variable <chr>,
#> #   estimate_type <chr>, estimate <dbl>, unit <chr>, variability_type <chr>,
#> #   variability <dbl>, interval_type <chr>, interval_lower <dbl>,
#> #   interval_upper <dbl>
#> # ℹ Use `colnames()` to see all variable names
#> 
#> Slot "cohorts":
#> # A tibble: 6 × 4
#>   pss_id    sample_id cohort_symbol cohort_name                                 
#>   <chr>         <int> <chr>         <chr>                                       
#> 1 PSS000970         1 GNEHGI2020Q2  Genentech Human Genetics Initiative Cancer …
#> 2 PSS004173         1 UKB           UK Biobank                                  
#> 3 PSS004174         1 UKB           UK Biobank                                  
#> 4 PSS004175         1 UKB           UK Biobank                                  
#> 5 PSS004176         1 UKB           UK Biobank                                  
#> 6 PSS004177         1 UKB           UK Biobank