Sample sets
A sample set is a group of samples used in a polygenic score evaluation.Each sample set is identified in the PGS Catalog by a unique sample set identifier (pss_id
). See vignette('cohorts-samples-sample-sets')
for more details on the relationship between cohorts, samples, and sample sets.
Getting sample sets
To get information on sample sets you can either search by the associated polygenic score identifiers, or by the sample set identifiers themselves (if you know them beforehand).
By the PGS identifier
# Sample sets used in the evaluation of the PGS000013
(pgs_13_sset <- get_sample_sets(pgs_id = 'PGS000013'))
#> An object of class "sample_sets"
#> Slot "sample_sets":
#> # A tibble: 45 × 1
#> pss_id
#> <chr>
#> 1 PSS000015
#> 2 PSS000019
#> 3 PSS000020
#> 4 PSS000021
#> 5 PSS000022
#> 6 PSS000219
#> 7 PSS000227
#> 8 PSS000228
#> 9 PSS000229
#> 10 PSS000230
#> # … with 35 more rows
#> # ℹ Use `print(n = ...)` to see more rows
#>
#> Slot "samples":
#> # A tibble: 49 × 15
#> pss_id sampl…¹ stage sampl…² sampl…³ sampl…⁴ sampl…⁵ pheno…⁶ ances…⁷ ances…⁸
#> <chr> <int> <chr> <int> <int> <int> <dbl> <chr> <chr> <chr>
#> 1 PSS000… 1 eval 288978 8676 280302 NA CAD as… Europe… NA
#> 2 PSS000… 1 eval 5762 173 5589 41.3 Preval… Europe… French…
#> 3 PSS000… 1 eval 862 446 416 NA Recurr… Europe… French…
#> 4 PSS000… 2 eval 2333 937 1396 NA Recurr… Europe… French…
#> 5 PSS000… 1 eval 1964 974 976 72.7 Preval… Europe… French…
#> 6 PSS000… 1 eval 3309 2492 817 72.4 Preval… Europe… French…
#> 7 PSS000… 1 eval 11010 126 10884 17.1 Phenot… Europe… NA
#> 8 PSS000… 1 eval 544 40 504 NA NA Asian … NA
#> 9 PSS000… 1 eval 1298 336 962 NA NA Africa… NA
#> 10 PSS000… 1 eval 919 168 751 NA NA Hispan… NA
#> # … with 39 more rows, 5 more variables: country <chr>,
#> # ancestry_additional_description <chr>, study_id <chr>, pubmed_id <chr>,
#> # cohorts_additional_description <chr>, and abbreviated variable names
#> # ¹sample_id, ²sample_size, ³sample_cases, ⁴sample_controls,
#> # ⁵sample_percent_male, ⁶phenotype_description, ⁷ancestry_category, ⁸ancestry
#> # ℹ Use `print(n = ...)` to see more rows, and `colnames()` to see all variable names
#>
#> Slot "demographics":
#> # A tibble: 16 × 11
#> pss_id sampl…¹ varia…² estim…³ estim…⁴ unit varia…⁵ varia…⁶ inter…⁷ inter…⁸
#> <chr> <int> <chr> <chr> <dbl> <chr> <chr> <dbl> <chr> <dbl>
#> 1 PSS000… 1 age mean 34 years NA NA iqr 30
#> 2 PSS000… 2 age mean 33 years NA NA iqr 30
#> 3 PSS000… 1 age mean 54 years NA NA iqr 46
#> 4 PSS000… 2 age mean 55 years NA NA iqr 49
#> 5 PSS000… 1 age mean 60.6 years NA NA iqr 54.4
#> 6 PSS000… 2 age mean 52.8 years NA NA iqr 46.3
#> 7 PSS000… 1 follow… median 9.2 years NA NA iqr 5.5
#> 8 PSS000… 1 follow… median 9.2 years NA NA iqr 5.5
#> 9 PSS000… 1 follow… median 11.7 years NA NA iqr 6
#> 10 PSS000… 1 follow… median 11.7 years NA NA iqr 6
#> 11 PSS000… 1 follow… median 10.4 years NA NA iqr 5.7
#> 12 PSS000… 1 follow… median 10.4 years NA NA iqr 5.7
#> 13 PSS000… 1 follow… median 21.3 years NA NA iqr 16.1
#> 14 PSS000… 1 follow… median 23.2 years NA NA iqr 17.6
#> 15 PSS000… 1 follow… median 8.1 years NA NA iqr 7.4
#> 16 PSS001… 1 follow… median 14 years NA NA iqr 14
#> # … with 1 more variable: interval_upper <dbl>, and abbreviated variable names
#> # ¹sample_id, ²variable, ³estimate_type, ⁴estimate, ⁵variability_type,
#> # ⁶variability, ⁷interval_type, ⁸interval_lower
#> # ℹ Use `colnames()` to see all variable names
#>
#> Slot "cohorts":
#> # A tibble: 114 × 4
#> pss_id sample_id cohort_symbol cohort_name
#> <chr> <int> <chr> <chr>
#> 1 PSS000015 1 UKB UK Biobank
#> 2 PSS000019 1 CARTaGENE CARTaGENE cohort (CHU Sainte-Justine, Queb…
#> 3 PSS000020 1 MHI Montreal Heart Institute Biobank
#> 4 PSS000020 2 MHI Montreal Heart Institute Biobank
#> 5 PSS000021 1 MHI Montreal Heart Institute Biobank
#> 6 PSS000022 1 MHI Montreal Heart Institute Biobank
#> 7 PSS000219 1 CG Color Genomics
#> 8 PSS000227 1 VIRGO Variation in Recovery: Role of Gender on O…
#> 9 PSS000227 1 MESA Multi-Ethnic Study of Atherosclerosis
#> 10 PSS000228 1 VIRGO Variation in Recovery: Role of Gender on O…
#> # … with 104 more rows
#> # ℹ Use `print(n = ...)` to see more rows
By the sample set identifier
# Sample set PSS000020
(pss_20 <- get_sample_sets(pss_id = 'PSS000020'))
#> An object of class "sample_sets"
#> Slot "sample_sets":
#> # A tibble: 1 × 1
#> pss_id
#> <chr>
#> 1 PSS000020
#>
#> Slot "samples":
#> # A tibble: 2 × 15
#> pss_id sampl…¹ stage sampl…² sampl…³ sampl…⁴ sampl…⁵ pheno…⁶ ances…⁷ ances…⁸
#> <chr> <int> <chr> <int> <int> <int> <dbl> <chr> <chr> <chr>
#> 1 PSS0000… 1 eval 862 446 416 NA Recurr… Europe… French…
#> 2 PSS0000… 2 eval 2333 937 1396 NA Recurr… Europe… French…
#> # … with 5 more variables: country <chr>,
#> # ancestry_additional_description <chr>, study_id <chr>, pubmed_id <chr>,
#> # cohorts_additional_description <chr>, and abbreviated variable names
#> # ¹sample_id, ²sample_size, ³sample_cases, ⁴sample_controls,
#> # ⁵sample_percent_male, ⁶phenotype_description, ⁷ancestry_category, ⁸ancestry
#> # ℹ Use `colnames()` to see all variable names
#>
#> Slot "demographics":
#> # A tibble: 0 × 11
#> # … with 11 variables: pss_id <chr>, sample_id <int>, variable <chr>,
#> # estimate_type <chr>, estimate <dbl>, unit <chr>, variability_type <chr>,
#> # variability <dbl>, interval_type <chr>, interval_lower <dbl>,
#> # interval_upper <dbl>
#> # ℹ Use `colnames()` to see all variable names
#>
#> Slot "cohorts":
#> # A tibble: 2 × 4
#> pss_id sample_id cohort_symbol cohort_name
#> <chr> <int> <chr> <chr>
#> 1 PSS000020 1 MHI Montreal Heart Institute Biobank
#> 2 PSS000020 2 MHI Montreal Heart Institute Biobank
By trait or disease
If you wish to search by other criteria other than the PGS identifier or the PSS identifier, then you will need to do it in several steps. The general approach is to map your criteria to matching PGS identifiers and from those PGS IDs to sample sets using get_sample_sets()
.
Let’s say that you want to retrieve all sample sets used in the evaluation of polygenic scores for the disease Vitiligo (loss of skin melanocytes that causes areas of skin depigmentation).
We start by searching for this disease in the PGS Catalog with get_traits()
:
(traits_vitiligo <- get_traits(trait_term = 'Vitiligo'))
#> An object of class "traits"
#> Slot "traits":
#> # A tibble: 1 × 6
#> efo_id parent_efo_id is_child trait description url
#> <chr> <chr> <lgl> <chr> <chr> <chr>
#> 1 EFO_0004208 NA FALSE Vitiligo Generalized well circumscri… http…
#>
#> Slot "pgs_ids":
#> # A tibble: 3 × 4
#> efo_id parent_efo_id is_child pgs_id
#> <chr> <chr> <lgl> <chr>
#> 1 EFO_0004208 NA FALSE PGS000738
#> 2 EFO_0004208 NA FALSE PGS000760
#> 3 EFO_0004208 NA FALSE PGS001536
#>
#> Slot "child_pgs_ids":
#> # A tibble: 0 × 4
#> # … with 4 variables: efo_id <chr>, parent_efo_id <chr>, is_child <lgl>,
#> # child_pgs_id <chr>
#> # ℹ Use `colnames()` to see all variable names
#>
#> Slot "trait_categories":
#> # A tibble: 1 × 4
#> efo_id parent_efo_id is_child trait_categories
#> <chr> <chr> <lgl> <chr>
#> 1 EFO_0004208 NA FALSE Immune system disorder
#>
#> Slot "trait_synonyms":
#> # A tibble: 1 × 4
#> efo_id parent_efo_id is_child trait_synonyms
#> <chr> <chr> <lgl> <chr>
#> 1 EFO_0004208 NA FALSE vitiligo
#>
#> Slot "trait_mapped_terms":
#> # A tibble: 14 × 4
#> efo_id parent_efo_id is_child trait_mapped_terms
#> <chr> <chr> <lgl> <chr>
#> 1 EFO_0004208 NA FALSE DOID:12306
#> 2 EFO_0004208 NA FALSE ICD10:L80
#> 3 EFO_0004208 NA FALSE ICD10CM:L80
#> 4 EFO_0004208 NA FALSE ICD9:709.01
#> 5 EFO_0004208 NA FALSE MESH:D014820
#> 6 EFO_0004208 NA FALSE MONDO:0008661
#> 7 EFO_0004208 NA FALSE MeSH:D014820
#> 8 EFO_0004208 NA FALSE MedDRA:10047642
#> 9 EFO_0004208 NA FALSE NCIT:C26915
#> 10 EFO_0004208 NA FALSE NCIt:C26915
#> 11 EFO_0004208 NA FALSE OMIM:193200
#> 12 EFO_0004208 NA FALSE Orphanet:247871
#> 13 EFO_0004208 NA FALSE SNOMEDCT:56727007
#> 14 EFO_0004208 NA FALSE UMLS:C0042900
The slot pgs_ids
contains the polygenic score identifiers associated with Vitiligo.
traits_vitiligo@pgs_ids
#> # A tibble: 3 × 4
#> efo_id parent_efo_id is_child pgs_id
#> <chr> <chr> <lgl> <chr>
#> 1 EFO_0004208 NA FALSE PGS000738
#> 2 EFO_0004208 NA FALSE PGS000760
#> 3 EFO_0004208 NA FALSE PGS001536
Now to search for the sample sets, we can pass those PGS identifiers to get_sample_sets()
:
(pss_vitiligo <- get_sample_sets(pgs_id = traits_vitiligo@pgs_ids$pgs_id))
#> An object of class "sample_sets"
#> Slot "sample_sets":
#> # A tibble: 7 × 1
#> pss_id
#> <chr>
#> 1 PSS000907
#> 2 PSS000970
#> 3 PSS004173
#> 4 PSS004174
#> 5 PSS004175
#> 6 PSS004176
#> 7 PSS004177
#>
#> Slot "samples":
#> # A tibble: 7 × 15
#> pss_id sampl…¹ stage sampl…² sampl…³ sampl…⁴ sampl…⁵ pheno…⁶ ances…⁷ ances…⁸
#> <chr> <int> <chr> <int> <int> <int> <dbl> <chr> <chr> <chr>
#> 1 PSS0009… 1 eval 4008 1827 2181 NA Cases … Europe… NA
#> 2 PSS0009… 1 eval 1584 NA NA NA NA Europe… NA
#> 3 PSS0041… 1 eval 6497 17 6480 NA NA Africa… NA
#> 4 PSS0041… 1 eval 1704 6 1698 NA NA East A… NA
#> 5 PSS0041… 1 eval 24905 45 24860 NA NA Europe… NA
#> 6 PSS0041… 1 eval 7831 71 7760 NA NA South … NA
#> 7 PSS0041… 1 eval 67425 131 67294 NA NA Europe… NA
#> # … with 5 more variables: country <chr>,
#> # ancestry_additional_description <chr>, study_id <chr>, pubmed_id <chr>,
#> # cohorts_additional_description <chr>, and abbreviated variable names
#> # ¹sample_id, ²sample_size, ³sample_cases, ⁴sample_controls,
#> # ⁵sample_percent_male, ⁶phenotype_description, ⁷ancestry_category, ⁸ancestry
#> # ℹ Use `colnames()` to see all variable names
#>
#> Slot "demographics":
#> # A tibble: 0 × 11
#> # … with 11 variables: pss_id <chr>, sample_id <int>, variable <chr>,
#> # estimate_type <chr>, estimate <dbl>, unit <chr>, variability_type <chr>,
#> # variability <dbl>, interval_type <chr>, interval_lower <dbl>,
#> # interval_upper <dbl>
#> # ℹ Use `colnames()` to see all variable names
#>
#> Slot "cohorts":
#> # A tibble: 6 × 4
#> pss_id sample_id cohort_symbol cohort_name
#> <chr> <int> <chr> <chr>
#> 1 PSS000970 1 GNEHGI2020Q2 Genentech Human Genetics Initiative Cancer …
#> 2 PSS004173 1 UKB UK Biobank
#> 3 PSS004174 1 UKB UK Biobank
#> 4 PSS004175 1 UKB UK Biobank
#> 5 PSS004176 1 UKB UK Biobank
#> 6 PSS004177 1 UKB UK Biobank