quincunx vs REST API direct access
Source:vignettes/quincunx-s-advantages.Rmd
quincunx-s-advantages.Rmd
In this vignette we provided a set of advantages to using quincunx when compared to a direct access of the PGS Catalog REST API.
Handling of lower-level tasks
quincunx automatically handles lower-level tasks related to the GET requests, such as pagination and handling of errors. When a user of quincunx calls a get functions
, .e.g, get_scores()
, this will translate to GET requests on one of the following endpoints: /rest/score/all
, /rest/score/{pgs_id}
, or /rest/score/search
. The first and last endpoints return paginated JSON responses, meaning that we need logic on the client side that iterates over all the paginated resources, making the respective GET requests such that all the data is fetched. quincunx automatically detects if the endpoint is returning a paginated response or not (because it is not always paginated), and iterates automatically. Also, quincunx’s get functions
gracefully handle http errors that might arise along the process, not breaking the execution, and collecting the data from those responses that were successful, while generating warnings for those that failed. Moreover, user control is offered on these warnings, and a verbose flag is also provided in our functions for easy debugging of the underlying requests. All these are menial features that, nevertheless, are important for a smooth usage of the REST API.
Endpoint abstraction
quincunx’s set of get_<entity>()
functions provide an API that abstracts away the REST API endpoint construction and that best matches user expectations, i.e., one get
function for each type of Catalog entity. For example, a quincunx’s user only needs to know the get_scores()
to retrieve metadata about Polygenic Scores. A more direct access to the REST API implies knowing the three endpoints and how to pass them their parameters. Also, by having one function for all related endpoints, this means that we can provide a single interface for all search criteria. For example, in the case of get_scores()
, the user may search by PGS identifier (pgs_id
), Trait id (efo_id
) or by PubMed id (pubmed_id
), or even get all polygenic scores by not supplying any criteria. quincunx’s accepts all these options in a single interface, and provides an extra argument (set_operation
) on how to combine the results obtained with the different criteria. Without a client like quincunx all of this logic would have to be implemented anew.
Automatic JSON deserialization
Retrieving data from a REST API to an environment, such as the R programming language, requires the conversion of JSON text into R objects, i.e., JSON deserialization. Here, we hinge on the R package tidyjson. Some other more direct access method to the REST API would require the users to either learn to use a similar tool or implement it themselves.
Relational database representation of Catalog entities
Most importantly, besides JSON deserialization, we chose to represent each Catalog entity as a relational database (quincunx’s S4 objects). This does not follow automatically from the data structure in the JSON responses. This process required careful analysis of the Catalog data and of the relationships between the different data structures, since this information is not explicit in the Catalog documentation. Specifically, we partly reverse engineered by studying the JSON responses and frequently communicated with the Catalog team during the development of our package.
Tidyverse friendly for easy data wrangling and visualisation
The actual implementation of the in-memory relational databases based on lists of tibbles provides a commonly used interface in the R community nowadays, i.e., the use of so-called tidy data, which allows taking advantage of the tidyverse toolkit for direct data analysis/modeling and visualization, e.g. with ggplot2.
Variable name harmonisation
Each table in quincunx’s relational databases is not simply an automatic result of the parsing of the JSON responses. The majority of the column names have been unified to follow a common naming scheme and, whenever possible, simplified to better communicate their biological meaning. For example, in nearly all JSON responses, there is an id
element, which, depending on the endpoint, can stand for PGS id, PGP id, PSS id, PPM id, Trait id, etc.; in quincunx, these have been aptly mapped to pgs_id
, pgp_id
, pss_id
, ppm_id
or efo_id
for clarity and to prevent id
mistakes. Moreover, new identifiers have been created in quincunx’s objects, allowing the connection of information between tables. These new identifiers are indicated as having a “local” scope (see vignette('identifiers')
.
Correct variable type coercion
Basic data types, such as booleans, strings, integers and doubles are correctly mapped to R’s equivalent basic types in quincunx’s tables. Note that in JSON there is no distinction between integers and doubles, an ambiguity that can only be resolved by analysis of the data variables’ meaning. Also, various representations of missing data (e.g., "NR"
, "Not Reported"
, null
, ""
, etc.) are correctly mapped to NA
in R. In the case of null
JSON objects that have a nested schema, the corresponding relational tables are recreated with the correct columns and types (albeit empty) so that the scripts do not break. This data structure consistency is an important feature allowing posterior data wrangling, such as combination of results that have the same structure. So although our tables inside the S4 objects might seem raw, they are actually the result of data tidying and cleaning, and of reverse engineering of the relationships between objects such that they can be made into relational tables.
Database level subsetting of tables
A very useful feature of our S4 objects is the subsetting based on indices or identifiers. Moreover, subsetting quincunx’s objects conveniently permeates to all tables (slots).
Part of the hapiverse
quincunx’s is a sibling package of gwasrapidd— providing access to the GWAS Catalog—, and incorporates the same design principles, particularly when it comes to the data representation and wrangling. So users working in this field that have already used one of the tools, will see their knowledge easily transferred to the other one.