Overview


The OSDR biological data API is an application programming interface enabling granular access to GeneLab and ALSDA data contained in the NASA Open Science Data Repository (NASA OSDR). It provides two modes of (meta)data interrogation:
  • a REST interface, which exposes a higher-level, structured view of the objects in the database, with JSON being the primary output format;
  • a query interface, which allows for complex querying and filtering of both the metadata and the data;
    outputs are provided as tables (CSV/TSV and tables converted to JSON).
The REST interface can be seen as a means for database traversal, while the query interface is a means for metadata and data interrogation.
TL;DR

Organization of datasets, assays, samples, and their metadata as presented via the REST interface: screenshots of implied metadata structure Sample-level metadata addressable in a flatter format as used in the query interfaces:
  • id.accession
  • id.sample%20name=Mmus_C57-6J_LVR_GC_I_Rep1_M31
  • investigation.ontology%20source%20reference
  • study.characteristics
  • study.characteristics.strain
  • study.factor%20value.spaceflight!=basal%20control
  • file.data%20type=pca
  • etc.
Data-level columns as addressable in the data query interface:
  • column.*
  • column.ENTREZID
  • column.Mmus_C57-6J_LVR_GC_I_Rep1_M31!=0
  • etc.
TL;DR: examples
(expand)

Hierarchical metadata


  • The database consists of datasets uniquely identifiable by their accession numbers, e.g. OSD-48.
  • Each dataset contains assays and samples whose metadata are organized according to the ISA model.
  • The API represents these metadata in a hierarchical fashion, extracting the relevant ISA sections for each sample upon request.
For example, if a certain sample has been assayed with microscopy as well as RNA-Seq, the metadata describing such a sample can be found under the respective assays accordingly (note that any sample may be associated with some, but not all, assays in a dataset). E.g. in order to request the ISA (investigation, study, assay) metadata of the OSD-48 sample Mmus_C57-6J_LVR_GC_I_Rep1_M31 that represents it as an analysis object in a microscopy assay, the following hierarchy is applicable: and, respectively, for the same sample, but in an RNA-Seq assay:
Note: the syntax of the two example links is discussed further down in the REST interface section.
The REST interface also serves as a means to identify exact assay and sample names of interest.
On-demand metadata combinations

  • Given a scope of dataset × assay × sample, the API output is a representation of all relevant ISA metadata for a sample, combined on-demand from ISA entries (investigation, assay, study tabs) that are associated with this sample, rather than simply a reflection of a single record in the database.
This is exemplified by the two links above producing JSONs that share identical sections (investigation, study) and only differ in the assay section: microscopy vs. RNA-Seq.
  • This ensures that each such representation synthesizes complete investigation, assay, and study metadata for a given sample within the scope of a given dataset and assay.
Indeed, the same section may be repeated in full across outputs for multiple samples; however, this also means that an output for one sample in one assay of one dataset will always contain complete relevant metadata.

The REST interface


  • The REST interface employs progressive disclosure; by starting from the listing of all datasets ( /v2/datasets/) and exploring REST URLs provided within each of the outputs, one can glean an understanding of the entire scope of the REST interface and avoid the need to peruse the next few subsections of this documentation. If you choose to do so, skip to section REST syntax extensions.
  • The syntax of REST URLs follows the hierarchy outlined above.
  • The returned JSON is always comprised of branches starting at the top-level object (dataset).
  • All REST endpoints accept an optional parameter format (see Output formats). If omitted, the format defaults to plain JSON.
Metadata REST endpoints
(expand)
File records REST endpoints
(expand)
REST syntax extensions
(expand)

The query interface


  • The query interface facilitates the process of filtering by values of multiple metadata fields and/or data columns at once.
  • Field selectors and filters are accepted as key-value GET parameters (e.g. study.characteristics.strain=S288C) to the respective
    endpoints: /v2/query/metadata/ and /v2/query/data/.
All query endpoints accept an optional parameter format (see Output formats). If omitted, the format defaults to CSV.
Sample-level metadata query endpoints

  • This is positioned as the primary entrypoint into the query interface.
  • The root endpoint for the sample-level metadata query interface is /v2/query/metadata/.
  • id.accession, id.assay name, id.sample name are always returned as the first three fields, regardless of being requested or not.
    • In application to e.g. RNA-Seq analysis, this ensures these three columns can be used as row names that parallel data query column names, if the latter is requested with a format.header.multi modifier.
  • The query interface operates on the same metadata combinations as the REST interface, providing access to metadata fields and values under branches:
    • investigation (part of ISA)
    • study (part of ISA)
    • assay (part of ISA)
    • id (accession number, assay name, sample name, aggregated into a branch)
    • file (metadata for files associated with the sample, aggregated into a branch)
  • The metadata structure is flattened further; for example, a branch:
    • studycharacteristicsstrain → "S288C"
    becomes a field:
    • study.characteristics.strain
    with a value:
    • "S288C"
Providing fields as such period-separated keys, optionally paired with values, as components of a GET request (e.g. study.characteristics.strain=S288C), allows for filtering by values of multiple metadata fields at once. The GET request syntax understands the following conventions:
Format exampleMeaning
study.characteristics.strain Include this field in the output, regardless of its value, for all implicated samples.
=study.characteristics.strain Only include samples that have this field annotated with a non-null (non-NaN) value;
also include the field itself in the output.
study.characteristics.strain=S288C Only include samples whose value of this field is equal to the provided value;
also include the field itself in the output.
study.characteristics.strain=S288C|BY4743 Only include samples whose value of this field is equal to either of the provided values (separated by a vertical pipe, i.e. a logical "OR");
also include the field itself in the output.
study.characteristics.strain!=S288C Only include samples whose value of this field is not equal to the provided value;
still include the field itself in the output.
Note that this excludes null (NaN) values, since NaNs are not equal to anything (not even to themselves) by definition.
study.characteristics.strain=/^BY\d+$/
study.characteristics.strain=/^BY\d+$/i
study.characteristics.strain=/^BY\d+$/c
Only include samples whose value of this field matches the provided regular expression (in this case: ^BY\d+$, i.e. a leading "BY" followed by a number); also include the field itself in the output.
The /i flag invokes case-insensitive matching, while the /c flag enforces case sensitivity.
Note: the flags override the behavior.matchcase modifier.
study.characteristics Include all fields in the given section that are present for any of the samples in the request.
This wildcard syntax is applicable to any ISA field starting from the 2nd level (e.g., assay.parameter value, study.factor value, investigation.study assays, etc.)

Example usage
(expand)
Assay-grouped metadata query endpoints

  • The root endpoint for the assay-grouped metadata query interface is /v2/query/assays/.
  • These endpoints provide a condensed representation of sample-level metadata, grouped by assay.
  • id.accession and id.assay name are always returned as the first two fields, regardless of being requested or not.
  • The same query syntax as for sample-level metadata applies;
  • However, the id.sample name column is omitted, and the duplicate rows that result from this operation are collapsed.
  • This, essentially, lists each combination of metadata values that appear per assay once, and can help describe the usability of a given assay for an analysis a user may have in mind.
Example usage
(expand)
Data query endpoints

  • The root endpoint for the data query interface is /v2/query/data/.
  • This endpoint provides direct access to data in the OSDR repository:
    • If a metadata query resolves to a single file or a subset of mergeable files, the matching data query will resolve to the representation of the underlying data.
    • For tabular data, querying and filtering by columns and column values is also implemented;
    • Non-tabular data are provided as a direct download.
  • For tabular data, accession, assay name, and sample name are included as column names.
  • The same query syntax as for sample-level metadata applies;
  • One of file.data type or file.file name must be included in order to narrow down the search to specific file data;
  • If a matching sample-level metadata query resolves to a single file, its data will be returned;
  • If a matching sample-level metadata query resolves to several files that can be unambiguously merged on the fly (currently, this only applies to file.data type=unnormalized counts), their data will be returned as a merged table.
If the data are in an arbitrary (e.g. binary) format and/or it has been requested with format=raw, the data will be retrieved directly as a file.
If the data type is understood by the API as being tabular, additional GET key-value pairs can be provided, addressing columns as column.COLUMN_NAME and otherwise following the same conventions as those outlined for sample-level metadata. Note: by default, providing any column query component constrains the output to only the requested column(s); to display all columns regardless, the wildcard column.* is to be used.
Example usage
(expand)
Query interface modifiers
(expand)

Output formats


REST endpoints Query endpoints Notes Examples
default json csv see conventions below REST | query
interactive html html see conventions below REST | query
raw raw only for query data endpoints: retrieves the original data file query
alternative tsv see conventions below query
json.split format: {"columns": […], "data": [[…], …]} query
json.records format: [{"field": value, …}, …] query
json.table format: {"schema": …, "data": [{"field": value, …}, …]} query
auxiliary browser browser resolves to html if possible to visualize; to raw otherwise

Output format conventions

  • JSON outputs are produced in accordance with the strict specification:
    • null/NaN values are represented as null;
    • positive and negative infinities, which are not valid JSON, are converted to strings ("Infinity" and "-Infinity").
  • In tabular formats (CSV, TSV) the following holds:
    • Column names are unquoted strings (e.g. study.characteristics);
    • All cell values of type string are quoted (e.g. "S288C");
    • All cell values of other types are not quoted;
      • This includes null/NaN values, which are represented as unquoted NaN in order to avoid any possible confusion with a string value "NaN".
  • In tabular formats, the following holds for column names:
    • Metadata column names are represented by period-separated field names as seen in the metadata hierarchy:
      e.g. studycharacteristicsstrain becomes: study.characteristics.strain;
    • Data column names are represented by forward-slash-separated accession, assay name, and sample name values:
      e.g. a sample entry identified by accession number OSD-48, assay name OSD-48_molecular-cellular-imaging_microscopy_pannoramic scan (3d histech), and sample name Mmus_C57-6J_LVR_GC_I_Rep1_M31 becomes:
      OSD-48/OSD-48_molecular-cellular-imaging_microscopy_pannoramic scan (3d histech)/Mmus_C57-6J_LVR_GC_I_Rep1_M31.
    • In the interactive (browser) format, the column names are broken up into multiple rows instead, purely to avoid taking up too much horizontal screen space with such names, which can often be rather long.