Sections
You are here: Home » Development » API » API 1.1 - archived version » Dataset

Dataset

— filed under:

Provides access to chemical compounds and their features (e.g. structural, physical-chemical, biological, toxicological properties)

REST operations

Dataset

Description Method URI
Parameters Result Status codes
Get a list of available datasets GET /dataset
Query parameters (optional, to be defined by service providers) List of URIs
or RDF for the metadata only
200,404,503
Get a dataset GET /dataset/{id}
- Representation of the dataset in a supported MIME type 200,404,503
Query a dataset GET /dataset/{id}
compound_uris[] and/or feature_uris[] to select compounds and features; further query parameters may be defined by service providers
Representation of the query result in a supported MIME type 200,404,503
Get metadata for a dataset GET /dataset/{id}/metadata
- Representation of the dataset metadata in a supported MIME type 200,404,503
Get a list of all compounds in a dataset GET /dataset/{id}/compounds
- List of compound URIs 200,404,503
Get a list of all features in a dataset GET /dataset/{id}/features
- RDF or List of feature URIs (pointing to feature definitions/ontologies) 200,404,503
Create a new dataset POST /dataset
Dataset representation in a supported MIME type. MIME type to be specified via Content-type header.

  • Content-type:application/www-form-urlencoded dataset_uri , feature_uris[] and compound_uris[] parameters are used to specify subset of a dataset, as in GET operation;
  • File upload via Content-type:multipart/form-data: file parameter
  • File upload metadata: parameters as in opentox.owl


New URI /dataset/{id} or redirect to task URI (for large uploads)
200,202,400,503
Update a dataset PUT
  • /dataset/{id}

  • Data representation in a supported MIME type;
  • entries for existing compound/feature pairs will be overwritten, entries for new compound/features will be added
  • File upload metadata: Dublin core annotation parameters, as in opentox,owl#Dataset

  • Content-type:application/www-form-urlencoded dataset_uri , feature_uris[] and compound_uris[] parameters are used to specify subset of a dataset, as in GET operation;
  • File upload via Content-type:multipart/form-data: file parameter
  • File upload metadata: Dublin core annotation parameters, as in opentox,owl#Dataset
Dataset URI or task URI
200,202,400,404,503
Remove a dataset DELETE /dataset/{id}
- - 200,404,503
Remove a part of the dataset DELETE /dataset/{id}
compound_uris[] and/or feature_uris[]; further query parameters may be defined to select the data to be deleted

NOTE: HTTP DELETE doesn't allow to pass a body (at least in Restlet), therefore this functionality can only be implemented as compound_uris[] and feature_uris[] as query parameters, which may result in a long URL - how to redesign  partial delete?
- 200,404,503

Dataset representation

RDF specification

Metadata

  • RDF dataset representation (Dublin core properties only)
  • Content-type:multipart/form-data File upload metadata: parameters as in opentox.owl  - verify if fully qualified dc: properties can be used as parameter names 

Features

RDF dataset representation


Supported MIME types:

Mandatory:

  • application/rdf+xml (default)
  • application/www-form-urlencoded
  • multipart/form-data

Optional:

  • other RDF serialization formats
  • application/xml
  • text/xml
  • text/x-yaml
  • text/x-json
  • application/json
  • text/csv
  • text/arff
  • text/html
  • chemical/x-mdl-sdfile
  • ...
  • multipart/form-data for upload - we need to fix the name of the file upload field

HTTP status codes

Interpretation Nr Name
Success 200 OK
Asynchronous task started
202 Accepted
Dataset not found 404 Not Found
Incorrect MIME type 400 Bad request
Service not available 503 Service unavailable

Queries

Subsets of a dataset (e.g. all data for a certain feature, all data for a set of compounds)) are accessed through query parameters. This allows us to pass full URIs as parameters and circumvents the problem of no-unique IDs (e.g. for /dataset/{id}/compound/{compound_id} URIs). The query parameters compound_uris[] and feature_uris[] are mandatory, more advanced queries (e.g. similarity searches) may be implemented by individual services.

Examples:

Get all features of two compounds
curl -X GET http://my_dataset_service/dataset_id?compounds[]=compound1_uri&compounds[]=compound2_uri
Get a single feature of a single compound
curl -X GET http://my_dataset_service/dataset_id?compounds[]=compound_uri&features[]=feature_uri
Remove a compound from a dataset
curl -X DELETE -d "compounds[]=compound_uri" http://my_dataset_service/dataset_id
Upload an sdf to ambit server
 curl -X POST -H 'Content-Type:chemical/x-mdl-sdfile' --data-binary @filename.sdf http://ambit.uni-plovdiv.bg:8080/ambit2/dataset
Get compound URIs of a dataset
 curl -X GET -H 'Accept:text/uri-list' http://ambit.uni-plovdiv.bg:8080/ambit2/dataset/dataset_id
Together with a little bit of RDF processing you can use queries also for set operations (e.g. subsets, split, merge, intersection).

PS Take care to URI encode parameters that are sent via GET.

 

Proposal

Introduce copy/clone operation on dataset

Feature ontologies

The feature URI points to a Feature object, which allows retrieval of the Feature object as RDF and provides information about the name, units, source and the type of the feature. The feature type is denoted by a mandatory link to an ontology via owl:sameAs or directly subclassing a class from an ontology.

 

This allows Feature URI to point directly to an existing (fixed) ontology, or to a web service, providing access to dynamically created Feature objects.

 

Conformers

Conformer URIs (see Compound API) can be used instead of compound URIs. The Resolving the parent structure should be done via the compound webservice.
Document Actions

Efficient creation of datasets for validation purposes

Posted by Martin Gütlein at Sep 28, 2009 02:06 PM
I propose to
* remove the split, merge and subset dataset-options
* add the following commands:

desc: copy a dataset while excluding compounds of the orig dataset
method: POST
uri: /dataset/{i}/copy
params: exclude_compounds (comma-separated list of compound-ids)
return: uri of new dataset

desc: copy a dataset while including compounds of the orig datset
method: POST
uri: /dataset/{i}/copy
params: include_compounds (comma-seperated list of compound-ids)
return: uri of new dataset

The old split and merge functions have the disadvantage that each dataset service has to provide this functions with the exact same functionality.
The new copy functions allow an efficient creation of test and training datasets (you do not have to add/remove each compound on its own), and ensure that the dataset-splits has to be implemented only once (by the validation component) and is easy to reproduce.

Efficient creation of datasets for validation purposes

Posted by Martin Gütlein at Sep 28, 2009 02:06 PM
As discussed with Nina on Friday, the changes I suggested have two shortcomings:
* it should be possible to create a new dataset by copying sets of compounds from various (more than one) other datasets
* instead of using compound ids, one should use a list of compound uri

Therefore my new suggestion would be to simply extend the already existing add and remove POST-commands to accept a list of compound URIs instead of just one. It should further be possible to 'add' a complete dataset (param is dataset-uri, all compounds of this dataset are added).

This would still allow for example the creation of a test-dataset, with only few http-requests, independent of the number of compounds.
The disadvantage is that a lot of redundant information is transfered (uri-prefix) if all compounds have the same location.

Any comments?

Efficient creation of datasets for validation purposes

Posted by Helma Christoph at Sep 30, 2009 03:12 PM
Do you think, that would give you a significant performance benefit over getting the complete dataset and re-inserting the splits? You will have the same number of database reads/inserts (well even more, because you have to find the features for the re-inserted compounds), but you may save a little bit on the transfer of features (which you might need for y-scrambling anyway).

In the long term we might need set operations not only for validation, but also e.g. for the aggregation of datasets of different origins, definition of composite endpoints, ... I am not sure, if they should belong to the dataset API or to the algorithm API.

Efficient creation of datasets for validation purposes

Posted by Martin Gütlein at Oct 05, 2009 04:54 PM
> Do you think, that would give you a significant performance benefit [...]

I would guess so, but that may depend on the connection between the client and the database-service. I think we can test it later, I don't think its too much effort to implement this.

Efficient creation of datasets for validation purposes

Posted by Jeliazkova Nina at Oct 01, 2009 05:29 PM
Martin,

In your suggestion to extend the already existing add and remove POST-commands to accept a list of compound URIs instead of just one, what do you think should be the preferred format for the list - uri-list, xml (as in dataset schema), or something else?

Efficient creation of datasets for validation purposes

Posted by Martin Gütlein at Oct 05, 2009 04:56 PM
I think we should use the default output-format when requesting a set of compounds from the service via GET. I would prefer simple URI-Lists to XML.

Dataset Features ?

Posted by Surajit Ray at Sep 19, 2010 08:34 PM
In our workflow we seem to require the ability to create a feature for a dataset as a whole. Can this be accommodated ?

For ex. When we generate an mcss dictionary for a particular set of compounds - it cannot be called a compond feature. And the feature is valid only for a particular dataset and not for the individual compounds.