# hpdeapi

Jump to: navigation, search

# 1. Review of Recent Activities

• Finished URI Template document in October 2015
• Started discussion of Bobby's Recommended file and data collection naming practices document as a follow-up. The motivation is that, if followed, these collections can be described by a URI Template. Todd noted that this document has similarities to http://spase-group.org/docs/ (which, and in what sense?). Action item What next for this document?
• Bob gave a presentation on serving a "pile of files" through the TSDS API .
• This particular data set was very challenging (as documented in link)
• We discussed follow-up of trying to specify configuration information that TSDS needed in SPASE. Todd set up a sandbox for testing.
• There was a suggestion that we standardize an API for serving data. Although TSDS can easily serve data through its interface using data from a diverse set of APIs and file collections, the idea is that some data providers may want to serve data using their own code instead of TSDS. This led to meetings on APIs.
• Jon gave a presentation on his prototype data server http://datashop.elasticbeanstalk.com
• Decided on extension to Heliophysics Event List Format, "Plain Text Data Format" for ASCII output and associated metadata. Draft v1

# 2. HPDE API

A discussion of a Standardized Data Access API for Heliophysics Time Series Data

Objective: an API that a data provider must implement to allow computer-to-computer communication. Use-case: Someone like Jeremy who has developed the Das2 server and wants to allow data to be accessed using a more standard API than the one they developed 10 years ago.

For data providers that provide an API with capabilities given below, TSDS can easily connect to it an provide all of the extra "bells and whistles". TSDS has already done the work of making data available from existing services available via an API like the one below (but making the connection required custom code).

The metadata and data will follow an extension to the Heliophysics Event List Format, named "Plain Text Data Format": Draft v1

1. List all datasets available TSDS Example | CDAWeb Example | Jon V Example | Das2 Example (datasetName list in same format as 2. and 3. possibly a description. Discuss next week how to allow one parser for 1-3.)
2. List metadata for a given dataset (must include maximum time range across all variables and variable names) including all variables in a dataset and metadata for a given variable (must include time range if it differs from that in the dataset). Next extension is SPASE ID for more information. Another extension is to allow query for only metadata for a variable name (more granularity). Should allow constraint for variable names so that we can have same header as in 3.
3. List data for a set of variables (default all) in a common format ASCII.
• #Binary will be optional. Option for noheader=true. Binary always has noheader.

Item 2. is same as Item 3. with a request for all parameters and start = stop.

Thus far, the primary agreement has been that the spec should require that it requires data to be served in an ASCII format. There was lots of discussion for a standard to use to allow more efficient communication, including Protocol Buffers, OpenDAP, and a simple binary format. See following.

## 2.1. Binary

(Bob) I was able to implement a Protocol Buffer demo in about 20 minutes: https://github.com/tsds/tsds2/tree/gh-pages/proto. I looked for OpenDAP clients for MATLAB and IDL and found broken links. Both OpenDAP and Protocol buffers require a schema definition, e.g.,

I suggest that we start the discussion with a simple binary format (one that can be described in a few sentences) and rule that out first if needed. We ruled out the simple binary format because it would have difficulty handling strings. The simple binary format of "all doubles" or 24 chars for ISO time + doubles, 24 chars for ISO time + doubles, could not handle this. We ruled out CDF as it is not a streaming format. Bob will look into OpenDAP + Protocol Buffers more - test out how reading into IDL/MATLAB would work.

It was suggested that the ASCII format could be that of the Event List. Action item: document agreement on ASCII format. Are two time stamps per record required?

## 2.2. Telecon Notes

### 2.2.1. 2016-04-19

Attendees: Todd King, Nand Lal, Bob Weigel, Jeremy Faden, Chris Piker, Bernie Harris, Aaron Roberts, Bobby Candey, Jon Vandegriff

Topics

Implementations of the API are not yet ready for presentation.

Chris Piker would like to see a spec to see if he can implement it.

We also had a discussion about how to make our server findable in Google. We can eventaully support an informative landing page (with enough metadata for Google to chew on), but for now, we are focusing on making a service

Question about identifiers: can the service API be hierarchical?

Answer: The identifiers used by this service should be the "native" ones, i.e., the ones that the data provider is using. If these names already have slashes in them, then those slashes need to be escaped in the calls to the data request API, since the slashes in the API calls indicate specific parts of the REST style request.

So if a dataset had a name like: cassini/mag/szs then requests related to this dataset would look like this:

http://server.com/datasets/cassini\/mag\/szs/filelist
http://server.com/datasets/cassini\/mag\/szs/datastream?start=2004-182&end=2004-185


Alternatively, the service could translate the identifier into something without the slashes, like this:

http://server.com/datasets/cassini_mag_szs/filelist


(The internally the server woulc need to know how to translate this back into what it wanted to use to get to the data.

The point is that to users of the service, there is no hierarchy of data, just a list of identifiers.

A SPASE record can indicate an access mechanism for the data, and there it would provide the "native" identifier that is needed to invoke the service to the the relevant dataset. The SPASE record also has a SPASE ID for a dataset, but this is a regularized identifier created to conform with the SPASE standards, so this may not be what the provider uses for the data.

Question about what to use for the action verb in the service API.

We settled on these names for the file listing and data streaming endpoints. (These endpoints are verbs or nouns, depending on how you view it. The official RESTful approach is to think of them as nouns, although every implementor must think about them as verbs.)

server.com/datasets/GEOTAIL_MAG/filelist
server.com/datasets/GEOTAIL_MAG/datastream


Discussion about the connection of SPASE to this service API:

Aaron will send a short email with: a definitive set of SPASE criteria that implementors of this service will need. Response below.

We seem to be close to having all the required elements for access in the SPASE descriptions. Specifically, we have:

• AccessURL This includes the Product Key, but note that the typical URLs listed here are to web pages that provide data access rather than to machines that provide a service. The syntax will be basically the same, but the system response will not.
• ProductKey This is under AccessURL, but is the name of the product for the provider at this URL, i.e., what is needed in the web service call.
• ProviderResourceName While this sounds like the ProductKey, it is a one liner used by the provider to refer to the resource; useful for display, but not needed in our access scheme.

Parameter elements:

• Name This should be informative, but is not part of the access
• Description Again, useful to identify the variable, but not part of access
• ParameterKey Essential for accessing a subset of the variables in the product
• Units In our access scheme, essential
• FillValue Certainly really useful
• Structure Mostly not essential for the simple task we have defined, with “Size” being the one subelement that could be most useful
• CoordinateSystem Usually part of the Name and/or Description; certainly required to understand the variable.

The one thing we are missing altogether in SPASE is “Template” that will tell the system how to parse file names to get, e.g., time ranges. We have to add this.

I have proposed to SPASE that Parameter Type (not actually a term) be optional since it is often redundant with the Name/Description, it is not essential for access, and it makes it hard to do parameter descriptions.

Still some things to work in detail, but there is a path to success.

Bob's TODO:

• Finish Example w/ Weygand dataset.
• Extend items 2 & 3 to include exampls
• Report on Das2 issues when trying to connect to TSDS
• Find out about interfacing TDAS with TSDS

### 2.2.2. 2016-04-05

attendees: Aaron Roberts, Bernie Harris, Bob Weigel, Jeremy Faden, Bobby Candey, Todd King, Nand Lal, Jon Vandegriff

A. Plain Text Data Format Specification:

1. allow multiple vectors or spectrograms in each file
2. find a suitable units standard to refer to; one suggestion is to have two parts to the units standard with the first part indicating a REQUIRED structure to the relationships between units quantities and the second being a suggested list of what to use for the actual units quantities;

possibilities units standards: UD units from Unidata or some SI units system

Action item: for Todd: provide another update with these suggestions and send it around

B. Data Server Request API

Collapse the 5 request steps down to these 3:

1. list datasets
2. show metadata for one dataset
3. get data for one dataset

These contain only the information needed to frame a valid request. I.e., there is no support for data discovery.

The header information for the dataset in #3 has the same format as the dataset info returned in item #2. In other words, only use one format for metadata from this server. Even the list of datasets in item #1 can be in the same format, so that only one parser is needed.

Action item: for Bob and Jon - Make an implementation at Amazon and at TSDS based on our existing standard.

### 2.2.3. 2016-03-08

Data Server API Telecon

Bob Weigel, Jeremy Faden, Jon Vandegriff, Bobby Candey, Bernie Harris, Aaron Roberts, Todd King, Nand Lal

For an ASCII data description, we should use something like the event-list format: http://spase-group.org/docs/conventions/HDMC-Event-List-Specification-v1.0.pdf

The data server output will initially be fixed (no choice of delimiter), but the format is uses should be able to be expressed in terms of the modified event list format (it will be that format with some of the choices fixed.)

Use some ideas from the data description mechanism Bob has: http://tsds.org/dd but use "Records" and "Fields" as much as possible (instead of "Channels" and "DEPEND_1", etc.). Use "bin" instead of channel.

action: Todd: Modify the event list format to make it suitable for describing CSV output data from the data server API; items to do include:

• take out second time column;
• add FieldBinMin, FieldBinMax, FieldBinCenter (for DEPEND_1 values)

action: everyone: improve file naming conventions document

1. emphasize what is essential for automatic interpretation versus nice-to-haves (like extensions)
2. clean up the part about versioning - mention the ones that work well with file templates
3. add comment about uniform hierarchy based on time
4. mention about concrete resolving service for naming conventions

discussion about binary format

(this is more stream of consciousness with conclusions at the end) OPeNDAP - clients not fully supported Google ProtoBuf - well supported, but no native support for IDL or Matlab - would need Java bridge

custom, simple binary blobs - easy to interpret; we write native readers in all client languages issues with this: same folding issues with flat ASCII different data types: try to support just a few minimal times, doubles, strings hmm... strings - fixed width? one option for time: encode the time as a 24-byte ASCII string (what about nanoseconds?) compromise: go to nanoseconds .... lots of plumbing .... might not be worth it ...

one issue with OPeNDAP - it needs to know the conclusion for now: probably not worth it for us to develop this on our own - too much plumbing and we are not plumbers

suggestion: allow output of CDF, maybe at a later date; issue: its not a streaming format

action: Bob

• look at the Java interface for OPeNDAP and/or ProtoBuf to see if we could use this to interface this with IDL and Matlab

#### 2.2.3.1. Bob's Follow-Up

Our requirements for the binary option for data output from and HPDE API are

1. Easy for person creating a server to create the stream
2. Easy to read in a script
3. Fast
4. A certain amount of flexibility
5. Streamable

I am basing my opinions on reading and experimenting with some of the following

All of the above tell me that there is nothing that satisfies (1) and (2). To prove this to yourself, (a) try reading any of the binary formats into MATLAB/IDL/Python. When you are done, imagine the maintenance that would be required as users upgrade versions or as new library versions are created and (b) try to write a program that converts an ASCII table to any of the suggested formats.

Based on requirement (5), OpenDAP/NetCDF and CDF are not an option.

On the last telecon, we noted that the "simple binary" option may not work because some of the columns may be strings. I think we can deal with this while still keeping things simple.

Suppose that we have a file containing columns of

Time, instrumentID, Bx, By, Bz


e.g.,

2015-01-01T00:00:00.000000, A, 9.0, 10.0, 11.0
2015-01-01T00:00:00.000000, B, 9.0, 10.0, 11.0


The following information would be needed in the ASCII header (we are discussion how to express this in the header of the spec for the ASCII format that is a generalization of the HPDE Event List:

units       null    null            nT      nT      nT
labels      Time    instrumentID    Bx      By      Bz


To create a binary version of this, we would need one bit of additional information, columnTypes (Todd is working on how to actually express this in the header of the spec for the ASCII format that is a generalization of the HPDE Event List; question - one could express this information using SPASE. Why are we defining a free-text metadata format? Answer: Because event list is a subset of existing standard. We are generalizing.):

columnTypes char32  char1           double  double  double
units       null    null            nT      nT      nT
labels      Time    instrumentID    Bx      By      Bz


To handle spectrograms, we need additional information in the header:

Suppose that we have a file containing columns of

Time, instrumentID, Bx, By, Bz, Px, Py, Pz


e.g.,

2015-01-01T00:00:00.000000, A, 9.0, 10.0, 11.0, 5.0, 10.0, 11.0
2015-01-01T00:00:00.000000, B, 9.0, 10.0, 11.0, 5.0, 10.0, 11.0


where Px, Py, and Pz are the powers in a frequency bin centered on 3.0 Hz. The following information would be needed in the ASCII header

Difficult to read

groups=[3-5],{6-8}
groupNames=B,P


Better, easy to read and about/equal as easy to parse

B:[Bx,By,Bz]
P:{P1,P2,P3}

vector B:Bx,By,Bz
spectrum P:P1,P2,P3

binValue    null    null            null    null    null    3.0     3.0     3.0
binUnits    null    null            null    null    null    Hz      Hz      Hz
units       null    null            nT      nT      nT      nT      nT      nT
labels      Time    instrumentID    Bx      By      Bz      P1     P2      P3


To create a binary version of this, we would need one bit of additional information, columnTypes:

columnTypes char32  char1           double  double  double  double  double  double
binValue    null    null            null    null    null    3.0     5.0     7.0
binUnits    null    null            null    null    null    Hz      Hz      Hz
units       null    null            nT      nT      nT      nT      nT      nT
labels      Time    instrumentID    Bx      By      Bz      P1      P2      P3


### 2.2.4. 2016-02-09

The overall point of the data server API discussion is to get a group of people to agree on a common API for serving relatively simple time series data, including up to to something like 2D spectrograms.

For most of these items below, we had a large amount of agreement.

1. Minimize the pain of data providers in implementing a new service over their data.

The interface should be very bare bones in terms of actual requirements. It only requires that you return CSV, with an optional binary format as well. We don't want to require that binary be supported, but we want to have a binary format that people can use if efficiency is an issue. People will think about extending the service to make it more efficient, and we should provide guidance on that, at least in terms of where we will be enhancing it, especially in terms of adding binary streaming support.

2. Provide a uniformly formatted ISO8601 time value as the first column; do not offer other time formats.

This keeps the server as simple as possible. Fractional year is prone to inaccuracies for leap years and years with leap seconds. When accessing the data through a client program, there can still be the option to change the time format to something else, but this happens in the client software. This puts a larger responsibility on the people who write client software to provide more useful options, but it minimizes client writers.

3. In terms of meta-data, only provide what is needed to use the data values.

The minimum metadata required for each parameter includes:

 - a name
- units
- FILL value; if none is specified, then there is a default FILL value for that type
- optional: flag indicating this parameter is part of a group (its one channel in a spectrogram)
- a type?
- a precision?


Bobby recommended to use the ISTP FILL value: -1e31

4. Use an existing standard for the metadata format.

The best option would be something that is machine readable, but also human readable. use some kind of computer readable format that is also readable by humans YAML was suggested. I do not want to use YAML since it is too hard for humans to generate, if that is ever needed.

XML is not very human readable. JSON seems like a good balance, since our metadata will not be very highly dimensional, the JSON representation would be fairly human readable still.

What about using the OPeNDAP variable description language? Its very simple and would provide a natural extension to using binary OPeNDAP data transport for larger data volumes.

5. For ASCII CSV output, unroll higher dimensional variables into multiple columns. 2D or higher data can still be represented in ASCII if you create multiple variables with a standard indexing mechanism in their names If you have a a CDF that has a variable that is a proton spectrum with 8 channels in a variable named "pspec", then this turns into 8 columns:

pspec_0
pspec_1
pspec_2
...
pspec_7


Potential issue: more than 10 channels cause the names to sort strangely unless the indices are zero padded. We require that this index have enough leading zeros to sort properly, so if you had 12 channels, the server should name them properly so that they still follow simple sorting rules:

pspec_00
pspec_01
pspec_02
...
pspec_09
pspec_10
pspec_11


Clients need to be smart to re-roll the data back into the correct spectrogram.

For spectrograms where the energy (or DEPEND_1) values change, these are really different channels and should have different labels.

6. Allow for comment characters or some easy way to skip the first line or indicate that it is a header line (prefix with # character).

Other items we did not talk about:

a. other required metadata values: mb per day, format strings for parameters

b. what binary format to use for binary streaming

c. how to specify the DEPEND_1 ranges for unrolled data?

### 2.2.5. 2016-02-02

Below is a summary of some of the points we talked about last week.

Some details about the API that we need to figure out:

• the exact structure of the query API
• metadata content and format - what to require for each dataset and how to present it in a machine readable way
• how to capture what we know as DEPEND_1 info (bin ranges for the second dimension, which we decided should be unrolled in the CSV files)

One thing to consider is to just use the existing OPeNDAP API and be fully OPeNDAP compliant. We can do this even if the servers we write only ever return only one type of the possible OPeNDAP types, namely the sequence data type, which is best for timeseries data. There are some OPeNDAP things that we will have to consider carefully, like their constraint expressions, which we will only support for the time variable. But I think not supporting these for all variables is still OK.

Then all our servers we create could be used by existing OPeNDAP clients, and we could add binary data distribution later in a natural way. I'm reading up on OPeNDAP and trying to get a feel for wether this would work. If we could tap into this existing standard mechanism, I think we would not have to re-decide everything in terms of how to represent the metadata and the binary streaming of data.

Retrieved from "http://tsds.org//hpdeapi"