# presentations/2016-SPASE-January

Jump to: navigation, search
Note that links that start with http://localhost:8004 will not work. Over the next day or so after a TSDS server update, I'll replace http://localhost:8004 with http://tsds.org/get/ in this document.

# 1. Definitions

• TSDS - Time Series Data Server - A "back-end" data server. Handles concatenation of granules (both files in directories and requests from data services). Uses Autoplot file processor libraries and custom wrapper code written in node.js. Handles things like optimizing the delivery speed of data and minimizing duplicate or redundant requests to servers it pulls data from.
• TSDSFE - Time Series Data Server Front-End Takes an input of catalog, dataset, and timerange and allows visualization in Autoplot and ViViz and creates ~10 line "fill my array scripts" in IDL/Python/MATLAB. Also monitors back-end servers that it needs and reports on problems. Implemented in node (essentially Javascript + OS interface function).
• TSDS Catalog - Contains information needed by TSDSFE to do everything (uses URI Templates and a THREDDS-like syntax). Can be exported as XML or JSON.
• TSDS DD - TSDS Dataset Description Used for communication between TSDSFE and Autoplot. Also used as a simple way of creating a TSDS Catalog with a simple URL string.

# 2. Overview of Demo

Use the Weygand Bow Shock Data Base to show the process of

• using a DD to create a TSDSFE Catalog for a subset of the database; and
• using a TSDSFE Catalog.

to enable all of the value-added services that TSDSFE connects. Also, to document SPASE and non-SPASE metadata issues encountered.

The database is composed of ASCII files (granules) that span one month and have records with a cadence of 60 seconds.

Preview of things we can do given a TSDFE Catalog for the Weygand Bow Shock Data Base:

• Download one or more parameters in a dataset into a IDL, MATLAB, and Python data structure; view instructions for viewing data set in Autoplot. (TODO: Add fill value value variable and add code to download README with links to metadata in scripts; Check IDL script.)
• View thumbnail overview plots or a page-able full-size plots of the entire dataset (one day per plot) TODO: This is the wrong dataset. Fix link.
• View and export data for a given parameter in a dataset with a single click in Autoplot; Import a list of bookmarks for all datasets into Autoplot
• View a web-based plot of a given parameter in a specified timerange PNG or SVG or download a PDF
• View URLs used to fulfill data request
• View a numbers used to generate plot of a given parameter in a web page or ASCII File
• Explore other things that TSDSFE can do with this dataset

# 3. Steps

These are the approximate steps that were taken to create the TSDS Catalog.

## 3.1. Preliminary

1. Search for weygand bow shock data set in Google
2. Inspect first link, which is a SPASE record
3. Find reference to
http://vmo.igpp.ucla.edu/data1/Weygand/PropagatedSolarWindGSE/weimer/Wind/TAP
4. Inspect link above and find
http://vmo.igpp.ucla.edu/data1/Weygand/PropagatedSolarWindGSE/weimer/Wind/TAP/V3/
and note that V3 directory was not mentioned in SPASE record.)
5. Look for start/stop time in SPASE record, and attempt to verify it is correct by inspection of data directory. Find directory has data through 2014-09 but SPASE record has End = 2010-06-30T23:59:00.000. This is why I advocate the autogeneration and nightly updating of SPASE records for details like this. At one point in time, the End date was correct. But once the record is written, it requires human manual inspection to update it. Which probably won't occur after the grant expires.

After some inital testing, I decided to start not the above dataset, but instead with the Geotail/mag dataset in the database which has a SPASE record that I found through a Google search.

With the directory link for Geotail/mag, I now have enough information to serve this data through TSDSFE. I am going to do this using three approaches:

1. By telling TSDSFE the minimal amount of information it needs to serve the numbers without useful metadata from the dataset.
2. By telling TSDSFE the minimal amount of information it needs to allow it to create plots with appropriate labels.
3. By writing a TSDS catalog with the full information about all of the data sets in the Weygand database (in this document, I only describe the catalog for four of the datasets; the others have either not been tested or not been written).

Then, I'll discuss how a hypothetical SPASE service could have been used to do 3. above instead.

## 3.2. 1.

Based on inspection of the dataset directory, I decide that the URI template is

http://vmo.igpp.ucla.edu/data1/Weygand/PropagatedSolarWindTSDSGSM/weimer/Geotail/mag/$Y/geotailmagP$Y$m.dat  I could enter this in Autoplot and it would give me a GUI for selecting the columns that I want to plot and I could plot it over an arbitrary time range. (I would need to manually determine the column labels as the data files do not contain a header.) All that I need is to use this DD string: uri=http://vmo.igpp.ucla.edu/data1/Weygand/PropagatedSolarWindGSM/harweimer/Geotail/mag/$Y/geotailmagP$Y$m.dat


with the TSDSFE service by appending the DD string to http://localhost:8004/# and entering the following into a browser (link does not work for reasons described below)

http://localhost:8004/#uri=http://vmo.igpp.ucla.edu/data1/Weygand/PropagatedSolarWindGSM/weimer/Geotail/mag/$Y/geotailmagP$Y$m.dat  Internally, if this had worked, TSDSFE would have created a TSDS Catalog based on this information by inspecting the directory tree starting at http://vmo.igpp.ucla.edu/data1/Weygand/PropagatedSolarWindGSM/weimer/Geotail/mag/  to find the start/end time of the granules and the timeformat for each granule by looking to see if the first few columns look like a sensible timeformat. (In reality, one would want to specify the timeformat in the string, but for quick tests the capability of omitting it is useful.) The error is due to the fact that files from the start file (1993-10) through the file (2007-05), the timeformat is $Y $m$d $H$M $S and then starting at file (2007-06) the timeformat is $d $m$Y $H$M $S. For this reason, I will need to treat weimer/Geotail/mag as two datasets instead of one. In addition, instead of automatically determining the start/end time of the datasets (to future-protect against time coverage expansion or contraction), I will hard-wire the stop time for the first part of the full dataset; the start time will be determined by code inspecting the directory as before. Note also that the SPASE record has <TemporalDescription> <TimeSpan> <StartDate>1992-09-01T00:00:00.000</StartDate> <StopDate>2010-07-31T23:59:00.000</StopDate> <Note> Time format in data files is: Day Month Year Hour Minute Second (DD MM YYYY HH MM SS.SSS) </Note> </TimeSpan> </TemporalDescription>  A few notes: • TSDSFE found that the StartDate of the data set is 1993-10-01T04:00:00.000 whereas the SPASE record claims 1992-09-01T00:00:00.000 • TSDSFE found that the StopDate of the data set is 2015-01-01T04:59:00.000Z whereas the SPASE record claims 2010-07-31T23:59:00.000 • The SPASE record has a note that says the timeformat is DD MM YYYY HH MM SS.SSS - this is Java date notation whereas the URI Templates document uses Unix date notation (%d %m %Y %H %M %S). We should decide on using only one set of notation for SPASE records. • I found the SPASE record by searching Weygand/PropagatedSolarWinGSM/weimer/Geotail spase in Google - it could be that a more correct version exists but is not indexed by Google. • The dataset README has different variable names than that in the SPASE record. I personally think it is important to allow a mapping from the original data provider's variable naming to one used in SPASE. I think it is impudent for a SPASE author to decide to override an author's original naming of a parameter without referencing the fact that he did it. I see this often in SPASE records where the SPASE author is different from the data set author. Here, strangely, the data set and original metadata creator is also the author of the SPASE record. Because of start/end time issue described above, the DD string needs to be constrained to be: uri=http://vmo.igpp.ucla.edu/data1/Weygand/PropagatedSolarWindGSM/weimer/Geotail/mag/$Y/geotailmagP$Y$m.dat&timeFormat&start=2007-06-01&timeFormat=$d$m $Y$H $M$S.$(millis)  I append this string to http://localhost:8004/# and enter the following into a browser http://localhost:8004/#uri=http://vmo.igpp.ucla.edu/data1/Weygand/PropagatedSolarWindGSM/weimer/Geotail/mag/$Y/geotailmagP$Y$m.dat&timeFormat&start=2007-06-01&timeFormat=$d$m $Y$H $M$S.$(millis)  When the above URL was entered in the browser, it sent the query string to the TSDS server, which created a TSDS catalog. To view the catalog that was created and that the GUI is now working with, select the link named "Catalog configuration" near the bottom of the page in the Catalog section. Note that the catalog created has one data set with id=1 and one parameter with id=1 (it assumed the first non-time column was the only parameter and gave it a name of 1; in principle it could detect all columns). This above DD is the minimal amount of information that I needed to start serving numbers for one of the datasets in the database (for non-ASCII files, a bit more additional information would be needed for services - more on that in another demonstration). Next, I selected and output format of ascii-0 from the GUI and verified that the first non-fill line returned 30 12 2014 15 28 00.000 5.175e+00  is found in the source file found by selecting the urilist link in the GUI, which shows the list of URLs used to fulfill a request over a time range. There is only one file in this case, and I CTRL+F searched it for the timestamp 30 12 2014 15 28. ## 3.3. 2. I considered [the dataset README and the SPASE record to find parameter names. I decided to use those in the SPASE record because they did not have spaces: Bx-GSM, By-GSM, Bz-GSM, x-GSM, y-GSM, z-GSM because variable IDs in a TSDS Catalog may not have spaces and certain other characters; in the future, I'll use the README variable names as the variable names in the TSDS Catalog to avoid being impudent. Next I append to the previous DD string additional information: &dataColumns=7,8,9,10,11,12&columnIDs=Bx-GSM&By-GSM&Bz-GSM&x-GSM&y-GSM&z-GSM&catalogLabel=Weigand SW Propagation Data Set&datasetID=Weygand/PropagatedSolarWind  and finally append this to the URL used in #1. to create http://localhost:8004/#uri=http://vmo.igpp.ucla.edu/data1/Weygand/PropagatedSolarWindGSM/weimer/Geotail/mag/$Y/geotailmagP$Y$m.dat&timeFormat&start=2007-06-01&timeFormat=$d+$m+$Y+$H+$M+$S.\$(millis)&columns=7,8,9,10,11,12&columnIDs=Bx-GSM,By-GSM,Bz-GSM,x-GSM,y-GSM,z-GSM&columnUnits=nT,nT,nT,Re,Re,Re&columnFills=1e34&catalogLabel=Weygand+SW+Propagation+Data+Set&datasetID=Weygand/PropagatedSolarWind


and use it to form requests for data. Of course, this URL is cumbersome and we have still only described one dataset in the database with this method. This "DD" method is useful for quick checks of a single data set and follows the style used by Autoplot in adding modifiers to a URL in order to add additional information about the dataset so that a scientifically sensible plot can be made.

It is also useful for looking at the dataset node in the internally generated TSDS catalog that was generated by the DD as a template for creating a full TSDS Catalog.

## 3.4. 3.

To begin work on the full database, I started inspecting http://vmo.igpp.ucla.edu/data1/Weygand/, which has subdirectories of

ProcessedSolarWindGSE/
ProcessedSolarWindGSM/
PropagatedSolarWindGSE/
PropagatedSolarWindGSM/


I'll start the last one, which contains the dataset used previously and has a directory structure of

 PropagatedSolarWindGSM/
parallel/
ACE/
Geotail/
IMP8/
ISEE1/
ISEE2/
ISEE3/
Wind/
weimer/
Geotail/
mag/
mag_cpi/
plasma/
plasma_cpi/
Wind/
mag/
mag_swe/
plasma/
swe/
ace/
imp8/
isee1/
isee2/
...


Note the inconsistent use of capitialization for S/C names.

NB: It would be very helpful if the READMEs in these directories pointed to the SPASE records and vice-versa.

DDs will be used to generate a baseline TSDS catalog and then the baseline catalog is enhanced with a small amount of code (this is not always the method used for generating a TSDS Catalog, but was in this case).

The basic approach was to write a DD for each bottom-level directory using the method in #2., combine all of the dataset nodes into a single catalog, and then add additional dataset documentation information in a post-processing step. Below are the DDs that were used by code for all of the datasets under weimer/Geotail/ along with links to other documentation that was found by inspecting the directory tree or doing Google searches:

 weimer/
Geotail/
mag/
- header4dat
- (found) SPASE: http://vmo.nasa.gov/mission/metadata/VMO/NumericalData/Weygand/Geotail/MGF/Processed/GSM/PT60S.xml
mag_cpi/
- No header4dat file1.
- (could not find) SPASE.
plasma/
- header4dat
- (could not find) SPASE.
plasma_cpi/
- No header4dat
- (found) SPASE2: http://vmo.nasa.gov/mission/metadata/VMO/NumericalData/Weygand/Geotail/CPI/Propagated.CPI/GSM/PT60S.xml


1 Found parameters by looking at variable names in MATLAB binaries posted alongside of ASCII files. Parameter names are same as mag/. Not sure what difference is. Has same problem with change in timeformat as mag/

2 Contains a dead link to http://vmo.igpp.ucla.edu/data1/Weygand/PropagatedSolarWindGSM/weimer/Geotail/cpi_cpi (seems like link should end in plasma_cpi)

Based on inspection of SPASE records, I decided it would be easiest (or required) to

• write code to generate the TSDS Catalog than to write code to extract info from SPASE records;
• write additional code/metadata to supplement what is found in a SPASE (and to write code to verify that that what is in a SPASE record is consistent with that found in directories);

The TSDS catalog has pointers to the URL to the definitive SPASE record and the header4dat files so that the user will need to combine the information as needed to learn about the dataset.

The catalog containing the datasets under Weimer/Geotail is located at [1] which is linked to in the Catalog Information section of [2].

## 3.5. 4.

I would query a (hypothetical) service, e.g.,

http://spase.org/?catalog=spase://VMO/NumericalData/Weygand/


or

http://spase.org/?database=spase://VMO/NumericalData/Weygand/


and get a list of datasets under Weygand, such as

spase://VMO/NumericalData/Weygand/Geotail/GSM/mag
spase://VMO/NumericalData/Weygand/Geotail/GSM/mag_cpi
spase://VMO/NumericalData/Weygand/Geotail/GSM/plasma
spase://VMO/NumericalData/Weygand/Geotail/GSM/plasma_cpi
spase://VMO/NumericalData/Weygand/Wind/GSM/mag
spase://VMO/NumericalData/Weygand/Wind/GSM/mag_cpi
spase://VMO/NumericalData/Weygand/Wind/GSM/plasma
spase://VMO/NumericalData/Weygand/Wind/GSM/swe
...


and within each of the above SPASE records would be a URI Template.

I could then use this SPASE service to do everything in #3. above provided that the SPASE record had structure information as in the SPASE record considered here (which is rare).

Note that by default I would ignore the start/stop times in the SPASE records because datasets often expand long before the SPASE record is updated.

This procedure applies to datasets obtainable from files; the service case is a bit more complicated but do-able.

# 4. Discussion

Major SPASE- and metadata-related issues encountered (some are issues, some are problems):

• SPASE records have incorrect start/stop times for some datasets.
• Granules have timeformats and fill values that vary within some datasets. SPASE record does not indicate this; README found in dataset directory indicates timeformat should be same for all granules in dataset.
• Directory naming convention for granules varies and is inconsistent with SPASE IDs.
• Lack of a pointer in dataset READMEs to SPASE record and vice-versa.
• Some READMEs are missing. (As discussed in this document, I was not able to determine if any SPASE records are missing.)
• Variable naming convention in some dataset READMEs is different from that used in SPASE record.
• Fill values not documented in SPASE record but appear in READMEs when they exist.

All of the SPASE-related problems are not surprising. The job of completing a SPASE record is typically considered complete if it passes schema validation tests. However, it appears that no tests are performed to validate other things that are needed by automated processing as attempted here or checks made for potential points of confusion by human consumers of the SPASE record who click on some of the links within it.

I am not sure how many people prefer the README vs. the SPASE record. Based on experience, I usually go directly to the README when using data as I know that that the transformation of information in a README to that in the SPASE records can often be, say, lossy.

# 5. Core Activities?

I think that there are the following core activities:

1. Finalize the TSDS + TSDSFE reference implementation.
2. Autoplot maintenance of core parts used by TSDS/TSDSFE (file readers and plot servlet at minimum).
3. Development and finalization of how to map to/from TSDS Catalog to SPASE. Writing SPASE records that can be easily used by TSDS using the procedure in #4.; writing code to auto-update and auto-generate SPASE records.
4. Development of a workflow for organizing and communicating with all of the data providers who maintain servers or services used by TSDS + TSDSFE (could be informal, similar to how Jeremy has monthly meetings with CDAWeb group). For example, if people started using the Weygand dataset though TSDSFE, we would need coordination/communication with Todd if he made changes to the dataset or if we encountered inconsistencies that should be addressed.
5. Finalization of the DD specification. This would probably take about as much effort as URI Template specification. I am not sure how relevant this is for SPASE. It is useful for TSDSFE communication with Autoplot, but I am not sure where else it would be used.