Recommended file and data collection naming practices

From TSDS

Jump to: navigation, search
Recommended file and data collection naming practices

Version 1.1

Original: 2015 November 13 by Robert Candey

Revised: 2016 February 16 by Robert Candey

Contents

  1. Summary
  2. Directory Hierarchy Conventions
  3. Instrument Directory Naming Conventions
  4. Data Collection Directory Naming Conventions
  5. Data File Naming Conventions
    1. Field descriptions used in filenaming and data collection naming
  6. Appendix: 00README.TXT template

1. Summary

Use data file and directory names that express the information researchers require to locate data collections of interest and to indicate the differences between collections (both current and possible future ones). The following some hard-won practices for naming your datasets and filenames and adding information to a dataset read me file (hopefully without being too legalistic). All data files should be made publicly available and in FTP or HTTP directories for easy access, even if also available through other services and databases. FTP directories have the advantage of allowing easy download (using mget) of whole directories. Each instrument team should strive to capture the full scientific content of their instrument in a self-describing scientific format (CDF, netCDF, HDF, FITS) with full documentation and metadata, sufficient for someone in the far future to be able to fully apply its scientific value.

The following sections provide recommendations for laying out the directory hierarchy, naming the dataset or collection, naming the data files, and finally creating a Readme file for the collection. In general, use all lowercase text, except for specific files and subdirectories, and with a limited character set, to ensure maximum compatibility across computer platforms and processing languages. Use times in one of the ISO 8601 formats, such as 2016-02-15T05:03:57.123Z, 20160215T050357.123Z, or 2016046T050357.123Z. Use well-known extensions, spacecraft and instrument short names, and datatypes where available.

2. Directory Hierarchy Conventions

Directory hierarchy should flow from high level to specific: project/mission/spacecraft, instrument, data collections, time ranges. Time range directories (yearly, monthly, daily) should be chosen to keep the number of files per directory below 1000 to avoid delays in directory display.

For instance, a data file in the DE-1 magnetometer high resolution dataset on SPDF's archive is at http://spdf.gsfc.nasa.gov/pub/data/de/de1/magnetic_fields_maga/gms_62ms_vmsbin/1981/1981258_de1_maga_gms_62ms.vmsbin

3. Instrument Directory Naming Conventions

Instrument directories can be the instrument_acronym (and perhaps add institution if needed), but if possible expanded to include the instrument type for users not familiar with the specific spacecraft. For examples, waves_pwi or particles_epact. SPASE Measurement Types are recommended. Datasets from multiple instruments can be named with combinations, such as magnetic_electric_fields, or "combined" or "merged" if they involve a number of instruments. Composite/combined/merged data collections are best stored in directories at the instrument level (de/de2/combined_magnetic_electric_fields/ and de/de2/combined_plasma_neutrals/). Data collections for housekeeping, engineering, ephemeris, orbits, attitude or combinations go at the same level as the instruments. Directories should use written out words and underscores, and avoid non-obvious abbreviations where possible.

Include a readme file (perhaps named 00README.TXT, see below) with brief explanations of the data collections and their relationships. 00README.TXT may be placed in every directory for navigation and content identification (perhaps aliased to the top-level one). Connect data collections across several instruments with links in the 00README.TXT files and web pages. Combined data collections at the higher level can have place-holder directories under each instrument involved, that includes a 00README.TXT pointing to the actual data directory. Instrument directories contain data collections directories, and may also include other directories, such as SOFTWARE, DOCUMENTS, CATALOGS, ATTRIBUTES (upper case to distinguish from data directories and to push to the top of directory listings). Archived web sites may be stored under the DOCUMENTS directory, perhaps with the name WEBSITE. The name 00README.TXT was selected so it generally will appear first in directory listings.

4. Data Collection Directory Naming Conventions

Data collection names should include info on parameters, temporal and spatial resolution, compression and file format with enough specificity to allow other variations later (so we can later add other data collections only differing by one of these), using the order:

project_instrument_parameters_resolution_format_compression

These fields or sections are separated by underscores, and parts of a section can use hyphens to improve readability. Use English words where meaningful, except for short forms listed below.

5. Data File Naming Conventions

Filenames contain the data collection name and add time and version information and file extension, such as

project_instrument_dataform_time_version.fileformat

Alternatively, the ISTP format uses a slightly different ordering:

project_dataform_instrument_time_version.fileformat

Filenames must carry the project and instrument for uniqueness and clarity. Use only alphabet characters, numbers, hyphens, underscores, and periods, so the names are valid on all common file systems. Use all lower case filenames except uppercase 'T' between date and time.

5.1. Field descriptions used in filenaming and data collection naming

project/source/mission/spacecraft
shortest string that clearly describes but distinguishes from other spacecraft and projects.
(Dictionary: ISTP: "Source", SPASE: "Observatory")
instrument
instrument name; 'ALL' for all instruments of a project, "combined" or "merged" for subsets
(Dictionary: ISTP: "Descriptor", SPASE: "Instrument")
dataform/datatype
characteristics of the dataset that distinguish it from others (including plausible ones created later) using some meaningful combination of parameters, resolution, format, compression:
Parameters: AC vs DC, AVG for average, or ISTP K0, K1, H0, etc.
Resolution: temporal resolution using time codes: ms, min, s, hr, day, week, month; round off resolutions for varying resolutions; preferably note all time resolutions in the data collection. Examples: 500ms for 0.5sec resolution, 6s for 6sec, 5min, 2hr for 2 hour, 1day for daily
Format: ascii, cdf (include in data collection directory name, but data files use this as file extension)
Compression: zip, gz, tar.gz (include in data collection directory name, but data files use this as file extension)
(Dictionary: ISTP: "Datatype", SPASE: "ProviderResourceName", "ProviderProcessingLevel")
time
begin time (or begin and end time if required) in ISO 8601 https://en.wikipedia.org/wiki/ISO_8601 format ("T" between day and hour) and always 4-digit years. Time preferably uses MMDD rather than day of year (DDD) for consistency.
  • YYYYMMDDTHHMMSS (truncating where sensible)
  • YYYYDDDTHHMMSS (where DDD is day of year, with 001 = Jan 1)
  • YYYYqx (for quarter year: q1, q2, q3, q4)

Add: Ideally time string has fixed span, don't drop SS when they are zero?

Remove quarter?

version
reprocessing version; preferably uses a format of "vNN" where NN=01, 02, 03, etc. Some projects use a more complicated versioning scheme.
(Dictionary: ISTP: "Data Version", SPASE: "")
file format
(Dictionary: SPASE: "Format")
usually in the file extension, including these:
  • standard science format: ".cdf", ".hdf", ".fits"
  • ASCII data format: ".asc" (not ".txt" which is reserved for text descriptions)
  • binary data format: ".vmsbin", ".os2bin", ".idl", ".xdr", ".ieeebin", ".intelbin"? (reserve ".dat" for unknown or uncommon binary data files)
  • software: ".for", ".c", ".pro", ".class", ".pas", ".pl"
  • document text format: ".txt" (".doc" for MS Word files only!)
  • graphics format: ".gif", ".jpeg", ".ps", ".png", ".tiff"
  • appended compression/collection: ".gz", ".tar.gz", ".tar.Z", ".zip", ".sit" (".bin" is MacBinary format or WordPerfect)
(Dictionary: SPASE: "Encoding")

It is very important to distinguish text files from binary ones when the user is transferring them via FTP or the user wants to examine them. EBCDIC, BCD and 36bit file formats are discouraged.

6. Appendix: 00README.TXT template

00README.TXT file in each directory describes the directory contents and points to directories below and back to higher directories and other info.

  • Keep short (point to 00README_LONG.TXT when longer file required to contrast various instruments or datasets.
  • Use <> around URLs and "/" at end for directories.
  • Use URLs to point to software and documentation.
  • Users may come into hierarchy at any level so each 00README.TXT should stand alone and place the user in hierarchy.

Example

Data Directory: (location of this file) Example: <http://spdf.gsfc.nasa.gov/pub/data/de/de2/magnetic_fields_magb/00readme.txt>

1-line title

Keywords: Short of list of keywords for search engines Example: space physics, triaxial fluxgate magnetometer data, ionosphere, magnetosphere

short description of directory {up to four lines}

Subdirectories: one line for each sub-directory name and short descriptive title

Contact: {staff person name, phone, e-mail}

Additional related information and data services on NSSDC's DE-1 magnetometer Master Catalog: <http://nssdc.gsfc.nasa.gov/nmc/experimentDisplay.do?id=1981-070B-01> and the Heliophysics Data Portal <http://heliophysicsdata.gsfc.nasa.gov/websearch/dispatcher?action=RESULT_LIST_PANE_ACTION&command=ProductViewCmd&pid=1143>

Pointers to documentation and software

Please acknowledge "the NASA Space Physics Data Facility (SPDF)" for data usage.

Personal tools