TSDS: A software and data package with Time Series Data Sets in a single file and metadata format, a set of programs for creating them from their original format, and a set of programs for reading, exporting, and exchanging the data. [Note: Using File->Print from your web browser will print the full documentation (not just the visible tab).]
Note: Versions of TSDS less than 1.0 will have significant differences in the metadata structure and content and small corrections to the data sets may have been made. For version 1.0, the data will be final (as described in the Data section) and the metadata will only have additions or modifications to the README, Note, Warning, and Acknowledgements.

Introduction

This program TSDS and its associated data (see the FAQ section for a list) were developed to simplify the analysis and exchange of long time series data sets. Using the TSDS program, a data base of standard and often-used space physics related time series from many data sources are available using only a few commands. For example, the commands
>> DATA = ts_slice('NAME',[1995,1,1],[2001,12,31],'hour');
>> ts_export(DATA,'DATA.txt')
return a Nx1 matrix DATA with 1-hour values of time series NAME in the 7-year interval (N=24*365*5+2*24*366) and writes a text file named DATA.txt. If data are not available in any part of this interval, fill values are used. If data for NAME were not found on the users disk an attempt is made to automatically download the requested file. If NAME is only available as daily-averaged measurements, the daily values are repeated to put the measurements on a 1-hour time grid. The string NAME can specify a 1-D time series, for example, amplitude vs. L-shell. The string NAME can also be a list (cell array of strings, i.e., {'NAME1', 'NAME2'}) of M time series; in this case DATA is a NxM matrix.

TSDS was developed in part to make large statistical analysis problems easier. I encourage you to use some of the it saves you to verify that the transcription of data from the orignial source matches what is in TSDS. Many spot checks have been made to verify that TSDS data matches the original data files. However, errors can occur if the original data files have nonuniformities. Many nonuniformities have been corrected and are noted in the comments of the .m and .pl files in the subdirectory ./make_fns.

TSDS would not be possible without data providers who provide easy access to their data sets and have generously given permission (or have been allowed to given us permission) for their data to be available through TSDS. We have made extended efforts to obtain permissions from original data providers. It is important that you acknowledge the original data sources if you present or publish results based on their data; in many cases their continued funding depends in part on evidence that their data is actually used. The metadata of the TSDS files contain information about how the data sources prefer to be acknowledged and when you call a data set this acknowledgment information is printed to screen.

Getting Started

First, follow the download, install, and startup directions in the "Download" section. See the FAQ for a list of all data that is available through TSDS.

To extract a data set from the TSDS data base, only the time series name is required. To list all of the available time series (this may take a while), type

>> ts_list
To list only time series names that contain the string "FMI", type

>> ts_list('FMI')
(The directory contains subdirectories with names that correspond to the data source, i.e., "FMI", and "CDAWeb". Use these strings with ts_list to restrict the list to a given data source.)

To get all of data associated with the time series named "OUJ X (FMI 1-min)", type

>> A = ts_slice('OUJ X (FMI 1-min)'); 
To get all of the metadata associated with the time series named "OUJ X (FMI 1-min)", type

>> A_info = ts_get('OUJ X (FMI 1-min)')
(The absence of a ";" at the end of the above command tells the interpreter to list the output to the screen.)

To view elements 20-100 of array A, type

>> A(20:100)
To save the array A to a text file, type
>> save -text A.txt A
To determine the location and size of this file, type
>> pwd ; ls -lh
To get data in a time interval (say, during the Halloween Storms), type

>> A = ts_slice('OUJ X (FMI 1-min)',[2003 10 29],[2003 11 1]); 
The demos ts_get_demo.m and ts_slice_demo.m in the directory time_series_fns can be modified with any text editor and executed from the Octave or Matlab command line.

Creating a Data Set

You can insert your own time series into your TSDS data set so that it can be called using the ts_* functions. To do this, edit the text file [tsds_create.m] and [tsds_create_meta.m]. After editing, type tsds_create at the command line.

Also see the demo program [ts_insert_demo.m] for a more advanced example. Type which ts_insert_demo to see the location of this file.

How TSDS data files are created

All of the data available through TSDS have the data + code available that is required to make the final data file. For examples of how to do this, see the subdirectories of the directory make_fns/. The programs in these subdirectories (1) auto-download the data from the data provider (using wget), parse the data sets and put them into the TSDS format, add metadata, and create the final set of files that are read with calls to ts_list, ts_get, and ts_slice. To run some of these programs you will need Perl. They should work without problem on Unix systems where perl is located in /bin. On Windows you will need to find the location of a perl binary (usually one is distributed with Matlab) and edit the TSDS_PERL variable in TSDS_GLOBALS.m

Contributing/Sharing a Data Set

This program was made with the eventual goal of facilitating data sharing between scientists. Ideally there will be two ways to contribute data. Method 1 is best for data sets that are derived from files that are not readily available from a ftp or http server. Method 2 is best for data files that are available on a ftp or http server and for data sets that will need to be updated. This method is used for all data sets that are a part of the main TSDS distribution.

Method 0 The current method is to just have a user put the files named TSDS_username_SourceTab.EXT (created by ts_insert) in the recipients data directory. The recipient then needs to execute ts_db('user',1) to update their data base.

Method 1

This functionality does not exist, but this is how I would prefer to exchange data. This method involves simply creating a data file as in the Creating a Data Set section. To exchange this new (preliminary) data with someone who has TSDS they first execute

>> ts_upload(NAME,FTP_SERVER_NAME) % (This function is not yet available)

which uploads the files TSDS_username_SourceTag.EXT and TSDS_username_tag-metadataonly.EXT to a ftp server, The string username is the last name of contributor (you must edit the variable TSDS_GLOBALS.m). The string SourceTag is the short string used in the ts_insert command. All users will have access to this file after executing

>> ts_update(FTP_SERVER_NAME) % (This function is not yet available)

FAQ

Q: What data are available through TSDS?
A: Over ~ 4 GB (compressed) of time series data drawn from Augsburg, CDAWeb, COHOWeb, OMNIWeb, DMI WDC, FMI, ISGI, Kyoto, WSO. Also, time series of space physics coordinate transform matrices, planetary ephemeris (JPL DE), and stock market indices (from Yahoo). A (short) list with names only [.txt] or the full (long) list of time series with metadata [.txt].

Q: Why is a 20 MB file downloaded if only 1 day of data is requested?
A: The data are stored in 20 MB file chunks. If you are running out of disk space, delete any large file in ~/.tsds/data (Unix) or tsds/data (Windows) that does not have the string "metadataonly" in its name. It will be re-downloaded as needed.

Q: What is the data file format?
A: Data are stored in both Matlab binary (V6 .mat) although this could be changed. The entire data set could be exported to HDF, H5, CDF, netCDF, or text; see the function ts_export and the discussion in the Data section. Originally I planned to put the data set either HDF (.hdf), HDF 5 (.h5), CDF, or netCDF. These data formats have (1) essentially the same capabilities (for purposes relevent to TSDS) (2) varying levels of usability in different programs (for example, although Matlab has CDF support the reader is terribly inefficient.) (3) Varying levels of support on different platforms. I settled on Matlab V6 binary because it can be read by Octave, which is free (as in price) and is easily installed on the major operating systems.

Q: What language is TSDS written in?
A: The programs for reading and exporting the data in are Matlab/Octave scripts. Knowledge of these programs is not required for for using TSDS. The wget utility is used for downloading orignial data and Perl is used for parsing orignal data files in text format.

Q: What about updating the data?
A: In principle a user should be able update the time series by executing the programs in the make_fns directory. However, I have not tested these programs for cross-platform and Matlab/Octave compatability as much as the main TSDS programs, so their use may require some effort.

Download

TSDS requires Matlab or Octave (Free/Libre). Extensive experience with Matlab or Octave is not required. After installation, instructions and tutorial information are displayed to the screen. All path and configuration information is stored in the text file TSDS_GLOBALS.m; you should not need to change the parameters in this file. You can install TSDS into any directory; the subdirectory tsds-0.9.35/data (or ~/.tsds under Linux systems) will be used to store all of the downloaded data. If you are running out of disk space, delete any .mat file that does not have the string "metadataonly" in its name. It will be re-downloaded as needed.

Windows XP:

  • If you have Matlab 6.5 or greater, download the package [~5 MB zip], unzip, start Matlab, cd to the tsds-0.9.35/ directory and then type tsds at the Matlab command line; at this point you should see instructions printed to the screen.
  • If you do not have a Matlab license, first install Cygwin+Octave (both are free). Then, download the package [~1 MB zip], unzip, cd to the tsds-0.9.35/ start Octave and then type tsds at the Octave command line; at this point you should see instructions printed to the screen. (This combination should work, but is the least tested. It is quite likely that you will encounter problems such as missing dependencies or problems that are Octave-version dependent.)

Linux:

  • If you do not have Matlab, download [~10 MB zip] cd to the tsds-0.9.35/ directory execute ./tsds_octave (or PATH="$PATH:.";./tsds_octave) from the system command line. This script starts a slightly modified version of Octave 2.1.71 that is included in the zip file you download.
  • If you have Matlab 6.5 or greater, download the package [~1 MB zip], unzip, start Matlab, cd to the tsds-0.9.35/ directory and then type tsds at the Matlab command line; at this point you should see instructions printed to the screen.

Source code: [~1 MB zip]

Multi-user installation (not recommended): If you are a system administrator for a multi-user Linux system, you can install tsds into a system directory (i.e., /usr/local), but downloaded data will be stored in the users home directory. You may want to have each users ~/.tsds/data directory linked symbolically to a common directory.

Alternatively, the program tsds_test_files will try to download all of the TSDS data files and load each time series into memory.

Data Sources

TSDS was created to address several issues we have encountered when attempting to do statistical analyses that require time series from data providers:

  1. Many data providers have a web interface that is only useful for short time intervals.
  2. Time series data are stored in numerous formats, including HDF, H5, netCDF, CDF, WDC, IAGA2000, along with many text formats. This can make basic exploration, idea testing, and cross validation and comparison time consuming difficult.
  3. There are many valuable time series data sets, in even more file formats, that are not available from a data center. These data sets are typically posted on a scientists web page or are provided upon request.
  4. Data provided on the web are not always static. If one attempts to reproduce your results starting with downloading the data from your cited data source, there is no guarantee that they will receive the files with the same content that you used.

Note that as data centers evolve with user demands, a program such as TSDS will not be needed for their holdings. However, the problems listed below will always exist to some degree, and we expect that in the future TSDS will only be needed for data sets that are either too new or too specialized for the large data center or virtual observatory.

The TSDS package contains two parts: a set of programs that interface with the TSDS data sets and a set of "make" programs that download from the data providers and transform it into the TSDS format. The TSDS data format is not intended to be a new data format. Instead it is a temporary format; a number of examples are given for transforming all TSDS data into more common scientific data formats such as CDF or H5. In dealing with 1. and 2. we have found that writing a program that automatically downloads and extracts data from all available files in a given format is only slightly more complex than writing one to extract a day or two of data for a given research task. The TSDS programs are simply a generalized and organized collection of such parsing files that we have used at one time or another.

The following set of commands will plot every time series in the data set to the screen. (You probably don't want to do this; it is noted here only to make the point that this is how I would prefer to access data). The for loop will take a very long time to execute because it will attempt to download all of the TSDS data files (> ~ 4 GB compressed) if they have not been downloaded previously. (The TSDS install package contains only metadata and no data files, so an internet connection is needed to do anything more than browse the metadata using the ts_list function.)

>> NAMES = ts_list;
>> for i = 1:length(NAMES)
>>   D = ts_slice(NAMES{i}); 
>>   I = ts_get(NAMES{i}); 
>>   plot(D) 
>>   I
>> end

Data are downloaded automatically, as needed. However, if want access to all of the data when you are offline (> 4 GB; you probably do not want to do this), you can download all of the TSDS data files, execute the shell commands from the tsds-0.9.35 directory:

% cd ./data/final-s0
% wget -nH -p "http://www.scs.gmu.edu/~rweigel/tsds_data/final-s0"

An important feature of TSDS and its associated data set is its expendability, which was developed to address 3. To add a time series, one only needs to execute a few commands after creating two text files, one with the time series data and the other with metadata. Exchange of a time series with someone running TSDS, requires only a few TSDS commands.

To address 4., TSDS data sets are versioned in two ways. There is a snapshot number that corresponds to the date the data files were downloaded. Each snapshot has a version 0 time series which are exact copies of what was extracted from the original data files. If we find a problem with a time series, either a time series with a new version number is created or a metadata tag is added that informs the user of the problem. If a data source changes the data in a file used in a previous snapshot, a time series with a new snapshot number is created; the data from a previous time will always be available, which is important to ensure that someone doing a validation and metric analysis five years from now is able to reproduce the results of today, even when todays version of the data is no longer available from the original data source.

Directory original-sX/

The directory contains subdirectories with names that correspond to the data source. The online data directory contains all of the raw data used to create a versioned time series that are available through TSDS. Most of its subdirectories were created using a "wget" call to the URL of a data provider. The web location of the data can be inferred from the directory structure created by wget. The *_get.m programs in the subdirectories of make_fns/ usually contain the "wget" command used to download the data. The subdirectory manual_dl/ contains data files that could not be downloaded using "wget".

All of the data in the original-sX/ (where s means snapshot) directories represent a snapshot in time of a data source's directory. Many data sources will change file contents without changing filenames. For this reason, TSDS data may differ from what currently exists, which may or may not matter depending on your application.

The X in the directory name "original-sX/" indicates the snapshot number. When a new snapshot is taken, the original-sX directory is copied to original-s(X+1). Then the "wget" command is initiated starting in the directory original-s(X+1). A feature of "wget" is that it will not re-download a file on the server if the same-named file on the local disk is older. If any files containing data before Year_f in a sX time series were changed, the relevant make_fns/ command is run using data in the s(X+1) directory and a new time series is generated with label s(X+1).

Directory data/final/

The .m files in the make_fns directory take data from the subdirectory original-sX of the data directory and transforms it to a uniform time grid. (a few of the .m files are not yet Octave-compatible and will require Matlab.) The original data files are usually parsed and transformed by .pl and .m files that are part of the TSDS source code and create files

  • TSDS_base_1_day.EXT
  • TSDS_base_1_hour.EXT
  • TSDS_base_ACE.part_1.EXT
  • TSDS_base_gnd_mag.part_1.EXT
  • TSDS_base_gnd_mag.part_2.EXT
  • etc.

that are located in the final-sX/ subdirectory of data. Currently EXT=mat (V6), but this is easily changed by modifying the program ts_mat.m. For example, see ts_hdf.m, which deals with EXT=HDF. For a discussion of data file formats see the FAQ.

All of these files will have an associated *-metadataonly.* file in the same directory. Some of the larger data files are not distributed with in the standard TSDS package. When a user requests data that is in a file that is not part of the standard data distribution, a message will appear that states that the data is being downloaded with the program "wget" along with the estimated download time. The data files are usually 20 MB or less in size.

Versioning notes
  • The version number is indicated in the time series name in parenthesis only if it is a version other than zero.
  • Version zero means the time series is an exact copy of what was found from a data provider (with the exception of sub-minute data which is nearest-neighbor interpolated to a 1-minute time grid).
  • The only time a time series with a given version number is allowed to change is if data previously filled with FLAG values in the year listed as Year_f in the metadata becomes available.
  • The data available through TSDS are based on snapshots in time of a data provider's ftp or http site. If we take a new snapshot and the resulting time series is significantly different (in any of the years before Year_f), a new version of the time series will be created and distributed with the next release of TSDS, and the old version will be labeled as depreciated. (Note that only a diff from the original time series may not be stored in the .EXT files to save space.)

Transcription notes

  • The data in these files are all on a uniformly-spaced time grid. Fill values were added so that all time series start on the first time interval of a year and end an the last time interval of a year. The only allowed time grids are minute, hour, and day. (Data are stored compressed, so the storage space required for time series with many fill values is small.)
  • Data at sub-minute resolution were put on a 1-minute time grid using nearest neighbor interpolation.
  • 1-minute averaged CDAWeb data with time stamp HH:MM:SS=00:00:30 were assumed to be from data in the interval 00:00:00 to 00:00:59.99999.
  • Note that OMNIWeb text data files usex a different time stamp convention than the same data in CDAWeb files. In OMNIWeb text files, the rows are labeled by hour, such that data with hour label H are based on averages of data in the interval [H,H+1].
  • 1-minute averaged CDAWeb data with time stamp HH:MM:SS=00:00:30 were assumed to be from data in the interval 00:00:00 to 00:00:59.99999.
  • CDAWeb data that were not on a 1-minute time grid (for example, 16 second or 64 second) were put on a 1-minute time grid by taking the measurement closest to the middle of the minute. If the minute interval had no data, a the fill value was used.
  • Data at 3-hour cadence were put on 1-hour time grid by repeating values.
  • Generally, (original) 1-hour data with time stamps such as 00:00 or 0 is are based on data in the interval 00:00-00:01.999. There is often an ambiguity regarding what the averaging interval for data is. That is, a number with time stamp of 00:00 could be the result of averaging data in an interval of +/- 30 minutes or a result of averaging data in the interval 23:00-23:59:59 (which makes sense for time stamping real-time data). We have made an effort to make sure that our time stamping convention is consistent, but checks with original data sources and files should always be made. Note that if a time series was filtered before averaging, the average value may be based on data outside of the averaging window.

Caution: The data in these files have been tested by (1) visually inspecting a plot of the sorted time series (e.g., with the Octave command plot(sort(X)), (2) by comparing select time intervals with plots available from the original data sources on the web and the original data files (You can see what tests were done by looking at the *_test.m functions in the directory tsds/make_fns). It is still possible that transcription errors exist. Before presenting these data, do quality checks of your own and report any errors you find to rweigel@gmu.edu. This data set was created to make analysis of data from many sources easier. However, always do your own checking and make sure you read the notes and information given by the the original data providers (a web link to the original data source is provided in the metadata and is viewable using ts_list(TS_NAME)).

The data files contain extensive metadata including information about where the data were obtained, the location of relevant README files, a string indicating the first year of data, and the geographic locations of the measurement instruments (for ground magnetometers).

CSS style source