LAT Data Catalog: Virtual File System
The LAT Data Catalog is a virtual file system maintained in an Oracle database. Data may be stored at several locations, e.g., SLAC, University of Washington (UW), Lyon (IN2P3), and elsewhere). The files themselves may be stored:
- On disk in AFS-, NFS-, or XROOTD-managed servers.
- In one of several tape archive systems.
- Or in any combination of the above.
The Data Catalog simplifies access to data by providing a uniform view of files that is independent of their physical location, and it provides features that are not available in standard file systems, including tagging files with:
- Meta-data attributes (typed name/value pairs) that provide additional information about the data they contain.
- Several physical locations, allowing a file to exist in multiple places for more convenient access.
In addition, the Data Catalog maintains a conventional folder structure even as it provides a group structure, which allows files of different pedigree to be separated while coexisting within the same folder.
The Data Catalog also provides access to files by requesting:
- A file, or a set of files, at a specific location (folder / group).
- A set of files via a meta data query.
JAVA API. Access to the Data Catalog is provided via a Java API that is is under continued development, and features are added regularly. Any Java program running within the SLAC firewall may use this API to take advantage of the full Data Catalog feature set. The Java API is is available Confluence (see Data Catalog Java API). The API can be accessed by:
- A line-mode client, available from SLAC UNIX machines.
- Jython scriptlet processes in the GLAST Pipeline
Line-mode Client
The Line-mode client is available from the UNIX command line at SLAC, and represents a subset of the full Data Catalog API.
The Data Catalog Line-mode executable is available at:
/afs/slac.stanford.edu/g/glast/ground/bin/datacat
Help:
- To display the help screen, invoke the executable with no parameters.
- Command-specific help can be obtained by executing:
/afs/slac.stanford.edu/g/glast/ground/bin/datacat -h <command>
Commands Currently Available
(The usage is similar to CVS) The following commands are currently available:
- registerDataset
(adds a new dataset to the catalog)
datacat registerDataset [-options] <dataset name> <data type> <logical folder> <file path>
Required Parameters:
<data type> |
Type of data in the file (merit, MC, DIGI, RECON, etc.) See Java API child page for a full list. |
<logical folder> |
Dataset Folder Path under which to create the new dataset. |
<file path> |
Physical location of file to add to Data Catalog. |
Optional Parameters:
Long Form |
Short Form |
Parameter |
Default Value |
Description |
--name |
-n |
dataset name |
file name |
Name to give new dataset in the catalog. |
--group |
-G |
group name |
none |
Group under which to store the dataset. |
--format |
-F |
file format |
file extension |
Format of the file.
(root, fits, etc.)
|
--site |
-S |
site name |
SLAC |
Site where dataset physically exists (SLAC, SLAC_XROOT, etc.) |
--define |
-D |
"name=value" |
none |
Define a meta data name/value pair for the new dataset. This option may be used more than once. For naming rules, see the Java API child page |
Example:
datacat registerDataset -n 000002 -G merit -D nEvt=2500 -S SLAC -F root merit /
ServiceChallenge/Interleave3h-GR-v11r17/runs /nfs/farm/g/glast/u43/MC-tasks/
Interleave3h-GR-v11r17/data/merit/Interleave3h-GR-v11r17-000002-merit.root
- addLocation - Adds an additional physical location to an existing dataset. Use this routine to specify that a dataset exists in more than one physical location (i.e., it's on SLAC NFS and in SLAC XROOT.) Except for <file path> all of the parameters and options are used to identify the existing dataset entry to which you want to add an additional physical location.
datacat addLocation [-options] <dataset name> <logical folder> <file path>
Required Parameters:
<dataset name> |
Name of existing dataset. |
<logical folder> |
Data Catalog Folder Path under which the dataset lives. |
<file path> |
Additional physical location of file to add to the dataset entry. |
Optional Parameters:
Long Form |
Short Form |
Parameter |
Default Value |
Description |
--group |
-G |
group name |
none |
Dataset Group in the Data Catalog under which the dataset lives. |
--site |
-S |
site name |
SLAC |
Site at at which the additional physical location exists. |
Example:
datacat addLocation -G merit -S SLAC_XROOT 000002 /ServiceChallenge/
Interleave3h-GR-v11r17/runs root://glastrdr//glast/mc/ServiceChallenge/
Interleave3h-GR-v11r17/merit/Interleave3h-GR-v11r17-000002-merit.root
- addMetaData - Adds meta data entrie(s) to an existing dataset.
datacat addMetaData [-options] <logical folder>
Required Parameters:
<logical folder> |
Logical Folder Path where the group or dataset lives, or to tag with meta data if no dataset or group specified. |
Optional Parameters:
Long Form |
Short Form |
Parameter |
Default Value |
Description |
--dataset |
-n |
dataset name |
file name |
Name of existing dataset. |
--group |
-G |
group name |
none |
Dataset Group in the Data Catalog under which the dataset lives. |
--define |
-D |
"name=value" |
none |
Define a new meta data name/value pair for the dataset. This option may be used more than once. (And must be used at least once!) For naming rules, see the Java API child page |
Example:
datacat addMetaData -d 000002 -G merit -D nEvt=2500 /ServiceChallenge/
Interleave3h-GR-v11r17/runs
Pipeline Jython Scriptlets
Jython scriptlet processes within the pipeline enjoy access to the full Java API. Access to the Data Catalog is provide via an object named "datacatalog".
Example: Dataset registration is performed by calling:
datacatalog.registerDataset(DATA_TYPE, DATA_CATALOG_LOCATION,
DISK_LOCATION [, META_DATA])
where:
- DATA_TYPE is the type of data within the file.
- Typical values are MERIT, MC, RECON, ...
- (See the Java API link below for a full list.)
- DATA_CATALOG_LOCATION has the following form: <logical folder path>[<dataset group name>:]<dataset name>
- <logical folder path> is required and has the form: /folder1/sub-folder/.../
- It denotes the location within the Data Catalog folder-tree where the dataset will be registered.
- The folder need not exist, it will be created if necessary.
- <dataset group name> is optional.
- If present, it must be followed by a ":" (colon) character.
- The name is a simple alphanumeric string (spaces are not permitted.)
- A dataset group is used to bundle together datasets which are fragments of a larger dataset.
- For example, all merit files of a large monte carlo task are generally cataloged together using a dataset group.
- <dataset name> is required.
- It is simply the name of the dataset.
- It is an alphanumeric string (spaces are not permitted.)
- It must be unique within the folder or group where it will be placed.
- DISK_LOCATION has the following form: <disk file path>[@<site name>]
- <disk file path> is required.
- It is the full path on disk (or in XRootd, etc.) to the file that is being registered.
- <site name> is optional.
- If specified, it must be preceded by a "@" (ampersand) character.
- The site name tells the data catalog where to find the physical file.
- Currently it may be one of:
- SLAC, SLAC_XROOT, IN2P3, IN2P3_HPSS, UW
- If no site name is specified, a default of "SLAC" is assumed.
- META_DATA is optional. If specified, the supplied meta-data will be attached to the dataset upon registration. Meta-data provide a basis for searching the Data Catalog for datasets. A META_DATA expression has the following form: <name>=<value>[:<name2>=<value2>[...]]
- <name> is required.
- It is simply the name of the meta-data object, but it's form is significant because it denotes the object type of the <value> parameter. The Data Catalog will perform a type conversion and store the <value> parameter internally based on the type specified by the name:
- n[A-Z]+.* (ex: nEvents, nSecondsMET) indicates a numeric value
- t[A-Z]+.* (ex: tStartDate, tEndDate) indicates a timestamp value
- Anything else (ex: RunStatus, myDogsName) indicates a string value
- <value> is required and must be separated from <name> by a single '=' (equals) character.
- The value must reflect the type specified by <name> or an error will be thrown, and the registration will fail.
- Numeric values have 38 decimal digits of precision for integers and 18 for floats. Leading and trailing zeros will be removed during conversion.
- Timestamp values must be supplied in the following format: yyyy-mm-dd hh:mm:ss.[fff...]
(fff... is an optional, fractional seconds component with nanosecond precision.)
- String values are simply ASCII strings. Put whatever you want in there, even numbers.
- Multiple <name>=<value> pairs may be supplied if separated by ":" (colon) characters
Below is an example. The parameters are interpreted as follows:
- It registers a "merit" type dataset.
- The dataset is placed under the Data Catalog folder "/ServiceChallenge/Interleave3h-GR-v11r17/runs/", in the group "merit", with a name of "000002".
- The file is found on disk at:
"/nfs/farm/g/glast/u43/MC-tasks/Interleave3h-GR-v11r17/
data/merit/Interleave3h-GR-v11r17-000002-merit.root"
and is assumed to be located at SLAC (because no site name was specified).
datacatalog.registerDataset("merit","/ServiceChallenge/
Interleave3h-GR-v11r17/runs/merit:000002","/nfs/farm/g/
glast/u43/MC-tasks/Interleave3h-GR-v11r17/data/merit/
Interleave3h-GR-v11r17-000002-merit.root")
Last updated by: Chuck Patterson
12/13/2007 |
|
../data_accessDataServer/latDataCatalog_virtualFileSystem.htm
|