URI Fragment Identifier for Binary Data
Status: Draft
This version:http://dfdf.inesc-id.pt/tr/doc/fragid/20070930
Latest version:http://dfdf.inesc-id.pt/tr/fragid
Editors
- Xiaoshu Wang (xiao
kdbio.inesc-id.pt) - Jonas S.
Almeida(jalmeida
mdanderson.org) - Arlindo L.
Oliveira(aml
inesc-id.pt)
Abstract
This document defines the syntax and semantics of URI fragment identifier for the content type of "application/dfdf+octet-stream" or "application/x-dfdf+octet-stream (for experimental non-standard type)" that can be used within the context of data format description framework (http://dfdf.inesc-id.pt) to refer part of the binary data.
Table of Contents
1.1. Terminology and notations
1.3. Binary Representation of df:InfoSpace
2. Fragment Identification Methods
2.1.2. Fragment Definition Block
3. Item Selection for df:Stream
3.2.3. Item Selection for df:Stream
3.2.4. Semantics of fragment definition block
3.4. A few potentially confusing cases
1. Introduction
1.1. Terminology and notations
The terminology and notations used to describe the Data Format Description Framework (DFDF) is defined at a separate document at "http://dfdf.inesc-id.pt/tr/terms". The key words MUST, MUST NOT, REQUIRED, SHALL, SHALL NOT, SHOULD, SHOULD NOT, RECOMMENDED, MAY, and OPTIONAL, when EMPHASIZED, are to be interpreted as described in IETF RFC 2119[1].
1.2. Background
URI is at the heart of web architecture. A URI gives a resource an identity in the web and enables software agents to retrieve its representation as well as the description from the web. Generic URI syntax sanctions the use of a fragment identifier by appending a number sign ('#') and the name of the fragment identifier to the end of primary URI. Using fragment identifier allows a resource secondary to the primary resource to be identified. However, the identification method of a fragment identifier is dependent on the media type of the returned representation of the primary resource[2]. For most media types, a fragment identifier refers to a subcomponent of the primary resource. But for some media types, notably the RDF type, fragment identifiers are mostly used to refer to external entities.
A typical DFDF application deals with documents of two different media types. One of them is the RDF document and the other is the binary data file. The semantics of fragment identifiers for RDF documents is well documented in its specifications (http://www.w3.org/RDF/) but that for a binary file is left undefined. The reason is obvious because neither a data model nor a markup tag is there for use to delineate the boundary of binary fragments. Such a limitation, however, is overcome in DFDF by binding a data model to the data document. This allows fragment identifiers to be built from model components, which, in turn, can be used to refer to fractions of binary data.
1.3. Binary Representation ofdf:InfoSpace
In DFDF, a df:InfoSpace holds information items, which meaning are expressed through df:Transformations that relate them to the information items of other df:InfoSpaces. If, by a series of transformations, the information content of an df:InfoSpace is ultimately transformed from a collection of bytes on a binary resource, we define this collection of bytes to be the binary representation of the df:InfoSpace at that resource. Take the following description as an example.
@prefix x <http://example.com/data#>.<http://example.com/data> a df:Stream; df:about df:byte.
x:str1 df:destOf [a df:RegMapping;
df:src <http://example.com/data>;
df:srcIndices [a df:NumberStream;
df:generator "1, 3, 5"].
x:str2 df:destOf [a df:RegMapping;
df:src x:str1;
df:srcIndices [a df:NumberStream;
df:generator "1,2"].
Stream x:str1 is transformed directly from the 1st, 3rd and 5th byte of "http://example.com/data". Hence, these three bytes are the binary representation of x:str1 at "http://example.com/data". Similarly, x:str2 is transformed from the first two elements of x:str1, which is, in turn, built from the 1st and 3rd byte of "http://example.com/data". Hence, the 1st and 3rd byte of "http://example.com/data" are the binary representation of x:str2 at "http://example.com/data".
A df:InfoSpace, however, is not limited to a single binary representation. In the following description, for instance, x:str3 has two binary representations at two different locations.
x:str3 df:destOf [a df:RegMapping;df:src <http://example.com/data/1>
df:for _:property1];
df:destOf [a df:RegMapping;
df:src <http://example.com/data/2>;
df:for _:property2].
But even at the same location, a df:InfoSpace can have multiple binary representations because a df:InfoSpace may be transformed differently in different models. Consider the following two sets of statements.
# Statement 1x:str4 df:destOf[a df:RegMapping;
df:for_:foo ;
df:src <http://example.com/data>;
df:srcIndices [a df:NumberStream;
df:generator "1,2"]].
# Statement 2
x:str4df:destOf [a df:RegMapping;
df:for_:bar ;
df:src <http://example.com/data>;
df:srcIndices [a df:NumberStream;
df:generator "3,4"]].
With statement 1, the binary representation of x:str4 at "http://example.com/data" is the first two bytes of the byte-stream. But with statement 2, it becomes the 3rd and 4th bytes and with both statement 1 and 2, it becomes the first four bytes of the byte stream. On the other hand, some df:InfoSpace has no binary representation whatsoever. An instance of a df:NumberStream, for example, has no binary representation at any location.
The exact composition of a binary representation, therefore, is defined by three factors. (1) The location of the primary resource (2) The RDF model (3) Selections of df:InfoSpace. Obviously, the URI of the primary resource should naturally become the primary URI of a binary fragment identifier. In the subsequent sections, we defines how the fragment identifier should be used to select models and df:InfoSpaces.
2. Fragment Identification Methods
2.1. Syntax
2.1.1. Production rules
The syntax of binary fragment identifier is defined by the following production rules.
FragIdentifier = FragmentDef, {FragmentDef} ;FragmentDef = "!", {NSDefinition}, { InfoSpaceDesc };
NSDefinition = [ qnamePrefix ], '(', URI-reference ,')';
InfoSpaceDesc = [Operator], [InfoSpaceName], [ItemSelection];
Operator = "+" | "-"
qnamePrefix = NCName
InfoSpaceName = NCName | QName
ItemSelection = undefined | StreamSelection
StreamSelection = "[", GeneratorString, "]";
2.1.2. Fragment Definition Block
A binary fragment identifier is composed of one or more definition blocks. Each block starts with an exclamation point "!" followed by zero of more namespace definitions and zero or more df:InfoSpace descriptions.
2.1.3. Namespace definition
A namespace definition is composed with an optional namespace prefix followed by a URI references enclosed within a pair of parenthesis. The syntax for "URI-reference" is defined in the section 4.1 of URI specification [2].
2.1.4. InfoSpace Description
An df:InfoSpace description is composed of three parts. The first part is an optional operator component, which, if present, can be either "+" or "-". The second part of df:InfoSpace description is an df:InfoSpace name component. The name component should appear in the form of either NCName or QName. The third component of an df:InfoSpace description is an optional component for selecting individual items from the preceding df:InfoSpace. However, both the syntax and semantics of item selections are left undefined and they are dependent on the type of the df:InfoSpace.
3. Item Selection for df:Stream
This section defines the syntax and semantics of selecting items from a df:Stream. The df:Stream item selection is enclosed within a pair of bracket. The syntax and semantics of the selection value must follow what is defined in number generator.
3.1. Example
The following examples are all valid URIs that can be used to refer a fraction of binary data file.
1 a) http://example.com/data#!a(foo)a:spaceb) http://example.com/data#!a(http://example.com/foo)a:space
c) http://example.com/data#!a(foo/)
d) http://example.com/data#!a(foo%2F)
2. http://example.com/data#!(foo)space
3. http://example.com/data#!(foo)(bar)()
4. http://example.com/data#!a(foo)a:space1+space2
5. http://example.com/data#!space1-space2
6. http://example.com/data#!foo[2]
7. http://example.com/data#![1:2:1]
8. http://example.com/data#!(foo)space1!(bar)-space1
9. http://example.com/data#!(foo)bar(bar)space1-bar:space1
3.2. Semantics
3.2.1. Namespace Definition
A namespace definition serves two functions. First, it defines the namespace prefix so to allow a QName to be used to refer a df:InfoSpace. Second, it defines the model base for the ensuing df:InfoSpace descriptions.
Prefix Definition
A namespace definition is composed with an optional name followed by a URI reference enclosed within a pair of parenthesis. If a URI reference is a relative reference, it must be resolved with the primary URI as the base URI. For instance, in Example 1a, "foo" is a relative URI reference and should be resolved against "http://example.com/data". The result is "http://example.com/foo". So, Example 1a is equivalent to Example 1b. Since namespace typically ends with either a "/" or a "#" and we should not use a "#" within a fragment identifier (unless we "%" escaped the "#", which makes the syntax ugly), an arbitrary rule must be set to resolve it. The rule is, therefore, set as follows. For the namespace URI that does not end with a "/" or a "#", a "#" sign must be appended. In both Example 1a and 1b, "a" is used as the namespace prefix for "http://example.com/foo#". In the rare occasion where the namespace does end with "/" and would like to have "#" appended, the last "/" must be percentage-encoded as defined in URI specification. Therefore, the namespaces for the prefix "a" in Example 1c and 1d are:
1c) http://example.com/foo/
1d) http://example.com/foo/#
For any fragment definition block, the default namespace is initially set to be the primary URI. A prefix-less namespace declaration overwrites this default. For instance, in Example 2, the default namespace is "http://example.com/foo#".
Namespace definitions are evaluated from left to right. If a prefix is defined more than once in a definition block, what is defined on the right overwrites those on the left. In Example 3, for instance, the default namespace is first set to "http://example.com/foo#", then to "http://example.com/bar#" and then reset to the primary URI.
Model Import
In addition to setting namespace prefix, a namespace definition also function as an "import" statement. Within a given fragment definition block, a df:InfoSpace is interpreted according to the joined model of all namespace URIs. Therefore, for every URI enclosed within a parenthesis, a processor should execute an HTTP GET operation for the RDF media type. The RDF statements imported from all namespace URIs form a joined model, with which the binary representation of any subsequent df:InfoSpace is interpreted.
One question remains as whether the processor should attempt to retrieve the RDF model from the primary URI. The answer is that it should be executed on an as-needed basis. In Example 2, it should not because the lone mentioned df:InfoSpace does not use primary URI as namespace. In Example 4, however, it should because the namespace of "space2" is the primary URI. Obviously, if we would like our intension clear, we can always force the action by using an empty namespace definition "()" as in Example 3.
The rational for the as-needed import for primary URI is that not all data files are bound with a data model. If the model on the primary URI is always requested, we would never be able to refer a fraction of those data. Using the as-needed processing logic allows third party descriptions be used to identify part of data file.
3.2.2. InfoSpace Selection
As introduced in the previous section, a df:InfoSpace component is composed with three parts: operator, InfoSpace name, and item selection. The operator is optional, but if absent, it is assumed to be "+". The InfoSpace name should be either an NCName or QName. If it is an NCName, the default namespace is used to construct the InfoSpace URI. If it is a QName, the namespace for the defined prefix in the current block should be used. The InfoSpace name suggests the object that the operator should be applied to. If an item selection is present, it further suggests which part of the InfoSpace the operator should be applied to.
A "+" operator suggests the inclusion, whereas a "-" suggests the exclusion, of the binary representation for the ensuing df:InfoSpace. For instance, the URI shown in Example 5 identifies the bytes at "http://example.com/data" that is the binary representation of "#space1" but not that of "#space2". The evaluation of InfoSpace should be conducted from left to right. Therefore, the order of InfoSpace selection is significant. For instance, "#!-space1+space1" is the same as "#!space1" but "#!space1-space1" is equivalent to select nothing.
3.2.3. Item Selection for df:Stream
To select a part of a df:Stream, the syntax and the semantics of number generator are used. In brief, processing the number generator string should yield a list of integers. The integers are the indices for the information items of the df:Stream in question. For instance, assuming "http://example.com/data#foo" is a df:Stream, then Example 6 denotes the binary representation for the second item of http://example.com/data#foo. Similarly, Example 7 denotes all the odd numbered bytes on "http://example.com/data".
3.2.4. Semantics of fragment definition block
Each fragment definition block has a different model and namespace prefix. Hence, whenever encountering an exclamation point "!", a processor must reset the namespace definition - both the prefix definition and the import of RDF model. However, the selected binary representations from the previous block are still valid. For instance, Example 8 refers to the bytes that are the binary representation of "http://example.com/foo#space1" but not the binary representation of "http://example.com/bar#space1". Note, Example 8 and Example 9 do not necessarily express the same binary representation because the model context are different. In Example 8, "http://example.com/foo#space1" is explained by the model "http://example.com/foo" alone. But in Example 9, it is explained by the joined model of "http://example.com/foo" and "http://example.com/bar", which can have statements that alter the interpretation of "http://example.com/foo#space1".
3.3. Character escape
If any component of fragment identifier contains the character, whose semantics are defined in this document, it must be "%" escaped as it is defined in the URI syntax. Currently, the reserved characters are: '#', '!', ':', '(', ')', '[', ']'.
3.4. A few potentially confusing cases
Case 1: What does "foo" denote in the following URI?
http://example.com/data#!foo
Is it a namespace prefix or a QName for a df:InfoSpace?
Answer: It is a QName for "http://example.com/data#foo". If the prefix of a namespace definition is present, it must be followed with a URI reference enclosed within a pair of parenthesis.
Case 2: What does the following URI denote, all or none of the bytes?
http://example.com/data#!
Answer: None of the bytes. First, no df:InfoSpace is selected. Second, if we think it selects the default namespace of the primary URI, then it renders use cases like "Case 1" meaningless. Third, if we want to refer to the entire byte, we can always use the primary URI.
4. References
1. IETF. RFC2119, Key words for use in RFCs to Indicate Requirement Levels, Bradner, S., 1997, http://www.ietf.org/rfc/rfc2119.txt
2. Berners-Lee, T., R. Fielding, and L. Masinter, Uniform Resource Identifier (URI): Generic Syntax. 2005.

