DFDF Primer
Status:Draft
This version:http://dfdf.inesc-id.pt/tr/doc/primer/20070925
Latest version:http://dfdf.inesc-id.pt/tr/primer
Editors
- Xiaoshu
Wang (xiao
kdbio.inesc-id.pt) - Jonas S. Almeida
(jalmeida
mdanderson.org) - Arlindo L. Oliveira
(aml
inesc-id.pt)
Abstract
This document describes the data format description framework ( DFDF ), a system that uses semantic web technology to describe the format of data resources. In addition, the framework is also used as the basis for defining URI fragment identifiers that can be used to address, and subsequently access to, parts the binary resources. As all other DFDF documents, all terminologies, special notations and syntax are defined in a separate document at "http://dfdf.inesc-id.pt/tr/terms".
3. Data Identification and Access with DFDF
1. Introduction
A format language defines a formatting model, which, in turn, determines the arrangement of data elements on an electronic media. Most format languages are designed to communicate data to a specific application. To facilitate effective communication, the application's data structure usually takes the place of the formatting model and data is organized in a format that is native to the application's platform. While efficient in processing, data formatted in this fashion has limited interoperability. Because no explicit connection can be established between the data and its formatting model, ad hoc communication between a data provider and its consumers is unlikely to take place without any pre-ordination. To overcome the limitation, a collection of the so-called self-descriptive format languages is developed. Among these languages, the most well known one is the extensible markup language (XML). XML uses tags to markup the structure of data so that a data document can be parsed by an XML processor into a generic document data model without adhering to a fixed prescribed format. This effectively makes the XML data self-descriptive and the success of XML in the recent past has shown the importance of a self-describing format in improving data's interoperability.
All self-descriptive languages, however, have the drawback of imposing overhead upon its data. This is especially true for XML, which overhead in its storing, parsing and accessing of data had made it unsuitable in the data domain where space/time efficiency is the top priority[1]. It is worth noting that embedding binary data in an XML document as a base64-type data does not offer a true solution. The reason for that is not entirely due to the increased space consumption incurred by the base-64 encoding algorithm, which may be partially circumvented by the W3C recommended XOP packaging mechanism[2]. Rather, it is due to the inability of using XML model to describe the internal structure of the embedded data. As a language designed to serialize text based data, XML can only treat a base64-typed data as a collection of character information items. The internal structure of the binary data, therefore, cannot be described to an XML parser so that none of the standard XML based technologies, such as XPointer, XPath, XSLT and XQuery can be used to process the data. In other words, the embedding approach makes the data no longer self-descriptive so that the approach is only sensible if the embedded data is to be treated as an atomic object.
Apparently, there is a conundrum regarding whether a formatting model should be bound with data. On one hand, the binding is necessary for improving data's transparency so that ad hoc communication can take place without any explicit coordinating effort. But, on the other hand, the spectrum of diverse application needs wishes data to be free from any arbitrary constrains and unnecessary overheads[3]. The challenge, therefore, lies in finding an approach that can balance these opposing forces.
What causes the above conflict is, in fact, not the binding itself but the form of binding that a data language is engaged with. For most self-describing languages, the binding occurs in situ at the data document. Such a physical form of binding assigns two responsibilities to one document. But once the fulfilling of one responsibility gets in the way of fulfilling the other, conflicts inevitably occurs. In the case of XML, the conflict is over the usage of space. To improve data transparency, the language asks for more space to insert more human readable tags; but to improve processing efficiency, it desires to conserve space and to use binary coding. But if the responsibilities are handled to two separate documents, the conflict should be easily resolved. The question, then, becomes how these separated documents can be meaningfully bound without being physically tied together. Within the current web architecture, this is possible.
Everything in the web is a resource identified by a Uniform Resource Identifier (URI) [4]. Data is of no exception. It should also have an URI, which, upon being dereferenced, returns a representation of the data. A resource, however, is not limited to have just one representation. Dereferencing an HTTP URI, for instance, can return different representations of the resource depending on how the content is negotiated[5]. This one-to-many relationship between a resource's URI and its representations allows data to be physically separated from, while still logically connected, to its model. And it is this form of logical binding that establishes the basic foundation of DFDF [a].
In DFDF , two - instead of one - documents are used to represent a data resource with each document handled a different responsibility. The first document stores the actual data. Its sole responsibility is to improve data's processing efficiency and any kind of encoding method can be used toward this objective. The second document is the model document that is responsible for describing the data document by giving detailed account of the data arrangement on the media and its relations to the domain knowledge. The language that will be used in the model document is the Resource Description Framework ( RDF )[6]. RDF is chosen for its modular and extensible structure as well its role as the language for knowledge representation in the semantic web[7].
The subsequent sections are used to introduce the basic design of DFDF . To make a concrete discussion, one particular example is used throughout the document. The example is drawn from an earlier published article[7], in which the shape of a hypothetical spot on a two dimensional electrophoresis (2DE) gel is discussed. This particular example is chosen for the following three reasons. First, the cited article has detailed discussion on the advantages of RDF-based approach so that we do not have to elaborate the topic any more in this document. Second, the example used a domain-ontology that is already deployed in the web (http://www.charlestoncore.org/ontology/example). Although the ontology is a bit toy-natured, its conceptualization is still both concrete and correct so that re-using it saves us from unnecessary work of developing another exemplary ontology. Third, the straight-forward RDF/XML approach introduced in [7] can be compared with the DFDF approach introduced here.
2. Data Description with DFDF
2.1. Core Conceptualization
Format is the structure that organizes data. In DFDF , the formatting structure is collectively described by the concept of df:InfoSpace and df:Transformation. Here, a df:InfoSpace is considered an unlimited expanse, in which information items reside. The number of items contained in a df:InfoSpace can be described by the df:size property; the kind of information items held within the df:InfoSpace can be described by df:about.
As a container, however, a df:InfoSpace only contains but doesn't produce any information. Information is only generated when the information items from one df:InfoSpace are turned into information items of another df:InfoSpace. Take the 32-byte array shown in Figure 1a as an example. The byte stream itself is not very meaningful in the sense that it can be used to represent many different kinds of information. Two of them are shown in Figure 1b and c, respectively: one of them is an array of floating point numbers whereas the other of integers. Without further description, no one can be certain about which set of values the byte stream encodes. However, if the relationship between the byte stream and the data arrays is described with native datatypes and byte orders, the meaning of the byte stream becomes clear.

Figure 1 - InfoSpace and its transformation. The figure illustrates how information is generated through the process of transforming InfoSpace. a): a 32-byte InfoSpace. b) an array of four real numbers. c) an array of four integers. Both (b) and (c) can be represented by (a).
If both the byte stream and data arrays described are modeled as df:InfoSpaces, the relationship between them can be modeled as a process of transforming one df:InfoSpace into another. Such a process is denoted in DFDF as a df:Transformation. As shown in Figure 2, each df:Transformation has two required properties: the property df:src indicates the source, whereas the property df:dest indicates the destination, infospace. The domain semantics of a df:Transformation can be described by the df:for, which relates a df:Transformation to a rdf:Property.

Figure 2 - DFDF Core ontology. The ontology establish the core conceptualization of df:InfoSpace and df:Transformation.
It is worth noting that a df:InfoSpace contains the information about the kind of resource that is related to it by the df:about property; but a df:InfoSpace does not hold the actual instances of that kind of resource. A df:InfoSpacedf:about byte, for instance, should not be mistaken as a collection of byte instances. The latter semantics, if desired, should be described by an rdf:List or an RDF container. On the surface, the distinction seems arbitrary. But in actuality, the former semantics is explicitly chosen to avoid serious modeling dilemma. For instance, were a df:InfoSpace defined to contain the actual instance of the resource, a byte can be easily defined as an ordered collection of eight bits. Such a definition may seem advantageous because a byte-stream can then be automatically reasoned into a bit-stream and vice versa. But the problem is: when such line of thought is followed to define the concept of bit, we are immediately trapped into the Russell's paradox because we must decide whether a bit is an info-space of itself.
Answering this hard philosophical questions is not the objective of DFDF . DFDF is designed to facilitate the efficient storing, accessing and processing of large quantity of data. Almost always, data described by DFDF will encode the properties of some domain objects existed elsewhere. The resource show in Figure 1a, for instance, is developed to hold the information of a hypothetical 2DE spot discussed in [7]. The two resources have completely different natures - one is a byte stream and the other is a 2DE spot and they are located at different places as well. It is, therefore, not only consistent but also feels natural to think that a data resource contains the information, but not the actual instance, of some domain objects. The meaning of property df:about, therefore, is to relate a data fragment to a domain concept. It is for the same reason that the df:for property is designed to relate a df:Transformation to an rdf:Property. With these two properties in place, using DFDF to describe a df:Transformation can be the equivalent of using a domain-ontology to describe domain objects. For instance, in the example shown in Figure 3, _:shape is an instance of df:Transformation with data:ellipse as the source and data:spot2 as the destination space. Because data:ellipse and data:spot2 are described as a df:InfoSpaceaboutcce:Ellipse and cce:Spot, respectively, what is shown in Figure 3 is essentially saying that an instance of cce:Spot is assuming a cce:shape of a cce:Ellipse. Furthermore, the actual subject of interest can be further entailed via the df:dataSource property[b]. In this particular example, space data:spot2 is declared as the df:dataSource of a cce:Spot -ex:spot2. The entire message shown in Figure 3, therefore, suggests that the information about the cce:shape of ex:spot2 resides in data:ellipse. Using the same process to continue the transformation of data:ellipse will eventually lead us to the numerical values of this particular ellipse.

Figure 3 - Connecting formatting knowledge to domain Knowledge. The upper gray area indicates the domain knowledge, whereas the lower gray area shows the formatting knowledge. Shapes and lines with solid border indicate statements that are explicitly made and dotted line shows what is implied from the semantics of DFDF.
With the DFDF approach, instead of describing ex:spot2 as illustrated in [7], the spot can be described as follows as well. To make it comparable to the original document, the following code is written in RDF/XML with omission of some headers.
<cce:Spot rdf:about="http://www.charlestoncore.org/ontology/example/spot2"><df:dataSource rdf:resource="http://dfdf.inesc-id.pt/ex/1"> .
</cce:Spot>
At first glance, this new approach does not seem offering too much saving in space consumption. Consider the extra statements made in the formatting document, the approach appears having consumed even more space than the original methodology. But the given example has described the shape of only one spot. Once more spots are to be described, the advantages of this new approach will be more and more significant because the complexity of data format description is independent of the size of data. A more reasonable comparison, therefore, should be made by comparing the sizes of the original RDF fragment to that of binary data documents. In [7], a total of 246 characters are used to describe the shape of ex:spot2 but with DFDF approach only 16 (for IEEE 754 single) or 32 bytes (for IEEE 754 double) are needed to encode those binary values. Hence, just the space along, the DFDF approach can offer an 8- to 16-fold saving in this particular case. Considering that the data is already stored in its binary form so that only minimal parsing is needed for an application to handle the data, the overall processing efficiency will be improved even more with this new approach.
2.2. Stream
The core DFDF ontology defined a high level conceptual model of the framework and should be used as the top ontology for all DFDF works. However, as a top ontology, the defined concept has a very coarse semantic granularity. Notably, the structure of df:InfoSpace is undefined, which renders the concept of little use in practice. But such a void is left on purpose because the core ontology is designed to establish the general workflow of the framework rather than targeting a specific use case. Defining the structure of df:InfoSpace would severely limit the applicability of the framework.
The stream ontology, therefore, is developed to give a detailed account on a particular set of df:InfoSpace, termed as df:Stream (Figure 4). In this ontology, a df:Stream is considered to have a one-dimensional structure and contain only homogeneous items, which are sequentially positioned according to a one-based index. The df:Stream becomes the modeling primitive in DFDF for the following two reasons. First, most electronic storage and transmission take the form of a stream. Second, many heterogeneous spaces of higher dimensionality can be modeled via a series of df:Transformations of df:Streams. Hence, building on the concept of df:Stream provides us the most flexibility with minimal conceptual complexity.

Figure 4 Ontology of Stream
The stream ontology has also defined two other important streams. The first one is the df:ByteStream, which is defined to be a df:Stream of bytes. Df:ByteStream plays an important role in bootstrapping information content because, unlike most other df:InfoSpaces that fills its content by df:Transformation, the content of a df:ByteStream can be acquired by dereferencing the URI that identifies the stream via a df:WebTransfer.
The second special type of df:Stream is the df:DataStream, which is defined to contain the information about primitive data, such as character, numerical and Boolean values.
With the stream ontology, we can, then, describe the byte-stream shown in Figure 1a with the following code.
data:bytes a df:ByteStream; df:size 32 ;df:destOf [ a df:WebTransfer;
df:mimeType "application/dfdf+octet-stream";
df:src <http://dfdf.inesc-id.pt/ex/1> ].
And the set of numerical values shown Figure 1b can be described as:
data:doubles a df:DataStream; df:about xsd:double; df:size 4 .2.3. Data Encoding
The first step of data processing is almost always to convert bytes into various data types. In DFDF , this process is modeled as a df:DataEncoding, which transforms a df:ByteStream into a df:DataStream with the help of information provided by its df:ntype property (Figure 5). To remove the unnecessary ontology dependency, the range of df:ntype is intentionally left undefined in the encoding ontology. In a separate ontology - native datatype ontology - df:NativeType is developed for this purpose.

Figure 5 - Data Encoding Ontology
2.4. Native Datatypes
A datatype is often represented in computer memory by a number of bytes, which collectively defines the value of a specific data. The native datatype ontology is designed to model various type of native representation of datatypes.
The number of bits/byte that are used to represent a datatype is described by the df:bitSize and df:byteSize, respectively. Obviously, the semantics of these two properties overlap since df:byteSize can always be expressed in df:bitSize. Df:byteSize is developed for convenience reasons becomes it is more customary for a user to refer a datatype in bytes than in bits. But, if both properties are specified for a given instance of df:NativeType, the two values must agree with each other. But if neither property is defined, it implies a variable length representation, for instance the UTF-8 string (See Figure 6).
Because, within a multi-byte datatype, the bytes can be arranged in different orders on different computer platform, which will give rise to different data values, the byte-order must also be specified. In general, two types of byte order are used. If the bytes are arranged from low to high memory address according to the byte's significance in ascending order, it is called little-endian and its inverse is called big-endian. Two instances of df:Endian class - df:little and df:big - are defined for describing the df:byteOrder of a native data type.
Although bits can also be arranged differently within a byte, it is not quite common in practice. Different bit ordering may appear over the transmission of bit-stream over a serial medium. But most hardware automatically converts the different bit order and for storage, the bits are almost always arranged in the normal up order, i.e., the least significant bit at the lowest address and higher significant bit at higher address. For this reason, no bit-order properties are defined for df:NativeType, which are always assumed to have the normal bit order. In rare occasions where the down bit ordering is used, a transformation process can be used to reverse the bit order before the data encoding process.

Figure 6 - Ontology for Native Datatype
It is helpful to understand the difference between the rdfs:Datatype and df:NativeType. The former is a virtual type whereas the latter is a concrete representation of the former. Take number 5.2820 as an example. The number itself is an abstract concept, which value can be represented either by a set of characters like "5.2820", or by a set of hex number "401520C49BA5E354", or by a bit-string. A df:DataStream holds the information about data values in the abstract sense whereas the df:NativeType describes the way that data values are represented in an electronic medium.
With all the native datatypes being specified, we can now further define the data encoding process of the example shown in Figure 1b.
data:doubles a df:DataStream; df:about xsd:double ._:enc a df:DataEncoding;
df:src <http://dfdf.inesc-id.pt/ex/1> ;
df:dest data:doubles ;
df:ntype [a df:Double;
df:byteOrder df:big].
The above statement first defines data:doubles as a data stream of xsd:double. Then, the statement describes the double's stream as the result of a data encoding process using a native type of big-endian df:Double.
2.5.Number Generator
There are many circumstances when we need to generate a set of numbers. Of course, we can encode the numbers in a df:ByteStream and uses the df:DataEncoding to transform them into numerical values, but the approach will be two cumbersome. The df:Numbers, therefore, is designed to simplify the process. A df:Numbers is a virtual df:DataStream (See Figure 7). The exact values in the data stream is described by the df:gstr - a string whose syntax and semantics are described by the specification for Number Generator String. In brief, the list of numbers in a df:Numbers can be described by the df:gstr in one of the following manners. First, the numbers can be listed explicitly and each number is separated by a comma, like "1, 3, 5". Second, the numbers can be described by a special syntax of a loop-like structure. For instance, the string "(1:2:3)" describes the production of three numbers with starting value of 1 and an increment value of 2. Third, the combination of the above two methods, such as "1, (3:2:2)". For more complex cases, please see the specification for the Number Generator String.

Figure 7 - Number Generator
2.6. Stream Mapping
The mapping ontology describes how the information content of one df:Stream is related the information content of the other. The mapping is modeled as a df:Transformation of many-to-one association between the information items of the source stream to those of destination stream. The participating information items of each stream is described by properties df:destIndices and df:destIndices (See Figure 8). If either property is not specified, the natural indices of the respective streams are assumed.

Figure 8 - Mapping Ontology
Two kinds of mappings are defined in this ontology depending on if the "many" size is eventually grouped to map the one-side. Df:RegMapping defines an evenly, whereas df:IregMapping defines an unevenly, distributed mapping. To facilitate the discussion, let's name the stream having more participating items than the other as the many-stream and the other stream as one-stream.
In a df:Regmapping, the information items from the many-stream are first arranged according to the df:srcIndicies and then evenly divided into groups, each of which has the size of df:groupSize. If df:groupSize is absent, its value is set as the result of dividing the number of participating items of the many-streamwith that of one-stream. For example, if the size of many-stream is of size 99 and size of one-stream is 33, then the df:groupSize for a df:RegMapping is 3.
In a df:IregMapping, each group may have different sizes. The size of each group is described by a property df:groupSizes, which ranges over a df:DataStream of non-negative integers.
Please remember that df:Mapping is kind of df:Transformation so the meaning of the mapping is described by its property df:for. For instance, we can describe the x-position property of a cce:point by mapping a stream of double values to a stream of point as follows.
data:point a df:Stream; df:about cce:Point ; df:size 1._:x-position a df:RegMapping;
df:for cce:x-position;
df:src data:doubles;
df:dest data:point.
df:srcIndices [a df:Numbers; df:gstr "1"].
In the above description, number 5.2820, the first number of the double's stream, is mapped to a cce:Point and the mapping is said to be used df:for cce:x-position. In plain word, the RDF statement suggests that a point has an x-position at 5.2820. Similarly, all other attributes of the point can be described in the same fashion. Interested reader can retrieve the description from the data source at "http://dfdf.inesc-id.pt/ex/1".
3. Data Identification and Access with DFDF
The above sections have shown how DFDF can be used to describe the format of a data file. In this section, we will introduce how fractions of the described data can be identified and accessed in DFDF .
Most data described by DFDF will be in quite large quantity because otherwise a straight-forward RDF description may offer a simpler solution. Obviously, if an application needs to process the entire data set, it has to request the data in its entirety. However, if an application only needs to process part of the described data, substantial saving can be made if the data can be selectively sent over the internet. To achieve this goal requires two additional mechanisms. First, fractions of binary data need to be identified. For this purpose, a URI fragment identifiers for binary data described by the DFDF are defined to allow the components of DFDF data model to be used to refer to parts of binary data. Second, the availability and the programming interface of server support for returning fractions of binary data must also be specified so that a client can discover and invoke the service. For this purpose, an ontology along with its usage is specified. The details of the above two specifications can be found, respectively, at "http://dfdf.inesc-id.pt/tr/fragid" and "http://dfdf.inesc-id.pt/ont/da". In the subsequent paragraphs, we will use an example to give a brief overview of the basic process.
Assuming an application is given the following descriptions and is asked to retrieve information about the cce:center of the ex:spot2.
ex:spot2 a cce:Spot;df:dataSource <http://dfdf.inesc-id.pt/ex/1#spot>.
The above description shows that ex:spot2 is a cce:Spot and its data is located at an infospace identified by "http://dfdf.inesc-id.pt/ex/1#spot". The application can then obtain the model document of the data source by executing HTTP GET for an RDF representation of the respective URI. From the returned RDF document, the application should make the following discoveries. First, the information about the cce:center of ex:spot2 is contained by an instance of df:InfoSpace - data:point. Second, the property df:daqn for primary data resource "http://dfdf.inesc-id.pt/ex/1" is set to be "brof". The presence of df:daqn property suggests that the server supports the return of partial binary data. So, what the application needs now is to construct the URI for the binary representation of data:point. In this particular case, the construction is simple: It only needs to insert an exclamation point "!" after the number sign "#" of the URI for data:point. The resulting URI becomes
http://dfdf.inesc-id.pt/ex/1#!pointTo request the partial binary stream, a URI query component is constructed by using the value of df:daqn as the name of the query parameter and fragment ID of the binary representation as the value. In the given example, the full URI for the request would be:
http://dfdf.inesc-id.pt/ex/1?brof=!pointThe "Accept" header should be set to
"application/brof+octet-stream", which is also
the content type of the response if the server granted the request.
If for any reason that the server cannot grant the request, it
should return an RDF document to explain the failure.
Please note that the media type used for requesting/responding fractions of binary data is not a standard registered MIME type and it may vary from server to server. The reason behind such a design is that DFDF's data access API follows the REST architectural style[9] and uses the returned MIME type to signal the success and failure of a request, hence it needs a mechanism to safeguard from careless practice. For instance, almost all existing web servers ignore the query strings that they do not understand. Dereferencing "http://www.w3.org?foo=bar" gets back the identical response as dereferencing "http://www.w3.org" (At least as it is today 09/03/2007). If a common registered MIME type, such as "application/octet-stream", is used to signal the return of partial binary data, unintended mistake can be easily made. Consider what will happen if a client mistakenly sets the name of the query, such as using "broff" instead of "brof" in the above example? The request is sent to the server, which ignores the "broff" and returns the entire data stream back. Upon receiving the data, a client is unable to detect his and server's mistake and will treat the response as normal, which will eventually leads to processing errors. Of course, careful server implementation is always the key but using a dynamically generated media type offers an additional mechanism to prevent miscommunication.
4. Data Processing with DFDF
In the above sections, we have shown how arbitrary binary data can be conceptually described, individually identified and selectively accessed. In this section, we describe how DFDF described data could be bound and processed by an Objected Oriented Programming ( OOP ) language.
Because DFDF is a system based on RDF , its data binding must be built on top of the RDF binding as well. Most existing RDF bindings are created for building generic RDF parsers so that only the binding for the generic RDF concepts, like resource, classes and properties etc., are defined. But to automate the data processing in DFDF desires the binding between the domain resources and specific object types of a programming language. Thus, given an RDF data set, an application would be able to automatically generate a set of objects and subsequently manipulate them in a programming environment.

Figure 9 - Object Resource Binding Ontology
The binding for the basic datatypes of RDF , such as xsd:float and xsd:string etc., should simply follow the XML binding specification of the respective programming languages. What needs to be defined is how domain knowledge defined in RDF such as cce:Spot and cce:Ellipse etc., can be mapped to the data types in a programming language. Owing to the similar syntax between the two languages, the mapping is, in fact, quite straight-forward. As shown in Figure 9, a simple one-on-one mapping with a few name tags would suffice.
But a shared syntax must not be equated as a shared semantics. In many key respects, the semantics of RDF differs fundamentally from that of OOP [10]. The interpretation of an RDF -model, for instance, is based on open-word semantics where unknown is not considered as false. For an RDF resource, neither the absence of an optional property nor the presence of an unknown one denies its membership in its declared class. An instance of cce:Ellipse, for example, does not need to have a cce:center to become an cce:Ellipse. Conversely, being an cce:Ellipse does not suggest the resource cannot have an unknown property such as a _:color or a dc:creator. The same cannot be said about an OOP object. The definition of an OOP class carries the closed-world semantics in the sense that what is not defined should not exist. An ellipse instance, therefore, must either have a center or not. In other words, it can be the instance of either Ellipse1 or Ellipse2 shown in Figure 10 but it cannot be an instance of both at the same time.

Figure 10 - Two OO interpretation of cce:Ellipse. The attribute name of Ellipse1 is intentionally made different from that of the Ellipse2 to shown that the attribute is only meaningful within a class definition.
The semantics of a programming language concerns what a program will do when its codes are executed. Hence, what is important to an OOP class is not about what kind of properties an object can carry; but rather what kind of behaviors the object can exhibit. Whether a property should be defined in a class is solely determined by its necessity in fulfilling the class's defined behavior. The semantics of RDF and OWL are different. They are logic languages that use consistency checking to determine the meaning of a knowledge base. The presence of a property or not, therefore, is important in the validation of a resource's membership in class. These are two different semantics and they are orthogonal to each other. In principle, therefore, an RDF resource can be mapped to any OOP object and vice versa because there isn't any objective notation of equivalence that can be used to evaluate a RDF-OOP binding. But in practice, the binding could be judged by whether the exhibited behavior of an OOP object makes sense for the bound RDF resource. For instance, if the class Ellipse2 shown in Figure 10b is modeled as a real world ellipse, binding cce:Ellipse to Ellipse2 naturally makes sense. But, on the other hand, if the task at hand is to study the shape distribution of some cce:Spots, it makes sense as well for a program to bind cce:Ellipse to a Point class. An RDF-to-OOP binding, therefore, is always contextual and one RDF resource can be bound to different OOP object in different context.
Once a binding is defined, it can be used to compile new OOP class definition or to reuse those from existing software library. In the case of DFDF, a binding would also allow a particular implementation of df:Transformation to be shared for generating programming objects. For instance, assuming that all df:Stream are modeled as object arrays. If a particular implementation of df:Transformation uses the following interface to function,
Object[] Xformation.do(Object[] srcStream);We can describe the interface with the ontology shown in Figure 9.
_:xform a df:Transformation;
df:oopCls [df:name
"Xformation"];
df:apply [a df:Method;
df:names "do"];
Hence, with this description and additional description of the location of a software library, an agent should be able to dynamically retrieve the library and invoke the appropriate method to generate a collection of domain objects from a given DFDF description.
An additional benefit of such a model is that it would also allow proprietary data to be mixed with open ones. A proprietary data is the data which information regarding their df:Transformation is not fully specified. For instance, to make the example data source private, we can use the following description.
_:proprietary a df:Transformation;df:src <http://dfdf.inesc-id.pt/ex/data/1>;
df:dest [a df:Stream; df:about cce:Spot];
df:lib <http://example.com/lib> .
By making the meaning of df:Transformation opaque and by guarding the access to the library URI, proprietary data can still be described in DFDF without losing protection of data owner's intellectual property. As a primer, only the simplest case is introduced. For more detailed information about the data binding and process in DFDF, please see the full specification at "http://dfdf.inesc-id.pt/ont/orb".
5. Summary
DFDF is a semantic web based data standard technology. Unlike the standardization technologies of the past that prescribe the specific format of data encoding, DFDF prescribes its standardization over the description of the data encoding. Such a descriptive approach allows data to be freely encoded while improving the data's interoperability by using domain ontologies.
The general architecture of a DFDF application involves the interaction of three
kinds of resources. The first type of resources is the domain
knowledge documents, in which domain ontologies are used to
describe domain objects. The example RDF fragment illustrated in [7] belongs to this category. Typically, if the
described domain object is too large to be described in plain RDF,
a df:dataSoruce can be specified to indicate the binary
encoding of the knowledge. The df:dataSource property points to a data resource.
Typically, a data resource has at least two different kinds of
representation. The first kind is the binary representation of the
domain knowledge and its content can be retrieved by dereferencing
the URI of data resource with content type of "application/dfdf+octet-stream". The second
representation of the data resource should be an RDF document, in
which the arrangement of bits/bytes on the resource's binary
representation is described in DFDF
ontologies. A DFDF application should be able to use the
collective representations of the data resource and transforms the
stored bits and bytes into domain objects like cce:Gel and cce:Spot. The third type of resource is the software
library resource. Just as the data resource, the library resource
can have at least two representations as well. The RDF
representation of the library resource should describe the used
programming language and necessary binding between the programming
and domain objects. The other representation of the library
resource would be the binary code which can be downloaded and
executed according to the RDF description (See Figure 12).

Figure 12 - Basic architecture of DFDF.
Although the above introduced DFDF ontologies are primarily focused on the binary data representation, other types of data encoding methods can be developed in similar fashion. For instance, it is possible to develop an XML infospace to acquire information from an XML InfoSet [11] and subsequently uses domain ontology the XPointer collectively to describe the semantics of the XML data elements. Any existing data format, open or proprietary, can, in fact, be modeled as a special df:InfoSpace, to which specialized df:Transformation can be developed to turn the raw data into domain knowledge.
Because the design of DFDF was motivated by the data of large quantity, the effort is primarily focused on the situation where the raw data resource is physically separated from its RDF description so to prevent a user from unknowingly requesting a huge data. The workflow of a DFDF application, however, is not dependent upon the separation of documents. As long as the subject of DFDF description is clear, the RDF description and its raw data can be packed within the same document. The single document approach may be useful in certain circumstances, such as to return the query result of an RDF store or relational database. But whether such an approach should be chosen over a straight-forward RDF description should be evaluated on a case by case basis. In either case, however, there should be no need to design a specific document format. DFDF relieves us from the burden of data syntax and allows us to focus more on the data semantics by developing and sharing domain ontologies.
6. References
1. W3C. Working Group Notes, XML Binary Characterization Use Cases, Cokus, M. and S. Pericas-Geertsen, 2005, http://www.w3.org/TR/xbc-use-cases/
2. W3C. Recommendation, XML-binary Optimized Packaging, Gudgin, M., et al., 2005, http://www.w3.org/TR/xop10/
3. W3C. Working Group Notes, XML Binary Characterization Properties, Cokus, M. and S. Pericas-Geertsen, 2005, http://www.w3.org/TR/xbc-properties/
4. W3C. Recommendation, Architecture of the World Wide Web, Jacobs, I. and N. Walsh, 2004, http://www.w3.org/TR/webarch/
5. IETF. RFC2616, Hypertext Transfer Protocol -- HTTP/1.1, Fielding, R., et al., 1999, http://www.ietf.org/rfc/rfc2616.txt
6. W3C. Resource Description Framework Specifications, http://www.w3.org/RDF/#specs
7. Wang, X., R. Gorlisky, and J.S. Almeida, From XML to RDF: How semantic web technologies will change the design of 'omic' standards. Nature Biotechnology, 2005. 23: p. 1099-1103.
8. IETF. RFC2045, Multipurpose Internet Mail Extensions (MIME) Part ONE: Format of Internet Message Bodies, Freed, N. and N. Borenstein, 1996, http://tools.ietf.org/html/rfc2045
9. Fielding, R.T., Architectural Styles and the Design of Network-based Software Architectures. 2000, University of California, Irvine.
10. W3C. Working Group Note, A Semantic Web Primer for Object-Oriented Software Developers, Knublauch, H., et al., 2006, http://www.w3.org/TR/sw-oosd-primer/
11. W3C. Recommendation, XML Information Set (Second Edition), Cowan, J. and R. Tobin, 2004, http://www.w3.org/TR/xml-infoset/
[a]It is worth noting that "one-to-many" relationship between a URI and its representations is not supported by all URI schemes. A "file" or "ftp" URI, for instance, is only associated with one representation. Obviously, resources identified by this type of URIs cannot use the identifier to bind a model with the data. Nevertheless, the logical binding can still be made with a proxy document. For example, a simple XML document of the following structure may serve the purpose.
<dfdf data="URI Reference" model="URI Ref"/>
However, in order to create a more integrated web, W3C has recently converging on the idea of making HTTP URI as the default URI scheme (see http://www.w3.org/2001/tag/doc/URNsAndRegistries-50). Hence, in this project, all data documents are assumed to be deployed in the web with an HTTP URI. The proxy-document or any other alternative approach will not be discussed.
[b] There is an implicit relation between df:dataSource and df:about: For any given InfoSpace S, if S is the df:dataSource of x, and Y is the df:about of S, then it must be true that x is an instance of Y. However, I am not sure how this can be explicitly specified in OWL.

