Revision History | |
---|---|
Revision 0.1 | 20 July 2006 |
Initial version. | |
Revision 0.2 | 16 November 2006 |
Added section on Comments. | |
Revision 0.3 | 3 December 2006 |
Added section on Packages and Imports. | |
Revision 0.4 | 11 January 2007 |
Added section on Subtypes. | |
Revision 0.5 | 23 May 2007 |
More details on packages. Scope of enum items changed from global to local. More background in introduction. | |
Revision 0.6 | 28 June 2007 |
New sum operator. Functions. New section on Relational Extensions. | |
Revision 0.7 | 29 November 2007 |
New choice type. Functions may return compound types. Explicit parameters. bitsizeof operator. align(n) modifier. UTF-8 string encoding. |
Table of Contents
There are dozens of languages for modelling abstract datatypes. For some of these, the binary representation of the defined types is implementation dependent and of no concern. Others do provide a binary encoding, but there is usually no way to retrofit an abstract specification to an existing binary format.
DataScript is a formal language for modelling binary datatypes, bitstreams or file formats. Using a formal language for defining such binary datatypes resolves all ambiguities typically found in textual or tabular specifications.
In addition, one can automatically generate encoders and decoders for a given binary format from such a formal specification, so that application developers do not have to worry about serialization and can focus on application logic instead.
The present document describes a DataScript dialect supported by
our implementation called rds
(short for Relational
DataScript), which is available from SourceForge [rds]
under a BSD License. This Language Overview is more of a User's Manual
than a formal language specification.
DataScript was designed by Godmar Back [Back]. His reference implementation of a DataScript compiler also includes a Java code generator producing Java classes which are able to read and write a binary stream that complies with a DataScript specification.
rds
and the DataScript dialect defined in this
document are a spin-off of the Physical Storage Format
Standardization Initiative (PSI), a joint effort of about 20
partner companies in the automotive industry to standardize a car
navigation database format.
After evaluating different data modelling languages and toolsets, including ASN.1, CSN.1, UML, XML Schema and others, DataScript was selected by the PSI as the closest match to our requirements.
While Back's reference implementation [DataScript] provided a great start, we found that some
language extensions were desirable to better support our specific
requirements. For this reason, we branched off our own DataScript Tools
or dstools
project from Back's reference
implementation.
As a major addition to the DataScript language, we introduced
relational extensions, which permit the definition of hybrid data
models, where the high-level access structures are implemented by
relational tables and indices, whereas the bulk data are stored in
single columns as BLOBs with a format defined in DataScript, hence the
name Relational DataScript. rds
is currently built on top of the SQLite embedded database [SQLite].
The DataScript syntax for literal values is similar to the Java syntax. There are no character literals, only string literals with the usual escape syntax. Integer literals can use decimal, hexadecimal, octal or binary notation.
Examples:
Decimal: 100, 4711, 255
Hexadecimal: 0xCAFEBABE, 0Xff
Octal: 044, 0377
Binary: 111b, 110b, 001B
String: "You"
Hexadecimal digits and the x
prefix as well as
the b
suffix for binary types are
case-insensitive.
String literals correspond to zero-terminated UTF-8-encoded strings.
Thus, the literal "You"
corresponds to a sequence of 4
bytes equal to the binary representation of the integer literal
0x596F7500
. Other character encodings (e.g. ISO 8859-1
or UTF-16) are not supported.
DataScript supports the following integer base types
Unsigned Types: uint8, uint16, uint32,
uint64
Signed Types: int8, int16, int32,
int64
These types correspond to unsigned or signed integers represented
as sequences of 8, 16, 32 or 64 bits, respectively. Negative values are
represented in two's complement, i.e. the hex byte FF
is 255 as uint8
or -1 as
int8
.
The default byte order is big endian. Thus, for multi-byte integers, the most significant byte comes first. Within each byte, the most significant bit comes first.
Example: The byte stream 02 01
(hex)
interpreted as int16
has the decimal value 513. As a
bit stream, this looks like 0000 0010 0000 0001
. Bit
0 is 0
, bit 15 is 1
.
A bit field type is denoted by bit:1, bit:2
,
... The colon must be followed by a positive integer literal, which
indicates the length of the type in bits. The length is not limited. A
bit field type corresponds to an unsigned integer of the given length.
Thus, bit:16
and uint16
are
equivalent.
Signed bit field types are not supported.
Variable length bit field types can be specified as
bit<
expr>
,
where expr is an expression of integer type to be
evaluated at run-time.
A string type is denoted by string
. It is
represented by a sequence of bytes in UTF-8 encoding, terminated by a
zero byte. Thus, the encoded size of a string with
n characters is at least n+1
bytes, and it may even be larger if some of the characters have a
multibyte UTF-8 encoding.
Since DataScript models arbitrary bitstreams, the term byte should not be taken too literally in this context: A byte in this sense is a group of 8 successive bits, where the offset of the first bit of each group from the enclosing type or the beginning of the stream is not necessarily divisible by 8.
An enumeration type has a base type which is an integer type or a bit field type. The members of an enumeration have a name and a value which may be assigned explicitly or implicitly. A member that does not have an initializer gets assigned the value of its predecessor incremented by 1, or the value 0 if it is the first member.
enum bit:3 Color { NONE = 000b, RED = 010b, BLUE, BLACK = 111b };
In this example, BLUE
has the value 3. When
decoding a member of type Color
, the decoder will read
3 bits from the stream and report an error when the integer value of these
3 bits is not one of 0, 2, 3 or 7.
An enumeration type provides its own lexical scope, similar to Java
and dissimilar to C++. The member names must be unique within each
enumeration type, but may be reused in other contexts with different
meanings. Referring to the example, any other enumeration type
Foo
may also contain a member named
NONE
.
In expressions outside of the defining type, enumeration members
must always be prefixed by the type name and a dot, e.g.
Color.NONE
.
A sequence type is the concatenation of its members. There is no padding or alignment between members. Example:
MySequence { bit:4 a; uint8 b; bit:4 c; };
This type has a total length of 16 bits or 2 bytes. As a bit
stream, bits 0-3 correspond to member a
, bits 4-11
represent an unsigned integer b
, followed by member
c
in bits 12-15. Note that member
b
overlaps a byte boundary, when the entire type is
byte aligned. But MySequence
may also be embedded
into another type where it may not be byte-aligned.
A union type corresponds to exactly one of its members, which are also called branches.
union VarCoordXY { CoordXY8 coord8 : width == 8; CoordXY16 coord16 : width == 16; CoordXY24 coord24 : width == 24; CoordXY16 coord32 : width == 32; };
In this example, the union VarCoordXY
has two
branches coord8
and coord16
. The
syntax of a member definition is the same as in sequence types. However,
each member should be followed by a constraint. This is a boolean
expression introduced by a colon. The terms involved in the constraint
must be visible in the scope of the current type at compile time and
must have been decoded at runtime before entering the branch.
The decoding semantics of a union type is a trial-and-error method. The decoder tries to decode the first branch. If a constraint fails, it proceeds with the second branch, and so on. If all branches fail, a decoder error is reported for the union type.
A branch without constraints will never fail, so any following branches will never be matched. This can be used to express a default branch of a union, which should be the last member.
When all constraints of a union depend on the same member, a choice type is usually more convenient.
A choice type is a shorthand notation for a union where all constraints compare a given member or parameter to one or more constant values.
choice VarCoordXY on Coord.width { case 8: CoordXY8 coord8; case 16: CoordXY16 coord16; case 24: CoordXY24 coord24; case 32: CoordXY32 coord32; };
A choice type depends on a selector
expression following the on
keyword. Each
branch of the choice type is preceded by one or more case labels with a
literal value. After evaluating the selector expression, the decoder
will directly select the branch labelled with a literal value equal to
the selector value. This is more efficient than the trial-and-error
method applied to union types. Loosely speaking, a union type
corresponds to a chain of if ... else if ... else if ...
else
statements, whereas a choice type is equivalent to a
single switch
statement.
In the example above, the selector expression refers to a member
width
of a type named Coord
containing the current choice type (see Section 10, “Member Access and Contained Types”). Alternatively, the selector may be a
parameter of the choice type (see Section 11, “Parameterized Types”):
choice VarCoordXY(uint8 width) on width { case 8: CoordXY8 coord8; case 16: CoordXY16 coord16; case 24: CoordXY24 coord24; case 32: CoordXY32 coord32; };
A given branch of a choice may have more than one case label. In this case, the branch is selected when the selector value is equal to any of the case label values. A choice type may have a default branch which is selected when no case label matches the selector value. The decoder will throw an exception when there is no default branch and the selector does not match any case label. Any branch, including the default branch, may be empty, with a terminating semicolon directly following the label. It is good practice to insert a comment in this case. When the selector expression has an enumeration type, the enumeration type prefix may be omitted from the case label literals.
choice AreaAttributes(AreaType type) on type { case AreaType.COUNTRY: // The prefix "AreaType." is optional case STATE: case CITY: RegionAttributes regionAttr; case MAP: /* empty */ ; case ROAD: RoadAttributes roadAttr; default: DefaultAttributes defaultAttr; };
A constraint may be specified for any member of a compound type, not just for selecting a branch of a union. In a sequence type, after decoding a member with a constraint, the decoder checks the constraint and reports an error if the constraint is not satisfied.
There is a shorthand syntax for a constraint that tests a field
for equality. Type fieldName =
expr;
is equivalent to Type
fieldName :
fieldName ==
expr;
A sequence type may have optional members:
ItemCount { uint8 count8; uint16 count16 if count8 == 0xFF; };
An optional member has an if
clause with a
boolean expression. The member will be decoded only if the expression
evaluates to true at run-time.
Optional members are a more compact and convenient alternative to a union with two branches one of which is empty.
A compound type may contain functions:
ItemCount { uint8 count8; uint16 count16 if count8 == 0xFF; function uint16 getValue() { return (count8 == 0xFF) ? count16 : count8; } };
The return type of a function has to be a standard integer or compound type, and the function parameter list must be empty. The function body may contain nothing but a return statement with an expression matching the return type.
Functions are intended to provide no more than simple expression semantics. There are no plans to add more advanced type conversion or even procedural logic to DataScript. (For complex logic, it would be more sensible to bind native functions to DataScript.)
An array type is like a sequence of members of the same type. The element type may be any other type, except an array type. (Two dimensional arrays can be emulated by wrapping the element type in a sequence type.)
The length of an array is the number of elements, which may be fixed (i.e. set at compile-time) or variable (set at run-time). The elements of an array have indices ranging from 0 to n-1, where n is the array length.
The notation for array types and elements is similar to C:
ArrayExample { uint8 header[256]; int16 numItems; Element list[numItems]; };
header
is a fixed-length array of 256 bytes;
list is an array with n elements, where
n is the value of numItems
.
Individual array elements may be referenced in expressions with the
usual index notation, e.g. list[2]
is the third
element of the list
array.
Constraints on all elements of an array can be expressed with the
forall
operator, see Section 8.3.2, “Quantified Expression”.
An array type may have an implicit length indicated by an empty pair of brackets. In this case, the decoder will continue matching instances of the element type until a constraints fail or the end of the stream is reached.
ImplicitArray { Element list[]; };
The length of the list array can be referenced as
lengthof list
, see Section 8.2.5, “lengthof Operator”.
The name of a member of integral type may be used as a label on another member to indicate its byte offset in the enclosing sequence:
Tile { TileHeader header; uint32 stringOffset; uint16 numFeatures; stringOffset: StringTable stringTable; };
In this example, the byte offset of member
stringTable
from the beginning of the
Tile
instance is given by the value of
stringOffset
.
The offset of a label is relative to the enclosing sequence by default. If the offset is relative to some other type containing the current one, this is indicated by a global label, where the type name is used as a prefix, followed by a double colon:
Database { uint32 numTiles; Tile tiles[numTiles]; }; Tile { TileHeader header; uint32 stringOffset; uint16 numFeatures; Database::stringOffset: StringTable stringTable; };
Since labels always refer to byte offsets, a given member within a sequence type cannot be labelled if it is not guaranteed to be byte-aligned. To overcome this restriction, an alignment can be specified explicitly:
Tile { TileHeader header; uint32 stringOffset; uint16 numBits; bit:1 bits[numBits]; align(8): stringOffset: StringTable stringTable; };
The align(n) modifier causes the decoder to skip 0..n-1 bits so that the bit offset from the beginning of the stream is divisible by n. n may be any integer literal. Alignment modifiers may be used in any sequence type, independent of labels:
AlignmentExample { bit:11 a; align(32): uint32 b; };
The size of the AlignmentExample type is 64 bits; without the alignment modifier, the size would be 43 bits.
The semantics of expression and the precedence rules for operators
is the same as in Java, except where stated otherwise. DataScript has a
number of special operators sizeof, lengthof, is
and
forall
that will be explained in detail below.
The following Java operators have no counterpart in DataScript:
++, --, >>>, instanceof
.
The integer arithmetic operations include +
(addition), -
(subtraction), *
(multiplication), /
(division),
%
(modulo). In addition, there are the shift
operators <<
and
>>
.
There are the following relational operators for integer
expressions: ==
(equal to), !=
(not equal to), <
(less than),
<
(less than or equal), >
(greater than), >=
(greater than or
equal).
The equality operators ==
and
!=
may be applied to any type
The boolean operators &&
(and) and
||
(or) may be applied to boolean
expressions.
The bit operators &
(bitwise and),
|
(bitwise or), ^
(bitwise
exclusive or) may be applied to integer types.
The assignment operator =
and the combined
assignment operators *=, /=, %=, +=, -=, <<=,
>>=, &=, ^=, |=
have the usual semantics.
For integer expressions, there are +
(unary
plus), -
(unary minus) and ~
(bitwise complement).
The sizeof
operator returns the size of a
type or an expression in bytes. sizeof
may not be
used, when the size in bits is not divisible by 8. When
sizeof
is applied to a type name, the size of the
type must be fixed and known at compile time. When
sizeof
is a applied to a member, it refers to the
actual size of the member after decoding.
The bitsizeof
operator returns the size of a
type or an expression in bits. When bitsizeof
is
applied to a type name, the size of the type must be fixed and known
at compile time. When bitsizeof
is a applied to a
member, it refers to the actual size of the member after
decoding.
The lengthof
operator may be applied to an
array member and returns the actual length (i.e. number of elements of
an array.Thus, given int32 a[5]
, the expression
lengthof a
evaluates to 5. This is not particularly
useful for fixed or variable length arrays, but it is the only way to
refer to the length of an implicit length array.
The is
operator can be applied to two field
names, e.g. x is y
. x
must be a
member of union type, and y
must be one of the
branch names of that union. The expression is true if and only if the
decoder has selected branch y for the union.
A conditional expression booleanExpr ? expr1 : expr2 has the value of expr1 when booleanExpr is true. Otherwise, it has the value of expr2.
A quantified expression has the form forall
indexIdentifier in
arrayExpr :
booleanExpr. The quantified expression is true if
and only if the booleanExpr is true for all
indices of the array. This is only useful when the boolean expression
after the colon involves the array expression and the index identifier
from the left hand side.
Example: The constraint
forall i in a : (i == 0) || (a[i] == a[i-1]+1)
means the elements of a
are a sequence of
consecutive integers.
In the following list, operators are grouped by precedence in ascending order. Operators on the bottom line have the highest precedence and are evaluated first. All operators on the same line have the same precedence and are evaluated left to right, except assignment operators which are evaluated right to left.
comma
assignment
forall
? :
||
&&
|
^
&
== !=
< > <= >=
<< >>
+ -
* / %
cast
unary + - ~ !
sizeof bitsizeof lengthof sum
[] () . is
DataScript syntax permits the definition of nested types, however, it is not easy to define the semantics of such types in a consistent way. For the time being, the only supported use is a sequence type definition within a sequence or union field definition, or a union type definition within a sequence field definition, and even this should be avoided in favour of a reference to a type defined at global scope. Example:
VarCoord { uint8 width; union { { int16 x; int16 y; } coord16 : width == 16; { int32 x; int32 y; } coord32 : width == 32; } coord; };
The sequence type VarCoord
contains the member
coord
which has a nested union type definition. This
union type has two members each of which is a nested sequence type. All
nested types in this example are anonymous, but this it not
necessary.
The nested type definitions can be avoided as follows:
VarCoord { uint8 width; Coords coords; }; union Coords { Coord16 coord16 : VarCoord.width == 16; Coord32 coord32 : VarCoord.width == 32; }; Coord16 { int16 x; int16 y; }; Coord32 { int32 x; int32 y; };
Note that the constraints for the members of the
Coords
union refer to the containing type
VarCoord
. This is explained in more detail in the
following section.
The dot operator can be used to access a member of a compound type: the expression f.m is valid if
f is a field of a compound type C
The type T of f is a compound type.
T has a member named m.
The value of the expression f.m can be evaluated at run-time only if the member f has been evaluated before.
There is a second use of the dot operator involving a type name:
At run-time, each compound type C (except the root type) is contained in a type P which has a member of type C which is currently being decoded. Within the scope of C, members of the parent type P may be referenced using the dot operator P.m.
The containment relation is extended recursively: If C is contained in P and P is contained in Q, then Q.m is a valid expression in the scope of C, denoting the member m of the containing type Q.
Example:
Header { uint32 version; uint16 numItems; }; Message { Header h; Item items[h.numItems]; }; Item { uint16 p; uint32 q if Message.h.version >= 10; };
Within the scope of the Message
type,
header
refers to the field of type
Header
, and header.numItems
is a
member of that type. Within the scope of the Item
type,
the names h
or Header
are not
defined. But Item
is contained in the Message type, and
h
is a member of Message
, so
Message.h
is a valid expression of type
Header
, and
references the
Message.h
.versionversion
member of the Header
type.
The definition of a compound type may be augmented with a parameter list, similar to a parameter list in a Java method declaration. Each item of the parameter list has a type and a name. Within the body of the compound type definition, parameter names may be used as expressions of the corresponding type.
To use a parameterized type as a field type in another compound type, the parameterized type must be instantiated with an argument list matching the types of the parameter list.
For instance, the previous example can be rewritten as
Header { uint32 version; uint16 numItems; }; Message { Header h; Item(h) items[h.numItems]; }; Item(Header header) { uint16 p; uint32 q if header.version >= 10; };
When the element type of an array is parameterized, a special notation can be used to pass different arguments to each element of the array:
Database { uint16 numBlocks; BlockHeader headers[numBlocks]; Block(headers[blocks$index]) blocks[numBlocks]; }; BlockHeader { uint16 numItems; uint32 offset; }; Block(BlockHeader header) { Database::header.offset: Item items[header.numItems]; };
blocks$index
denotes the current index of the
blocks
array. The use of this expression in the
argument list for the Block
reference indicates that
the i-th element of the blocks
array is of type
Block
instantiated with the i-th header
headers[i]
.
A subtype definition defines a new name for a given type, optionally
in combination with a constraint. When the constraint is omitted, this is
rather like a typedef
in C:
subtype uint16 BlockIndex; Block { BlockIndex index; BlockData data; };
A constraint in the subtype definition is, as usual, a boolean
expression introduced by a colon which may contain the keyword
this
to refer to the current type:
subtype uint16 BlockIndex : 1 <= this && this < 1024;
Subtype constraints will be checked by the decoder for every occurrence of the given subtype in a field definition.
Implementation note: Subtypes were introduced in version rds 0.7. Subtype constraints are not yet implemented.
DataScript supports the standard comment syntax of Java or C++.
Single line comments start with //
and extend to the
end of the line. A comments starting with /*
is
terminated by the next occurrence of */
, which may or
may not be on the same line.
// This is a single-line comment. /* This is an example of a multi-line comment spanning three lines. */
To support inline documentation within a DataScript module,
multi-line comments starting with /**
are treated as
special documentation comments. The idea and syntax are borrowed from
Java(doc). A documentation comment is associated to the following type
or field definition. The documentation comment and the corresponding
definition may only be separated by whitespace.
/** * Traffic flow on links. */ enum bit:2 Direction { /** No traffic flow allowed */ NONE, /** Traffic allowed from start to end node. */ POSITIVE, /** Traffic allowed from end to start node. */ NEGATIVE, /** Traffic allowed in both directions. */ BOTH };
The content of a documentation comment, excluding its delimiters,
is parsed line by line. Each line is stripped of leading whitespace, a
sequence of asterisks (*
), and more whitespace, if
present. After stripping, a comment is composed of one or more
paragraphs, followed by zero or more tag blocks. Paragraphs are
separated by lines.
A line starting with whitespace and a keyword preceded by an
at-sign (@
) is the beginning of a tag block and also
indicates the end of the preceding tag block or comment
paragraph.
The only tag currently supported is @param
,
which is used for documenting the arguments of a parameterized type. The
documentation comment should contain one @param
block
for each argument in the correct order. The @param
tag is followed by the parameter name and the parameter description. The
parameter name must be enclosed by whitespace.
/** * This type takes two arguments. * @param arg1 The first argument. * @param arg2 The second argument. */ ParamType(Foo arg1, Blah arg2) { ... };
Complex DataScript specifications should be split into multiple packages stored in separate source files. Every user-defined type belongs to a unique package. For backward compatibility, there is an unnamed default package used for files without an explicit package declaration. It is strongly recommended to use a package declaration in each DataScript source file.
A package provides a lexical scope for types. Type names must be
unique within a package, but a given type name may be defined in more
than one package. If a type named Coordinate
is
defined in package com.acme.foo
, the type can be
globally identified by its fully qualified name
com.acme.foo.Coordinate
, which is obtained by
prefixing the type name with the name of the defining package, joined by
a dot. Another package com.acme.foo.bar
may also
define a type named Coordinate
, having the fully
qualified name com.acme.bar.Coordinate
.
By default, types from other packages are not visible in the current package, unless there are imported explicitly. The package and import syntax and semantics follow the Java example.
package map; import common.geometry.*; import common.featuretypes.*;
Import declarations only have any effect when there is a reference
to a type name not defined in the current package. If package
map
defines its own Coordinate
type, any reference to that within package map will be resolved to the
local type map.Coordinate
, even when one or more of
the imported packages also define a type named
Coordinate
.
On the other hand, if package map
references a
Coordinate
type but does not define it, the import
declarations are used to resolve that type in one of the imported
packages. In that case, the referenced type must be matched by exactly
one of the imported packages. It is obviously a semantic error if the
type name is defined in none of the packages. It is also an error if the
type name is defined in two or more of the imported packages. The order
of the import declarations does not matter.
Individual types can be imported using their fully qualified name:
import common.geometry.Geometry
;
This single import has precedence over any wildcard import. It
prevents an ambiguity with
common.featuretypes.Geometry
. However, it would be a
semantic error to have another single import of the same type name from
a different package.
Implementation note: rds 0.8 supports wildcard imports. Single imports are not yet implemented but will be added in the near future. There are currently no plans to implement references to types by their fully qualified names, as this would cause parser ambiguities with the nested type syntax, see Section 9, “Nested Types”.
Package and file names are closely related. Each package must be
located in a separate file. The above example declares a package
map
stored in a source file
map.ds
. The import declarations direct the parser to
locate and parse source files common/geometry.ds
and
common/featuretypes.ds
.
Imported files may again contain import declarations. Cyclic import relations between packages are supported but should be avoided. The DataScript parser takes care to parse each source file just once.
With its basic language features presented in the previous sections, DataScript provides a rich language for modelling binary data streams, which are intended to be parsed sequentially. Direct access to members in the stream is usually not possible, except for labels specifying the offset of a given member. Navigation between semantically related members at different positions in the stream cannot be expressed at the stream level. Member insertions or updates are not supported..
All in all, the stream model is not an adequate approach for updatable databases in the gigabyte size range with lots of internal cross-references where fast access to individual members is required. In a desktop or server environment, it would be a natural approach to model such a database as a relational database using SQL.
However, in an embedded environment with limited storage space and processing resources, a full-fledged relational schema is too heavy-weight. To have the best of both worlds, i.e. compact storage on the one hand and direct member access including updates on the other hand, one can adopt a hybrid data model: In this hybrid model, the high-level access structures are strictly relational, but most of the low-level data are stored in binary large objects (BLOBs), where the internal structure of each BLOB is modelled in DataScript.
For example, we can model a digital map database as a collection of tiles resulting from a rectangular grid where the tiles are numbered row-wise. The database has a rather trivial schema:
CREATE TABLE europe ( tileNum INT NOT NULL PRIMARY KEY, tile BLOB NOT NULL);
Accessing or updating any given tile can simply be delegated to the relational DBMS. Assuming that the tile BLOBs have a reasonable size, each tile can be decoded on the fly to access the individual members within the tile.
For seamless modelling of this hybrid approach, we decided to add relational extensions to DataScript. Some SQL concepts have been translated to DataScript, others are transparent to DataScript and can be embedded as literal strings to be passed to the SQL backend. Since it is hard to find a natural border between native DataScript and embedded SQL, the relational extensions of DataScript should be regarded as preliminary.
An SQL table type is a special case of a compound type, where
the members of the type correspond to the columns of a relational
table. Members of integer or string type translate to the
corresponding SQL column types. Members of compound or array type
correspond to BLOB columns. Members of type
uint8[n]
correspond to a CHAR(n)
type. In Relational DataScript, we can express the above example as
follows:
sql_table GeoMap { int32 tileId sql "PRIMARY KEY"; Tile tile; }; GeoMap europe; GeoMap america;
It is important to note that the GeoMap
is a
table type and not a table. A table is defined by
the instance
of type
europe
GeoMap
. Table types have no direct equivalent in
SQL. They can be used to create tables with identical structure and
column names. Each instance of an
type in DataScript
translates to an SQL table where the table name in SQL is equal to the
instance name in DataScript. A member definition may include an SQL
constraint introduced by the keyword sql_table
sql
, followed
by a literal string which is passed transparently to the SQL
engine.
Thus, the DataScript instance america
gives
rise to the following SQL table:
CREATE TABLE america ( tileNum INT NOT NULL PRIMARY KEY, tile BLOB NOT NULL);
Note: The current definition of sql_tables is not consistent with the abstract data type and value semantics of plain DataScript. Actually, a table instance is itself a composite type, represented by a set of rows, and each row item is an instance of the corresponding column type. To model this more adequately and to avoid embedded SQL for the primary key property, we are planning to change our definitions.
An SQL table is a map container type mapping (primary) keys to values: sql_map<int32, Tile> GeoMap is a container type. An instance of this type is an SQL table. A map entry is realized as a table row, corresponding to a key-value pair. Both key and value may be composed of one or more columns.
An sql_table
type differs from a sequence
type in that there is no decoder function that automatically reads all
members of a table row. The application has to build its own queries
an invoke a decoder function on each BLOB column explicitly. In the
case where an sql_table
member is an instance of a
parameterized type, the application may want to derive the parameter
values from the context (e.g. other table columns), which is not
available to the DataScript decoder. In this case, the type arguments
shall be marked with the keyword explicit
to
indicate that these values will be set explicitly be the application.
Otherwise, the decoder would complain about not being able to evaluate
the type arguments.
Tile(uint8 level, uint8 width) { ... }; sql_table TileTable { uint32 tileId; uint32 version; Tile(explicit level, explicit width) tile; };
Since an SQL table is always contained in an SQL database, we
introduce an sql_database
type in DataScript to model
databases. sql_table
instances may only be created as
members of an sql_database
.
sql_table GeoMap { // see above }; sql_database TheWorld { GeoMap europe; GeoMap america; };
Some SQL engines internally always use an integer key or rowid. If
the user-defined primary key is an integer, it can be used as row id. If
the primary key is composite, it is mapped internally to an integer
rowid. To avoid this indirection which requires additional storage space
and increases access times, we introduce SQL integer types. An
sql_integer
is a sequence type whose members are of
integer base type such that the total size of the
sql_integer
type does not exceed 64 bits. Members of
an sql_integer
may not be optional.
sql_integer TileId { uint8 levelNr; int32 tileNr; };
In this example, the value used as SQL key is (levelNr
<< 32 + (uint32) tileNr)
.
[Back] DataScript - a Specification and Scripting Language for Binary Data. Proceedings of the ACM Conference on Generative Programming and Component Engineering Proceedings (GPCE 2002), published as LNCS 2487. ACM. Pittsburgh, PA. October 2002. pp. 66-77.
[DataScript] DataScript Reference Implementation.
[rds] Relational DataScript.
[SQLite] SQLite Embedded Database.