Changes between Initial Version and Version 1 of Bithenge/Syntax


Ignore:
Timestamp:
2012-08-20T18:25:20Z (12 years ago)
Author:
Sean Bartell
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Bithenge/Syntax

    v1 v1  
     1= Bithenge Syntax =
     2
     3[[PageOutline(2-3)]]
     4
     5This page describes the syntax and semantics of [[Bithenge]] scripts.
     6
     7== Simple example ==
     8
     9{{{
     10transform item(scale) = struct {
     11    .name_len <- uint8;
     12    .name <- ascii <- known_length(.name_len);
     13    .value <- (in * scale) <- uint32le;
     14};
     15
     16transform main = repeat {item(100)};
     17}}}
     18
     19More complicated examples can be found in
     20[https://bazaar.launchpad.net/~wtachi/helenos/bithenge/files/head:/uspace/dist/src/bithenge uspace/dist/src/bithenge].
     21
     22== Tokenization ==
     23
     24Bithenge scripts consist of ASCII text. When a script is read, it is first
     25divided into pieces known as ''tokens''. As in most programming languages, the
     26longest possible token is read, so `==` is always a single token rather than
     27two. Whitespace (space, form‐feed, newline, carriage return, horizontal tab,
     28and vertical tab) and comments (`#` until newline) are allowed between tokens;
     29both LF (Unix) and CRLF (DOS) newline conventions can be used.
     30
     31There are several types of tokens:
     32 Symbols:: `&& ++ == >= // <- <= != || < > = + - * % ; : { } ( ) .`
     33 Keywords:: `do else false if in partial repeat struct switch transform true while`
     34 Identifiers:: Any sequence of letters, digits, and underscores, beginning with a letter, that is not a keyword.
     35 Integer literals:: A sequence of decimal digits; negative literals are not yet possible.
     36
     37Keywords and identifiers are case‐sensitive. Tokens can be at most
     38256 characters long.
     39
     40== Scripts ==
     41
     42- script → *definition
     43- definition → "`transform`" identifier [parameters] "`=`" transform "`;`"
     44- parameters → "`(`" [identifier *("`,`" identifier)] "`)`"
     45
     46A Bithenge script consists of a series of transform definitions. It must define
     47a transform named `main`, which is applied when the script is used. Transforms
     48may be defined with parameters, in which case corresponding arguments must be
     49provided when the transform is invoked.
     50
     51Examples:
     52- `transform main = uint8;` defines an extremely simple main transform.
     53- `transform is_odd(val) = (val % 2 == 1);` defines a transform that checks
     54  whether its argument is odd.
     55
     56== Transform invocation ==
     57
     58- transform → identifier [arguments]
     59- arguments → "`(`" [ expression *("`,`" expression) ] "`)`"
     60
     61A built‐in or previously defined transform can be invoked, with arguments if
     62necessary.
     63
     64Examples:
     65- `is_odd(17)` is a transform that applies `is_odd` with 17 as the argument.
     66
     67== Transform composition ==
     68
     69- transform → transform "`<-`" transform
     70
     71A transform can be created by composing two existing transforms. It will work
     72by first applying the right transform to the input, then applying the left
     73transform to the result.
     74
     75Examples:
     76- `ascii <- zero_terminated` decodes a zero‐terminated ASCII string.
     77- `repeat{uint32le} <- known_length(8)` decodes two 32‐bit integers.
     78
     79== Expression transforms ==
     80
     81- transform → "`(`" expression "`)`"
     82
     83Expression transforms produce their output by calculating an expression. If
     84"`in`" is used in the expression, it refers to the expression transform’s
     85input; otherwise, the input must be an empty blob. In either case, the
     86parentheses are mandatory.
     87
     88Examples:
     89- `(in + 1)` adds 1 to its input.
     90- `(.bytes_per_sector * .sectors_per_cluster)` calculates the number of bytes
     91  per cluster.
     92
     93== Struct transforms ==
     94
     95- transform → "`struct`" "`{`" *struct-field "`}`"
     96- struct-field → [ "`.`" identifier ] "`<-`" transform "`;`"
     97
     98A struct transform applies each of its subtransforms in sequence to the input
     99blob. The result is an internal node, with the specified key used for the
     100result of each subtransform. If a subtransform has no corresponding key, its
     101result must be an internal node; the result’s keys will be merged into the
     102struct transform’s result.
     103
     104It must be possible to determine the size of the subblob each subtransform
     105consumes. Bithenge currently only handles the simple cases: it knows `uint32le`
     106needs 4 bytes, but not that `known_length(2) <- repeat{uint32le}` needs 8
     107bytes.
     108
     109Example:
     110{{{
     111struct {
     112    .min <- uint32le;
     113    .max <- uint32le;
     114    .mid <- ((.min + .max) // 2);
     115    <- repeat(3) {uint8};
     116}
     117}}}
     118One possible output is `{"min": 0, "max": 8, "mid": 4, 0: 127, 1: 126, 2: 125}`.
     119
     120== Partial transforms ==
     121
     122- transform → "`partial`" [ "`(`" expression "`)`" ] "`{`" transform "`}`"
     123
     124A partial transform applies its subtransform to a prefix of a blob. An
     125expression can be given to specify the offset within the blob at which the
     126subtransform is applied.
     127
     128Examples:
     129- `partial {uint8}` decodes the first byte of the arbitrary‐length input as an
     130  integer.
     131- `partial(1) {uint8}` decodes the second byte of the arbitrary‐length input as
     132  an integer.
     133
     134== Conditional transforms ==
     135
     136- transform → "`if`" "`(`" expression "`)`" "`{`" transform "`}`" "`else`" "`{`" transform "`}`"
     137- struct-field → "`if`" "`(`" expression "`)`" "`{`" *struct-field "`}`" [ "`else`" "`{`" *struct-field "`}`" ]
     138
     139An if transform applies its first subtransform if the expression evaluates to
     140true, and the second subtransform if the expression evaluates to false. A
     141second form can be used in struct transforms:
     142`struct { ... if (...) {...} ... }` is equivalent to
     143`struct { ... <- if (...) { struct { ... } } else { struct { } }; ... }`.
     144
     145- transform → "`switch`" "`(`" expression "`)`" "`{`" *( (expression/"`else`") "`:`" transform "`;`" ) "`}`"
     146- struct-field → "`switch`" "`(`" expression "`)`" "`{`" *( (expression/"`else`") "`:`" "`{`" *struct-field "`}`" "`;`" ) "`}`"
     147
     148A switch transform compares the first expression to each of the case
     149expressions in turn, stopping when they compare equal and using that
     150subtransform. The last case can be "`else`", which will always be used if none
     151of the previous cases matched. The second form can be used in struct
     152transforms, much like the second form of "`if`".
     153
     154Examples:
     155- `if (little_endian) { uint32le } else { uint32be }` decodes a little‐ or
     156  big‐endian integer, depending on `little_endian`.
     157- `struct { switch (.type) { 0: {.x <- uint8; .y <- uint8;}; 1: {.val <- uint16;}; } }`
     158  decodes two 8‐bit integers or one 16‐bit integer depending on the value of
     159  `.type`.
     160
     161== Repetition transforms ==
     162
     163- transform → "`repeat`" [ "`(`" expression "`)`" ] "`{`" transform "`}`"
     164
     165A repeat transform applies its subtransform repeatedly to sequential subblobs
     166of the input. The output is an internal node with a member for each repetition,
     167with keys starting at `0`. If an expression is given, it determines the number
     168of times to repeat; otherwise, the subtransform is applied until it fails or
     169there is no input remaining.
     170
     171- transform → "`do`" "`{`" transform "`}`" "`while`" "`(`" expression "`)`"
     172
     173A do/while transform applies its subtransform repeatedly until the expression
     174evaluates to `false`. The output has the same format as repeat transform
     175output. Scope member expressions should be used in the expression to access the
     176output of the subtransform and determine when to stop.
     177
     178Examples:
     179- `repeat{uint8}` decodes each byte as an integer.
     180- `repeat(256) {uint32le}` decodes an array of 256 integers.
     181- `do { struct { .type <- uint8; .value <- uint32le; } } while (.type != 0)`
     182  decodes types and values, stopping after a type of 0 is seen.
     183
     184== Expressions ==
     185
     186- expression → "`true`"
     187- expression → "`false`"
     188- expression → integer
     189
     190Expressions can be boolean or integer literals.
     191
     192- expression → identifier
     193
     194Expressions can use parameters.
     195
     196- expression → "`in`"
     197
     198In an expression transform, "`in`" refers to the transform’s input. Otherwise,
     199it refers to the input of the whole transform in the definition.
     200
     201- expression → "`.`" identifier
     202
     203A scope member expression looks in each containing struct transform for a
     204member with the given key. It starts from the innermost struct transform and
     205searches as far as the whole transform being defined.
     206
     207- expression → expression "`.`" identifier
     208- expression → expression "`[`" expression "`]`"
     209
     210An expression can look for a member of another expression with a given key.
     211
     212- expression → expression "`[`" expression "`:`" "`]`"
     213- expression → expression "`[`" expression "`:`" expression "`]`"
     214- expression → expression "`[`" expression "`,`" expression "`]`"
     215
     216An expression can create a subblob of another expression. The first expression
     217specifies the blob and the second expression specifies the start offset within
     218the blob. If a comma is used, the third expression specifies the length of the
     219subblob; otherwise it specifies the end offset. If no third expression is
     220given, the subblob extends to the end of the blob.
     221
     222- expression → "`(`" expression "`)`"
     223- expression → expression operator expression
     224
     225Expressions can, of course, use binary operators. The operators and their
     226precedences are as follows:
     227
     228||= Operator =||= Precedence =||= Associativity =||= Description                 =||
     229|| "`.`"      || 5            || Left‐to‐right   || Member (see above)            ||
     230|| "`[]`"     || 5            || Left‐to‐right   || Member or subblob (see above) ||
     231|| "`*`"      || 4            || Left‐to‐right   || Integer multiplication        ||
     232|| "`//`"     || 4            || Left‐to‐right   || [https://en.wikipedia.org/wiki/Modulo_operation Floored/euclidean] integer division; divisor must be positive ||
     233|| "`%`"      || 4            || Left‐to‐right   || [https://en.wikipedia.org/wiki/Modulo_operation Floored/euclidean] modulo operation; divisor must be positive ||
     234|| "`+`"      || 3            || Left‐to‐right   || Integer addition              ||
     235|| "`-`"      || 3            || Left‐to‐right   || Integer subtraction           ||
     236|| "`++`"     || 3            || Left‐to‐right   || Blob concatenation            ||
     237|| "`<`"      || 2            || Left‐to‐right   || Integer less‐than             ||
     238|| "`<=`"     || 2            || Left‐to‐right   || Integer less‐than‐or‐equal    ||
     239|| "`>`"      || 2            || Left‐to‐right   || Integer greater‐than          ||
     240|| "`>=`"     || 2            || Left‐to‐right   || Integer greater‐than‐or‐equal ||
     241|| "`==`"     || 1            || Left‐to‐right   || Equal‐to (not supported for internal nodes) ||
     242|| "`!=`"     || 1            || Left‐to‐right   || Unequal‐to (not supported for internal nodes) ||
     243|| "`&&`"     || 0            || Left‐to‐right   || Logical or ||
     244|| "`||`"     || 0            || Left‐to‐right   || Logical and ||
     245
     246== Built‐in transforms ==
     247
     248These transforms are implemented in C and included with Bithenge. Note that
     249precise names are preferred; scripts can define shorter aliases if necessary.
     250
     251||= name =||= input =||= output =||= description =||= example =||
     252||ascii             ||byte blob node   ||string        ||decodes some bytes as ASCII characters ||  `hex:6869` becomes `"hi"` ||
     253||bit               ||1‐bit blob node  ||boolean       ||decodes a single bit || `1` becomes `true` ||
     254||bits_be           ||byte blob node   ||bit blob node ||decodes bytes as bits, starting with the most‐significant bit || `hex:0f` becomes `bit:00001111` ||
     255||bits_le           ||byte blob node   ||bit blob node ||decodes bytes as bits, starting with the least‐significant bit || `hex:0f` becomes `bit:11110000` ||
     256||known_length(len) ||blob node        ||blob node     ||requires the input to have a known length || ||
     257||nonzero_boolean   ||integer          ||boolean       ||decodes a boolean where nonzero values are true || `0` becomes `false` ||
     258||uint8             ||1‐byte blob node ||integer node  ||decodes a 1‐byte unsigned integer ||  `hex:11` becomes `17` ||
     259||uint16be          ||2‐byte blob node ||integer node  ||decodes a 2‐byte big‐endian unsigned integer ||  `hex:0201` becomes `513` ||
     260||uint16le          ||2‐byte blob node ||integer node  ||decodes a 2‐byte little‐endian unsigned integer ||  `hex:0101` becomes `257` ||
     261||uint32be          ||4‐byte blob node ||integer node  ||decodes a 4‐byte big‐endian unsigned integer ||  `hex:00000201` becomes `513` ||
     262||uint32le          ||4‐byte blob node ||integer node  ||decodes a 4‐byte little‐endian unsigned integer ||  `hex:01010000` becomes `257` ||
     263||uint64be          ||8‐byte blob node ||integer node  ||decodes a 8‐byte big‐endian unsigned integer ||  `hex:0000000000000201` becomes `513` ||
     264||uint64le          ||8‐byte blob node ||integer node  ||decodes a 8‐byte little‐endian unsigned integer ||  `hex:0101000000000000` becomes `257` ||
     265||uint_be(len)      ||bit blob node    ||integer node  ||decodes bits as an unsigned integer, starting with the most‐significant bit || ||
     266||uint_le(len)      ||bit blob node    ||integer node  ||decodes bits as an unsigned integer, starting with the least‐significant bit || ||
     267||zero_terminated   ||byte blob node   ||byte blob node||takes bytes up until the first `00` ||  `hex:7f0400` becomes `hex:7f04` ||