= Bithenge Syntax =

[[PageOutline(2-3)]]

This page describes the syntax and semantics of [[Bithenge]] scripts.

== Simple example ==

{{{
transform item(scale) = struct {
    .name_len <- uint8;
    .name <- ascii <- known_length(.name_len);
    .value <- (in * scale) <- uint32le;
};

transform main = repeat {item(100)};
}}}

More complicated examples can be found in
[https://bazaar.launchpad.net/~wtachi/helenos/bithenge/files/head:/uspace/dist/src/bithenge uspace/dist/src/bithenge].

== Tokenization ==

Bithenge scripts consist of ASCII text. When a script is read, it is first
divided into pieces known as ''tokens''. As in most programming languages, the
longest possible token is read, so `==` is always a single token rather than
two. Whitespace (space, form‐feed, newline, carriage return, horizontal tab,
and vertical tab) and comments (`#` until newline) are allowed between tokens;
both LF (Unix) and CRLF (DOS) newline conventions can be used.

There are several types of tokens:
 Symbols:: `&& ++ == >= // <- <= != || < > = + - * % ; : { } ( ) .`
 Keywords:: `do else false if in partial repeat struct switch transform true while`
 Identifiers:: Any sequence of letters, digits, and underscores, beginning with a letter, that is not a keyword.
 Integer literals:: A sequence of decimal digits; negative literals are not yet possible.

Keywords and identifiers are case‐sensitive. Tokens can be at most
256 characters long.

== Scripts ==

- script → *definition
- definition → "`transform`" identifier [parameters] "`=`" transform "`;`"
- parameters → "`(`" [identifier *("`,`" identifier)] "`)`"

A Bithenge script consists of a series of transform definitions. It must define
a transform named `main`, which is applied when the script is used. Transforms
may be defined with parameters, in which case corresponding arguments must be
provided when the transform is invoked.

Examples:
- `transform main = uint8;` defines an extremely simple main transform.
- `transform is_odd(val) = (val % 2 == 1);` defines a transform that checks
  whether its argument is odd.

== Transform invocation ==

- transform → identifier [arguments]
- arguments → "`(`" [ expression *("`,`" expression) ] "`)`"

A built‐in or previously defined transform can be invoked, with arguments if
necessary.

Examples:
- `is_odd(17)` is a transform that applies `is_odd` with 17 as the argument.

== Transform composition ==

- transform → transform "`<-`" transform

A transform can be created by composing two existing transforms. It will work
by first applying the right transform to the input, then applying the left
transform to the result.

Examples:
- `ascii <- zero_terminated` decodes a zero‐terminated ASCII string.
- `repeat{uint32le} <- known_length(8)` decodes two 32‐bit integers.

== Expression transforms ==

- transform → "`(`" expression "`)`"

Expression transforms produce their output by calculating an expression. If
"`in`" is used in the expression, it refers to the expression transform’s
input; otherwise, the input must be an empty blob. In either case, the
parentheses are mandatory.

Examples:
- `(in + 1)` adds 1 to its input.
- `(.bytes_per_sector * .sectors_per_cluster)` calculates the number of bytes
  per cluster.

== Struct transforms ==

- transform → "`struct`" "`{`" *struct-field "`}`"
- struct-field → [ "`.`" identifier ] "`<-`" transform "`;`"

A struct transform applies each of its subtransforms in sequence to the input
blob. The result is an internal node, with the specified key used for the
result of each subtransform. If a subtransform has no corresponding key, its
result must be an internal node; the result’s keys will be merged into the
struct transform’s result.

It must be possible to determine the size of the subblob each subtransform
consumes. Bithenge currently only handles the simple cases: it knows `uint32le`
needs 4 bytes, but not that `known_length(2) <- repeat{uint32le}` needs 8
bytes.

Example:
{{{
struct {
    .min <- uint32le;
    .max <- uint32le;
    .mid <- ((.min + .max) // 2);
    <- repeat(3) {uint8};
}
}}}
One possible output is `{"min": 0, "max": 8, "mid": 4, 0: 127, 1: 126, 2: 125}`.

== Partial transforms ==

- transform → "`partial`" [ "`(`" expression "`)`" ] "`{`" transform "`}`"

A partial transform applies its subtransform to a prefix of a blob. An
expression can be given to specify the offset within the blob at which the
subtransform is applied.

Examples:
- `partial {uint8}` decodes the first byte of the arbitrary‐length input as an
  integer.
- `partial(1) {uint8}` decodes the second byte of the arbitrary‐length input as
  an integer.

== Conditional transforms ==

- transform → "`if`" "`(`" expression "`)`" "`{`" transform "`}`" "`else`" "`{`" transform "`}`"
- struct-field → "`if`" "`(`" expression "`)`" "`{`" *struct-field "`}`" [ "`else`" "`{`" *struct-field "`}`" ]

An if transform applies its first subtransform if the expression evaluates to
true, and the second subtransform if the expression evaluates to false. A
second form can be used in struct transforms:
`struct { ... if (...) {...} ... }` is equivalent to
`struct { ... <- if (...) { struct { ... } } else { struct { } }; ... }`.

- transform → "`switch`" "`(`" expression "`)`" "`{`" *( (expression/"`else`") "`:`" transform "`;`" ) "`}`"
- struct-field → "`switch`" "`(`" expression "`)`" "`{`" *( (expression/"`else`") "`:`" "`{`" *struct-field "`}`" "`;`" ) "`}`"

A switch transform compares the first expression to each of the case
expressions in turn, stopping when they compare equal and using that
subtransform. The last case can be "`else`", which will always be used if none
of the previous cases matched. The second form can be used in struct
transforms, much like the second form of "`if`".

Examples:
- `if (little_endian) { uint32le } else { uint32be }` decodes a little‐ or
  big‐endian integer, depending on `little_endian`.
- `struct { switch (.type) { 0: {.x <- uint8; .y <- uint8;}; 1: {.val <- uint16;}; } }`
  decodes two 8‐bit integers or one 16‐bit integer depending on the value of
  `.type`.

== Repetition transforms ==

- transform → "`repeat`" [ "`(`" expression "`)`" ] "`{`" transform "`}`"

A repeat transform applies its subtransform repeatedly to sequential subblobs
of the input. The output is an internal node with a member for each repetition,
with keys starting at `0`. If an expression is given, it determines the number
of times to repeat; otherwise, the subtransform is applied until it fails or
there is no input remaining.

- transform → "`do`" "`{`" transform "`}`" "`while`" "`(`" expression "`)`"

A do/while transform applies its subtransform repeatedly until the expression
evaluates to `false`. The output has the same format as repeat transform
output. Scope member expressions should be used in the expression to access the
output of the subtransform and determine when to stop.

Examples:
- `repeat{uint8}` decodes each byte as an integer.
- `repeat(256) {uint32le}` decodes an array of 256 integers.
- `do { struct { .type <- uint8; .value <- uint32le; } } while (.type != 0)`
  decodes types and values, stopping after a type of 0 is seen.

== Expressions ==

- expression → "`true`"
- expression → "`false`"
- expression → integer

Expressions can be boolean or integer literals.

- expression → identifier

Expressions can use parameters.

- expression → "`in`"

In an expression transform, "`in`" refers to the transform’s input. Otherwise,
it refers to the input of the whole transform in the definition.

- expression → "`.`" identifier

A scope member expression looks in each containing struct transform for a
member with the given key. It starts from the innermost struct transform and
searches as far as the whole transform being defined.

- expression → expression "`.`" identifier
- expression → expression "`[`" expression "`]`"

An expression can look for a member of another expression with a given key.

- expression → expression "`[`" expression "`:`" "`]`"
- expression → expression "`[`" expression "`:`" expression "`]`"
- expression → expression "`[`" expression "`,`" expression "`]`"

An expression can create a subblob of another expression. The first expression
specifies the blob and the second expression specifies the start offset within
the blob. If a comma is used, the third expression specifies the length of the
subblob; otherwise it specifies the end offset. If no third expression is
given, the subblob extends to the end of the blob.

- expression → "`(`" expression "`)`"
- expression → expression operator expression

Expressions can, of course, use binary operators. The operators and their
precedences are as follows:

||= Operator =||= Precedence =||= Associativity =||= Description                 =||
||"`.`"       || 5            || Left‐to‐right   || Member (see above)            ||
||"`[]`"      || 5            || Left‐to‐right   || Member or subblob (see above) ||
||"`*`"       || 4            || Left‐to‐right   || Integer multiplication        ||
||"`//`"      || 4            || Left‐to‐right   || [https://en.wikipedia.org/wiki/Modulo_operation Floored/euclidean] integer division; divisor must be positive ||
||"`%`"       || 4            || Left‐to‐right   || [https://en.wikipedia.org/wiki/Modulo_operation Floored/euclidean] modulo operation; divisor must be positive ||
||"`+`"       || 3            || Left‐to‐right   || Integer addition              ||
||"`-`"       || 3            || Left‐to‐right   || Integer subtraction           ||
||"`++`"      || 3            || Left‐to‐right   || Blob concatenation            ||
||"`<`"       || 2            || Left‐to‐right   || Integer less‐than             ||
||"`<=`"      || 2            || Left‐to‐right   || Integer less‐than‐or‐equal    ||
||"`>`"       || 2            || Left‐to‐right   || Integer greater‐than          ||
||"`>=`"      || 2            || Left‐to‐right   || Integer greater‐than‐or‐equal ||
||"`==`"      || 1            || Left‐to‐right   || Equal‐to (not supported for internal nodes) ||
||"`!=`"      || 1            || Left‐to‐right   || Unequal‐to (not supported for internal nodes) ||
||"`&&`"      || 0            || Left‐to‐right   || Logical and ||
||"`||`"      || 0            || Left‐to‐right   || Logical or ||

== Built‐in transforms ==

These transforms are implemented in C and included with Bithenge. Note that
precise names are preferred; scripts can define shorter aliases if necessary.

||= name =||= input =||= output =||= description =||= example =||
||ascii             ||byte blob node   ||string        ||decodes some bytes as ASCII characters ||  `hex:6869` becomes `"hi"` ||
||bit               ||1‐bit blob node  ||boolean       ||decodes a single bit || `1` becomes `true` ||
||bits_be           ||byte blob node   ||bit blob node ||decodes bytes as bits, starting with the most‐significant bit || `hex:0f` becomes `bit:00001111` ||
||bits_le           ||byte blob node   ||bit blob node ||decodes bytes as bits, starting with the least‐significant bit || `hex:0f` becomes `bit:11110000` ||
||known_length(len) ||blob node        ||blob node     ||requires the input to have a known length || ||
||nonzero_boolean   ||integer          ||boolean       ||decodes a boolean where nonzero values are true || `0` becomes `false` ||
||uint8             ||1‐byte blob node ||integer node  ||decodes a 1‐byte unsigned integer ||  `hex:11` becomes `17` ||
||uint16be          ||2‐byte blob node ||integer node  ||decodes a 2‐byte big‐endian unsigned integer ||  `hex:0201` becomes `513` ||
||uint16le          ||2‐byte blob node ||integer node  ||decodes a 2‐byte little‐endian unsigned integer ||  `hex:0101` becomes `257` ||
||uint32be          ||4‐byte blob node ||integer node  ||decodes a 4‐byte big‐endian unsigned integer ||  `hex:00000201` becomes `513` ||
||uint32le          ||4‐byte blob node ||integer node  ||decodes a 4‐byte little‐endian unsigned integer ||  `hex:01010000` becomes `257` ||
||uint64be          ||8‐byte blob node ||integer node  ||decodes a 8‐byte big‐endian unsigned integer ||  `hex:0000000000000201` becomes `513` ||
||uint64le          ||8‐byte blob node ||integer node  ||decodes a 8‐byte little‐endian unsigned integer ||  `hex:0101000000000000` becomes `257` ||
||uint_be(len)      ||bit blob node    ||integer node  ||decodes bits as an unsigned integer, starting with the most‐significant bit || ||
||uint_le(len)      ||bit blob node    ||integer node  ||decodes bits as an unsigned integer, starting with the least‐significant bit || ||
||zero_terminated   ||byte blob node   ||byte blob node||takes bytes up until the first `00` ||  `hex:7f0400` becomes `hex:7f04` ||