= Structured Binary Data = [[PageOutline(2-3)]] As part of [wiki:GSOC Google Summer of Code 2012], Bithenge is being created to address #317. This page describes the project’s design and implementation. The code is at [https://code.launchpad.net/~wtachi/helenos/bithenge lp:~wtachi/helenos/bithenge] and periodic updates are posted to [http://lists.modry.cz/cgi-bin/listinfo/helenos-devel HelenOS-devel]. == Overview == Exploring and working with structured binary data is necessary in many different situations in a project like HelenOS. For instance, when implementing a file format or filesystem, it is first necessary to explore preexisting files and disks and learn the low‐level details of the format. Debugging compiled programs, working with core dumps, and exploring network protocols also require some way of interpreting binary data. The most basic tool for exploring binary data is the hex editor. Using a hex editor is inefficient and unpleasant because it requires manual calculation of lengths and offsets while constantly referring back to the data format. General‐purpose scripting languages can be used instead, so a structure can be defined once and decoded as often as necessary. However, even with useful tools like Python’s struct module, the programmer must specify how to read the input data, calculate lengths and offsets, and provide useful output, so there’s much more work involved than simply specifying the format of the data. This extra code will probably be rewritten every time a new script is made, due to slightly differing requirements. The Bithenge project involves creating a powerful library and tools that will make working with structured binary data faster and easier. It will consist of: * A core library that manages structured data and provides basic building blocks for binary data interpretation. * Data providers to access various sources of raw binary data. * Format providers, which can load and save complex format specifications. In particular, there will be a domain‐specific language for format specifications. * Clients, programs which use the library to work with binary data. For instance, there will be an interactive browser. The initial goals for the project are an interactive browser for filesystem structures as well as a debugger backend that can interpret core dumps and task memory. == Requirements == * Work in HelenOS—this means the code must be in C and/or an easily ported language like Lua. * View on different layers. For instance, when viewing a FAT directory entry, it should be possible to switch between viewing the formatted date and time, the integers, and the original bytes. * Check whether data is valid; handle broken data reasonably well. * Parse pieces of the data lazily; don’t try to read everything at once. * Work in both directions (parsing and building) without requiring too much extra effort when specifying the format. * Support full modifications. Ideally, allow creation of a whole filesystem from scratch. == Trees == Bithenge represents all data in the form of a data structure called a “tree,” similar to the data structure represented by JSON. A tree consists of a boolean node, integer node, string node, or blob node, or an internal node with children. A boolean node holds a boolean value, an integer node holds a signed integer, and a string holds a Unicode string. A blob node represents an arbitrary sequence of raw bytes or bits. Blob nodes are polymorphic, allowing any source of raw binary data to be used. Bithenge includes blob node implementations for in‐memory buffers, files, and block devices. An implementation has also been written that reads another task’s virtual memory, but it hasn’t been committed because it’s unclear whether it will be useful in its current form. An internal node has an arbitrary number of children, each with a unique key. The key can be any node other than an internal node. Arrays can be represented by internal nodes with integer keys starting at 0. The tree node can provide children in an arbitrary order; the order will be used when displaying the tree, but should have no semantic significance. Internal nodes are polymorphic and can delay creation of child nodes until necessary, so keeping the whole tree in memory can be avoided. Internal nodes are currently responsible for freeing their own children. In the future, it should be possible for there to be multiple references to the same node, but it isn’t clear whether this should be implemented with symbolic links, an acyclic graph with reference counting, or a full graph. Note that all interpreted data is represented in Bithenge with nodes. Therefore, the word “blob” usually refers to a blob node, and so on. A decoded tree for a FAT filesystem might look something like this: {{{ ○───bits─▶16 │ ├───fat──▶○ │ ├───0───▶0xfff0 │ ├───1───▶0xffff │ └───2───▶0x0000 │ └───root──▶○ ├───0───▶○ │ ├───name───▶README.TXT │ └───size───▶0x1351 │ └───1───▶○ ├───name───▶KERNEL.ELF └───size───▶0x38e9a2 }}} == Transforms == A transform is a function from a tree to a tree. One example is `uint32le`, which takes a 4‐byte blob node as the input tree and provides an integer node as the output tree. Another example would be `FAT16_filesystem`, a transform that takes a byte blob node as the input tree and provides a complex output tree with various decoded information about the filesystem. Some transforms, like `uint32le`, are built in to Bithenge; more complicated transforms can be loaded from a script file. Transforms are represented in Bithenge with a polymorphic object. The primary method is `apply`, which applies a transform to an input tree and creates an output tree. When a transform takes a blob node as input, it is sometimes necessary to determine the prefix of a given blob that can be used as input to the transform; the method `prefix_length` can be used for this. == Built‐in transforms == These transforms are implemented in C and included with Bithenge. Note that precise names are preferred; scripts can define shorter aliases if necessary. ||= name =||= input =||= output =||= description =||= example =|| ||ascii ||byte blob node ||string ||decodes some bytes as ASCII characters || `hex:6869` becomes `"hi"` || ||bit ||1‐bit blob node ||boolean ||decodes a single bit || `1` becomes `true` || ||bits_be ||byte blob node ||bit blob node ||decodes bytes as bits, starting with the most‐significant bit || `hex:0f` becomes `bit:00001111` || ||bits_le ||byte blob node ||bit blob node ||decodes bytes as bits, starting with the least‐significant bit || `hex:0f` becomes `bit:11110000` || ||known_length(len) ||blob node ||blob node ||requires the input to have a known length || || ||nonzero_boolean ||integer ||boolean ||decodes a boolean where nonzero values are true || `0` becomes `false` || ||uint8 ||1‐byte blob node ||integer node ||decodes a 1‐byte unsigned integer || `hex:11` becomes `17` || ||uint16be ||2‐byte blob node ||integer node ||decodes a 2‐byte big‐endian unsigned integer || `hex:0201` becomes `513` || ||uint16le ||2‐byte blob node ||integer node ||decodes a 2‐byte little‐endian unsigned integer || `hex:0101` becomes `257` || ||uint32be ||4‐byte blob node ||integer node ||decodes a 4‐byte big‐endian unsigned integer || `hex:00000201` becomes `513` || ||uint32le ||4‐byte blob node ||integer node ||decodes a 4‐byte little‐endian unsigned integer || `hex:01010000` becomes `257` || ||uint64be ||8‐byte blob node ||integer node ||decodes a 8‐byte big‐endian unsigned integer || `hex:0000000000000201` becomes `513` || ||uint64le ||8‐byte blob node ||integer node ||decodes a 8‐byte little‐endian unsigned integer || `hex:0101000000000000` becomes `257` || ||uint_be(len) ||bit blob node ||integer node ||decodes bits as an unsigned integer, starting with the most‐significant bit || || ||uint_le(len) ||bit blob node ||integer node ||decodes bits as an unsigned integer, starting with the least‐significant bit || || ||zero_terminated ||byte blob node ||byte blob node||takes bytes up until the first `00` || `hex:7f0400` becomes `hex:7f04` || == Basic syntax == Script files are used to define complicated transforms. Transforms (including built‐in transforms) can be referenced by name: `uint32le`. Transforms can be given a new name: `transform u32 = uint32le;` defines a shorter alias for `uint32le`. Transforms can be composed to create a new transform that applies them in order. The transform `ascii <- zero_terminated` first removes the 0x00 from the end of the blob, then decodes it as ascii. Note that the order of composition is consistent with function composition and nested application in mathematics, and also consistent with the general idea that data moves from right to left as it is decoded. Transforms can have parameters that affect how they decode data: {{{ transform u16(little_endian) = if (little_endian) { uint16le } else { uint16be }; }}} When such a transform is used, expressions must be given to calculate its parameters. Currently, the only possible expressions are parameters given to the current transform, boolean and integer literals, and previously decoded fields in the current `struct` or an outer `struct`: {{{ transform item(little_endian) = struct { .len <- u16(little_endian); .text <- ascii <- known_length(.len); .data <- known_length(8); }; }}} == Structs == Structs are used when a blob contains multiple data fields in sequence. A struct transform applies each subtransform to sequential parts of the blob and combines the results to create an internal node. The result of each subtransform is either assigned a key or has its keys and values merged into the final tree. Each subtransform must support `prefix_length`, so the lengths and positions of the data fields can be determined. === Example === {{{ transform point = struct { .x <- uint32le; .y <- uint32le; }; transform labeled_point = struct { .id <- uint32le; .label <- ascii <- zero_terminated; <- point; }; }}} If `labeled_point` is applied to `hex:0600000041000300000008000000`, the result is `{"id": 6, "label": "A", "x": 3, "y": 8}`. == Flow Control == Boolean conditions can use `if (expression) { transform-if-true } else { transform-if-false }`, for example: `if (little_endian) { uint32le } else { uint32be }`. There is also syntax for switches: `switch (expression) { expression: transform; ... }`. Both of these have syntactic sugar for use in `struct`s; in a `struct`, `if (expression) { fields... }` is equivalent to `<- if (expression) { struct { fields... } };`. Three kinds of repetition are supported. `repeat(expression) {transform}` applies the transform a given number of times. `repeat {transform}` applies the transform until it fails or the end of the data is reached. `do {transform} while(expression)` applies the transform until the expression, evaluated within the result of the transform, is false; this can be used for things like `do { struct { .keep_going <- nonzero_boolean <- uint8; .val <- uint8; } } while(.keep_going)`. For each type of repetition, the result is an internal node with sequential keys starting at `0`. == Using Bithenge == The Bithenge source code is in `uspace/app/bithenge` and is built along with HelenOS. It can be built for Linux instead with `make -f Makefile.linux`. The program can be run with `bithenge