Fork us on GitHub Follow us on Facebook Follow us on Twitter

Changes between Version 22 and Version 23 of StructuredBinaryData

2012-08-24T06:49:33Z (9 years ago)
Sean Bartell

replace with link to Bithenge


  • StructuredBinaryData

    v22 v23  
    11= Structured Binary Data =
    3 [[PageOutline(2-3)]]
    5 As part of [wiki:GSOC Google Summer of Code 2012], Bithenge is being created to
    6 address #317. This page describes the project’s design and implementation.
    7 The code is at
    8 [ lp:~wtachi/helenos/bithenge]
    9 and periodic updates are posted to
    10 [ HelenOS-devel].
    12 == Overview ==
    14 Exploring and working with structured binary data is necessary in many
    15 different situations in a project like HelenOS. For instance, when implementing
    16 a file format or filesystem, it is first necessary to explore preexisting files
    17 and disks and learn the low‐level details of the format. Debugging compiled
    18 programs, working with core dumps, and exploring network protocols also require
    19 some way of interpreting binary data.
    21 The most basic tool for exploring binary data is the hex editor. Using a hex
    22 editor is inefficient and unpleasant because it requires manual calculation of
    23 lengths and offsets while constantly referring back to the data format.
    24 General‐purpose scripting languages can be used instead, so a structure can be
    25 defined once and decoded as often as necessary. However, even with useful tools
    26 like Python’s struct module, the programmer must specify how to read the input
    27 data, calculate lengths and offsets, and provide useful output, so there’s much
    28 more work involved than simply specifying the format of the data. This extra
    29 code will probably be rewritten every time a new script is made, due to
    30 slightly differing requirements.
    32 The Bithenge project involves creating a powerful library and tools that will
    33 make working with structured binary data faster and easier. It will consist of:
    35 * A core library that manages structured data and provides basic building
    36   blocks for binary data interpretation.
    37 * Data providers to access various sources of raw binary data.
    38 * Format providers, which can load and save complex format specifications. In
    39   particular, there will be a domain‐specific language for format
    40   specifications.
    41 * Clients, programs which use the library to work with binary data. For
    42   instance, there will be an interactive browser.
    44 The initial goals for the project are an interactive browser for filesystem
    45 structures as well as a debugger backend that can interpret core dumps and task
    46 memory.
    48 == Requirements ==
    50 * Work in HelenOS—this means the code must be in C and/or an easily ported
    51   language like Lua.
    52 * View on different layers. For instance, when viewing a FAT directory entry,
    53   it should be possible to switch between viewing the formatted date and time,
    54   the integers, and the original bytes.
    55 * Check whether data is valid; handle broken data reasonably well.
    56 * Parse pieces of the data lazily; don’t try to read everything at once.
    57 * Work in both directions (parsing and building) without requiring too much
    58   extra effort when specifying the format.
    59 * Support full modifications. Ideally, allow creation of a whole filesystem
    60   from scratch.
    62 == Trees ==
    64 Bithenge represents all data in the form of a data structure called a “tree,”
    65 similar to the data structure represented by JSON. A tree consists of a boolean
    66 node, integer node, string node, or blob node, or an internal node with
    67 children. A boolean node holds a boolean value, an integer node holds a signed
    68 integer, and a string holds a Unicode string.
    70 A blob node represents an arbitrary sequence of raw bytes or bits. Blob nodes
    71 are polymorphic, allowing any source of raw binary data to be used. Bithenge
    72 includes blob node implementations for in‐memory buffers, files, and block
    73 devices. An implementation has also been written that reads another task’s
    74 virtual memory, but it hasn’t been committed because it’s unclear whether it
    75 will be useful in its current form.
    77 An internal node has an arbitrary number of children, each with a unique key.
    78 The key can be any node other than an internal node. Arrays can be represented
    79 by internal nodes with integer keys starting at 0. The tree node can provide
    80 children in an arbitrary order; the order will be used when displaying the
    81 tree, but should have no semantic significance. Internal nodes are polymorphic
    82 and can delay creation of child nodes until necessary, so keeping the whole
    83 tree in memory can be avoided.
    85 Internal nodes are currently responsible for freeing their own children. In the
    86 future, it should be possible for there to be multiple references to the same
    87 node, but it isn’t clear whether this should be implemented with symbolic
    88 links, an acyclic graph with reference counting, or a full graph.
    90 Note that all interpreted data is represented in Bithenge with nodes.
    91 Therefore, the word “blob” usually refers to a blob node, and so on.
    93 A decoded tree for a FAT filesystem might look something like this:
    94 {{{
    95 ○───bits─▶16
    97 ├───fat──▶○
    98 │         ├───0───▶0xfff0
    99 │         ├───1───▶0xffff
    100 │         └───2───▶0x0000
    102 └───root──▶○
    103            ├───0───▶○
    104            │        ├───name───▶README.TXT
    105            │        └───size───▶0x1351
    106            │
    107            └───1───▶○
    108                     ├───name───▶KERNEL.ELF
    109                     └───size───▶0x38e9a2
    110 }}}
    112 == Transforms ==
    114 A transform is a function from a tree to a tree. One example is `uint32le`,
    115 which takes a 4‐byte blob node as the input tree and provides an integer node
    116 as the output tree. Another example would be `FAT16_filesystem`, a transform
    117 that takes a byte blob node as the input tree and provides a complex output
    118 tree with various decoded information about the filesystem. Some transforms,
    119 like `uint32le`, are built in to Bithenge; more complicated transforms can be
    120 loaded from a script file.
    122 Transforms are represented in Bithenge with a polymorphic object. The primary
    123 method is `apply`, which applies a transform to an input tree and creates an
    124 output tree. When a transform takes a blob node as input, it is sometimes
    125 necessary to determine the prefix of a given blob that can be used as input to
    126 the transform; the method `prefix_length` can be used for this.
    128 == Built‐in transforms ==
    130 These transforms are implemented in C and included with Bithenge. Note that
    131 precise names are preferred; scripts can define shorter aliases if necessary.
    133 ||= name =||= input =||= output =||= description =||= example =||
    134 ||ascii             ||byte blob node   ||string        ||decodes some bytes as ASCII characters ||  `hex:6869` becomes `"hi"` ||
    135 ||bit               ||1‐bit blob node  ||boolean       ||decodes a single bit || `1` becomes `true` ||
    136 ||bits_be           ||byte blob node   ||bit blob node ||decodes bytes as bits, starting with the most‐significant bit || `hex:0f` becomes `bit:00001111` ||
    137 ||bits_le           ||byte blob node   ||bit blob node ||decodes bytes as bits, starting with the least‐significant bit || `hex:0f` becomes `bit:11110000` ||
    138 ||known_length(len) ||blob node        ||blob node     ||requires the input to have a known length || ||
    139 ||nonzero_boolean   ||integer          ||boolean       ||decodes a boolean where nonzero values are true || `0` becomes `false` ||
    140 ||uint8             ||1‐byte blob node ||integer node  ||decodes a 1‐byte unsigned integer ||  `hex:11` becomes `17` ||
    141 ||uint16be          ||2‐byte blob node ||integer node  ||decodes a 2‐byte big‐endian unsigned integer ||  `hex:0201` becomes `513` ||
    142 ||uint16le          ||2‐byte blob node ||integer node  ||decodes a 2‐byte little‐endian unsigned integer ||  `hex:0101` becomes `257` ||
    143 ||uint32be          ||4‐byte blob node ||integer node  ||decodes a 4‐byte big‐endian unsigned integer ||  `hex:00000201` becomes `513` ||
    144 ||uint32le          ||4‐byte blob node ||integer node  ||decodes a 4‐byte little‐endian unsigned integer ||  `hex:01010000` becomes `257` ||
    145 ||uint64be          ||8‐byte blob node ||integer node  ||decodes a 8‐byte big‐endian unsigned integer ||  `hex:0000000000000201` becomes `513` ||
    146 ||uint64le          ||8‐byte blob node ||integer node  ||decodes a 8‐byte little‐endian unsigned integer ||  `hex:0101000000000000` becomes `257` ||
    147 ||uint_be(len)      ||bit blob node    ||integer node  ||decodes bits as an unsigned integer, starting with the most‐significant bit || ||
    148 ||uint_le(len)      ||bit blob node    ||integer node  ||decodes bits as an unsigned integer, starting with the least‐significant bit || ||
    149 ||zero_terminated   ||byte blob node   ||byte blob node||takes bytes up until the first `00` ||  `hex:7f0400` becomes `hex:7f04` ||
    151 == Basic syntax ==
    153 Script files are used to define complicated transforms.
    155 Transforms (including built‐in transforms) can be referenced by name:
    156 `uint32le`.
    158 Transforms can be given a new name: `transform u32 = uint32le;` defines a
    159 shorter alias for `uint32le`.
    161 Transforms can be composed to create a new transform that applies them in
    162 order. The transform `ascii <- zero_terminated` first removes the 0x00 from the
    163 end of the blob, then decodes it as ascii. Note that the order of composition
    164 is consistent with function composition and nested application in mathematics,
    165 and also consistent with the general idea that data moves from right to left as
    166 it is decoded.
    168 == Expressions ==
    170 Transforms can have parameters that affect how they decode data:
    171 {{{
    172 transform u16(little_endian) =
    173     if (little_endian) {
    174         uint16le
    175     } else {
    176         uint16be
    177     };
    178 }}}
    179 When such a transform is used, expressions must be given to calculate its
    180 parameters. The basic terms used in expressions are parameters given to the
    181 current transform, boolean and integer literals, and previously decoded fields
    182 in the current `struct` or an outer `struct`:
    183 {{{
    184 transform item(little_endian) = struct {
    185     .len <- u16(little_endian);
    186     .text <- ascii <- known_length(.len);
    187     .data <- known_length(8);
    188 };
    189 }}}
    190 You can also use `+`, `-`, `*`, and parentheses as you would expect.
    192 Expressions can also be used as transforms themselves, in two ways. First, an
    193 expression referring to `in` can be used to decode a value, such as
    194 `.num_bytes <- (in * 4) <- uint32le;`. Second, an expression that doesn’t refer
    195 to `in` can be used as a transform that takes an empty blob and evaluates the
    196 expression: for example, `.num_total <- (.num_short + .num_long);` could be
    197 used in a `struct`. In both cases, the parentheses are mandatory.
    199 == Structs ==
    201 Structs are used when a blob contains multiple data fields in sequence. A
    202 struct transform applies each subtransform to sequential parts of the blob and
    203 combines the results to create an internal node. The result of each
    204 subtransform is either assigned a key or has its keys and values merged into
    205 the final tree. Each subtransform must support `prefix_length`, so the lengths
    206 and positions of the data fields can be determined.
    208 === Example ===
    210 {{{
    211 transform point = struct {
    212     .x <- uint32le;
    213     .y <- uint32le;
    214 };
    216 transform labeled_point = struct {
    217     .id <- uint32le;
    218     .label <- ascii <- zero_terminated;
    219     <- point;
    220 };
    221 }}}
    223 If `labeled_point` is applied to `hex:0600000041000300000008000000`, the result
    224 is `{"id": 6, "label": "A", "x": 3, "y": 8}`.
    226 == Flow Control ==
    228 Boolean conditions can use
    229 `if (expression) { transform-if-true } else { transform-if-false }`, for
    230 example: `if (little_endian) { uint32le } else { uint32be }`. There is also
    231 syntax for switches: `switch (expression) { expression: transform; ... }`. Both
    232 of these have syntactic sugar for use in `struct`s; in a `struct`,
    233 `if (expression) { fields... }` is equivalent to
    234 `<- if (expression) { struct { fields... } };`.
    236 Three kinds of repetition are supported. `repeat(expression) {transform}`
    237 applies the transform a given number of times. `repeat {transform}` applies the
    238 transform until it fails or the end of the data is reached.
    239 `do {transform} while(expression)` applies the transform until the expression,
    240 evaluated within the result of the transform, is false; this can be used for
    241 things like
    242 `do { struct { .keep_going <- nonzero_boolean <- uint8; .val <- uint8; } } while(.keep_going)`.
    243 For each type of repetition, the result is an internal node with sequential
    244 keys starting at `0`.
    246 == Using Bithenge ==
    248 The Bithenge source code is in `uspace/app/bithenge` and is built along with
    249 HelenOS. It can be built for Linux instead with `make -f Makefile.linux`.
    251 The program can be run with `bithenge <script file> <source>`. The script file
    252 must define a transform called `main`. The source can start with one of the
    253 following prefixes:
    255 ||= Prefix =||= Example =||= Description =||
    256 || file: || file:/textdemo || Read the contents of a file. This is the default if no prefix is used. ||
    257 || block: || block:bd/initrd || Read the contents of a block device. (HelenOS only.) ||
    258 || hex: || hex:01000000 || Use a string of hexadecimal characters to create a blob node. ||
    260 There are some example files in `uspace/dist/src/bithenge`.
    262 === Using the API ===
    264 An overview of the API will be written later.
    266 Nodes, expressions, and transforms use reference counting.  Functions that
    267 produce such objects (through a `bithenge_xxx_t **` argument) create a new
    268 reference to the object; you are responsible for ensuring the reference count
    269 is eventually decremented. If a function’s documentation says it “takes
    270 [ownership of] a reference” to an object, the function guarantees the object’s
    271 reference count will eventually be decremented, even if an error occurs.
    272 Therefore, if you create an object only to immediately pass it to such a
    273 function, you do not need to worry about its reference count.
    275 == Future language ideas ==
    277 In approximate order of priority.
    279 === Other ideas ===
    281  Subblobs:: When there are pointers to other offsets in the blob, the script
    282    could pass the whole blob as a parameter and apply transforms to subblobs.
    283    This is essential for non‐sequential blobs like filesystems.
    284  Infinite loop detection:: When decoding transforms like
    285    `struct { if (.non_existent) { } }`, an infinite loop occurs. This can also
    286    happen if the field exists, but in an outer `struct`. An error should be
    287    printed instead; Bithenge should not try to look in the `if` when searching
    288    for `.non_existent`.
    289  Complex expressions:: More operators, and expressions that call transforms.
    290  Better error reporting:: errno.h is intended for system call errors; when
    291    other errors occur, a more helpful error message should be printed, like
    292    "field .foo not found" or "cannot apply nonzero_boolean to blobs".
    293  Assertions:: These could be implemented as transforms that don't actually
    294    change the input. There could be multiple levels, ranging from “warning” to
    295    “fatal error”.
    296  Enumerations:: An easier way to handle many constant values, like
    297    `enum { 0: "none", 1: "file", 2: "directory", 3: "symlink" }`.
    298  Recursive transforms:: Although simple cases are handled by `do...while`, in
    299    some cases transforms need to recursively refer to themselves or each other.
    300  Merge blobs and internal nodes:: Currently, `struct`, `repeat`, and so on only
    301    work with blobs, which must be either byte sequences or bit sequences.
    302    Numbered internal nodes (such as those made by `repeat`) should be supported
    303    as well.
    304  Transforming internal nodes:: After binary data is decoded into a tree, it
    305    should be possible to apply further transforms to interpret the data
    306    further. For instance, after the FAT and directory entries of a FAT
    307    filesystem have been decoded, a further transform could determine the data
    308    for each file.
    309  More information in repeat subtransforms:: Repeat subtransforms should have
    310    access to the current index and previously decoded items.
    311  Hidden fields:: Some fields, such as length fields, are no longer interesting
    312    after the data is decoded, so they should be hidden by default.
    313  Search:: Decoding may require searching for a fixed sequence of bytes in the
    314    data.
    315  Automatic parameters:: It could be useful to automatically pass some
    316    parameters rather than computing and passing them explicitly. For instance,
    317    a version number that affects the format of many different parts of the file
    318    could be passed automatically, without having to write it out every time. A
    319    more advanced automatic parameter could keep track of current offset being
    320    decoded within a blob. There would need to be some sort of grouping to
    321    determine which transforms have the automatic parameters.
    322  Smarter length calculation:: Bithenge should automatically detect the length
    323    of certain composed transforms, such as `repeat(8) {bit} <- bits_le`. This
    324    would also be addressed by the constraint‐based version.
    326 === Constraint‐based version ===
    328 This and most other projects use an imperative design, where the format
    329 specification is always used in a fixed order, one step at a time. The
    330 imperative design causes problems when the user wants to modify a field,
    331 because arbitrary changes to other fields may be necessary that cannot be
    332 determined from the format specification.
    334 It may be possible to solve this with a constraint-based design, where the
    335 format specification consists of statements that must be true about the raw and
    336 interpreted data, and the program figures out how to solve these constraints.
    337 Unfortunately, this approach seems too open-ended and unpredictable to fit
    338 within GSoC.
    340 == Interesting formats ==
    342 These formats will be interesting and/or difficult to handle. I will keep them
    343 in mind when designing the library.
    345 * Filesystem allocation tables, which should be kept consistent with the actual
    346   usage of the disk.
    347 * Filesystem logs, which should be applied to the rest of the disk before
    348   interpreting it.
    349 * Formats where the whole file can have either endianness depending on a field
    350   in the header.
    351 * The [ Blender file format]
    352   is especially dynamic. When Blender saves a file, it just copies the
    353   structures from memory and translates the pointers. Since each Blender
    354   version and architecture will have different structures, the output file
    355   includes a header describing the fields and binary layout of each structure.
    356   When the file is loaded, the header is read first and the structures will be
    357   translated as necessary.
    358 * If the language is powerful enough, it might be possible to have a native
    359   description of Zlib and other compression formats.
    360 * It could be interesting to parse ARM or x86 machine code.
    362 == Existing Tools ==
    364 I researched existing tools related to my project, so they can be used for
    365 inspiration.
    367 === Construct ===
    369 [ Construct] is a Python library for creating
    370 declarative structure definitions. Each instance of the `Construct` class has a
    371 name, and knows how to read from a stream, write to a stream, and determine its
    372 length. Some predefined `Construct` subclasses use an arbitrary Python function
    373 evaluated at runtime, or behave differently depending on whether
    374 sub‐`Construct`s throw exceptions. `Const` uses a sub‐`Construct` and makes
    375 sure the value is correct. Also has lazy `Construct`s.
    377 Unfortunately, if you change the size of a structure, you still have to change
    378 everything else manually.
    380 === !BinData ===
    382 [ BinData] makes good use of Ruby syntax; it
    383 mostly has the same features as Construct.
    385 === Imperative DSLs ===
    387 DSLs in this category are used in an obvious, deterministic manner, and complex
    388 edits (changing the length of a structure) are difficult or impossible. They
    389 are simple imperative languages in which fields, structures, bitstructures, and
    390 arrays can be defined. The length, decoded value, and presence of fields can be
    391 determined by expressions using any previously decoded field, and structures
    392 can use `if`/`while`/`continue`/`break` and similar statements. Structures can
    393 inherit from other structures, meaning that the parent’s fields are present at
    394 the beginning of the child. Statements can move to a different offset in the
    395 input data. There may be a real programming language that can be used along
    396 with the DSL.
    398  [ DWARF]::
    399   Uses a simple stack‐based VM to calculate variable locations.
    400  [ “Grammar‐based specification and parsing of binary file formats”]::
    401   Actually uses an attribute grammar, but it isn’t terribly different from an
    402   imperative language.
    403  [ PyFFI]::
    404   Lets you create or modify files instead of just reading them. Fields can
    405   refer to blocks of data elsewhere in the file. Uses an XML format.
    406  [ QuickBMS]::
    407   A popular tool for extracting files from video game archives. Its main
    408   strength is the broad number of compression formats supported. It can put
    409   modified files back in the archive in trivial situations.
    410  [ Synalize It!]::
    411   Not completely imperative; if you declare optional structs where part of the
    412   data is constant, the correct struct will be displayed. Has a Graphviz export
    413   of file structure. Uses an XML format.
    414  Other free::
    415   [ BinPAC],
    416   [ Data::ParseBinary],
    417   [ DataScript],
    418   [ DataWorkshop],
    419   [ Wireshark Generic Dissector],
    420   [ Metafuzz BinStruct], and
    421   [ PADS].
    422  Other proprietary::
    423   [ 010 Editor],
    424   [ Andys Binary Folding Editor],
    425   [ Hackman Suite],
    426   [ Hex Editor Neo],
    427   [ iBored], and
    428   [ WinHext].
    430 === Less interesting tools ===
    432  Simple formats in hex editors::
    433   These support static fields and dynamic lengths only:
    434   [ FlexHex],
    435   [ HexEdit],
    436   [ Hexplorer],
    437   [ Hex Workshop], and
    438   [ Okteta].
    439  Simple formats elsewhere::
    440   [ CTF],
    441   [ ffe],
    442   [ Node Packet],
    443   [ Scapy], and
    444   [ stabs]
    445   can only handle trivial structures.
    446   [ lpack],
    447   [ Perl’s pack],
    448   [ Python’s struct], and
    449   [ VStruct]
    450   use concise string formats to describe simple structures.
    451   [ Hachoir]
    452   uses Python for most things.
    453  Protocol definition formats::
    454   [ ASN.1],
    455   [ MIDL],
    456   [ Piqi],
    457   and other IPC implementations go in the other direction: they generate a
    458   binary format from a text description of a structure. ASN.1 in particular
    459   has many features.
    460  [ Wireshark] and [ tcpdump]::
    461   As the Construct wiki notes, you would expect these developers to have some
    462   sort of DSL, but they just use C for everything. Wireshark does use ASN.1,
    463   Diameter, and MIDL for protocols developed with them.
    465 == Miscellaneous ideas ==
    467 === Code exporter ===
    469 A tool could generate C code to read and write data given a specification. A
    470 separate file could be used to specify which types should be used and which
    471 things should be read lazily or strictly.
    473 === Diff ===
    475 A diff tool could show differences in the interpreted data.
    477 === Space‐filling curves ===
    479 [ Space‐filling curves]
    480 look cool, but this project is about ''avoiding'' looking at raw binary data.
     3Now that the project has a name, everything here has been moved to [[Bithenge]].