Changes between Version 8 and Version 9 of StructuredBinaryData

2012-06-20T21:03:17Z (12 years ago)
Sean Bartell

add description of project and initial design


  • StructuredBinaryData

    v8 v9  
    5 This page will document my thoughts and design ideas for the structured binary
    6 data project. The project aims to address #317; a description of my overall
    7 approach can be found on the
    8 [ GSoC project page].
     5As part of [wiki:GSOC Google Summer of Code 2012], Bithenge is being created to
     6address #317. This page describes the project’s design and implementation.
     7The code is at
     8[ lp:~wtachi/helenos/bithenge]
     9and periodic updates are posted to
     10[ HelenOS-devel].
     12== Overview ==
     14Exploring and working with structured binary data is necessary in many
     15different situations in a project like HelenOS. For instance, when implementing
     16a file format or filesystem, it is first necessary to explore preexisting files
     17and disks and learn the low‐level details of the format. Debugging compiled
     18programs, working with core dumps, and exploring network protocols also require
     19some way of interpreting binary data.
     21The most basic tool for exploring binary data is the hex editor. Using a hex
     22editor is inefficient and unpleasant because it requires manual calculation of
     23lengths and offsets while constantly referring back to the data format.
     24General‐purpose scripting languages can be used instead, so a structure can be
     25defined once and decoded as often as necessary. However, even with useful tools
     26like Python’s struct module, the programmer must specify how to read the input
     27data, calculate lengths and offsets, and provide useful output, so there’s much
     28more work involved than simply specifying the format of the data. This extra
     29code will probably be rewritten every time a new script is made, due to
     30slightly differing requirements.
     32The Bithenge project involves creating a powerful library and tools that will
     33make working with structured binary data faster and easier. It will consist of:
     35* A core library that manages structured data and provides basic building
     36  blocks for binary data interpretation.
     37* Data providers to access various sources of raw binary data.
     38* Format providers, which can load and save complex format specifications. In
     39  particular, there will be a domain‐specific language for format
     40  specifications.
     41* Clients, programs which use the library to work with binary data. For
     42  instance, there will be an interactive browser.
     44The initial goals for the project are an interactive browser for filesystem
     45structures and a debugger backend that can interpret core dumps and task
    1048== Requirements ==
    1250* Work in HelenOS—this means the code must be in C and/or an easily ported
    1351  language like Lua.
    14 * View on different layers; for instance, switch between viewing the formatted
    15   date and time for a FAT directory entry, the integers, and the original
    16   bytes.
     52* View on different layers. For instance, when viewing a FAT directory entry,
     53  it should be possible to switch between viewing the formatted date and time,
     54  the integers, and the original bytes.
    1755* Check whether data is valid; handle broken data reasonably well.
    1856* Parse pieces of the data lazily; don’t try to read everything at once.
    1957* Work in both directions (parsing and building) without requiring too much
    20   extra effort.
     58  extra effort when specifying the format.
    2159* Support full modifications. Ideally, allow creation of a whole filesystem
    2260  from scratch.
     62== Trees ==
     64Bithenge represents all data in the form of a data structure called a “tree,”
     65similar to the data structure used by JSON. A tree consists of a boolean node,
     66integer node, string node, or blob node, or an internal node with children. A
     67boolean node holds a boolean value, an integer node holds a signed integer, and
     68a string holds a Unicode string.
     70A blob node represents an arbitrary sequence of raw bytes. Blob nodes are
     71polymorphic, allowing any source of raw binary data to be used. Bithenge
     72includes blob node implementations for in‐memory buffers, files, and block
     73devices. An implementation has been written that reads another task’s virtual
     74memory, but it hasn’t been committed because it’s unclear whether it will be
     77An internal node has an arbitrary number of children, each associated with a
     78unique key. The key can be any node other than an internal node. An array can
     79be represented by an internal node with integer keys starting at 0. The tree
     80node can provide children in an arbitrary order; the order will be used when
     81displaying the tree, but should have no semantic significance. Internal nodes
     82are polymorphic and can delay creation of child nodes until necessary, so
     83keeping the whole tree in memory can be avoided.
     85Internal nodes are currently responsible for freeing their own children. In the
     86future, it should be possible for there to be multiple references to the same
     87node, but it isn’t clear whether this should be implemented with symbolic
     88links, an acyclic graph with reference counting, or a full graph.
     90Note that all interpreted data is represented in Bithenge with nodes.
     91Therefore, the word “blob” usually refers to a blob node, &c.
     98│         ├───0───▶0xfff0
     99│         ├───1───▶0xffff
     100│         └───2───▶0x0000
     103           ├───0───▶○
     104           │        ├───name───▶README.TXT
     105           │        └───size───▶0x1351
     106           │
     107           └───1───▶○
     108                    ├───name───▶KERNEL.ELF
     109                    └───size───▶0x38e9a2
     112== Programs ==
     114The only program currently available is a simple test that prints some trees as
     115JSON and Python literals.
     117== Transforms ==
     119A transform is a function from a tree to a tree. One example is `uint32le`,
     120which takes a 4‐byte blob node as the input tree and provides an integer node
     121as the output tree. Another example would be `FAT16_filesystem`, a transform
     122that takes a blob node as the input tree and provides a complex output tree
     123with various decoded information about the filesystem. Some transforms, like
     124`uint32le`, are built in to Bithenge; more complicated transforms can be loaded
     125from a script file.
     127Transforms are represented in Bithenge with a polymorphic object. The primary
     128method is `apply`, which applies a transform to an input tree and creates an
     129output tree. When a transform takes a blob node as input, it is sometimes
     130necessary to determine the prefix of a given blob that can be used as input to
     131the transform; the method `prefix_length` can be used for this.
     133== Built‐in transforms ==
     135These transforms are implemented in C and included with Bithenge. Note that
     136fully specific names are preferred; scripts can define aliases if necessary.
     138||= name =||= input =||= output =||= description =||= example =||
     139||uint32le        ||4‐byte blob node ||integer node ||decodes a 4‐byte little‐endian unsigned integer ||  `x"01010000"` becomes `257` ||
     140||zero_terminated ||blob node        ||blob node    ||takes bytes up until the first `00` ||  `x"7f0400"` becomes `x"7f04"` ||
     141||ascii           ||blob node        ||string       ||decodes some bytes as ASCII characters ||  `x"6869"` becomes `"hi"` ||
     142||padded_with_spaces_at_end ||string ||string       ||removes spaces from the end of a string ||  `"README  "` becomes `"README"`
     144== Basic syntax ==
     146Script files used to define new transforms.
     148Transforms (including built‐in transforms) can be referenced by name:
     151Transforms can be given a new name: `transform u32 = uint32le;` defines a
     152shorter alias for `uint32le`.
     154Transforms can be composed to create a new transform that applies them in
     155order. The transform `padded_with_spaces_at_end . ascii . zero_terminated`
     156first removes the 0x00 from the end of the blob, then decodes it as ascii and
     157removes spaces from the end. Note that the order of composition is consistent
     158with function composition and nested application in mathematics, and also
     159consistent with the general idea that data moves from right to left as it is
     162== Structs ==
     164Structs are used when a blob contains multiple data fields in sequence. A
     165struct transform applies each subtransform to sequential parts of the blob and
     166combines the results to create an internal node. The result of each
     167subtransform is either assigned a key or has its keys and values merged into
     168the final internal node. Each subtransform must support `prefix_length`, so the
     169lengths and positions of the data fields can be determined.
     171=== Example ===
     174transform point = struct {
     175    .x = uint32le;
     176    .y = uint32le;
     179transform labeled_point = struct {
     180    .id = uint32le;
     181    .label = ascii . zero_terminated;
     182    point;
     186If `labeled_point` is applied to `x"06000000 4100 03000000 08000000"`, the
     187result is `{"id": 6, "label": "A", "x": 3, "y": 8}`.
     189== Future features ==
     191* Parameters for transforms
     192  * Keyword parameters only?
     193* Expressions depending on previously decoded values
     194* Enumerations
     195* Variables
     196* Transforming internal nodes
     197* Assertions
     198  * Transforms that return their input
     199  * Different levels (expected, required, mandatory)
     200* Error handling
     201* Hidden fields
     202* Iteration/recursion/repetition
     203* Seeking and detecting position
     204* Checking alignment
     205* Reference to structures at other offsets
     206  * How to know what blob to go within?
     207  * How to know current offset within that blob?
     208  * Could be relative to multiple things at once...
     209  * Blob node can be an inherited parameter
     210    * This is also useful for endianness
     211  * Offset could be an automatically incremented parameter
     212* Ad hoc tweaks at runtime
     214=== Constraint‐based version ===
     216This and most other projects use an imperative design, where the format
     217specification is always used in a fixed order, one step at a time. The
     218imperative design causes problems when the user wants to modify a field,
     219because arbitrary changes to other fields may be necessary that cannot be
     220determined from the format specification.
     222It may be possible to solve this with a constraint-based design, where the
     223format specification consists of statements that must be true about the raw and
     224interpreted data, and the program figures out how to solve these constraints.
     225Unfortunately, this approach seems too open-ended and unpredictable to fit
     226within GSoC.
    24228== Interesting formats ==
    46250== Existing Tools ==
    48 I am researching existing tools related to my project, so they can be used for
     252I researched existing tools related to my project, so they can be used for
    62266everything else manually.
    64 === BinData ===
     268=== !BinData ===
    66270[ BinData] makes good use of Ruby syntax; it
    163367[ Space‐filling curves]
    164368look cool, but this project is about ''avoiding'' looking at raw binary data.
     372The next step is to design and implement the domain-specific language. I
     373will do this incrementally: start with basic features, design them,
     374implement them, and make an example, then move on to more advanced
     375features, and so on. I will post an update after each step, especially
     376after each part of the design. This is different from the schedule I
     377gave in my proposal, but my goal for July 1st is the same: a program
     378that can use a format specification file to interpret data and dump it
     379in JSON format.