= Structured Binary Data =

[[PageOutline(2-3)]]

As part of [wiki:GSOC Google Summer of Code 2012], Bithenge is being created to
address #317. This page describes the project’s design and implementation.
The code is at
[https://code.launchpad.net/~wtachi/helenos/bithenge lp:~wtachi/helenos/bithenge]
and periodic updates are posted to
[http://lists.modry.cz/cgi-bin/listinfo/helenos-devel HelenOS-devel].

== Overview ==

Exploring and working with structured binary data is necessary in many
different situations in a project like HelenOS. For instance, when implementing
a file format or filesystem, it is first necessary to explore preexisting files
and disks and learn the low‐level details of the format. Debugging compiled
programs, working with core dumps, and exploring network protocols also require
some way of interpreting binary data.

The most basic tool for exploring binary data is the hex editor. Using a hex
editor is inefficient and unpleasant because it requires manual calculation of
lengths and offsets while constantly referring back to the data format.
General‐purpose scripting languages can be used instead, so a structure can be
defined once and decoded as often as necessary. However, even with useful tools
like Python’s struct module, the programmer must specify how to read the input
data, calculate lengths and offsets, and provide useful output, so there’s much
more work involved than simply specifying the format of the data. This extra
code will probably be rewritten every time a new script is made, due to
slightly differing requirements.

The Bithenge project involves creating a powerful library and tools that will
make working with structured binary data faster and easier. It will consist of:

* A core library that manages structured data and provides basic building
  blocks for binary data interpretation.
* Data providers to access various sources of raw binary data.
* Format providers, which can load and save complex format specifications. In
  particular, there will be a domain‐specific language for format
  specifications.
* Clients, programs which use the library to work with binary data. For
  instance, there will be an interactive browser.

The initial goals for the project are an interactive browser for filesystem
structures and a debugger backend that can interpret core dumps and task
memory.

== Requirements ==

* Work in HelenOS—this means the code must be in C and/or an easily ported
  language like Lua.
* View on different layers. For instance, when viewing a FAT directory entry,
  it should be possible to switch between viewing the formatted date and time,
  the integers, and the original bytes.
* Check whether data is valid; handle broken data reasonably well.
* Parse pieces of the data lazily; don’t try to read everything at once.
* Work in both directions (parsing and building) without requiring too much
  extra effort when specifying the format.
* Support full modifications. Ideally, allow creation of a whole filesystem
  from scratch.

== Trees ==

Bithenge represents all data in the form of a data structure called a “tree,”
similar to the data structure used by JSON. A tree consists of a boolean node,
integer node, string node, or blob node, or an internal node with children. A
boolean node holds a boolean value, an integer node holds a signed integer, and
a string holds a Unicode string.

A blob node represents an arbitrary sequence of raw bytes. Blob nodes are
polymorphic, allowing any source of raw binary data to be used. Bithenge
includes blob node implementations for in‐memory buffers, files, and block
devices. An implementation has been written that reads another task’s virtual
memory, but it hasn’t been committed because it’s unclear whether it will be
useful.

An internal node has an arbitrary number of children, each associated with a
unique key. The key can be any node other than an internal node. An array can
be represented by an internal node with integer keys starting at 0. The tree
node can provide children in an arbitrary order; the order will be used when
displaying the tree, but should have no semantic significance. Internal nodes
are polymorphic and can delay creation of child nodes until necessary, so
keeping the whole tree in memory can be avoided.

Internal nodes are currently responsible for freeing their own children. In the
future, it should be possible for there to be multiple references to the same
node, but it isn’t clear whether this should be implemented with symbolic
links, an acyclic graph with reference counting, or a full graph.

Note that all interpreted data is represented in Bithenge with nodes.
Therefore, the word “blob” usually refers to a blob node, &c.

{{{

○───bits─▶16
│
├───fat──▶○
│         ├───0───▶0xfff0
│         ├───1───▶0xffff
│         └───2───▶0x0000
│
└───root──▶○
           ├───0───▶○
           │        ├───name───▶README.TXT
           │        └───size───▶0x1351
           │
           └───1───▶○
                    ├───name───▶KERNEL.ELF
                    └───size───▶0x38e9a2
}}}

== Programs ==

The only program currently available is a simple test that prints some trees as
JSON and Python literals.

== Transforms ==

A transform is a function from a tree to a tree. One example is `uint32le`,
which takes a 4‐byte blob node as the input tree and provides an integer node
as the output tree. Another example would be `FAT16_filesystem`, a transform
that takes a blob node as the input tree and provides a complex output tree
with various decoded information about the filesystem. Some transforms, like
`uint32le`, are built in to Bithenge; more complicated transforms can be loaded
from a script file.

Transforms are represented in Bithenge with a polymorphic object. The primary
method is `apply`, which applies a transform to an input tree and creates an
output tree. When a transform takes a blob node as input, it is sometimes
necessary to determine the prefix of a given blob that can be used as input to
the transform; the method `prefix_length` can be used for this.

== Built‐in transforms ==

These transforms are implemented in C and included with Bithenge. Note that
fully specific names are preferred; scripts can define aliases if necessary.

||= name =||= input =||= output =||= description =||= example =||
||uint32le        ||4‐byte blob node ||integer node ||decodes a 4‐byte little‐endian unsigned integer ||  `x"01010000"` becomes `257` ||
||zero_terminated ||blob node        ||blob node    ||takes bytes up until the first `00` ||  `x"7f0400"` becomes `x"7f04"` ||
||ascii           ||blob node        ||string       ||decodes some bytes as ASCII characters ||  `x"6869"` becomes `"hi"` ||
||padded_with_spaces_at_end ||string ||string       ||removes spaces from the end of a string ||  `"README  "` becomes `"README"`

== Basic syntax ==

Script files used to define new transforms.

Transforms (including built‐in transforms) can be referenced by name:
`uint32le`.

Transforms can be given a new name: `transform u32 = uint32le;` defines a
shorter alias for `uint32le`.

Transforms can be composed to create a new transform that applies them in
order. The transform `padded_with_spaces_at_end . ascii . zero_terminated`
first removes the 0x00 from the end of the blob, then decodes it as ascii and
removes spaces from the end. Note that the order of composition is consistent
with function composition and nested application in mathematics, and also
consistent with the general idea that data moves from right to left as it is
decoded.

== Structs ==

Structs are used when a blob contains multiple data fields in sequence. A
struct transform applies each subtransform to sequential parts of the blob and
combines the results to create an internal node. The result of each
subtransform is either assigned a key or has its keys and values merged into
the final internal node. Each subtransform must support `prefix_length`, so the
lengths and positions of the data fields can be determined.

=== Example ===

{{{
transform point = struct {
    .x = uint32le;
    .y = uint32le;
};

transform labeled_point = struct {
    .id = uint32le;
    .label = ascii . zero_terminated;
    point;
};
}}}

If `labeled_point` is applied to `x"06000000 4100 03000000 08000000"`, the
result is `{"id": 6, "label": "A", "x": 3, "y": 8}`.

== Future features ==

* Parameters for transforms
  * Keyword parameters only?
* Expressions depending on previously decoded values
* Enumerations
* Variables
* Transforming internal nodes
* Assertions
  * Transforms that return their input
  * Different levels (expected, required, mandatory)
* Error handling
* Hidden fields
* Iteration/recursion/repetition
* Seeking and detecting position
* Checking alignment
* Reference to structures at other offsets
  * How to know what blob to go within?
  * How to know current offset within that blob?
  * Could be relative to multiple things at once...
  * Blob node can be an inherited parameter
    * This is also useful for endianness
  * Offset could be an automatically incremented parameter
* Ad hoc tweaks at runtime

=== Constraint‐based version ===

This and most other projects use an imperative design, where the format
specification is always used in a fixed order, one step at a time. The
imperative design causes problems when the user wants to modify a field,
because arbitrary changes to other fields may be necessary that cannot be
determined from the format specification.

It may be possible to solve this with a constraint-based design, where the
format specification consists of statements that must be true about the raw and
interpreted data, and the program figures out how to solve these constraints.
Unfortunately, this approach seems too open-ended and unpredictable to fit
within GSoC.

== Interesting formats ==

These formats will be interesting and/or difficult to handle. I will keep them
in mind when designing the library.

* Filesystem allocation tables, which should be kept consistent with the actual
  usage of the disk.
* Filesystem logs, which should be applied to the rest of the disk before
  interpreting it.
* Formats where the whole file can have either endianness depending on a field
  in the header.
* The [http://www.blender.org/development/architecture/blender-file-format/ Blender file format]
  is especially dynamic. When Blender saves a file, it just copies the
  structures from memory and translates the pointers. Since each Blender
  version and architecture will have different structures, the output file
  includes a header describing the fields and binary layout of each structure.
  When the file is loaded, the header is read first and the structures will be
  translated as necessary.
* If the language is powerful enough, it might be possible to have a native
  description of Zlib and other compression formats.
* It could be interesting to parse ARM or x86 machine code.

== Existing Tools ==

I researched existing tools related to my project, so they can be used for
inspiration.

=== Construct ===

[http://construct.wikispaces.com/ Construct] is a Python library for creating
declarative structure definitions. Each instance of the `Construct` class has a
name, and knows how to read from a stream, write to a stream, and determine its
length. Some predefined `Construct` subclasses use an arbitrary Python function
evaluated at runtime, or behave differently depending on whether
sub‐`Construct`s throw exceptions. `Const` uses a sub‐`Construct` and makes
sure the value is correct. Also has lazy `Construct`s.

Unfortunately, if you change the size of a structure, you still have to change
everything else manually.

=== !BinData ===

[http://bindata.rubyforge.org/ BinData] makes good use of Ruby syntax; it
mostly has the same features as Construct.

=== Imperative DSLs ===

DSLs in this category are used in an obvious, deterministic manner, and complex
edits (changing the length of a structure) are difficult or impossible. They
are simple imperative languages in which fields, structures, bitstructures, and
arrays can be defined. The length, decoded value, and presence of fields can be
determined by expressions using any previously decoded field, and structures
can use `if`/`while`/`continue`/`break` and similar statements. Structures can
inherit from other structures, meaning that the parent’s fields are present at
the beginning of the child. Statements can move to a different offset in the
input data. There may be a real programming language that can be used along
with the DSL.

 [http://dwarfstd.org/ DWARF]::
  Uses a simple stack‐based VM to calculate variable locations.
 [http://ijdc.net/index.php/ijdc/article/view/207 “Grammar‐based specification and parsing of binary file formats”]::
  Actually uses an attribute grammar, but it isn’t terribly different from an
  imperative language.
 [http://pyffi.sourceforge.net/ PyFFI]::
  Lets you create or modify files instead of just reading them. Fields can
  refer to blocks of data elsewhere in the file. Uses an XML format.
 [http://aluigi.altervista.org/quickbms.htm QuickBMS]::
  A popular tool for extracting files from video game archives. Its main
  strength is the broad number of compression formats supported. It can put
  modified files back in the archive in trivial situations.
 [http://www.synalysis.net/ Synalize It!]::
  Not completely imperative; if you declare optional structs where part of the
  data is constant, the correct struct will be displayed. Has a Graphviz export
  of file structure. Uses an XML format.
 Other free::
  [http://www-old.bro-ids.org/wiki/index.php/BinPAC_Userguide BinPAC],
  [https://metacpan.org/module/Data::ParseBinary Data::ParseBinary],
  [http://datascript.berlios.de/DataScriptLanguageOverview.html DataScript],
  [http://www.dataworkshop.de/ DataWorkshop],
  [http://wsgd.free.fr/ Wireshark Generic Dissector],
  [http://metafuzz.rubyforge.org/binstruct/ Metafuzz BinStruct], and
  [http://www.padsproj.org/ PADS].
 Other proprietary::
  [http://www.sweetscape.com/010editor/#templates 010 Editor],
  [http://www.nyangau.org/be/be.htm Andys Binary Folding Editor],
  [https://www.technologismiki.com/prod.php?id=31 Hackman Suite],
  [http://www.hhdsoftware.com/doc/hex-editor/language-reference-overview.html Hex Editor Neo],
  [http://apps.tempel.org/iBored/ iBored], and
  [https://www.x-ways.net/winhex/templates.html#User_Templates WinHext].

=== Less interesting tools ===

 Simple formats in hex editors::
  These support static fields and dynamic lengths only:
  [http://www.flexhex.com/ FlexHex],
  [http://hexedit.com/ HexEdit],
  [http://sourceforge.net/projects/hexplorer/ Hexplorer],
  [http://www.hexworkshop.com/ Hex Workshop], and
  [http://kde.org/applications/utilities/okteta/ Okteta].
 Simple formats elsewhere::
  [http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/ctf.h CTF],
  [http://ff-extractor.sourceforge.net/ ffe],
  [http://bigeasy.github.com/node-packet/ Node Packet],
  [https://www.secdev.org/projects/scapy/ Scapy], and
  [http://sourceware.org/gdb/current/onlinedocs/stabs/ stabs]
  can only handle trivial structures.
  [http://www.tecgraf.puc-rio.br/~lhf/ftp/lua/#lpack lpack],
  [http://perldoc.perl.org/functions/pack.html Perl’s pack],
  [http://docs.python.org/library/struct.html Python’s struct], and
  [https://github.com/ToxicFrog/vstruct VStruct]
  use concise string formats to describe simple structures.
  [https://bitbucket.org/haypo/hachoir Hachoir]
  uses Python for most things.
 Protocol definition formats::
  [https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One ASN.1],
  [https://en.wikipedia.org/wiki/Microsoft_Interface_Definition_Language MIDL],
  [http://piqi.org/ Piqi],
  and other IPC implementations go in the other direction: they generate a
  binary format from a text description of a structure. ASN.1 in particular
  has many features.
 [https://www.wireshark.org/ Wireshark] and [http://www.tcpdump.org/ tcpdump]::
  As the Construct wiki notes, you would expect these developers to have some
  sort of DSL, but they just use C for everything. Wireshark does use ASN.1,
  Diameter, and MIDL for protocols developed with them.

== Miscellaneous ideas ==

=== Code exporter ===

A tool could generate C code to read and write data given a specification. A
separate file could be used to specify which types should be used and which
things should be read lazily or strictly.

=== Diff ===

A diff tool could show differences in the interpreted data.

=== Space‐filling curves ===

[http://corte.si/posts/visualisation/binvis/index.html Space‐filling curves]
look cool, but this project is about ''avoiding'' looking at raw binary data.



The next step is to design and implement the domain-specific language. I
will do this incrementally: start with basic features, design them,
implement them, and make an example, then move on to more advanced
features, and so on. I will post an update after each step, especially
after each part of the design. This is different from the schedule I
gave in my proposal, but my goal for July 1st is the same: a program
that can use a format specification file to interpret data and dump it
in JSON format.