Context Navigation

← Previous Change
Wiki History
Next Change →

Changes between Version 22 and Version 23 of StructuredBinaryData

Timestamp:: 2012-08-24T06:49:33Z (13 years ago)
Author:: Sean Bartell
Comment:: replace with link to Bithenge

Legend:

: Unmodified
: Added
: Removed
: Modified

StructuredBinaryData

-              v22
+              v23
 = Structured Binary Data =
+[[PageOutline(2-3)]]
+As part of [wiki:GSOC Google Summer of Code 2012], Bithenge is being created to
+address #317. This page describes the project’s design and implementation.
+The code is at
+[https://code.launchpad.net/~wtachi/helenos/bithenge lp:~wtachi/helenos/bithenge]
+and periodic updates are posted to
+[http://lists.modry.cz/cgi-bin/listinfo/helenos-devel HelenOS-devel].
+== Overview ==
+Exploring and working with structured binary data is necessary in many
+different situations in a project like HelenOS. For instance, when implementing
+a file format or filesystem, it is first necessary to explore preexisting files
+and disks and learn the low‐level details of the format. Debugging compiled
+programs, working with core dumps, and exploring network protocols also require
+some way of interpreting binary data.
+The most basic tool for exploring binary data is the hex editor. Using a hex
+editor is inefficient and unpleasant because it requires manual calculation of
+lengths and offsets while constantly referring back to the data format.
+General‐purpose scripting languages can be used instead, so a structure can be
+defined once and decoded as often as necessary. However, even with useful tools
+like Python’s struct module, the programmer must specify how to read the input
+data, calculate lengths and offsets, and provide useful output, so there’s much
+more work involved than simply specifying the format of the data. This extra
+code will probably be rewritten every time a new script is made, due to
+slightly differing requirements.
+The Bithenge project involves creating a powerful library and tools that will
+make working with structured binary data faster and easier. It will consist of:
+* A core library that manages structured data and provides basic building
+  blocks for binary data interpretation.
+* Data providers to access various sources of raw binary data.
+* Format providers, which can load and save complex format specifications. In
+  particular, there will be a domain‐specific language for format
+  specifications.
+* Clients, programs which use the library to work with binary data. For
+  instance, there will be an interactive browser.
+The initial goals for the project are an interactive browser for filesystem
+structures as well as a debugger backend that can interpret core dumps and task
+memory.
+== Requirements ==
+* Work in HelenOS—this means the code must be in C and/or an easily ported
+  language like Lua.
+* View on different layers. For instance, when viewing a FAT directory entry,
+  it should be possible to switch between viewing the formatted date and time,
+  the integers, and the original bytes.
+* Check whether data is valid; handle broken data reasonably well.
+* Parse pieces of the data lazily; don’t try to read everything at once.
+* Work in both directions (parsing and building) without requiring too much
+  extra effort when specifying the format.
+* Support full modifications. Ideally, allow creation of a whole filesystem
+  from scratch.
+== Trees ==
+Bithenge represents all data in the form of a data structure called a “tree,”
+similar to the data structure represented by JSON. A tree consists of a boolean
+node, integer node, string node, or blob node, or an internal node with
+children. A boolean node holds a boolean value, an integer node holds a signed
+integer, and a string holds a Unicode string.
+A blob node represents an arbitrary sequence of raw bytes or bits. Blob nodes
+are polymorphic, allowing any source of raw binary data to be used. Bithenge
+includes blob node implementations for in‐memory buffers, files, and block
+devices. An implementation has also been written that reads another task’s
+virtual memory, but it hasn’t been committed because it’s unclear whether it
+will be useful in its current form.
+An internal node has an arbitrary number of children, each with a unique key.
+The key can be any node other than an internal node. Arrays can be represented
+by internal nodes with integer keys starting at 0. The tree node can provide
+children in an arbitrary order; the order will be used when displaying the
+tree, but should have no semantic significance. Internal nodes are polymorphic
+and can delay creation of child nodes until necessary, so keeping the whole
+tree in memory can be avoided.
+Internal nodes are currently responsible for freeing their own children. In the
+future, it should be possible for there to be multiple references to the same
+node, but it isn’t clear whether this should be implemented with symbolic
+links, an acyclic graph with reference counting, or a full graph.
+Note that all interpreted data is represented in Bithenge with nodes.
+Therefore, the word “blob” usually refers to a blob node, and so on.
+A decoded tree for a FAT filesystem might look something like this:
+{{{
+○───bits─▶16
+│
+├───fat──▶○
+│         ├───0───▶0xfff0
+│         ├───1───▶0xffff
+│         └───2───▶0x0000
+│
+└───root──▶○
+           ├───0───▶○
+           │        ├───name───▶README.TXT
+           │        └───size───▶0x1351
+           │
+           └───1───▶○
+                    ├───name───▶KERNEL.ELF
+                    └───size───▶0x38e9a2
+}}}
+== Transforms ==
+A transform is a function from a tree to a tree. One example is `uint32le`,
+which takes a 4‐byte blob node as the input tree and provides an integer node
+as the output tree. Another example would be `FAT16_filesystem`, a transform
+that takes a byte blob node as the input tree and provides a complex output
+tree with various decoded information about the filesystem. Some transforms,
+like `uint32le`, are built in to Bithenge; more complicated transforms can be
+loaded from a script file.
+Transforms are represented in Bithenge with a polymorphic object. The primary
+method is `apply`, which applies a transform to an input tree and creates an
+output tree. When a transform takes a blob node as input, it is sometimes
+necessary to determine the prefix of a given blob that can be used as input to
+the transform; the method `prefix_length` can be used for this.
+== Built‐in transforms ==
+These transforms are implemented in C and included with Bithenge. Note that
+precise names are preferred; scripts can define shorter aliases if necessary.
+||= name =||= input =||= output =||= description =||= example =||
+||ascii             ||byte blob node   ||string        ||decodes some bytes as ASCII characters ||  `hex:6869` becomes `"hi"` ||
+||bit               ||1‐bit blob node  ||boolean       ||decodes a single bit || `1` becomes `true` ||
+||bits_be           ||byte blob node   ||bit blob node ||decodes bytes as bits, starting with the most‐significant bit || `hex:0f` becomes `bit:00001111` ||
+||bits_le           ||byte blob node   ||bit blob node ||decodes bytes as bits, starting with the least‐significant bit || `hex:0f` becomes `bit:11110000` ||
+||known_length(len) ||blob node        ||blob node     ||requires the input to have a known length || ||
+||nonzero_boolean   ||integer          ||boolean       ||decodes a boolean where nonzero values are true || `0` becomes `false` ||
+||uint8             ||1‐byte blob node ||integer node  ||decodes a 1‐byte unsigned integer ||  `hex:11` becomes `17` ||
+||uint16be          ||2‐byte blob node ||integer node  ||decodes a 2‐byte big‐endian unsigned integer ||  `hex:0201` becomes `513` ||
+||uint16le          ||2‐byte blob node ||integer node  ||decodes a 2‐byte little‐endian unsigned integer ||  `hex:0101` becomes `257` ||
+||uint32be          ||4‐byte blob node ||integer node  ||decodes a 4‐byte big‐endian unsigned integer ||  `hex:00000201` becomes `513` ||
+||uint32le          ||4‐byte blob node ||integer node  ||decodes a 4‐byte little‐endian unsigned integer ||  `hex:01010000` becomes `257` ||
+||uint64be          ||8‐byte blob node ||integer node  ||decodes a 8‐byte big‐endian unsigned integer ||  `hex:0000000000000201` becomes `513` ||
+||uint64le          ||8‐byte blob node ||integer node  ||decodes a 8‐byte little‐endian unsigned integer ||  `hex:0101000000000000` becomes `257` ||
+||uint_be(len)      ||bit blob node    ||integer node  ||decodes bits as an unsigned integer, starting with the most‐significant bit || ||
+||uint_le(len)      ||bit blob node    ||integer node  ||decodes bits as an unsigned integer, starting with the least‐significant bit || ||
+||zero_terminated   ||byte blob node   ||byte blob node||takes bytes up until the first `00` ||  `hex:7f0400` becomes `hex:7f04` ||
+== Basic syntax ==
+Script files are used to define complicated transforms.
+Transforms (including built‐in transforms) can be referenced by name:
+`uint32le`.
+Transforms can be given a new name: `transform u32 = uint32le;` defines a
+shorter alias for `uint32le`.
+Transforms can be composed to create a new transform that applies them in
+order. The transform `ascii <- zero_terminated` first removes the 0x00 from the
+end of the blob, then decodes it as ascii. Note that the order of composition
+is consistent with function composition and nested application in mathematics,
+and also consistent with the general idea that data moves from right to left as
+it is decoded.
+== Expressions ==
+Transforms can have parameters that affect how they decode data:
+{{{
+transform u16(little_endian) =
+    if (little_endian) {
+        uint16le
+    } else {
+        uint16be
+    };
+}}}
+When such a transform is used, expressions must be given to calculate its
+parameters. The basic terms used in expressions are parameters given to the
+current transform, boolean and integer literals, and previously decoded fields
+in the current `struct` or an outer `struct`:
+{{{
+transform item(little_endian) = struct {
+    .len <- u16(little_endian);
+    .text <- ascii <- known_length(.len);
+    .data <- known_length(8);
+};
+}}}
+You can also use `+`, `-`, `*`, and parentheses as you would expect.
+Expressions can also be used as transforms themselves, in two ways. First, an
+expression referring to `in` can be used to decode a value, such as
+`.num_bytes <- (in * 4) <- uint32le;`. Second, an expression that doesn’t refer
+to `in` can be used as a transform that takes an empty blob and evaluates the
+expression: for example, `.num_total <- (.num_short + .num_long);` could be
+used in a `struct`. In both cases, the parentheses are mandatory.
+== Structs ==
+Structs are used when a blob contains multiple data fields in sequence. A
+struct transform applies each subtransform to sequential parts of the blob and
+combines the results to create an internal node. The result of each
+subtransform is either assigned a key or has its keys and values merged into
+the final tree. Each subtransform must support `prefix_length`, so the lengths
+and positions of the data fields can be determined.
+=== Example ===
+{{{
+transform point = struct {
+    .x <- uint32le;
+    .y <- uint32le;
+};
+transform labeled_point = struct {
+    .id <- uint32le;
+    .label <- ascii <- zero_terminated;
+    <- point;
+};
+}}}
+If `labeled_point` is applied to `hex:0600000041000300000008000000`, the result
+is `{"id": 6, "label": "A", "x": 3, "y": 8}`.
+== Flow Control ==
+Boolean conditions can use
+`if (expression) { transform-if-true } else { transform-if-false }`, for
+example: `if (little_endian) { uint32le } else { uint32be }`. There is also
+syntax for switches: `switch (expression) { expression: transform; ... }`. Both
+of these have syntactic sugar for use in `struct`s; in a `struct`,
+`if (expression) { fields... }` is equivalent to
+`<- if (expression) { struct { fields... } };`.
+Three kinds of repetition are supported. `repeat(expression) {transform}`
+applies the transform a given number of times. `repeat {transform}` applies the
+transform until it fails or the end of the data is reached.
+`do {transform} while(expression)` applies the transform until the expression,
+evaluated within the result of the transform, is false; this can be used for
+things like
+`do { struct { .keep_going <- nonzero_boolean <- uint8; .val <- uint8; } } while(.keep_going)`.
+For each type of repetition, the result is an internal node with sequential
+keys starting at `0`.
+== Using Bithenge ==
+The Bithenge source code is in `uspace/app/bithenge` and is built along with
+HelenOS. It can be built for Linux instead with `make -f Makefile.linux`.
+The program can be run with `bithenge <script file> <source>`. The script file
+must define a transform called `main`. The source can start with one of the
+following prefixes:
+||= Prefix =||= Example =||= Description =||
+|| file: || file:/textdemo || Read the contents of a file. This is the default if no prefix is used. ||
+|| block: || block:bd/initrd || Read the contents of a block device. (HelenOS only.) ||
+|| hex: || hex:01000000 || Use a string of hexadecimal characters to create a blob node. ||
+There are some example files in `uspace/dist/src/bithenge`.
+=== Using the API ===
+An overview of the API will be written later.
+Nodes, expressions, and transforms use reference counting.  Functions that
+produce such objects (through a `bithenge_xxx_t **` argument) create a new
+reference to the object; you are responsible for ensuring the reference count
+is eventually decremented. If a function’s documentation says it “takes
+[ownership of] a reference” to an object, the function guarantees the object’s
+reference count will eventually be decremented, even if an error occurs.
+Therefore, if you create an object only to immediately pass it to such a
+function, you do not need to worry about its reference count.
+== Future language ideas ==
+In approximate order of priority.
+=== Other ideas ===
+ Subblobs:: When there are pointers to other offsets in the blob, the script
+   could pass the whole blob as a parameter and apply transforms to subblobs.
+   This is essential for non‐sequential blobs like filesystems.
+ Infinite loop detection:: When decoding transforms like
+   `struct { if (.non_existent) { } }`, an infinite loop occurs. This can also
+   happen if the field exists, but in an outer `struct`. An error should be
+   printed instead; Bithenge should not try to look in the `if` when searching
+   for `.non_existent`.
+ Complex expressions:: More operators, and expressions that call transforms.
+ Better error reporting:: errno.h is intended for system call errors; when
+   other errors occur, a more helpful error message should be printed, like
+   "field .foo not found" or "cannot apply nonzero_boolean to blobs".
+ Assertions:: These could be implemented as transforms that don't actually
+   change the input. There could be multiple levels, ranging from “warning” to
+   “fatal error”.
+ Enumerations:: An easier way to handle many constant values, like
+   `enum { 0: "none", 1: "file", 2: "directory", 3: "symlink" }`.
+ Recursive transforms:: Although simple cases are handled by `do...while`, in
+   some cases transforms need to recursively refer to themselves or each other.
+ Merge blobs and internal nodes:: Currently, `struct`, `repeat`, and so on only
+   work with blobs, which must be either byte sequences or bit sequences.
+   Numbered internal nodes (such as those made by `repeat`) should be supported
+   as well.
+ Transforming internal nodes:: After binary data is decoded into a tree, it
+   should be possible to apply further transforms to interpret the data
+   further. For instance, after the FAT and directory entries of a FAT
+   filesystem have been decoded, a further transform could determine the data
+   for each file.
+ More information in repeat subtransforms:: Repeat subtransforms should have
+   access to the current index and previously decoded items.
+ Hidden fields:: Some fields, such as length fields, are no longer interesting
+   after the data is decoded, so they should be hidden by default.
+ Search:: Decoding may require searching for a fixed sequence of bytes in the
+   data.
+ Automatic parameters:: It could be useful to automatically pass some
+   parameters rather than computing and passing them explicitly. For instance,
+   a version number that affects the format of many different parts of the file
+   could be passed automatically, without having to write it out every time. A
+   more advanced automatic parameter could keep track of current offset being
+   decoded within a blob. There would need to be some sort of grouping to
+   determine which transforms have the automatic parameters.
+ Smarter length calculation:: Bithenge should automatically detect the length
+   of certain composed transforms, such as `repeat(8) {bit} <- bits_le`. This
+   would also be addressed by the constraint‐based version.
+=== Constraint‐based version ===
+This and most other projects use an imperative design, where the format
+specification is always used in a fixed order, one step at a time. The
+imperative design causes problems when the user wants to modify a field,
+because arbitrary changes to other fields may be necessary that cannot be
+determined from the format specification.
+It may be possible to solve this with a constraint-based design, where the
+format specification consists of statements that must be true about the raw and
+interpreted data, and the program figures out how to solve these constraints.
+Unfortunately, this approach seems too open-ended and unpredictable to fit
+within GSoC.
+== Interesting formats ==
+These formats will be interesting and/or difficult to handle. I will keep them
+in mind when designing the library.
+* Filesystem allocation tables, which should be kept consistent with the actual
+  usage of the disk.
+* Filesystem logs, which should be applied to the rest of the disk before
+  interpreting it.
+* Formats where the whole file can have either endianness depending on a field
+  in the header.
+* The [http://www.blender.org/development/architecture/blender-file-format/ Blender file format]
+  is especially dynamic. When Blender saves a file, it just copies the
+  structures from memory and translates the pointers. Since each Blender
+  version and architecture will have different structures, the output file
+  includes a header describing the fields and binary layout of each structure.
+  When the file is loaded, the header is read first and the structures will be
+  translated as necessary.
+* If the language is powerful enough, it might be possible to have a native
+  description of Zlib and other compression formats.
+* It could be interesting to parse ARM or x86 machine code.
+== Existing Tools ==
+I researched existing tools related to my project, so they can be used for
+inspiration.
+=== Construct ===
+[http://construct.wikispaces.com/ Construct] is a Python library for creating
+declarative structure definitions. Each instance of the `Construct` class has a
+name, and knows how to read from a stream, write to a stream, and determine its
+length. Some predefined `Construct` subclasses use an arbitrary Python function
+evaluated at runtime, or behave differently depending on whether
+sub‐`Construct`s throw exceptions. `Const` uses a sub‐`Construct` and makes
+sure the value is correct. Also has lazy `Construct`s.
+Unfortunately, if you change the size of a structure, you still have to change
+everything else manually.
+=== !BinData ===
+[http://bindata.rubyforge.org/ BinData] makes good use of Ruby syntax; it
+mostly has the same features as Construct.
+=== Imperative DSLs ===
+DSLs in this category are used in an obvious, deterministic manner, and complex
+edits (changing the length of a structure) are difficult or impossible. They
+are simple imperative languages in which fields, structures, bitstructures, and
+arrays can be defined. The length, decoded value, and presence of fields can be
+determined by expressions using any previously decoded field, and structures
+can use `if`/`while`/`continue`/`break` and similar statements. Structures can
+inherit from other structures, meaning that the parent’s fields are present at
+the beginning of the child. Statements can move to a different offset in the
+input data. There may be a real programming language that can be used along
+with the DSL.
+ [http://dwarfstd.org/ DWARF]::
+  Uses a simple stack‐based VM to calculate variable locations.
+ [http://ijdc.net/index.php/ijdc/article/view/207 “Grammar‐based specification and parsing of binary file formats”]::
+  Actually uses an attribute grammar, but it isn’t terribly different from an
+  imperative language.
+ [http://pyffi.sourceforge.net/ PyFFI]::
+  Lets you create or modify files instead of just reading them. Fields can
+  refer to blocks of data elsewhere in the file. Uses an XML format.
+ [http://aluigi.altervista.org/quickbms.htm QuickBMS]::
+  A popular tool for extracting files from video game archives. Its main
+  strength is the broad number of compression formats supported. It can put
+  modified files back in the archive in trivial situations.
+ [http://www.synalysis.net/ Synalize It!]::
+  Not completely imperative; if you declare optional structs where part of the
+  data is constant, the correct struct will be displayed. Has a Graphviz export
+  of file structure. Uses an XML format.
+ Other free::
+  [http://www-old.bro-ids.org/wiki/index.php/BinPAC_Userguide BinPAC],
+  [https://metacpan.org/module/Data::ParseBinary Data::ParseBinary],
+  [http://datascript.berlios.de/DataScriptLanguageOverview.html DataScript],
+  [http://www.dataworkshop.de/ DataWorkshop],
+  [http://wsgd.free.fr/ Wireshark Generic Dissector],
+  [http://metafuzz.rubyforge.org/binstruct/ Metafuzz BinStruct], and
+  [http://www.padsproj.org/ PADS].
+ Other proprietary::
+  [http://www.sweetscape.com/010editor/#templates 010 Editor],
+  [http://www.nyangau.org/be/be.htm Andys Binary Folding Editor],
+  [https://www.technologismiki.com/prod.php?id=31 Hackman Suite],
+  [http://www.hhdsoftware.com/doc/hex-editor/language-reference-overview.html Hex Editor Neo],
+  [http://apps.tempel.org/iBored/ iBored], and
+  [https://www.x-ways.net/winhex/templates.html#User_Templates WinHext].
+=== Less interesting tools ===
+ Simple formats in hex editors::
+  These support static fields and dynamic lengths only:
+  [http://www.flexhex.com/ FlexHex],
+  [http://hexedit.com/ HexEdit],
+  [http://sourceforge.net/projects/hexplorer/ Hexplorer],
+  [http://www.hexworkshop.com/ Hex Workshop], and
+  [http://kde.org/applications/utilities/okteta/ Okteta].
+ Simple formats elsewhere::
+  [http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/ctf.h CTF],
+  [http://ff-extractor.sourceforge.net/ ffe],
+  [http://bigeasy.github.com/node-packet/ Node Packet],
+  [https://www.secdev.org/projects/scapy/ Scapy], and
+  [http://sourceware.org/gdb/current/onlinedocs/stabs/ stabs]
+  can only handle trivial structures.
+  [http://www.tecgraf.puc-rio.br/~lhf/ftp/lua/#lpack lpack],
+  [http://perldoc.perl.org/functions/pack.html Perl’s pack],
+  [http://docs.python.org/library/struct.html Python’s struct], and
+  [https://github.com/ToxicFrog/vstruct VStruct]
+  use concise string formats to describe simple structures.
+  [https://bitbucket.org/haypo/hachoir Hachoir]
+  uses Python for most things.
+ Protocol definition formats::
+  [https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One ASN.1],
+  [https://en.wikipedia.org/wiki/Microsoft_Interface_Definition_Language MIDL],
+  [http://piqi.org/ Piqi],
+  and other IPC implementations go in the other direction: they generate a
+  binary format from a text description of a structure. ASN.1 in particular
+  has many features.
+ [https://www.wireshark.org/ Wireshark] and [http://www.tcpdump.org/ tcpdump]::
+  As the Construct wiki notes, you would expect these developers to have some
+  sort of DSL, but they just use C for everything. Wireshark does use ASN.1,
+  Diameter, and MIDL for protocols developed with them.
+== Miscellaneous ideas ==
+=== Code exporter ===
+A tool could generate C code to read and write data given a specification. A
+separate file could be used to specify which types should be used and which
+things should be read lazily or strictly.
+=== Diff ===
+A diff tool could show differences in the interpreted data.
+=== Space‐filling curves ===
+[http://corte.si/posts/visualisation/binvis/index.html Space‐filling curves]
+look cool, but this project is about ''avoiding'' looking at raw binary data.
+Now that the project has a name, everything here has been moved to [[Bithenge]].