wiki:Bithenge/Design

Bithenge

Various aspects of the Bithenge design are discussed here.

Motivation

Exploring and working with structured binary data is necessary in many different situations in a project like HelenOS. For instance, when implementing a file format or filesystem, it is first necessary to explore preexisting files and disks and learn the low‐level details of the format. Debugging compiled programs, working with core dumps, and exploring network protocols also require some way of interpreting binary data.

The most basic tool for exploring binary data is the hex editor. Using a hex editor is inefficient and unpleasant because it requires manual calculation of lengths and offsets while constantly referring back to the data format. General‐purpose scripting languages can be used instead, so a structure can be defined once and decoded as often as necessary. However, even with useful tools like Python’s struct module, the programmer must specify how to read the input data, calculate lengths and offsets, and provide useful output, so there’s much more work involved than simply specifying the format of the data. This extra code will probably be rewritten every time a new script is made, due to slightly differing requirements.

Requirements

  • Work in HelenOS—this means the code must be in C and/or an easily ported language like Lua.
  • View on different layers. For instance, when viewing a FAT directory entry, it should be possible to switch between viewing the formatted date and time, the integers, and the original bytes.
  • Check whether data is valid; handle broken data reasonably well.
  • Parse pieces of the data lazily; don’t try to read everything at once.
  • Work in both directions (parsing and building) without requiring too much extra effort when specifying the format.
  • Support full modifications. Ideally, allow creation of a whole filesystem from scratch.

Rationales

Purity: Pure functions are a natural basis for converting data, and the overall purpose of a script does not involve side effects. With a possible constraint‐based version (see below) in mind, it seemed best to avoid side effects entirely.

Composition: Bithenge’s emphasis on transform composition is intended to help view data on different “layers.” It should be relatively simple to view the result partway through a composition rather than at the end. Note that the order of composition is consistent with function composition and nested application in mathematics, and also consistent with the general idea that data moves from right to left as it is decoded.

Issues

  • Poor error handling in transform application—where and why did the error occur? scope_error should help, but it isn’t widely used and it can’t give the full context.
  • Mixing transforms and expressions, and using "in", are confusing. It may be best to combine them and eliminate the mandatory parentheses.
  • There may be too much emphasis on single‐input, single‐output transforms; it’s too early to tell.
  • The internal handling of parameters and scopes seems awkward.
  • The language isn’t very powerful yet.

Future language ideas

In approximate order of priority.

Infinite loop detection
When decoding transforms like struct { if (.non_existent) { } }, an infinite loop occurs. This can also happen if the field exists, but in an outer struct. An error should be printed instead; Bithenge should not try to look in the if when searching for .non_existent.
Code exporter
A tool could generate C code to read and write data given a specification. A separate file or annotations could be used to specify which types should be used and which things should be read lazily or strictly.
Complex expressions
More operators, and expressions that call transforms.
Better error reporting
errno.h is intended for system call errors; when other errors occur, a more helpful error message should be printed using scope_error, like "field .foo not found" or "cannot apply nonzero_boolean to blobs".
Assertions
These could be implemented as transforms that don't actually change the input. There could be multiple levels, ranging from “warning” to “fatal error”.
Enumerations
An easier way to handle many constant values, like enum { 0: "none", 1: "file", 2: "directory", 3: "symlink" }.
Recursive transforms
Although simple cases are handled by do...while or self‐recursion, in some cases transforms need to recursively refer to each other.
Merge blobs and internal nodes
Currently, struct, repeat, and so on only work with blobs, which must be either byte sequences or bit sequences. Numbered internal nodes (such as those made by repeat) should be supported as well.
Transforming internal nodes
After binary data is decoded into a tree, it should be possible to apply further transforms to interpret the data further. For instance, after the FAT and directory entries of a FAT filesystem have been decoded, a further transform could determine the data for each file.
More information in repeat subtransforms
Repeat subtransforms should have access to the current index and previously decoded items.
Hidden fields
Some fields, such as length fields, are no longer interesting after the data is decoded, so they should be hidden by default.
Search
Decoding may require searching for a fixed sequence of bytes in the data.
Automatic parameters
It could be useful to automatically pass some parameters rather than computing and passing them explicitly. For instance, a version number that affects the format of many different parts of the file could be passed automatically, without having to write it out every time. A more advanced automatic parameter could keep track of current offset being decoded within a blob. There would need to be some sort of grouping to determine which transforms have the automatic parameters.
Smarter length calculation
Bithenge should automatically detect the length of certain composed transforms, such as repeat(8) {bit} <- bits_le. This would also be addressed by the constraint‐based version.
Diff tool
Could show differences in the interpreted data.

Constraint‐based version

This and most other projects use an imperative design, where the format specification is always used in a fixed order, one step at a time. The imperative design causes problems when the user wants to modify a field, because arbitrary changes to other fields may be necessary that cannot be determined from the format specification.

It may be possible to solve this with a constraint-based design, where the format specification consists of statements that must be true about the raw and interpreted data, and the program figures out how to solve these constraints. Unfortunately, this approach seems too open-ended and unpredictable to fit within GSoC.

Existing Tools

I researched existing tools related to my project, so they can be used for inspiration.

Construct

Construct is a Python library for creating declarative structure definitions. Each instance of the Construct class has a name, and knows how to read from a stream, write to a stream, and determine its length. Some predefined Construct subclasses use an arbitrary Python function evaluated at runtime, or behave differently depending on whether sub‐Constructs throw exceptions. Const uses a sub‐Construct and makes sure the value is correct. Also has lazy Constructs.

Unfortunately, if you change the size of a structure, you still have to change everything else manually.

BinData

BinData makes good use of Ruby syntax; it mostly has the same features as Construct.

Imperative DSLs

DSLs in this category are used in an obvious, deterministic manner, and complex edits (changing the length of a structure) are difficult or impossible. They are simple imperative languages in which fields, structures, bitstructures, and arrays can be defined. The length, decoded value, and presence of fields can be determined by expressions using any previously decoded field, and structures can use if/while/continue/break and similar statements. Structures can inherit from other structures, meaning that the parent’s fields are present at the beginning of the child. Statements can move to a different offset in the input data. There may be a real programming language that can be used along with the DSL.

DWARF
Uses a simple stack‐based VM to calculate variable locations.
“Grammar‐based specification and parsing of binary file formats”
Actually uses an attribute grammar, but it isn’t terribly different from an imperative language.
PyFFI
Lets you create or modify files instead of just reading them. Fields can refer to blocks of data elsewhere in the file. Uses an XML format.
QuickBMS
A popular tool for extracting files from video game archives. Its main strength is the broad number of compression formats supported. It can put modified files back in the archive in trivial situations.
Synalize It!
Not completely imperative; if you declare optional structs where part of the data is constant, the correct struct will be displayed. Has a Graphviz export of file structure. Uses an XML format.
Other free
BinPAC, Data::ParseBinary, DataScript, DataWorkshop, Wireshark Generic Dissector, Metafuzz BinStruct, and PADS.
Other proprietary
010 Editor, Andys Binary Folding Editor, Hackman Suite, Hex Editor Neo, iBored, and WinHext.

Less interesting tools

Simple formats in hex editors
These support static fields and dynamic lengths only: FlexHex, HexEdit, Hexplorer, Hex Workshop, and Okteta.
Simple formats elsewhere
CTF, ffe, Node Packet, Scapy, and stabs can only handle trivial structures. lpack, Perl’s pack, Python’s struct, and VStruct use concise string formats to describe simple structures. Hachoir uses Python for most things.
Protocol definition formats
ASN.1, MIDL, Piqi, and other IPC implementations go in the other direction: they generate a binary format from a text description of a structure. ASN.1 in particular has many features.
Wireshark and tcpdump
As the Construct wiki notes, you would expect these developers to have some sort of DSL, but they just use C for everything. Wireshark does use ASN.1, Diameter, and MIDL for protocols developed with them.

Miscellaneous ideas

Space‐filling curves look cool, but this project is about avoiding looking at raw binary data.

Interesting formats

These formats will be interesting and/or difficult to handle. They should be kept in mind when designing the language.

  • Filesystem allocation tables, which should be kept consistent with the actual usage of the disk.
  • Filesystem logs, which should be applied to the rest of the disk before interpreting it.
  • Formats where the whole file can have either endianness depending on a field in the header.
  • The Blender file format is especially dynamic. When Blender saves a file, it just copies the structures from memory and translates the pointers. Since each Blender version and architecture will have different structures, the output file includes a header describing the fields and binary layout of each structure. When the file is loaded, the header is read first and the structures will be translated as necessary.
  • If the language is powerful enough, it might be possible to have a native description of Zlib and other compression formats.
  • It could be interesting to parse ARM or x86 machine code.
Last modified 12 years ago Last modified on 2012-08-24T06:41:27Z
Note: See TracWiki for help on using the wiki.