Changes between Initial Version and Version 1 of Bithenge/Design


Ignore:
Timestamp:
2012-08-24T06:41:27Z (12 years ago)
Author:
Sean Bartell
Comment:

Legend:

Unmodified
Added
Removed
Modified
  • Bithenge/Design

    v1 v1  
     1= Bithenge =
     2
     3[[PageOutline(2-3)]]
     4
     5Various aspects of the [[Bithenge]] design are discussed here.
     6
     7== Motivation ==
     8
     9Exploring and working with structured binary data is necessary in many
     10different situations in a project like HelenOS. For instance, when implementing
     11a file format or filesystem, it is first necessary to explore preexisting files
     12and disks and learn the low‐level details of the format. Debugging compiled
     13programs, working with core dumps, and exploring network protocols also require
     14some way of interpreting binary data.
     15
     16The most basic tool for exploring binary data is the hex editor. Using a hex
     17editor is inefficient and unpleasant because it requires manual calculation of
     18lengths and offsets while constantly referring back to the data format.
     19General‐purpose scripting languages can be used instead, so a structure can be
     20defined once and decoded as often as necessary. However, even with useful tools
     21like Python’s struct module, the programmer must specify how to read the input
     22data, calculate lengths and offsets, and provide useful output, so there’s much
     23more work involved than simply specifying the format of the data. This extra
     24code will probably be rewritten every time a new script is made, due to
     25slightly differing requirements.
     26
     27== Requirements ==
     28
     29* Work in HelenOS—this means the code must be in C and/or an easily ported
     30  language like Lua.
     31* View on different layers. For instance, when viewing a FAT directory entry,
     32  it should be possible to switch between viewing the formatted date and time,
     33  the integers, and the original bytes.
     34* Check whether data is valid; handle broken data reasonably well.
     35* Parse pieces of the data lazily; don’t try to read everything at once.
     36* Work in both directions (parsing and building) without requiring too much
     37  extra effort when specifying the format.
     38* Support full modifications. Ideally, allow creation of a whole filesystem
     39  from scratch.
     40
     41== Rationales ==
     42
     43'''Purity''': Pure functions are a natural basis for converting data, and the
     44overall purpose of a script does not involve side effects. With a possible
     45constraint‐based version (see below) in mind, it seemed best to avoid side
     46effects entirely.
     47
     48'''Composition''': Bithenge’s emphasis on transform composition is intended to
     49help view data on different “layers.” It should be relatively simple to view
     50the result partway through a composition rather than at the end. Note that the
     51order of composition is consistent with function composition and nested
     52application in mathematics, and also consistent with the general idea that data
     53moves from right to left as it is decoded.
     54
     55== Issues ==
     56
     57- Poor error handling in transform application—where and why did the error
     58  occur? `scope_error` should help, but it isn’t widely used and it can’t give
     59  the full context.
     60- Mixing transforms and expressions, and using "`in`", are confusing. It may be
     61  best to combine them and eliminate the mandatory parentheses.
     62- There may be too much emphasis on single‐input, single‐output transforms;
     63  it’s too early to tell.
     64- The internal handling of parameters and scopes seems awkward.
     65- The language isn’t very powerful yet.
     66
     67== Future language ideas ==
     68
     69In approximate order of priority.
     70
     71 Infinite loop detection:: When decoding transforms like
     72   `struct { if (.non_existent) { } }`, an infinite loop occurs. This can also
     73   happen if the field exists, but in an outer `struct`. An error should be
     74   printed instead; Bithenge should not try to look in the `if` when searching
     75   for `.non_existent`.
     76 Code exporter:: A tool could generate C code to read and write data given a
     77   specification. A separate file or annotations could be used to specify which
     78   types should be used and which things should be read lazily or strictly.
     79 Complex expressions:: More operators, and expressions that call transforms.
     80 Better error reporting:: errno.h is intended for system call errors; when
     81   other errors occur, a more helpful error message should be printed using
     82   `scope_error`, like "field .foo not found" or "cannot apply nonzero_boolean
     83   to blobs".
     84 Assertions:: These could be implemented as transforms that don't actually
     85   change the input. There could be multiple levels, ranging from “warning” to
     86   “fatal error”.
     87 Enumerations:: An easier way to handle many constant values, like
     88   `enum { 0: "none", 1: "file", 2: "directory", 3: "symlink" }`.
     89 Recursive transforms:: Although simple cases are handled by `do...while` or
     90   self‐recursion, in some cases transforms need to recursively refer to each
     91   other.
     92 Merge blobs and internal nodes:: Currently, `struct`, `repeat`, and so on only
     93   work with blobs, which must be either byte sequences or bit sequences.
     94   Numbered internal nodes (such as those made by `repeat`) should be supported
     95   as well.
     96 Transforming internal nodes:: After binary data is decoded into a tree, it
     97   should be possible to apply further transforms to interpret the data
     98   further. For instance, after the FAT and directory entries of a FAT
     99   filesystem have been decoded, a further transform could determine the data
     100   for each file.
     101 More information in repeat subtransforms:: Repeat subtransforms should have
     102   access to the current index and previously decoded items.
     103 Hidden fields:: Some fields, such as length fields, are no longer interesting
     104   after the data is decoded, so they should be hidden by default.
     105 Search:: Decoding may require searching for a fixed sequence of bytes in the
     106   data.
     107 Automatic parameters:: It could be useful to automatically pass some
     108   parameters rather than computing and passing them explicitly. For instance,
     109   a version number that affects the format of many different parts of the file
     110   could be passed automatically, without having to write it out every time. A
     111   more advanced automatic parameter could keep track of current offset being
     112   decoded within a blob. There would need to be some sort of grouping to
     113   determine which transforms have the automatic parameters.
     114 Smarter length calculation:: Bithenge should automatically detect the length
     115   of certain composed transforms, such as `repeat(8) {bit} <- bits_le`. This
     116   would also be addressed by the constraint‐based version.
     117 Diff tool:: Could show differences in the interpreted data.
     118
     119=== Constraint‐based version ===
     120
     121This and most other projects use an imperative design, where the format
     122specification is always used in a fixed order, one step at a time. The
     123imperative design causes problems when the user wants to modify a field,
     124because arbitrary changes to other fields may be necessary that cannot be
     125determined from the format specification.
     126
     127It may be possible to solve this with a constraint-based design, where the
     128format specification consists of statements that must be true about the raw and
     129interpreted data, and the program figures out how to solve these constraints.
     130Unfortunately, this approach seems too open-ended and unpredictable to fit
     131within GSoC.
     132
     133== Existing Tools ==
     134
     135I researched existing tools related to my project, so they can be used for
     136inspiration.
     137
     138=== Construct ===
     139
     140[http://construct.wikispaces.com/ Construct] is a Python library for creating
     141declarative structure definitions. Each instance of the `Construct` class has a
     142name, and knows how to read from a stream, write to a stream, and determine its
     143length. Some predefined `Construct` subclasses use an arbitrary Python function
     144evaluated at runtime, or behave differently depending on whether
     145sub‐`Construct`s throw exceptions. `Const` uses a sub‐`Construct` and makes
     146sure the value is correct. Also has lazy `Construct`s.
     147
     148Unfortunately, if you change the size of a structure, you still have to change
     149everything else manually.
     150
     151=== !BinData ===
     152
     153[http://bindata.rubyforge.org/ BinData] makes good use of Ruby syntax; it
     154mostly has the same features as Construct.
     155
     156=== Imperative DSLs ===
     157
     158DSLs in this category are used in an obvious, deterministic manner, and complex
     159edits (changing the length of a structure) are difficult or impossible. They
     160are simple imperative languages in which fields, structures, bitstructures, and
     161arrays can be defined. The length, decoded value, and presence of fields can be
     162determined by expressions using any previously decoded field, and structures
     163can use `if`/`while`/`continue`/`break` and similar statements. Structures can
     164inherit from other structures, meaning that the parent’s fields are present at
     165the beginning of the child. Statements can move to a different offset in the
     166input data. There may be a real programming language that can be used along
     167with the DSL.
     168
     169 [http://dwarfstd.org/ DWARF]::
     170  Uses a simple stack‐based VM to calculate variable locations.
     171 [http://ijdc.net/index.php/ijdc/article/view/207 “Grammar‐based specification and parsing of binary file formats”]::
     172  Actually uses an attribute grammar, but it isn’t terribly different from an
     173  imperative language.
     174 [http://pyffi.sourceforge.net/ PyFFI]::
     175  Lets you create or modify files instead of just reading them. Fields can
     176  refer to blocks of data elsewhere in the file. Uses an XML format.
     177 [http://aluigi.altervista.org/quickbms.htm QuickBMS]::
     178  A popular tool for extracting files from video game archives. Its main
     179  strength is the broad number of compression formats supported. It can put
     180  modified files back in the archive in trivial situations.
     181 [http://www.synalysis.net/ Synalize It!]::
     182  Not completely imperative; if you declare optional structs where part of the
     183  data is constant, the correct struct will be displayed. Has a Graphviz export
     184  of file structure. Uses an XML format.
     185 Other free::
     186  [http://www-old.bro-ids.org/wiki/index.php/BinPAC_Userguide BinPAC],
     187  [https://metacpan.org/module/Data::ParseBinary Data::ParseBinary],
     188  [http://datascript.berlios.de/DataScriptLanguageOverview.html DataScript],
     189  [http://www.dataworkshop.de/ DataWorkshop],
     190  [http://wsgd.free.fr/ Wireshark Generic Dissector],
     191  [http://metafuzz.rubyforge.org/binstruct/ Metafuzz BinStruct], and
     192  [http://www.padsproj.org/ PADS].
     193 Other proprietary::
     194  [http://www.sweetscape.com/010editor/#templates 010 Editor],
     195  [http://www.nyangau.org/be/be.htm Andys Binary Folding Editor],
     196  [https://www.technologismiki.com/prod.php?id=31 Hackman Suite],
     197  [http://www.hhdsoftware.com/doc/hex-editor/language-reference-overview.html Hex Editor Neo],
     198  [http://apps.tempel.org/iBored/ iBored], and
     199  [https://www.x-ways.net/winhex/templates.html#User_Templates WinHext].
     200
     201=== Less interesting tools ===
     202
     203 Simple formats in hex editors::
     204  These support static fields and dynamic lengths only:
     205  [http://www.flexhex.com/ FlexHex],
     206  [http://hexedit.com/ HexEdit],
     207  [http://sourceforge.net/projects/hexplorer/ Hexplorer],
     208  [http://www.hexworkshop.com/ Hex Workshop], and
     209  [http://kde.org/applications/utilities/okteta/ Okteta].
     210 Simple formats elsewhere::
     211  [http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/ctf.h CTF],
     212  [http://ff-extractor.sourceforge.net/ ffe],
     213  [http://bigeasy.github.com/node-packet/ Node Packet],
     214  [https://www.secdev.org/projects/scapy/ Scapy], and
     215  [http://sourceware.org/gdb/current/onlinedocs/stabs/ stabs]
     216  can only handle trivial structures.
     217  [http://www.tecgraf.puc-rio.br/~lhf/ftp/lua/#lpack lpack],
     218  [http://perldoc.perl.org/functions/pack.html Perl’s pack],
     219  [http://docs.python.org/library/struct.html Python’s struct], and
     220  [https://github.com/ToxicFrog/vstruct VStruct]
     221  use concise string formats to describe simple structures.
     222  [https://bitbucket.org/haypo/hachoir Hachoir]
     223  uses Python for most things.
     224 Protocol definition formats::
     225  [https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One ASN.1],
     226  [https://en.wikipedia.org/wiki/Microsoft_Interface_Definition_Language MIDL],
     227  [http://piqi.org/ Piqi],
     228  and other IPC implementations go in the other direction: they generate a
     229  binary format from a text description of a structure. ASN.1 in particular
     230  has many features.
     231 [https://www.wireshark.org/ Wireshark] and [http://www.tcpdump.org/ tcpdump]::
     232  As the Construct wiki notes, you would expect these developers to have some
     233  sort of DSL, but they just use C for everything. Wireshark does use ASN.1,
     234  Diameter, and MIDL for protocols developed with them.
     235
     236== Miscellaneous ideas ==
     237
     238[http://corte.si/posts/visualisation/binvis/index.html Space‐filling curves]
     239look cool, but this project is about ''avoiding'' looking at raw binary data.
     240
     241== Interesting formats ==
     242
     243These formats will be interesting and/or difficult to handle. They should be
     244kept in mind when designing the language.
     245
     246* Filesystem allocation tables, which should be kept consistent with the actual
     247  usage of the disk.
     248* Filesystem logs, which should be applied to the rest of the disk before
     249  interpreting it.
     250* Formats where the whole file can have either endianness depending on a field
     251  in the header.
     252* The [http://www.blender.org/development/architecture/blender-file-format/ Blender file format]
     253  is especially dynamic. When Blender saves a file, it just copies the
     254  structures from memory and translates the pointers. Since each Blender
     255  version and architecture will have different structures, the output file
     256  includes a header describing the fields and binary layout of each structure.
     257  When the file is loaded, the header is read first and the structures will be
     258  translated as necessary.
     259* If the language is powerful enough, it might be possible to have a native
     260  description of Zlib and other compression formats.
     261* It could be interesting to parse ARM or x86 machine code.