= Structured Binary Data = [[PageOutline(2-3)]] This page will document my thoughts and design ideas for the structured binary data project. The project aims to address #317; a description of my overall approach can be found on the [https://www.google-melange.com/gsoc/project/google/gsoc2012/wtachi/46005 GSoC project page]. == Requirements == * Work in HelenOS—this means the code must be in C and/or an easily ported language like Lua. * View on different layers; for instance, switch between viewing the formatted date and time for a FAT directory entry, the integers, and the original bytes. * Check whether data is valid; handle broken data reasonably well. * Parse pieces of the data lazily; don’t try to read everything at once. * Work in both directions (parsing and building) without requiring too much extra effort. * Support full modifications. Ideally, allow creation of a whole filesystem from scratch. == Interesting formats == These formats will be interesting and/or difficult to handle. I will keep them in mind when designing the library. * Filesystem allocation tables, which should be kept consistent with the actual usage of the disk. * Filesystem logs, which should be applied to the rest of the disk before interpreting it. * Formats where the whole file can have either endianness depending on a field in the header. * The [http://www.blender.org/development/architecture/blender-file-format/ Blender file format] is especially dynamic. When Blender saves a file, it just copies the structures from memory and translates the pointers. Since each Blender version and architecture will have different structures, the output file includes a header describing the fields and binary layout of each structure. When the file is loaded, the header is read first and the structures will be translated as necessary. * If the language is powerful enough, it might be possible to have a native description of Zlib and other compression formats. * It could be interesting to parse ARM or x86 machine code. == Existing Tools == I am researching existing tools related to my project, so they can be used for inspiration. === [http://construct.wikispaces.com/ Construct] === A Python library for creating declarative structure definitions. Each instance of the `Construct` class has a name, and knows how to read from a stream, write to a stream, and determine its length. Some predefined `Construct` subclasses use an arbitrary Python function evaluated at runtime, or behave differently depending on whether sub‐`Construct`s throw exceptions. `Const` uses a sub‐`Construct` and makes sure the value is correct. Also has lazy `Construct`s. Unfortunately, if you change the size of a structure, you still have to change everything else manually. === [http://bindata.rubyforge.org/ BinData] === Makes good use of Ruby syntax; mostly has the same features as Construct. === Imperative DSLs === DSLs in this category are used in an obvious, deterministic manner, and complex edits (changing the length of a structure) are difficult or impossible. They are simple imperative languages in which fields, structures, bitstructures, and arrays can be defined. The length, decoded value, and presence of fields can be determined by expressions using any previously decoded field, and structures can use `if`/`while`/`continue`/`break` and similar statements. Structures can inherit from other structures, meaning that the parent’s fields are present at the beginning of the child. Statements can move to a different offset in the input data. There may be a real programming language that can be used along with the DSL. [http://dwarfstd.org/ DWARF]:: Uses a simple stack‐based VM to calculate variable locations. [http://ijdc.net/index.php/ijdc/article/view/207 “Grammar‐based specification and parsing of binary file formats”]:: Actually uses an attribute grammar, but it isn’t terribly different from an imperative language. [http://pyffi.sourceforge.net/ PyFFI]:: Lets you create or modify files instead of just reading them. Fields can refer to blocks of data elsewhere in the file. Uses an XML format. [http://aluigi.altervista.org/quickbms.htm QuickBMS]:: A popular tool for extracting files from video game archives. Its main strength is the broad number of compression formats supported. It can put modified files back in the archive in trivial situations. [http://www.synalysis.net/ Synalize It!]:: Not completely imperative; if you declare optional structs where part of the data is constant, the correct struct will be displayed. Has a Graphviz export of file structure. Uses an XML format. Other free:: [http://www-old.bro-ids.org/wiki/index.php/BinPAC_Userguide BinPAC], [https://metacpan.org/module/Data::ParseBinary Data::ParseBinary], [http://datascript.berlios.de/DataScriptLanguageOverview.html DataScript], [http://www.dataworkshop.de/ DataWorkshop], [http://wsgd.free.fr/ Wireshark Generic Dissector], [http://metafuzz.rubyforge.org/binstruct/ Metafuzz BinStruct], and [http://www.padsproj.org/ PADS]. Other proprietary:: [http://www.sweetscape.com/010editor/#templates 010 Editor], [http://www.nyangau.org/be/be.htm Andys Binary Folding Editor], [https://www.technologismiki.com/prod.php?id=31 Hackman Suite], [http://www.hhdsoftware.com/doc/hex-editor/language-reference-overview.html Hex Editor Neo], [http://apps.tempel.org/iBored/ iBored], and [https://www.x-ways.net/winhex/templates.html#User_Templates WinHext]. === Less interesting tools === Simple formats in hex editors:: These support static fields and dynamic lengths only: [http://www.flexhex.com/ FlexHex], [http://hexedit.com/ HexEdit], [http://sourceforge.net/projects/hexplorer/ Hexplorer], [http://www.hexworkshop.com/ Hex Workshop], and [http://kde.org/applications/utilities/okteta/ Okteta]. Simple formats elsewhere:: [http://src.opensolaris.org/source/xref/onnv/onnv-gate/usr/src/uts/common/sys/ctf.h CTF], [http://ff-extractor.sourceforge.net/ ffe], [http://bigeasy.github.com/node-packet/ Node Packet], [https://www.secdev.org/projects/scapy/ Scapy], and [http://sourceware.org/gdb/current/onlinedocs/stabs/ stabs] can only handle trivial structures. [http://www.tecgraf.puc-rio.br/~lhf/ftp/lua/#lpack lpack], [http://perldoc.perl.org/functions/pack.html Perl’s pack], [http://docs.python.org/library/struct.html Python’s struct], and [https://github.com/ToxicFrog/vstruct VStruct] use concise string formats to describe simple structures. [https://bitbucket.org/haypo/hachoir Hachoir] uses Python for most things. Protocol definition formats:: [https://en.wikipedia.org/wiki/Abstract_Syntax_Notation_One ASN.1], [https://en.wikipedia.org/wiki/Microsoft_Interface_Definition_Language MIDL], [http://piqi.org/ Piqi], and other IPC implementations go in the other direction: they generate a binary format from a text description of a structure. ASN.1 in particular has many features. [https://www.wireshark.org/ Wireshark] and [http://www.tcpdump.org/ tcpdump]:: As the Construct wiki notes, you would expect these developers to have some sort of DSL, but they just use C for everything. Wireshark does use ASN.1, Diameter, and MIDL for protocols developed with them. == Miscellaneous ideas == === Code exporter === A tool could generate C code to read and write data given a specification. A separate file could be used to specify which types should be used and which things should be read lazily or strictly. === Diff === A diff tool could show differences in the interpreted data. === Space‐filling curves === [http://corte.si/posts/visualisation/binvis/index.html Space‐filling curves] look cool, but this project is about ''avoiding'' looking at raw binary data.