Introduction#

The “library reference” contains several different documents describing the core model of this framework. After the tutorial, you are now able to dive deeper into the functionalities of this library.

What exactly is a caterpillar?

Caterpillars (ˈ/k æ t ə rp ɪ l ə r - 🐛) are the wormlike larva of a butterfly or moth. [1] Just as caterpillars undergo a metamorphosis, caterpillar facilitates the metamorphosis of data structures into runtime objects.

This document aims to address burning questions regarding design and availability. It provides an overview of the aspects covered by this framework and those that it doesn’t. In general, this library was designed to enhance the process of reverse engineering binary structures using readable and shareable code. The use of “static” [2] class definitions delivers advantages but also brings up some problems that we need to discuss.

Why use Caterpillar?#

There are several reasons to incorporate this library into your code. Some of the scenarios where Caterpillar can be beneficial including

  • Quick Reverse Engineering: When you need to rapidly reverse engineer a binary structure.

  • Creating Presentable Binary Structures: When there’s a task to create a binary structure, and the result should be presentable.

  • Have some fun: or when you just want to experiment and play around in Python.

The biggest advantage of Caterpillar is the lack of external dependencies (though extensions can be integrated using dependencies). Additionally, the minimal lines of code required to define structures speak for themselves, as demonstrated in the following example from examples/formats/caf:

 1@struct(order=BigEndian)
 2class CAFChunk:
 3    # Include other structs just like that
 4    chunk_header: CAFChunkHeader
 5    # Built-in support for switch-case structures
 6    data: Field(this.chunk_header.chunk_type) >> {
 7        b"desc": CAFAudioFormat,
 8        b"info": CAFStringsChunk,
 9        b"pakt": CAFPacketTable,
10        b"data": CAFData,
11        b"free": padding[this.chunk_header.chunk_size],
12        # the fallback struct given with a default option
13        DEFAULT_OPTION: Bytes(this.chunk_header.chunk_size),
14    }

How does this even work?#

Caterpillar utilizes Python’s annotations to build its model from processing class definitions. With the use of Python 3.12, there are no conflicts in using annotations for defining fields.

@struct
class Format:
    # <name> : <field> [ = <default_value> ]

By using annotations, we can simply define a default value if desired, eliminating the need to make the code more complex by using assignments.

Pros & Cons#

TODO

Comparison#

TODO: add links

Here, we present a comparison of Caterpillar with Construct and Kaitai using the struct from the initial benchmark in the construct-docs repository as a base.Since Kaitai’s generated code can only parse a format and not build it, the comparison is focused on Construct. The files used in the benchmark are provided below:

from caterpillar.shortcuts import struct, LittleEndian, bitfield, unpack, pack
from caterpillar.fields import uint8, UInt, CString, Prefixed, uint32


@bitfield(order=LittleEndian)
class Flags:
    bool1 : 1
    num4  : 3
    # padding is generated automatically


@struct(order=LittleEndian)
class Item:
    num1: uint8
    num2: UInt(24)
    flags: Flags
    fixedarray1: uint8[3]
    name1: CString(encoding="utf-8")
    name2: Prefixed(uint8, encoding="utf-8")


Format = LittleEndian + Item[uint32::]

Note

It actually makes no time-related difference if we define the Format field directly or put it into a struct class.

from construct import *

d = Struct(
    "count" / Int32ul,
    "items"
    / Array(
        this.count,
        Struct(
            "num1" / Int8ul,
            "num2" / Int24ul,
            "flags"
            / BitStruct(
                "bool1" / Flag,
                "num4" / BitsInteger(3),
                Padding(4),
            ),
            "fixedarray1" / Array(3, Int8ul),
            "name1" / CString("utf8"),
            "name2" / PascalString(Int8ul, "utf8"),
        ),
    ),
)
d_compiled = d.compile()
meta:
  id: comparison_1_kaitai
  encoding: utf-8
  endian: le
seq:
  - id: count
    type: u4
  - id: items
    repeat: expr
    repeat-expr: count
    type: item
types:
  item:
    seq:
      - id: num1
        type: u1
      - id: num2_lo
        type: u2
      - id: num2_hi
        type: u1
      - id: flags
        type: flags
      - id: fixedarray1
        repeat: expr
        repeat-expr: 3
        type: u1
      - id: name1
        type: strz
      - id: len_name2
        type: u1
      - id: name2
        type: str
        size: len_name2
    instances:
      num2:
        value: 'num2_hi << 16 | num2_lo'
    types:
      flags:
        seq:
          - id: bool1
            type: b1
          - id: num4
            type: b3
          - id: padding
            type: b4
from hachoir.field import *
from hachoir.stream import StringInputStream
from hachoir.core.bits import LITTLE_ENDIAN


class Entry(FieldSet):
    endian = LITTLE_ENDIAN

    def createFields(self):
        yield UInt8(self, "num1")
        yield UInt24(self, "num2")
        yield Bit(self, "bool1")
        yield Bits(self, "num4", 3)
        yield PaddingBits(self, "_", 4)
        for _ in range(3):
            yield UInt8(self, "fixedarray[]")
        yield CString(self, "name1")
        yield PascalString8(self, "name2")


class Format(Parser):
    endian = LITTLE_ENDIAN

    def createFields(self):
        yield UInt32(self, "count")
        for _ in range(self["count"].value):
            yield Entry(self, "entry[]")
from mrcrowbar.fields import Bits8, UInt24_LE, UInt32_LE, BlockField, UInt8
from mrcrowbar.models import Block, CString, CStringN
from mrcrowbar.refs import Ref


class Entry(Block):
    num1 = UInt8()
    num2 = UInt24_LE()
    bool1 = Bits8(0x0004, bits=0b10000000, endian="little")
    num4 = Bits8(0x0004, bits=0b01110000, endian="little")
    fixedarray1 = UInt8(0x0005, count=3)
    name1 = CString(0x0008)
    name2 = CString(length_field=UInt8)


class Format(Block):
    count = UInt32_LE(0x0000)
    entries = BlockField(Entry, 0x0004, count=Ref("count"))

The test involved one thousand iterations of packing and unpacking the structure. Kaitai scores with the fastest time since it directly reads all data from the stream. caterpillar and Construct show similar performance in their initial form. The compilation feature of Construct makes it comparable to Kaitai, but since compilation is not a primary goal of caterpillar, these results are not considered.

Note

All tests have been performed on a Windows VM using the latest stable python implementation ( Python 3.12.1 (tags/v3.12.1:2305ca5) [MSC v.1937 64 bit (AMD64)] on win32) and PyPy ([PyPy 7.3.15 with MSC v.1929 64 bit (AMD64)]).

(venv-3.12.1)> python3 ./examples/comparison/comparison_1_caterpillar.py ./blob
Timeit measurements:
unpack 0.0097203362 sec/call
pack   0.0078892448 sec/call

(pypy-venv-3.10) pypy ./examples/comparison/comparison_1_caterpillar.py ./blob
Timeit measurements:
unpack 0.0044497086 sec/call
pack   0.0021923946 sec/call
(venv-3.12.1)> python3 ./examples/comparison/comparison_1_construct.py ./blob
Parsing measurements:
default  0.0145166325 sec/call
compiled 0.0085910592 sec/call

Building measurements:
default  0.0125181926 sec/call
compiled 0.0098681578 sec/call

(pypy-venv-3.10)> pypy ./examples/comparison/comparison_1_construct.py ./blob
Parsing measurements:
default  0.0051257158 sec/call
compiled 0.0033772090 sec/call

Building measurements:
default  0.0037924385 sec/call
compiled 0.0031346225 sec/call
(venv-3.12.1)> python3 ./examples/comparison/comparison_1_kaitai.py ./blob
Parsing measurements:
default  0.0034705456 sec/call

(pypy-venv-3.10)> pypy ./examples/comparison/comparison_1_kaitai.py ./blob
Parsing measurements:
default  0.0008136422 sec/call
(venv-3.12.1)> python3 ./examples/comparison/comparison_1_hachoir.py ./blob
Parsing measurements:
default  0.0260070809 sec/call

(pypy-venv-3.10)> pypy ./examples/comparison/comparison_1_hachoir.py ./blob
Parsing measurements:
default  0.0063716554 sec/call
(venv-3.12.1)> python3 ./examples/comparison/comparison_1_mrcrowbar.py ./blob
Parsing measurements:
default  0.0555872261 sec/call

Building measurements:
default  0.0898006975 sec/call

(pypy-venv-3.10)> pypy ./examples/comparison/comparison_1_mrcrowbar.py ./blob
Parsing measurements:
default  0.0110391153 sec/call

Building measurements:
default  0.0126670091 sec/call

In this benchmark, caterpillar demonstrates a performance advantage, being approximately 33.04% faster in unpacking data and approximately 36.97% faster in packing data compared to Construct (not compiled).

In the compiled Construct test, caterpillar shows a performance difference compared to Construct. Specifically, caterpillar is approximately 13.14% slower in unpacking data, but approximately 20.05% faster in packing data. It’s important to note that these figures reflect a trade-off between performance and other considerations such as simplicity and ease of use.

Caution

While this small benchmark provides a foundational starting point, it is crucial to acknowledge that it does not assert perfection. Instead, it serves as an initial reference to start the benchmarking process.

Users are advised to interpret the results with caution!