Introduction#
The “library reference” contains several different documents describing the core model of this framework. After the tutorial, you are now able to dive deeper into the functionalities of this library.
What exactly is a caterpillar?
Caterpillars (ˈ/k æ t ə rp ɪ l ə r - 🐛) are the wormlike larva of a butterfly or moth. [1] Just as caterpillars undergo a metamorphosis, caterpillar facilitates the metamorphosis of data structures into runtime objects.
This document aims to address burning questions regarding design and availability. It provides an overview of the aspects covered by this framework and those that it doesn’t. In general, this library was designed to enhance the process of reverse engineering binary structures using readable and shareable code. The use of “static” [2] class definitions delivers advantages but also brings up some problems that we need to discuss.
Why use Caterpillar?#
There are several reasons to incorporate this library into your code. Some of the scenarios where Caterpillar can be beneficial including
Quick Reverse Engineering: When you need to rapidly reverse engineer a binary structure.
Creating Presentable Binary Structures: When there’s a task to create a binary structure, and the result should be presentable.
Have some fun: or when you just want to experiment and play around in Python.
The biggest advantage of Caterpillar is the lack of external dependencies (though extensions can be integrated using dependencies). Additionally, the minimal lines of code required to define structures speak for themselves, as demonstrated in the following example from examples/formats/caf:
1@struct(order=BigEndian)
2class CAFChunk:
3 # Include other structs just like that
4 chunk_header: CAFChunkHeader
5 # Built-in support for switch-case structures
6 data: Field(this.chunk_header.chunk_type) >> {
7 b"desc": CAFAudioFormat,
8 b"info": CAFStringsChunk,
9 b"pakt": CAFPacketTable,
10 b"data": CAFData,
11 b"free": padding[this.chunk_header.chunk_size],
12 # the fallback struct given with a default option
13 DEFAULT_OPTION: Bytes(this.chunk_header.chunk_size),
14 }
How does this even work?#
Caterpillar utilizes Python’s annotations to build its model from processing class definitions. With the use of Python 3.12, there are no conflicts in using annotations for defining fields.
@struct
class Format:
# <name> : <field> [ = <default_value> ]
By using annotations, we can simply define a default value if desired, eliminating the need to make the code more complex by using assignments.
Pros & Cons#
TODO
Comparison#
TODO: add links
Here, we present a comparison of Caterpillar with Construct and Kaitai using the struct from the initial benchmark in the construct-docs repository as a base.Since Kaitai’s generated code can only parse a format and not build it, the comparison is focused on Construct. The files used in the benchmark are provided below:
from caterpillar.shortcuts import struct, LittleEndian, bitfield, unpack, pack
from caterpillar.fields import uint8, UInt, CString, Prefixed, uint32
@bitfield(order=LittleEndian)
class Flags:
bool1 : 1
num4 : 3
# padding is generated automatically
@struct(order=LittleEndian)
class Item:
num1: uint8
num2: UInt(24)
flags: Flags
fixedarray1: uint8[3]
name1: CString(encoding="utf-8")
name2: Prefixed(uint8, encoding="utf-8")
Format = LittleEndian + Item[uint32::]
Note
It actually makes no time-related difference if we define the Format
field directly or put it into
a struct class.
from construct import *
d = Struct(
"count" / Int32ul,
"items"
/ Array(
this.count,
Struct(
"num1" / Int8ul,
"num2" / Int24ul,
"flags"
/ BitStruct(
"bool1" / Flag,
"num4" / BitsInteger(3),
Padding(4),
),
"fixedarray1" / Array(3, Int8ul),
"name1" / CString("utf8"),
"name2" / PascalString(Int8ul, "utf8"),
),
),
)
d_compiled = d.compile()
meta:
id: comparison_1_kaitai
encoding: utf-8
endian: le
seq:
- id: count
type: u4
- id: items
repeat: expr
repeat-expr: count
type: item
types:
item:
seq:
- id: num1
type: u1
- id: num2_lo
type: u2
- id: num2_hi
type: u1
- id: flags
type: flags
- id: fixedarray1
repeat: expr
repeat-expr: 3
type: u1
- id: name1
type: strz
- id: len_name2
type: u1
- id: name2
type: str
size: len_name2
instances:
num2:
value: 'num2_hi << 16 | num2_lo'
types:
flags:
seq:
- id: bool1
type: b1
- id: num4
type: b3
- id: padding
type: b4
from hachoir.field import *
from hachoir.stream import StringInputStream
from hachoir.core.bits import LITTLE_ENDIAN
class Entry(FieldSet):
endian = LITTLE_ENDIAN
def createFields(self):
yield UInt8(self, "num1")
yield UInt24(self, "num2")
yield Bit(self, "bool1")
yield Bits(self, "num4", 3)
yield PaddingBits(self, "_", 4)
for _ in range(3):
yield UInt8(self, "fixedarray[]")
yield CString(self, "name1")
yield PascalString8(self, "name2")
class Format(Parser):
endian = LITTLE_ENDIAN
def createFields(self):
yield UInt32(self, "count")
for _ in range(self["count"].value):
yield Entry(self, "entry[]")
from mrcrowbar.fields import Bits8, UInt24_LE, UInt32_LE, BlockField, UInt8
from mrcrowbar.models import Block, CString, CStringN
from mrcrowbar.refs import Ref
class Entry(Block):
num1 = UInt8()
num2 = UInt24_LE()
bool1 = Bits8(0x0004, bits=0b10000000, endian="little")
num4 = Bits8(0x0004, bits=0b01110000, endian="little")
fixedarray1 = UInt8(0x0005, count=3)
name1 = CString(0x0008)
name2 = CString(length_field=UInt8)
class Format(Block):
count = UInt32_LE(0x0000)
entries = BlockField(Entry, 0x0004, count=Ref("count"))
The test involved one thousand iterations of packing and unpacking the structure. Kaitai scores with the fastest time since it directly reads all data from the stream. caterpillar and Construct show similar performance in their initial form. The compilation feature of Construct makes it comparable to Kaitai, but since compilation is not a primary goal of caterpillar, these results are not considered.
Note
All tests have been performed on a Windows VM using the latest stable python implementation (
Python 3.12.1 (tags/v3.12.1:2305ca5) [MSC v.1937 64 bit (AMD64)] on win32
) and PyPy
([PyPy 7.3.15 with MSC v.1929 64 bit (AMD64)]
).
(venv-3.12.1)> python3 ./examples/comparison/comparison_1_caterpillar.py ./blob
Timeit measurements:
unpack 0.0097203362 sec/call
pack 0.0078892448 sec/call
(pypy-venv-3.10) pypy ./examples/comparison/comparison_1_caterpillar.py ./blob
Timeit measurements:
unpack 0.0044497086 sec/call
pack 0.0021923946 sec/call
(venv-3.12.1)> python3 ./examples/comparison/comparison_1_construct.py ./blob
Parsing measurements:
default 0.0145166325 sec/call
compiled 0.0085910592 sec/call
Building measurements:
default 0.0125181926 sec/call
compiled 0.0098681578 sec/call
(pypy-venv-3.10)> pypy ./examples/comparison/comparison_1_construct.py ./blob
Parsing measurements:
default 0.0051257158 sec/call
compiled 0.0033772090 sec/call
Building measurements:
default 0.0037924385 sec/call
compiled 0.0031346225 sec/call
(venv-3.12.1)> python3 ./examples/comparison/comparison_1_kaitai.py ./blob
Parsing measurements:
default 0.0034705456 sec/call
(pypy-venv-3.10)> pypy ./examples/comparison/comparison_1_kaitai.py ./blob
Parsing measurements:
default 0.0008136422 sec/call
(venv-3.12.1)> python3 ./examples/comparison/comparison_1_hachoir.py ./blob
Parsing measurements:
default 0.0260070809 sec/call
(pypy-venv-3.10)> pypy ./examples/comparison/comparison_1_hachoir.py ./blob
Parsing measurements:
default 0.0063716554 sec/call
(venv-3.12.1)> python3 ./examples/comparison/comparison_1_mrcrowbar.py ./blob
Parsing measurements:
default 0.0555872261 sec/call
Building measurements:
default 0.0898006975 sec/call
(pypy-venv-3.10)> pypy ./examples/comparison/comparison_1_mrcrowbar.py ./blob
Parsing measurements:
default 0.0110391153 sec/call
Building measurements:
default 0.0126670091 sec/call
In this benchmark, caterpillar demonstrates a performance advantage, being approximately 33.04% faster in unpacking data and approximately 36.97% faster in packing data compared to Construct (not compiled).
In the compiled Construct test, caterpillar shows a performance difference compared to Construct. Specifically, caterpillar is approximately 13.14% slower in unpacking data, but approximately 20.05% faster in packing data. It’s important to note that these figures reflect a trade-off between performance and other considerations such as simplicity and ease of use.
Caution
While this small benchmark provides a foundational starting point, it is crucial to acknowledge that it does not assert perfection. Instead, it serves as an initial reference to start the benchmarking process.
Users are advised to interpret the results with caution!