2. Basic Concepts#

In this section, we’ll explore some common techniques used in binary file formats, setting the stage for more advanced topics in the next chapter.

Note

Some examples using the interpreter prompts make use of a shortcut to define Field objects:

>>> from caterpillar.shortcuts import F
>>> field = F(uint8)

2.1. Standard Types#

2.1.1. Numbers#

When dealing with binary data, numbers play a crucial role. Besides the default integer types (e.g., uint8, uint16, etc.), caterpillar introduces some special integer formats. The default types include:

  • Unsigned (u...) and signed:

int8, uint8, int16, uint16, int24, uint24, int32, uint32, int64, uint64, ssize_t, size_t,

  • Floating point:

float16, float32, float64

  • Special primitives:

boolean, char, void_ptr

  • Unsigned (u...) and signed:

i8, u8, i16, u16, i24, u24, i32, u32, i64, u64, i128, u128,

  • Floating point:

f16, f32, f64

  • Special primitives:

boolean, char

2.1.1.1. Custom-sized integer#

It’s also possible to use integers with a custom size (in bits). However, it’s important to note that you have to define the struct with the bit count, and internally, only the occupied bytes will be used. For example:

>>> field = F(Int(24))      # three-byte signed integer
>>> field = F(UInt(40))     # five-byte unsigned integer
>>> i48 = Int(48)               # six-byte signed integer
>>> u40 = Int(40, signed=False) # five-byte unsigned integer

2.1.1.2. Variable-sized integer#

The built-in struct py.VarInt/c.VarInt supports parsing and building integers with variable length. Its documentation provides a detailed explanation of all different configurations.

>>> field = F(vint) # or F(VarInt())
>>> # use 'varint' directly or use VarInt()
>>> be_varint = BIG_ENDIAN + varint
>>> le_varint = VarInt(little_endian=True)

2.1.2. Enumerations#

Enums are essential when working with binary file formats, and caterpillar integrates standard Python enumerations - classes extending enum.Enum - with ease.

Let’s revisit pHYS chunk to add an enum to the last field.

Simple enumeration in a struct definition#
import enum

class PHYSUnit(enum.IntEnum): # <-- the enum value doesn't have to be int
    __struct__ = uint8        # <-- to make the code even more compact, use this
    UNKNOWN = 0
    METRE = 1

@struct(order=BigEndian)         # <-- same as before
class PHYSChunk:
    pixels_per_unit_x: uint32
    pixels_per_unit_y: uint32
    unit: PHYSUnit               # <-- now we have an auto-enumeration
Simple enumeration in a struct definition#
import enum

class PHYSUnit(enum.IntEnum):       # <-- the enum value doesn't have to be int
    UNKNOWN = 0
    METRE = 1

@struct(endian=BIG_ENDIAN)          # <-- same as before
class PHYSChunk:
    pixels_per_unit_x: u32
    pixels_per_unit_y: u32
    unit: enumeration(u8, PHYSUnit) # <-- atom is required here

Important

It’s worth noting that a default value can be specified for the field as a fallback. If none is provided, and an unpacked value not in the enumeration is encountered, an error will be triggered.

2.1.3. Arrays/Lists#

Binary formats often require storing multiple objects of the same type sequentially. Caterpillar simplifies this with item access for defining arrays of static or dynamic size.

We started with the PLTE chunk, which stores three-byte sequences. We can define an array of RGB objects as follows:

>>> PLTEChunk = RGB[this.length / 3]
>>> PLTEChunk = RGB.__struct__[ContextPath("obj.length") / 3]

Added in version 2.2.0: The syntax will be changed once __class_getitem__ is implemented by any c.Struct instance.

Since this chunk has only one field, the array specifier is used to make it a list type. The length is calculated based on the chunk’s length field divided by three because the RGB class occupies three bytes.

2.1.4. String Types#

2.1.4.1. CString#

The CString in this library extends beyond a mere reference to C strings. It provides additional functionality, as demonstrated in the structure of the next chunk.

The tEXt chunk structure#
from caterpillar.py import *
from caterpillar.shortcuts import lenof

@struct
class TEXTChunk:
    # dynamic sized string that ends with a null-byte
    keyword: CString(encoding="ISO-8859-1")
    # static sized string based on the current context. some notes:
    #   - parent.length is the current chunkt's length
    #   - lenof(...) is the runtime length of the context variable
    #   - 1 because of the extra null-byte that is stripped from keyword
    text: CString(encoding="ISO-8859-1", length=parent.length - lenof(this.keword) - 1)
The tEXt chunk structure#
from caterpillar.c import *                 # <-- main difference
from caterpillar.shortcuts import lenof
# NOTE: lenof works here, because Caterpillar C's Context implements
# the 'Context Protocol'.

parent = ContextPath("parent.obj")
this = ContextPath("obj")

@struct
class TEXTChunk:
    # dynamic sized string that ends with a null-byte
    keyword: cstring(encoding="ISO-8859-1")
    # static sized string based on the current context. some notes:
    #   - parent.length is the current chunkt's length
    #   - lenof(...) is the runtime length of the context variable
    #   - 1 because of the extra null-byte that is stripped from keyword
    text: cstring(encoding="ISO-8859-1", length=parent.length - lenof(this.keword) - 1)

Challenge

You are now ready to implement the iTXt chunk. Try it yourself!

Solution

This solution serves as an example and isn’t the only way to approach it!

 1@struct
 2class ITXTChunk:
 3    keyword: CString(encoding="utf-8")
 4    compression_flag: uint8
 5    # we actually don't need an Enum here
 6    compression_method: uint8
 7    language_tag: CString(encoding="ASCII")
 8    translated_keyword: CString(encoding="utf-8")
 9    # length is calculated with parent.length - len(keyword)+len(b"\x00") - ...
10    text: CString(
11        encoding="utf-8",
12        length=parent.length - lenof(this.translated_keyword) - lenof(this.keyword) - 5,
13    )
 1from caterpillar.c import *                 # <-- main difference
 2from caterpillar.shortcuts import lenof
 3
 4parent = ContextPath("parent.obj")
 5this = ContextPath("obj")
 6
 7@struct
 8class ITXTChunk:
 9    keyword: cstring() # default encoding is "utf-8"
10    compression_flag: u8
11    # we actually don't need an Enum here
12    compression_method: u8
13    language_tag: cstring(encoding="ASCII")
14    translated_keyword: cstring(...) # explicit greedy parsing
15    # length is calculated with parent.length - len(keyword)+len(b"\x00") - ...
16    text: cstring(
17        parent.length - lenof(this.translated_keyword) - lenof(this.keyword) - 5,
18    )

You can also apply your own termination character, for example:

>>> struct = CString(pad="\x0A")
>>> s = cstring(sep="\x0A")

This struct will use a space as the termination character and strip all trailing padding bytes.

2.1.4.2. String#

Besides special the special c strings there’s a default String class that implements the basic behaviour of a string. It’s crucial to specify the length for this struct.

>>> struct = String(100 or this.length) # static integer or context lambda
>>> # takes static length, context lambda, another atom or ... for greedy parsing
>>> s = cstring(100)

2.1.4.3. Prefixed#

The Prefixed class introduces so-called Pascal strings for raw bytes and strings. If no encoding is specified, the returned value will be of type bytes. This class reads a length using the given struct and then retrieves the corresponding number of bytes from the stream returned by that struct.

>>> field = F(Prefixed(uint8, encoding="utf-8"))
>>> pack("Hello, World!", field)
b'\rHello, World!'
>>> unpack(field, _)
'Hello, World!'
>>> s = pstring(u8)
>>> pack("Hello, World!", s)
b'\rHello, World!'
>>> unpack(_, s)
'Hello, World!'

2.1.5. Byte Sequences#

2.1.5.1. Memory#

When dealing with data that can be stored in memory and you intend to print out your unpacked object, the Memory struct is recommended.

>>> m = F(Memory(5)) # static size; dynamic size is allowed too
>>> pack(bytes([i for i in range(5)], m))
b'\x00\x01\x02\x03\x04'
>>> unpack(m, _)
<memory at 0x00000204FDFA4411>

Not supported yet.

2.1.5.2. Bytes#

If direct access to the bytes is what you need, the Bytes struct comes in handy. It converts the memoryview to bytes. Additionally, as mentioned earlier, you can use the Prefixed class to unpack bytes of a prefixed size.

>>> field = F(Bytes(5)) # static, dynamic and greedy size allowed
>>> b = octetstring(5) # static, dynamic size allowed

With the gained knowledge, let’s implement the struct for the fDAT chunk of our PNG format. It should look like this:

Implementation for the frame data chunk#
@struct(order=BigEndian)                    # <-- endianess as usual
class FDATChunk:
    sequence_number: uint32
    # We rather use a memory instance here instead of Bytes()
    frame_data: Memory(parent.length - 4)
Implementation for the frame data chunk#
parent = ContextPath("parent.obj")

@struct(endian=BIG_ENDIAN)
class FDATChunk:
    sequence_number: u32
    frame_data: octetstring(parent.length - 4)

Challenge

If you feel ready for a more advanced structure, try implementing the zTXt chunk for compressed textual data.

Solution

Python API only:

Sample implementation of the zTXt chunk#
@struct                             # <-- actually, we don't need a specific byteorder
class ZTXTChunk:
    keyword: CString(...)           # <-- variable length
    compression_method: uint8
    # Okay, we haven't introduced this struct yet, but Memory() or Bytes()
    # would heve been okay, too.
    text: ZLibCompressed(parent.length - lenof(this.keyword) - 1)

2.1.6. Padding#

In certain scenarios, you may need to apply padding to your structs. caterpillar doesn’t store any data associated with paddings. If you need to retain the content of a padding, you can use Bytes or Memory again. For example:

>>> field = padding[10]  # padding always with a length

Tip

That was a lot of input to take, time for a coffee break! ☕

2.2. Context#

Caterpillar uses a special Context to keep track of the current packing or unpacking process. A context contains special variables, which are discussed in the Context reference in detail.

The current object that is being packed or parsed can be referenced with a shortcut this. Additionally, the parent object (if any) can be referenced by using parent.

Understanding the context#
@struct
class Format:
    length: uint8
    foo: CString(this.length)   # <-- just reference the length field
Understanding the context#
this = ContextPath("obj")

@struct
class Format:
    length: u8
    foo: cstring(this.length)

Note

You can apply any operation on context paths. However, be aware that conditional branches must be encapsulated by lambda expressions.

2.2.1. Runtime length of objects#

In cases where you want to retrieve the runtime length of a variable that is within the current accessible bounds, there is a special class designed for that use-case: lenof.

You might have seen this special class before when calculating the length of some strings. It simply applies the len(...) function of the retrieved variable.

Tip

To access elements of a sequence within the context, you can just use this.foobar[...].

2.3. Standard Structs#

We still have some important struct types to discuss to start defining complex structs.

2.3.1. Constants#

Proprietary file formats or binary formats often store magic bytes usually at the start of the data stream. Constant values will be validated against the parsed data and will be applied to the class automatically, eliminating the need to write them into the constructor every time.

2.3.1.1. ConstBytes#

These constants can be defined implicitly by annotating a field in a struct class with bytes. For example, in the case of starting the main PNG struct:

Starting the main PNG struct#
@struct(order=BigEndian) # <-- will be relevant later on
class PNG:
    magic: b"\x89PNG\x0D\x0A\x1A\x0A"
    # other fields will be defined at the end of this tutorial.

2.3.1.2. Const#

Raw constant values require a struct to be defined to parse or build the value. For example:

>>> field = F(Const(0xbeef, uint32))

2.3.2. Compression#

This library also supports default compression formats like zlib, lzma, bz2 and, if installed via pip, lzo (using lzallright).

>>> field = ZLibCompressed(100) # length or struct here applicable

2.3.3. Specials#

All of the following structs may be used in special situations where all other previously discussed structs can’t be used.

2.3.3.1. Computed#

A runtime computed variable that does not pack any data. It is rarely recommended to use this struct, because you can simply define a @property or method for what this structs represents, unless you need the value later on while packing or unpacking.

>>> struct = Computed(this.foobar) # context lambda or constant value

Challenge

Implement the gAMA chunk for our PNG format and use a Computed struct to calculate the real gamma value.

Solution
Example implementation of the gAMA chunk#
@struct(order=BigEndian)    # <-- same as usual
class GAMAChunk:
    gamma: uint32
    gamma_value: Computed(this.gamma / 100000)

Note

Question: Do we really need to introduce the gamma_value using a Computed struct here or can we just define a method?

2.3.3.2. Pass#

In case nothing should be done, just use Pass. This struct won’t affect the stream in any way.


Important

Congratulations! You have successfully mastered the basics of caterpillar! Are you ready for the next level? Brace yourself for some breathtaking action!