2. Basic Concepts#
In this section, we’ll explore some common techniques used in binary file formats, setting the stage for more advanced topics in the next chapter.
Note
Some examples using the interpreter prompts make use of a shortcut to define Field
objects:
>>> from caterpillar.shortcuts import F
>>> field = F(uint8)
2.1. Standard Types#
2.1.1. Numbers#
When dealing with binary data, numbers play a crucial role. Besides the default integer types (e.g., uint8, uint16, etc.), caterpillar introduces some special integer formats. The default types include:
Unsigned (
u...
) and signed:
int8
, uint8
,
int16
, uint16
,
int24
, uint24
,
int32
, uint32
,
int64
, uint64
,
ssize_t
, size_t
,
Floating point:
float16
, float32
,
float64
Special primitives:
boolean
, char
,
void_ptr
2.1.1.1. Custom-sized integer#
It’s also possible to use integers with a custom size (in bits). However, it’s important to note that you have to define the struct with the bit count, and internally, only the occupied bytes will be used. For example:
>>> field = F(Int(24)) # three-byte signed integer
>>> field = F(UInt(40)) # five-byte unsigned integer
>>> i48 = Int(48) # six-byte signed integer
>>> u40 = Int(40, signed=False) # five-byte unsigned integer
2.1.1.2. Variable-sized integer#
The built-in struct py.VarInt
/c.VarInt
supports parsing and building integers with variable length. Its
documentation provides a detailed explanation of all different configurations.
>>> field = F(vint) # or F(VarInt())
>>> # use 'varint' directly or use VarInt()
>>> be_varint = BIG_ENDIAN + varint
>>> le_varint = VarInt(little_endian=True)
2.1.2. Enumerations#
Enums are essential when working with binary file formats, and caterpillar integrates
standard Python enumerations - classes extending enum.Enum
- with ease.
Let’s revisit pHYS chunk to add an enum to the last field.
import enum
class PHYSUnit(enum.IntEnum): # <-- the enum value doesn't have to be int
__struct__ = uint8 # <-- to make the code even more compact, use this
UNKNOWN = 0
METRE = 1
@struct(order=BigEndian) # <-- same as before
class PHYSChunk:
pixels_per_unit_x: uint32
pixels_per_unit_y: uint32
unit: PHYSUnit # <-- now we have an auto-enumeration
import enum
class PHYSUnit(enum.IntEnum): # <-- the enum value doesn't have to be int
UNKNOWN = 0
METRE = 1
@struct(endian=BIG_ENDIAN) # <-- same as before
class PHYSChunk:
pixels_per_unit_x: u32
pixels_per_unit_y: u32
unit: enumeration(u8, PHYSUnit) # <-- atom is required here
Important
It’s worth noting that a default value can be specified for the field as a fallback. If none is provided, and an unpacked value not in the enumeration is encountered, an error will be triggered.
2.1.3. Arrays/Lists#
Binary formats often require storing multiple objects of the same type sequentially. Caterpillar simplifies this with item access for defining arrays of static or dynamic size.
We started with the PLTE chunk, which stores three-byte sequences. We can define an array of RGB objects as follows:
>>> PLTEChunk = RGB[this.length / 3]
>>> PLTEChunk = RGB.__struct__[ContextPath("obj.length") / 3]
Added in version 2.2.0: The syntax will be changed once __class_getitem__ is implemented by
any c.Struct
instance.
Since this chunk has only one field, the array specifier is used to make it a list type. The length is calculated based on the chunk’s length field divided by three because the RGB class occupies three bytes.
2.1.4. String Types#
2.1.4.1. CString#
The CString in this library extends beyond a mere reference to C strings. It provides additional functionality, as demonstrated in the structure of the next chunk.
from caterpillar.py import *
from caterpillar.shortcuts import lenof
@struct
class TEXTChunk:
# dynamic sized string that ends with a null-byte
keyword: CString(encoding="ISO-8859-1")
# static sized string based on the current context. some notes:
# - parent.length is the current chunkt's length
# - lenof(...) is the runtime length of the context variable
# - 1 because of the extra null-byte that is stripped from keyword
text: CString(encoding="ISO-8859-1", length=parent.length - lenof(this.keword) - 1)
from caterpillar.c import * # <-- main difference
from caterpillar.shortcuts import lenof
# NOTE: lenof works here, because Caterpillar C's Context implements
# the 'Context Protocol'.
parent = ContextPath("parent.obj")
this = ContextPath("obj")
@struct
class TEXTChunk:
# dynamic sized string that ends with a null-byte
keyword: cstring(encoding="ISO-8859-1")
# static sized string based on the current context. some notes:
# - parent.length is the current chunkt's length
# - lenof(...) is the runtime length of the context variable
# - 1 because of the extra null-byte that is stripped from keyword
text: cstring(encoding="ISO-8859-1", length=parent.length - lenof(this.keword) - 1)
Challenge
You are now ready to implement the iTXt chunk. Try it yourself!
Solution
This solution serves as an example and isn’t the only way to approach it!
1@struct
2class ITXTChunk:
3 keyword: CString(encoding="utf-8")
4 compression_flag: uint8
5 # we actually don't need an Enum here
6 compression_method: uint8
7 language_tag: CString(encoding="ASCII")
8 translated_keyword: CString(encoding="utf-8")
9 # length is calculated with parent.length - len(keyword)+len(b"\x00") - ...
10 text: CString(
11 encoding="utf-8",
12 length=parent.length - lenof(this.translated_keyword) - lenof(this.keyword) - 5,
13 )
1from caterpillar.c import * # <-- main difference
2from caterpillar.shortcuts import lenof
3
4parent = ContextPath("parent.obj")
5this = ContextPath("obj")
6
7@struct
8class ITXTChunk:
9 keyword: cstring() # default encoding is "utf-8"
10 compression_flag: u8
11 # we actually don't need an Enum here
12 compression_method: u8
13 language_tag: cstring(encoding="ASCII")
14 translated_keyword: cstring(...) # explicit greedy parsing
15 # length is calculated with parent.length - len(keyword)+len(b"\x00") - ...
16 text: cstring(
17 parent.length - lenof(this.translated_keyword) - lenof(this.keyword) - 5,
18 )
You can also apply your own termination character, for example:
>>> struct = CString(pad="\x0A")
>>> s = cstring(sep="\x0A")
This struct will use a space as the termination character and strip all trailing padding bytes.
2.1.4.2. String#
Besides special the special c strings there’s a default String
class that implements
the basic behaviour of a string. It’s crucial to specify the length for this struct.
>>> struct = String(100 or this.length) # static integer or context lambda
>>> # takes static length, context lambda, another atom or ... for greedy parsing
>>> s = cstring(100)
2.1.4.3. Prefixed#
The Prefixed
class introduces so-called Pascal strings for raw bytes and strings. If no
encoding is specified, the returned value will be of type bytes
. This class reads a length
using the given struct and then retrieves the corresponding number of bytes from the stream returned
by that struct.
>>> field = F(Prefixed(uint8, encoding="utf-8"))
>>> pack("Hello, World!", field)
b'\rHello, World!'
>>> unpack(field, _)
'Hello, World!'
>>> s = pstring(u8)
>>> pack("Hello, World!", s)
b'\rHello, World!'
>>> unpack(_, s)
'Hello, World!'
2.1.5. Byte Sequences#
2.1.5.1. Memory#
When dealing with data that can be stored in memory and you intend to print out your
unpacked object, the Memory
struct is recommended.
>>> m = F(Memory(5)) # static size; dynamic size is allowed too
>>> pack(bytes([i for i in range(5)], m))
b'\x00\x01\x02\x03\x04'
>>> unpack(m, _)
<memory at 0x00000204FDFA4411>
Not supported yet.
2.1.5.2. Bytes#
If direct access to the bytes is what you need, the Bytes
struct comes in handy. It
converts the memoryview
to bytes
. Additionally, as mentioned earlier, you can
use the Prefixed
class to unpack bytes of a prefixed size.
>>> field = F(Bytes(5)) # static, dynamic and greedy size allowed
>>> b = octetstring(5) # static, dynamic size allowed
With the gained knowledge, let’s implement the struct for the fDAT chunk of our PNG format. It should look like this:
@struct(order=BigEndian) # <-- endianess as usual
class FDATChunk:
sequence_number: uint32
# We rather use a memory instance here instead of Bytes()
frame_data: Memory(parent.length - 4)
parent = ContextPath("parent.obj")
@struct(endian=BIG_ENDIAN)
class FDATChunk:
sequence_number: u32
frame_data: octetstring(parent.length - 4)
Challenge
If you feel ready for a more advanced structure, try implementing the zTXt chunk for compressed textual data.
Solution
Python API only:
@struct # <-- actually, we don't need a specific byteorder
class ZTXTChunk:
keyword: CString(...) # <-- variable length
compression_method: uint8
# Okay, we haven't introduced this struct yet, but Memory() or Bytes()
# would heve been okay, too.
text: ZLibCompressed(parent.length - lenof(this.keyword) - 1)
2.1.6. Padding#
In certain scenarios, you may need to apply padding to your structs. caterpillar doesn’t
store any data associated with paddings. If you need to retain the content of a padding,
you can use Bytes
or Memory
again. For example:
>>> field = padding[10] # padding always with a length
Tip
That was a lot of input to take, time for a coffee break! ☕
2.2. Context#
Caterpillar uses a special Context
to keep track of the current packing or unpacking
process. A context contains special variables, which are discussed in the Context
reference in detail.
The current object that is being packed or parsed can be referenced with a shortcut this
.
Additionally, the parent object (if any) can be referenced by using parent
.
@struct
class Format:
length: uint8
foo: CString(this.length) # <-- just reference the length field
this = ContextPath("obj")
@struct
class Format:
length: u8
foo: cstring(this.length)
Note
You can apply any operation on context paths. However, be aware that conditional branches must be encapsulated by lambda expressions.
2.2.1. Runtime length of objects#
In cases where you want to retrieve the runtime length of a variable that is within the current
accessible bounds, there is a special class designed for that use-case: lenof
.
You might have seen this special class before when calculating the length of some strings. It
simply applies the len(...)
function of the retrieved variable.
Tip
To access elements of a sequence within the context, you can just use this.foobar[...]
.
2.3. Standard Structs#
We still have some important struct types to discuss to start defining complex structs.
2.3.1. Constants#
Proprietary file formats or binary formats often store magic bytes usually at the start of the data stream. Constant values will be validated against the parsed data and will be applied to the class automatically, eliminating the need to write them into the constructor every time.
2.3.1.1. ConstBytes#
These constants can be defined implicitly by annotating a field in a struct class with bytes. For example, in the case of starting the main PNG struct:
@struct(order=BigEndian) # <-- will be relevant later on
class PNG:
magic: b"\x89PNG\x0D\x0A\x1A\x0A"
# other fields will be defined at the end of this tutorial.
2.3.1.2. Const#
Raw constant values require a struct to be defined to parse or build the value. For example:
>>> field = F(Const(0xbeef, uint32))
2.3.2. Compression#
This library also supports default compression formats like zlib, lzma, bz2 and, if
installed via pip, lzo (using lzallright
).
>>> field = ZLibCompressed(100) # length or struct here applicable
2.3.3. Specials#
All of the following structs may be used in special situations where all other previously discussed structs can’t be used.
2.3.3.1. Computed#
A runtime computed variable that does not pack any data. It is rarely recommended to use this
struct, because you can simply define a @property
or method for what this structs
represents, unless you need the value later on while packing or unpacking.
>>> struct = Computed(this.foobar) # context lambda or constant value
Challenge
Implement the gAMA chunk for our PNG format and use
a Computed
struct to calculate the real gamma value.
Solution
@struct(order=BigEndian) # <-- same as usual
class GAMAChunk:
gamma: uint32
gamma_value: Computed(this.gamma / 100000)
Note
Question: Do we really need to introduce the gamma_value using a Computed
struct here
or can we just define a method?
2.3.3.2. Pass#
In case nothing should be done, just use Pass
. This struct won’t affect the stream in any way.
Important
Congratulations! You have successfully mastered the basics of caterpillar! Are you ready for the next level? Brace yourself for some breathtaking action!