2.2. String Types#
In binary file formats, string handling can be complex due to varying lengths, encodings, and termination methods. Caterpillar provides several specialized string types to manage these intricacies with ease.
2.2.1. Default Strings#
The standard type for regular string handling, requiring you to specify a fixed or dynamic length.
>>> s = String(100 or this.length) # static integer or context lambda
>>> # takes static length, context lambda or ... for greedy parsing
>>> s = string(100)
2.2.2. CString#
The CString
type is used to handle strings that end with a null byte. It
extends beyond simple C-style strings. Here’s how you might define a structure using a
CString
:
from caterpillar.py import *
from caterpillar.shortcuts import lenof
@struct
class TEXTChunk:
# dynamic sized string that ends with a null-byte
keyword: CString(encoding="ISO-8859-1")
# static sized string based on the current context. some notes:
# - parent.length is the current chunkt's length
# - lenof(...) is the runtime length of the context variable
# - 1 because of the extra null-byte that is stripped from keyword
text: CString(encoding="ISO-8859-1", length=parent.length - lenof(this.keword) - 1)
from caterpillar.c import * # <-- main difference
from caterpillar.shortcuts import lenof
# NOTE: lenof works here, because Caterpillar C's Context implements
# the 'Context Protocol'.
parent = ContextPath("parent.obj")
this = ContextPath("obj")
@struct
class TEXTChunk:
# dynamic sized string that ends with a null-byte
keyword: cstring(encoding="ISO-8859-1")
# static sized string based on the current context. some notes:
# - parent.length is the current chunkt's length
# - lenof(...) is the runtime length of the context variable
# - 1 because of the extra null-byte that is stripped from keyword
text: cstring(encoding="ISO-8859-1", length=parent.length - lenof(this.keword) - 1)
Challenge
Try implementing the iTXt chunk from the PNG format. This chunk uses a combination of strings and fixed-length fields. Here’s a possible solution:
Solution
This solution serves as an example and isn’t the only way to approach it!
1@struct
2class ITXTChunk:
3 keyword: CString(encoding="utf-8")
4 compression_flag: uint8
5 # we actually don't need an Enum here
6 compression_method: uint8
7 language_tag: CString(encoding="ASCII")
8 translated_keyword: CString(encoding="utf-8")
9 # length is calculated with parent.length - len(keyword)+len(b"\x00") - ...
10 text: CString(
11 encoding="utf-8",
12 length=parent.length - lenof(this.translated_keyword) - lenof(this.keyword) - 5,
13 )
1from caterpillar.c import * # <-- main difference
2from caterpillar.shortcuts import lenof
3
4parent = ContextPath("parent.obj")
5this = ContextPath("obj")
6
7@struct
8class ITXTChunk:
9 keyword: cstring() # default encoding is "utf-8"
10 compression_flag: u8
11 # we actually don't need an Enum here
12 compression_method: u8
13 language_tag: cstring(encoding="ASCII")
14 translated_keyword: cstring(...) # explicit greedy parsing
15 # length is calculated with parent.length - len(keyword)+len(b"\x00") - ...
16 text: cstring(
17 parent.length - lenof(this.translated_keyword) - lenof(this.keyword) - 5,
18 )
You can also customize the string’s termination character if needed:
>>> struct = CString(pad="\x0A")
>>> s = cstring(sep="\x0A")
2.2.3. Pascal Strings#
The Prefixed
class implements Pascal strings, where the
length of the string is prefixed to the actual data. This is useful when dealing
with raw bytes or strings with a length indicator.
>>> s = Prefixed(uint8, encoding="utf-8")
>>> pack("Hello, World!", s, as_field=True)
b'\rHello, World!'
>>> unpack(s, _, as_field=True)
'Hello, World!'
>>> s = pstring(u8)
>>> pack("Hello, World!", s)
b'\rHello, World!'
>>> unpack(_, s)
'Hello, World!'