2.2. String Types#

In binary file formats, string handling can be complex due to varying lengths, encodings, and termination methods. Caterpillar provides several specialized string types to manage these intricacies with ease.

2.2.1. Default Strings#

The standard type for regular string handling, requiring you to specify a fixed or dynamic length.

>>> s = String(100 or this.length) # static integer or context lambda
>>> # takes static length, context lambda or ... for greedy parsing
>>> s = string(100)

2.2.2. CString#

The CString type is used to handle strings that end with a null byte. It extends beyond simple C-style strings. Here’s how you might define a structure using a CString:

The tEXt chunk structure#
from caterpillar.py import *
from caterpillar.shortcuts import lenof

@struct
class TEXTChunk:
    # dynamic sized string that ends with a null-byte
    keyword: CString(encoding="ISO-8859-1")
    # static sized string based on the current context. some notes:
    #   - parent.length is the current chunkt's length
    #   - lenof(...) is the runtime length of the context variable
    #   - 1 because of the extra null-byte that is stripped from keyword
    text: CString(encoding="ISO-8859-1", length=parent.length - lenof(this.keword) - 1)
The tEXt chunk structure#
from caterpillar.c import *                 # <-- main difference
from caterpillar.shortcuts import lenof
# NOTE: lenof works here, because Caterpillar C's Context implements
# the 'Context Protocol'.

parent = ContextPath("parent.obj")
this = ContextPath("obj")

@struct
class TEXTChunk:
    # dynamic sized string that ends with a null-byte
    keyword: cstring(encoding="ISO-8859-1")
    # static sized string based on the current context. some notes:
    #   - parent.length is the current chunkt's length
    #   - lenof(...) is the runtime length of the context variable
    #   - 1 because of the extra null-byte that is stripped from keyword
    text: cstring(encoding="ISO-8859-1", length=parent.length - lenof(this.keword) - 1)

Challenge

Try implementing the iTXt chunk from the PNG format. This chunk uses a combination of strings and fixed-length fields. Here’s a possible solution:

Solution

This solution serves as an example and isn’t the only way to approach it!

 1@struct
 2class ITXTChunk:
 3    keyword: CString(encoding="utf-8")
 4    compression_flag: uint8
 5    # we actually don't need an Enum here
 6    compression_method: uint8
 7    language_tag: CString(encoding="ASCII")
 8    translated_keyword: CString(encoding="utf-8")
 9    # length is calculated with parent.length - len(keyword)+len(b"\x00") - ...
10    text: CString(
11        encoding="utf-8",
12        length=parent.length - lenof(this.translated_keyword) - lenof(this.keyword) - 5,
13    )
 1from caterpillar.c import *                 # <-- main difference
 2from caterpillar.shortcuts import lenof
 3
 4parent = ContextPath("parent.obj")
 5this = ContextPath("obj")
 6
 7@struct
 8class ITXTChunk:
 9    keyword: cstring() # default encoding is "utf-8"
10    compression_flag: u8
11    # we actually don't need an Enum here
12    compression_method: u8
13    language_tag: cstring(encoding="ASCII")
14    translated_keyword: cstring(...) # explicit greedy parsing
15    # length is calculated with parent.length - len(keyword)+len(b"\x00") - ...
16    text: cstring(
17        parent.length - lenof(this.translated_keyword) - lenof(this.keyword) - 5,
18    )

You can also customize the string’s termination character if needed:

>>> struct = CString(pad="\x0A")
>>> s = cstring(sep="\x0A")

2.2.3. Pascal Strings#

The Prefixed class implements Pascal strings, where the length of the string is prefixed to the actual data. This is useful when dealing with raw bytes or strings with a length indicator.

>>> s = Prefixed(uint8, encoding="utf-8")
>>> pack("Hello, World!", s, as_field=True)
b'\rHello, World!'
>>> unpack(s, _, as_field=True)
'Hello, World!'
>>> s = pstring(u8)
>>> pack("Hello, World!", s)
b'\rHello, World!'
>>> unpack(_, s)
'Hello, World!'