Data Model#
Structs, Sequences and Fields#
Structs serve as the foundation of this library. All data within the framework
undergoes the process of packing and unpacking using structs or _StructLike
objects. There are three possible types of structs:
- Sequences:
These structs operate independently of fields, making them field-agnostic. As such, they do not need to be attached to a
Field
instance. Typically, they are are combined with specific requirements, which will be discussed later on.
- Primitive Structs:
All defined primitive structs depent on being linked to a
Field
instance. They are designed to incorporate all attributes that can be set on a field.
- Partial Structs:
For scenarios where selective functionality is paramount, it is recommended to implement structs that focus solely on parsing, writing, or the calculation of the struct’s size. These specialized structs, referred to as partial structs, provide a modular approach for extending the library. Consideration of partial structs is essential when aiming to extend the capabilities of this framework.
Standard Types#
Below is a list of types provided by Caterpillar. These types are designed to maintain compatibility with older versions of the library, making them particularly important.
Sequence#
As previously explained, a sequence functions independently of fields. The library introduces
the Sequence
as a named finite collection of Field
objects. A Sequence
operates on a model, which is a string-to-field mapping by default. Later, we will discuss
the distinctions between a Sequence and a Struct regarding the model representation.
A sequence definition entails the specification of a Sequence
object by directly
indicating the model to use. Inheritance poses a challenge with sequences, as they are not
designed to operate on a type hierarchy. The default instantiation with all default options
involves passing the dictionary with all fields directly:
>>> Format = Sequence({"a": uint8, "b": uint32})
Programmers Note:
All sequence types introduced by this library can also store so-called unnamed fields.
These fields are not visible in the unpacked result and are automatically packed, removing
concerns about them when the option S_DISCARD_UNNAMED
is active. Their names usually
begin with an underscore and must solely contain numbers (e.g., _123
).
The sequence follows the Field
configuration model, allowing sequence and
field-related options to be set. As mentioned earlier, the S_DISCARD_UNNAMED
option can
be used for example to exclude all unnamed fields from the final representation. A complete
list of all configuration options and their impact can be found in Options.
All sequences store a configurable ByteOrder
and Arch
as architecture,
which are passed to all fields in the current model. For more information on why these
classes are not specified as an enum class, please refer to Byteorder and Architecture.
Inheritance in sequences is intricate, as a Sequence
is constructed from a dictionary
of elements. We can attempt to simulate a chain of extended base sequences using the
concatenation of two sequences. The __add__()
method will import all fields
from the other specified sequence. The only disadvantage is the placement required by the
operator. For instance:
>>> BaseFormat = Sequence({"magic": b"MAGIC", "a": uint8})
>>> Format = Sequence({"b": uint32, "c": uint16}) + BaseFormat
will result in the following field order:
>>> list(Format.get_members())
['b', 'c', 'magic', 'a']
which is not the intended order. The correct order should be ['magic', 'a', 'b', 'c']
.
This can be achieved by using the BaseFormat
instance as the first operand.
Warning
This will alter the BaseFormat sequence, making it unusable elsewhere as the base for
all sub-sequences. Therefore, it is not recommended to use inheritance within sequences.
The Struct
class resolves this issue with ease.
Nesting sequences is allowed by default and can be achieved by incorporating another
Sequence
into the model. It is important to note that nesting is distinct from
inheritance, adding an additional layer of packing and unpacking.
>>> Format = Sequence({"other": BaseFormat, "b": uint32})
Struct#
A struct describes a finite collection of named fields. In contrast to a sequence, a struct
utilizes Python classes as its model. The annotation feature in Python enables the definition of
custom types as annotations, enabling this special struct class to create a model solely based on
class annotations. Additionally, it generates a dataclass
of the provided model, offering a
standardized string representation.
Several differences exist between a Sequence
and a
Struct
, with the most significant ones highlighted below:
Sequence |
Struct |
|
---|---|---|
Model Type |
dict |
type |
Inheritance |
No |
Yes |
Attribute Access |
|
|
Unpacked Type (also needed to pack) |
dict [*] |
instance of model |
Documentation |
No |
Yes |
As evident from the comparison, the Struct
class introduces new features such as
inheritance and documentation support. It’s crucial to note that inheritance uses
struct types exclusively.
The Sequence
class implements a specific process for creating an internal representation
of the given model. The Struct
class enhances this process by handling default values, replacing
types for documentation purposes, or removing annotation fields directly from the model. Additionally,
this class adds __struct__
to the model afterward.
Implementation Note
If you decide to use the annotation
feature from the __future__
module, it is necessary to
enable S_EVAL_ANNOTATIONS
since it “Stringizes” all annotations. inspect
then
evaluates all strings, introducing a potential security risk. Exercise with caution when evaluating code!
Specifying structs is as simple as defining Python Classes:
>>> @struct
... class BaseFormat:
... magic: b"MAGIC"
... a: uint8
...
Internally, a representation with all required fields and their corresponding names is
created. As b"MAGIC"
or uint8
are instances of types, the type replacement
for documentation purposes should be enabled, as shown in Customizing the struct’s type.
As described above, this class introduces an easy-to-use inheritance system using the method resolution order of Python:
>>> @struct
... class Format(BaseFormat):
... b: uint32
... c: uint16
...
>>> list(Format.__struct__.get_members())
['magic', 'a', 'b', 'c']
Programmers Note
As the Struct
class is a direct subclass of Sequence
, nesting is supported
by default. That means, so-called anonymous inner structs can be defined within a class
definition.
>>> @struct
... class Format:
... a: uint32
... b: {"c": uint8}
...
It is not recommended to use this technique as the inner structs can’t be used anywhere else.
Anonymous inner union definitions are tricky and are not officially supported yet. There are
workarounds to that problem, which are discussed in the API documentation of Sequence
.
Union#
Internally constructing unions in the library poses challenges. The current implementation uses
the predefined behavior of the Sequence
class for union types. It selects the field with
the greatest length as its representational size. Unions, much like BitFields, must store a static
size.
In essence, they behave similarly to C unions. A traditional function hook will be installed on the model to capture field assignments. What that means will be illustrated by the following example:
>>> @union
... class Format:
... foo: uint16
... bar: uint32
... baz: boolean
...
>>> obj = Format() # union does not need any values
Right now, all attributes store the default value (None
). If we assign a new value to one field, it
will be applied to all others. Hence,
>>> obj.bar = 0xFF00FF00
will result in
>>> obj
Format(foo=65280, bar=4278255360, baz=False)
Implementation Detail
The constructor is the only place where there is no synchronization between fields. Additionally, the current implementation may produce some overhead, because every refresh will first pack the new value and then executes unpack on all other fields.
BitField#
A BitField, despite its name suggesting a field of bits, is a powerful structure designed for
detailed byte inspection. Similar to other structures, it is a finite collection of named fields. This
section will introduce potential challenges associated with the implementation of a BitField
and explains its behavior.
Caution
This class is still experimental, and caution is advised. For a list of known disadvantages or problems, refer to the information provided below.
As mentioned earlier, a BitField allows the inspection of individual bits within parsed bytes. Its
internal model relies on a special function or attribute, namely __bits__()
. Consequently,
a bitfield has a predefined length and will always possess a length that can be represented in bytes.
The BitField
class not only stores the existing model representation with a name-to-field
mapping and a collection of all fields but also introduces a special organizational class:
BitFieldGroup
. Each group defines its bit size, the absolute bit position in the bitfield,
and a mapping of fields to their relative bit position in the current group, along with the field’s
width. In the following example, three groups are created:
>>> @bitfield
... class Format:
... a : uint8 # Group 1, pos=0, size=8
... _ : 0 # Group 2, pos=8, size=8
... b : 15 - uint16 # \
... c : 1 # \ Group 3, pos=16, size=16
...
a
: The first field creates a group with a size of eight bits at position zero._
: Next, a zero-sized field indicates that padding until the end of the current byte should be added. As we start from bit position0
, one byte will be filled with zeros.b
: The third field only uses 15 bits of a 16-bit wide field (2 bytes inferred usinguint16
)c
: The last field uses the final bit of our current group.
TODO: describe process of collecting fields, packing and unpacking
Field#
The next core element of this library is the Field. It serves as a context storage to store configuration data
about a struct. Even sequences and structs can be used as fields. The process is straightforward: each custom operator
creates an instance of a Field
with the applied configuration value. Most of the time, this value can be
static or a Context lambda. A field implements basic behavior that should not be duplicated, such as
conditional execution, exception handling with default values, and support for a built-in switch-case structure.
As mentioned earlier, some primitive structs depend on being linked to a Field
. This is because all
configuration elements are stored in a Field
instance rather than in the target struct instance. More
information about each supported configuration can be found in Operators.
Greedy#
This library provides direct support for greedy parsing. Leveraging Python’s syntactic features, this special form
of parsing is enabled using the Ellipsis (...
). All previously introduced structs implement greedy parsing
when enabled.
>>> field = uint8[...]
This special type can be used in places where a length has to be specified. Therefore, it can be applied to all array
[]
declarations and constructors that take the length as an input argument, such as CString
, for
example.
>>> field = Field(CString(...))
>>> unpack(field, b"abcd\x00")
'abcd'
Prefixed#
In addition to greedy parsing, this library supports prefixed packing and unpacking as well. With prefixed, we refer
to the length of an array of elements that should be parsed. In this library, the slice
class is to achieve a
prefix option.
>>> field = CString[uint32::]
Context#
The context is another core element of this framework, utilized to store all relevant variables needed during the
process of packing or unpacking objects. The top-level unpack()
and pack()
methods are designed to
create the context themselves with some pre-defined (internal) fields.
Implementation Note
Context
objects are essentially dict
objects with enhanced capabilities. Therefore, all
operations supported on dictionaries are applicable.
The context enables special attribute-like access using getattr
if the attribute wasn’t defined in the
instance directly. All custom attributes are stored in the dictionary representation of the instance.
- CTX_PARENT = "_parent"#
All
Context
instances SHOULD contain a reference to the parent context. If the returned reference isNone
, it can be assumed that the current context is the root context. If this attribute is set, it MUST point to aContext
instance.
- CTX_OBJECT = "_obj"#
When packing or unpacking objects, the current object attributes are stored within the object context. This is a special context that allows access to previously parsed fields or attributes of the input object. To minimize the number of calls using this attribute, a shortcut named
this
was defined, which automatically inserts a path to the object context.
- CTX_STREAM = "_io"#
The input or output stream MUST be set in each context instance to prevent access errors on missing stream objects.
See also
Discussion on Github why this attribute has to be set in every context instance.
- CTX_PATH = "_path"#
Although it is optional to provide the current parsing or building path, it is recommended. All nesting structures implement a behavior that automatically adds a sub-path while packing or unpacking. Special names are
"<root>"
for the starting path and"<NUMBER>"
for greedy sequence elements.
- CTX_FIELD = "_field"#
In case a struct is linked to a field, the
Field
instance will always set this context variable to be accessible from within the underlying struct.
- CTX_INDEX = "_index"#
When packing or unpacking collections of elements, the current working index is given under this context variable. This variable is set only in this specific situation.
- CTX_VALUE = "_value"#
In case a switch-case statement is activated in a field, the context will receive the parsed value in this context variable temporarily.
- CTX_POS = "_pos"#
Currently undefined.
- CTX_OFFSETS = "_offsets"#
Internal use only: This special member is only set in the root context and stores all packed objects that should be placed at an offset position.
Context lambda#
Dynamic sized structs are supported by this library using the power of so-called context lambdas. This library
introduces a special callable _ContextLambda
, that takes a Context
instance and returns the
desired result. To mimic a context lambda, the __call__()
method has to be implemented.
Dynamic-sized structs are supported by this library using the power of so-called context lambdas. This library
introduces a special callable _ContextLambda
that takes a Context
instance and returns the #
desired result. To mimic a context lambda, the __call__()
method has to be implemented.
- object.__call__(self, context)#
This library does not distinguish between callable objects and context lambdas. They are treated as the same class (this aspect is under subject to changes).
Context path#
The path of a context is a specialized form of a Context lambda and supports lazy evaluation of most
operators (conditional ones excluded). Once called, they try to retrieve the requested value from within
the given Context
instance. Below is a list of default paths designed to provide a relatively easy
way to access the context variables.
- ctx = ""#
This special path acts as a wrapper to access all variables within the top-level
Context
object.
- this = "_obj"#
As described before, a special object context is created when packing or unpacking structs that store more than one field.
- parent = "_parent._obj"#
A shortcut to access the object context of the parent context.
Templates#
A specialized form of structs are templates, which are basically generic Python classes. Think of them
as blueprints for your final classes/structs that contain placeholders for actual types. As in C++, a
template needs type arguments, in this case we will name them TemplateTypeVar
.
Actually, there are two different types of type variables:
- Required:
These variables are required when creating a new struct based on the template and they can be used as positional arguments within the type derivation.
- Positional:
These arguments are usable only as keyword arguments and are may be optional if a default value is supplied.
These template type variables can be created using simple variable definitions:
>>> A = TemplateTypeVar("A")
Important
A template class is not a struct definition. It specifies a blueprint for the final class.
A template class is defined like a struct, union or bitfield class, but without being a dataclass nor storing a struct instance.
>>> @template(A, "B")
... class FormatTemplate:
... foo: A
... bar: B
... baz: uint32
...
The defined class then can be used to create new classes based on the provided class structure. For instance,
>>> Format = derive(FormatTemplate, A=uint32, B=uint8)
>>> Format
<class '__main__.__4BE4F2562B65393CFormatTemplate'>
will return an anonymous class (in this case). Normally, caterpillar tries to infer the
variable name from the current module (if name=...
). In summary, every time
derive()
is called, a new class will be created if not already
defined.
The current implementation will place template information about the current class using
a special class attribute: __template__
.
To support sub-classes of templates, we can declare a derived class as partial:
>>> Format32 = derive(FormatTemplate, A=uint32, partial=True)
Again, the resulting class is not a struct, but another template class.
Developer’s note
By now, a template won’t copy existing field documentation comments. Therefore, you can’t display inherited members using sphinx.
Special method names#
A class can either extend _StructLike
or implement the special methods needed
to act as a struct. The subsequent sections provide an overview of all special methods
and attributes introduced by this library. Further insights into extending structs with
custom operators can be found in Operators.
Emulating Struct Types#
- object.__pack__(self, obj, context)#
Invoked to serialize the given object into an output stream,
__pack__()
is designed to implement the behavior necessary for packing a collection of elements or a single element. Accordingly, the input obj may be anIterable
or a singular element.The absence of a standardized implementation for deserializing a collection of elements is deliberate. For example, all instances of the
FormatField
utilize the Python library struct internally to pack and unpack data. To optimize execution times, a collection of elements is packed and unpacked in a single call, rather than handling each element individually.The context must incorporate specific members, mentioned in Context. Any data input verification is implemented by the corresponding class.
__pack__()
is invoked by thepack()
method defined within this library. Its purpose is to dictate how input objects are written to the stream. It is crucial to note that the outcome of this function is ignored.Changed in version beta: The stream parameter has been removed and was instead moved into the context.
- object.__unpack__(self, context)#
Called to desersialize objects from an input stream (the stream is stored in the given context). The result of
__unpack__()
is not going to be ignored.Every implementation is tasked with the decision of whether to support the deserialization of multiple elements concurrently. By default, the
Field
class stores all essential attributes required to determine the length of elements set for unpacking. The__unpack__()
method is activated through theunpack()
operation, integrated with the default struct classes — namely,Sequence
,Struct
, andField
.Changed in version beta: The stream parameter has been removed and was instead moved into the context.
- object.__size__(self, context)#
This method serves the purpose of determining the space occupied by this struct, expressed in bytes. The availability of a context enables the execution of a
_ContextLambda
, offering support for dynamically sized structs. Furthermore, for the explicit definition of dynamic structs, the option to raise aDynamicSizeError
is provided.
Customizing the struct’s type#
- object.__type__(self)#
The configuration of Structs incorporates type replacement before a dataclass is created. This feature was specifically introduced for documentation purposes. The optional
__type__()
method allows for the specification of a type, with the default beingAny
if not explicitly defined.Note
The implementation of the
__type__()
method is optional and, therefore, not mandatory as per the library’s specifications.The following example demonstrates the use of the sphinx-autodoc extension to document struct classes with the
S_REPLACE_TYPE
option enabled. Only documented members are displayed... autoclass:: examples.formats.nibarchive.NIBHeader() :members:
Will be displayed as:
- class NIBHeader#
Example class doc comment
- NIBHeader.magic: bytes#
example field doc comment
- NIBHeader.unknown_1: int#
second field doc comment
In this illustration, the extra parentheses at the end are included to prevent the automatic creation of constructors.
Struct containers#
- class.__struct__#
All models annotated with either
@struct
or@bitfield
fall into the category of struct containers. These containers store the additional class attribute__struct__()
.Internally, any types utilizing this attribute can be employed within a struct, bitfield, or sequence definition. The type of the stored value must be a subclass of
_StructLike
.
Template Containers#
- class.__template__#
All template classes store information about the used template type variables. Whether they are required or just positional. In addition, default inferred types are stored as well.
BitField specific methods#
The introduced BitField
class is special in many different ways. One key
attribute is its fixed size. To determine the size of a struct, it leverages a special
member, which can be either a function or an attribute.
- object.__bits__(self)#
Called to measure the bit count of the current object.
__bits__()
serves as the sole requirement for the defined fields in the current implementation of theBitField
class.Note
This class member can also be expressed as an attribute. The library automatically adapts to the appropriate representation based on the context.
Customizing the object’s byteorder#
- object.__byteorder__#
The byteorder of a struct can be temporarily configured using the corresponding operator. It is important to note that this attribute is utilized internally and should not be used elsewhere.
>>> struct = BigEndian | struct # Automatically sets __byteorder__
- object.__set_byteorder__(self, byteorder)#
In contrast to the attribute
__byteorder__
, the__set_byteorder__()
method is invoked to apply the current byteorder to a struct. The default behavior, as described inFieldMixin
, is to return a newField
instance with the byteorder applied. Note the use of another operator here.>>> field = BigEndian + struct