Rice Fields

Brief Introduction to Physics Simulation

2024-10-11T00:00:00+07:00

I’ve been working on a custom physics engine for a while now. It’s been quite a journey, and I aim to document it through a series of blog posts. In this post we’ll mostly be talking about rigid body simulation, so for now, pretend that soft bodies don’t exist. First, we’ll be going through the basic architecture of a physics engine, focusing on a 2D rigid body simulation with code examples to keep everything neat and simple.

Modeling motion

Let’s start with the most straightforward part of a physics simulation: modeling the motion of objects. For this, we’ll use the equation of motion and Newton’s second law.

F = ma

v = v + a * t
v = v + F/m * t

v is velocity
a is acceleration
t is time

This tells us that velocity of an object at a given time t. With this equation, we can compute velocity of our object given the force we want to apply and its mass. Now, to find the new position or displacement of the object, we can simply do the following:

s = s + v * t;

This is also known as Euler’s integration f(t + dt) = f(t) + f'(t) x dt. It is simply an approximation of an object’s displacement over a small interval t. Mass is an object’s resistance to changes in its linear motion. We can compute mass using an object’s density and volume.

v = width * height * depth
mass = density * volume

Here is a code example for simulating this model: Link

Angular motion (rotation)

The above equations of motion are enough for modeling linear motion, but ideally, in a simulation, we also need to model angular motion. When a force is applied to a body at a point that does not pass through its center of mass, the body will experience both linear motion and angular motion (rotation).

Before going into modeling angular motion, let’s first talk about the mass moment of inertia. The mass moment of inertia is a measure of an object’s resistance to changes in its angular motion, which depends on both the mass and how the mass is distributed relative to the axis of rotation.

Every unique shape has its own mass moment of inertia, which can be computed by combining the know mass moment of inertia of simple shapes. Deriving the mass moment of inertia of different shapes is not the goal of this post, so I will just link some resources for interested readers: How to Find Mass Moment of Inertia

The moment of inertia for commonly used shapes can be found here: List of moments of inertia

3D objects will have three axis of rotation, and thus we will have moment of inertia for each of these axes: Ixx, Iyy and Izz. This is generally represented using a 3x3 matrix called inertia tensor.

Finally for we can model angular motion as,

w = w + a * t
w = w + to/I * t

w is angulary velocity
a is angular acceleration
t is delta time


// new rotation
r = r + w * t

// wrap angle around [0, 2*PI] 
if (r < 0) r += 2 * Math.PI;
if (r >= 2 * Math.PI) r -= 2 * Math.PI;

Updated example with Angular motion: Link

Resolving collision

Our motion simulation works quite well, but it has a major issue: the bodies overlap upon collision. Ideally, rigid bodies (bodies that don’t deform under force) do not overlap on collision, they exert forces on each other and may lose some energy due to friction and restitution. To address this, we first need to detect collision and then resolve them. The entire process of detecting and resolving collision can be divided into four majour stages.

Broadphase
Narrowphase
Constraints
Solver

Broadphase

In this stage, we compute a list of possibly colliding shape pairs. Possibly colliding means we don’t actually compute the precise contact points, rather determine whether a collision is possible. This can be done in a number of ways. The most widely used method involves using bounding volumes, usually an AABB (Axis Aligned Bounding Box) along with a space-partitioning structure, typically a binary tree BVH (Bounding Volume Hierarchy). Since directly computing contact between two shapes can be very expensive, the broadphase helps us avoid unnecessary computations for objects that obviously cannot collide. By using simple bounding volumes and specialized structures to keep track of those volumes broadphase checks become relatively inexpensive.

For our simple JS example, we can get away with hash grids. A hash grid is a straightforward space-partitioning structure in which the space is divided into an n × m grid, and for each cell, we keep track of the objects that intersect with it. If you are interested in implementing BVH, you might want to check out ErinCatto’s DynamicBVH slides.

Updated Example

Narrowphase

After we get a list of possibly colliding pairs through broadphase, we need to perform the actual collision tests between the shapes. This stage is relatively straight forward for basic shapes like spheres, but for some slightly complex shapes such as boxes, convex hulls or meshes, we’ll need to employ a combination some well-documented algorithms like SAT (Separating Axis Theorem) and GJK (Gilbert–Johnson–Keerthi) with EPA (Expanding Polytope Algorithm).

Narrowphase generates contact information such as contact points, contact normals, penetration depth, which we’ll need to resolve the contact.

Resources for these algorithms:

GJK: Implementing GJK
SAT: SAT for OOBB

For our small JS example, we are only using circles, and computing contact between two circles is fairly trivial.

// C1 and C2 be two possibly colliding circles
// r1 is radius of C1, r2 is radius of C2

d = distance(C1, C2);

if (d > r1 + r2) => no collision

normal = (C2 - C1) / d
point = C1 + normal * r1
penetration = (r1 + r2) - d;

Updated Example

Constraints

We can resolve the collision (separate the penetrating shapes) based on the information generated by the narrow phase. One way to solve collisions is through contact constraints. Constraints are limitations on the body’s degrees of freedom—a set of rules that dictate how a body can move. We can model constraints with an equation of the form C = 0, where C is our constraint that must equal zero; otherwise, the constraint fails. For example, if C represents the position of a body, C = 0 means that the position of the body should always remain at zero (essentially setting the position to zero every frame). Conversely, a constraint C greater than 0 indicates that C should be greater than zero.

Constraints allow us to model rules that control the behavior of the physics simulation. They can also be used to simulate various things, such as collision responses, joints, and springs.

For our example, let’s implement a simple ground constraint (objects shouldn’t fall off the ground). We can define our ground as y = 200, so our ground constraint will be y greater than 200.

if((pos.y - radius) <= 200) { // if the constraint is violated
    const bias = (200 - (pos.y - radius));
    pos.y += bias; // correct the positional error
    velocity.y = 0; // remove y velocity  
}

Updated Example

Solver

A solver is essentially a piece of logic that resolves constraints so that they remain valid. The two popular ways to solve constraints are by addressing positions (as in the example above) or solving velocities. While position constraints work, they are not particularly physically accurate.

For our contacts, we will focus on velocity constraints. Velocity constraints are essentially the first derivative of the position constraint (i.e., C’ = 0). When solving velocity constraints, we will apply a small amount of impulses to the bodies until the constraint is satisfied (also known as impulse solvers).

Since impulse solvers address continuous dynamics using discrete time steps, we need to solve them iteratively. Each iteration applies small impulses to correct the velocities of the bodies until the constraints are satisfied. As with most numerical methods, increasing the number of iterations generally leads to a more refined solution, although there are diminishing returns after a certain point. A physics simulation might involve multiple constraint solvers for different types of constraints. Before diving into contact constraints, let’s go through the formulas we’ll use to compute the impulses. We already have the direction for our impulse (the contact normal); now all we need to compute is the coefficient of impulse. (magnitude).

Given a velocity constraint Jv + b = 0, we can compute the coefficient of impulse as:

Since solving for velocities alone will not produce enough impulse to achieve the desired outcome, we can slightly boost the impulse using a bias b (e.g., penetration depth for contacts).

Contact Constraint

Now, let’s model our contact in terms of velocity constraint: (relative_velocity).n >= 0 (relative velocity projected onto the contact normal).

rv = V2 - V1 // relative velocity

jv = dot(rv, n) // n is contact normal

if(jv < 0) => bodies moving closer to each other
if(jv > 0) => bodies moving apart from each other
if(jv = 0) => no change in the movement

b = -penetration / dt (computed in Narrowphase)

l = -(jv + b) / eff_mass
impulse = l * contact_normal

But how do we computed the effective mass? To find the effective mass, we can just plug rv.n = 0 into the generic form Jv + b = 0.

Jv = rv.n; // dot(rv, n)
Jv = (V2 - V1).n;

Jv = (v2 + w2 x c1).n - (v1 + w1 x c2).n  // x - cross product
// c1 = (contact_point - pos_a)
// c2 = (contact_point - pos_b)
// v1, w1 = velocities of body a
// v2, w2 = velocities of body b
// v = [v1 w1 v2 w2]

With this, we can now solve for J, resulting in: J = [-n -(c1 x n) n (c2 x n)]. Finally, using the effective mass equation above we can compute our effective mass as:

eff_mass = inv_m1 + dot(inv_I1 * c1n, c1n) + inv_m2 + dot(inv_I2 * c2n, c2n)

c1n = c1 x n // x - cross product
c2n = c2 x n

Applying the impulse:

v += impulse / mass
w += cross(point - centroid, impulse) / I

Updated Example

Clamping impulse

When our impulse is negative, it will pull two bodies towards each other instead of pushing them apart, so we’ll need to clamp our impulse. We will go with clamping method suggested by Erin Catto.

l = -(jv + b) / eff_mass

// clamping
old_l = accumulated_l
accumulated_l = max(0, old_l + l)
l = accumulated_l - old_l

Bias smoothing

If we add the bias b as it is, it will immediately correct the positional error induced by the collision. This is usually not desirable, as we would prefer our bodies to correct themselves smoothly over multiple frames. We can dampen our bias with a smoothing factor to achieve this. This is also known as Baumgarte stabilization.

bias = -0.3 * (penetration / dt); // 0.3 is our smoothing factor

Friction constraint

Solving for friction is similar to solving for contacts; instead of addressing the contact along the contact normal, we will resolve it along the tangent to the contact normal.

tangent = [normal.y, normal.x] // perpendicular to the normal

rv = V2 - V1; // relative velocity

jv = dot(rv, tangent)
l = -jv / eff_mass
impulse = l * tangent

Since we are only solving for friction, we won’t need any bias term. Additionally, the clamping logic for friction is slightly different:

max_firction = friction * total_l // total_l is normal impulse
f_old_l = f_total_l
f_total_l = clamp(f_total_l + l, -max_friction, max_friction)
f_l = f_toal_l - f_old_l;

Updated Example

Warm starting

Warm starting is an optimization technique that can lead to better convergence for our solvers. This involves persisting the solver output (accumulated coefficient of impulse) across multiple frames. Warm starting can facilitate proper stable stacking of objects in the simulation.

Wrap up

This article has covered important steps for creating a simple physics simulation, starting with modeling linear and angular motion we explored broadphase and narrowphase collision detection, as well as methods for resolving collisions and handling friction. While these concepts provide a good starting point, there are many more techniques we can apply to enhance the simulation further. The same principles apply to 3D physics, with the addition of a third axis, making rotations a bit different since we’ll have three rotation axes to consider.

References

Sparse Set - stable pointers

2024-08-16T00:00:00+07:00

This is a continuation of my last article, ‘Sparse Set - A Flexible, Cache-Friendly Data Structure’.

In the last article, we implemented a basic sparse set. Now, let’s improve our implementation by adding support for stable pointers. To achieve this, we can make a slight adjustment, instead of storing data in the dense array, we’ll store it directly in the sparse array.

For this to work, we would need to change the structure of the sparse array from an array of integers to an array of pages. Each page will hold N dense indices along with N data items. The sparse index would then be a combination of the page index and the page offset (the index of the data within a specific page).

As shown in the diagram, the sparse array is paged and directly holds the data. Since we allocate and deallocate one page at a time, we avoid the need for complete reallocation each time we run out of capacity, as would be required with a dynamic array.

You might have noticed that this implementation of a sparse set is not fully contiguous. However, this isn’t a significant concern in practical use if an optimal page size is chosen. Crossing OS page boundaries often results in a cache miss anyway, so we still benefit from spatial locality while maintaining stable pointers to the elements.

Implementation

Let’s start by redefining our sparse and dense arrays.

struct SparsePage {
    size_t* sparse; // array of dense indices
    T* data;        // array of data
}

std::vector<SparsePage> pages;
std::vector<size_t> dense;

Now, we’ll need some logic to map the sparse index to the page index and page offset, which we can easily implement using bitwise operations.

const unsigned log2_page_size = 6;
const size_t page_size = 1 << 6;

size_t toPageIndex(size_t sparse_index) {
    sparse_index >> log2_page_size;
}

size_t toPageOffset(size_t sparse_index) {
    return sparse_index & (page_size - 1);
}

The page index refers to the index of the pages array, while the page offset is the index of the data & sparse array within the corresponding page.

auto page_index = toPageIndex(sparse_index);
auto page_offset = toPageOffset(sparse_index);

auto data = pages[page_index].data[page_offset];
auto dense_index = pages[page_index].sparse[page_offset];

With this changes, we can move on to a proper implementation of a sparse set with stable pointer to its elements.

const unsigned LG2_PAGE_SIZE = 6;
const size_t PAGE_SIZE = 1 << 6;

size_t toPageIndex(size_t sparse_index) {
    return sparse_index >> LG2_PAGE_SIZE;
}

size_t toPageOffset(size_t sparse_index) {
    return sparse_index & (PAGE_SIZE - 1);
}

template <typename T>
struct SparseSet {
    struct SparsePage {
        size_t* sparse;
        T* data;
    };

    struct SetEntry {
        T *data;
        size_t index;
    };

    size_t size = 0;
    size_t max_sparse_idx = 0;
    std::vector<std::optional<SparsePage>> pages;
    std::vector<size_t> dense;

    ~SparseSet() {
        for(auto page_opt : this->pages) {
            if(auto page = page_opt) {
                delete[] page->sparse;
                delete[] page->data;
            }
        }
    }

    SetEntry add(T item) {
        auto dense_idx = this->size;
        this->size += 1;

        // try to reuse last freed index
        if(dense_idx < this->dense.size()) {
            auto sparse_idx = this->dense[dense_idx];
            auto page_idx= toPageIndex(sparse_idx);
            auto page_offset = toPageOffset(sparse_idx);
            
            if(auto page = this->pages[page_idx]) {
                auto data_ptr = &page->data[page_offset];
                *data_ptr = item;
                return { data_ptr, sparse_idx };
            }
        }

        // allocate new index
        auto sparse_idx = this->max_sparse_idx;
        this->max_sparse_idx += 1;

        auto page_idx = toPageIndex(sparse_idx);
        auto page_offset = toPageOffset(sparse_idx);

        ensurePageIndex(page_idx);
        this->dense.push_back(sparse_idx);

        auto [sparse_ptr, data_ptr] = getSparseEntryPtr(sparse_idx);
        *sparse_ptr = dense_idx;
        *data_ptr = item;

        return {data_ptr, sparse_idx};
    }

    bool remove(size_t idx) {
        if(!contains(idx)) return false;

        this->size -= 1;

        auto [sparse_ptr, data_ptr] = getSparseEntryPtr(idx);

        auto end_dense_idx = this->size;
        auto end_sparse_idx = this->dense[end_dense_idx];
        auto [end_sparse_ptr, _] = getSparseEntryPtr(end_sparse_idx);

        // swap remove
        this->dense[*sparse_ptr] = end_sparse_idx;
        *end_sparse_ptr = *sparse_ptr;

        // update end dense idx for reuse
        *sparse_ptr = end_dense_idx;
        this->dense[end_dense_idx] =  idx;

        return true;
    }

    std::optional<T*const> getPtr(size_t idx) {
        if(!contains(idx)) return std::nullopt;

        auto page_idx = toPageIndex(idx);
        auto page_offset = toPageOffset(idx);

        return { &this->pages[page_idx].value().data[page_offset] };
    }

    bool contains(size_t idx) {
       auto page_idx = toPageIndex(idx);
       auto page_offset = toPageOffset(idx);

       if(page_idx >= this->pages.size()) return false;

       if(auto page = this->pages[page_idx]) {
           auto dense_idx = page->sparse[page_offset];
           auto current_idx = this->dense[dense_idx];
           return dense_idx < this->size && current_idx == dense_idx;
       }
        
       return false; 
    }

    void ensurePageIndex(size_t page_index) {
        auto pages_size = this->pages.size();
        if(page_index >= pages_size) {
            this->pages.resize(page_index + 1, std::nullopt);
        }
        if(!this->pages[page_index]) {
            this->pages[page_index] = { new size_t[PAGE_SIZE], new T[PAGE_SIZE] };
        }
    }

    std::pair<size_t*, T*> getSparseEntryPtr(size_t idx) {
        auto page_idx = toPageIndex(idx);
        auto page_offset = toPageOffset(idx);
        auto page = this->pages[page_idx].value();
        return { &page.sparse[page_offset], &page.data[page_offset] };
    }
};

Complete Implementation on Compiler Explorer

This implementation does not include iterator support, so it is up to the reader to implement their own iterator. Note: iterators can be implemented using the dense array.

Sparse Set - flexible cache friendy data structure

2024-08-15T00:00:00+07:00

The first time I heard about a sparse data structure (specifically, sparse arrays) was in a computer science class. Back then, I didn’t think much of them and didn’t encounter them again until recent years. I’ve been working a lot with contiguous memory and dynamic arrays. While dynamic arrays are extremely cache-friendly and allow code to take advantage of the spatial locality of the CPU cache, operating on them often gets quite tricky. Lets jump into the problems with plain dynamic arrays first.

The problem

Arrays have O(1) access time, but only when the indices are stable. Maintaining a stable index for an element in an array can be challenging, especially when we continuously add and remove elements. If we could maintain stable indices, we could use the index as a key or pointer to the element in the array without relying on other cache inefficient data structures like linked lists or hashmaps. This key would remain valid even when the length of the array changes.

Stable indices here mean that the index of an element in an array does not change when other elements are added or removed from the array.

Sparse Set

Sparse set solves the issue of stable indices with the help of two arrays: one for the indices (sparse) and the other for the actual data (dense). The basic idea is to use the index of the sparse array as a stable index; the element at that index in the sparse array holds the index of an element in the dense array where the actual data is stored.

As shown in the diagram above, the sparse array stores indices pointing to elements in the dense array where the actual data is stored. With this mapping, we can easily update the dense array without losing index stability.

If we were to remove the element with index 3 (sparse index), we could simply perform a swap-remove in the corresponding dense array and update the appropriate indices in the sparse array.

After swapping the elements in the dense array, we would then update the corresponding indices in the sparse array. Sparse index 5 now points to dense index 1, and sparse index 3 points to dense index 3.

Finally, just pop the element from the dense array.

This way, the actual index of the element, which is the sparse index, remains stable.

Implementation

Now that we have the basic idea behind sparse arrays, let’s look at a proper implementation.

template <typename T>
struct SparseSet {
    struct DenseElement {
        size_t sparse_idx;
        T value;
    };

    size_t size = 0;
    std::vector<size_t> sparse;
    std::vector<DenseElement> dense;

    size_t add(T item) {
        auto dense_idx = this->size;
        this->size += 1;

        // try to reuse last freed index
        if(dense_idx < this->dense.size()) {
           auto dense_element = &this->dense[dense_idx];
           dense_element->value = item;
           return dense_element->sparse_idx;
        }

        // allocate new index
        auto sparse_idx = this->sparse.size();

        this->dense.push_back({sparse_idx, item});
        this->sparse.push_back(dense_idx);

        return sparse_idx;
    }

    bool remove(size_t idx) {
        if(!contains(idx)) return false;
        this->size -= 1;

        auto dense_idx = this->sparse[idx];
        auto end_dense_idx = this->size;

        // swap remove
        auto end_element = this->dense[end_dense_idx];
        this->dense[dense_idx] = end_element;
        this->sparse[end_element.sparse_idx] = dense_idx;

        // update end dense idx for reuse
        this->dense[end_dense_idx].sparse_idx = idx;
        this->sparse[idx] = end_dense_idx;

        return true;
    }

    std::optional<T*const> getPtr(size_t idx) {
        if(!contains(idx)) return std::nullopt;

        return {&this->dense[this->sparse[idx]].value};
    }

    bool contains(size_t idx) {
        if(idx >= this->sparse.size()) return false;
        auto dense_idx = this->sparse[idx];
        auto current_idx = this->dense[dense_idx].sparse_idx;
        return dense_idx < this->size && idx == current_idx;
    }

    std::pair<typename std::vector<DenseElement>::const_iterator, 
        typename std::vector<DenseElement>::const_iterator>
    iterator() {
        auto start = this->dense.begin();
        auto end = this->dense.begin() + this->size;
        return { start, end };
    }
};

Complete Implementation on Compiler Explorer

Stable Pointers

Stable indices are useful, but we can also achieve stable pointers for elements in the sparse set. The second part of this post goes into the details of implementing a sparse set with stable pointers. Link to the post

Wrap-up

The sparse set is a useful data structure that combines the benefits of an array with stable indices and stable pointers (details in the next article). The major downside of a sparse set is that it uses more memory, as we maintain a sparse array, which isn’t particularly memory-efficient. Like most algorithms in computer science, this is a trade-off that can be really useful in many scenarios.

References

Flecs

SIMD - Vector Primitives and Operations

2024-06-09T00:00:00+07:00

For quite a while, all major CPU architectures have included support for SIMD instruction sets. Consequently, system programming languages are now beginning to offer support for SIMD, either through libraries or as first-class language primitives. SIMD provides an optimization window for modern software through data parallelism, greatly accelerating computation. This article aims to provide a detailed overview of SIMD vector primitives and operations supported by modern languages, along with some real-world examples.

SIMD (Single Instruction, Multiple Data)

SIMD stands for Single Instruction Multiple Data, which basically boils down to applying the same operation on multiple data or an array of primitives (such as integers, floats, or boolean masks). Let’s explore this concept deeper with an example.

Consider a mathematical vector with four components. If we want to perform element-wise addition of these vectors, it requires four separate addition operations.

let v1: [f32; 4] = [1.0, 2.0, 3.0, 4.0];
let v2: [f32; 4] = [5.0, 6.0, 7.0, 8.0];

let add: [f32; 4] = [
    v1[0] + v2[0],
    v1[1] + v2[1],
    v1[2] + v2[2],
    v1[3] + v2[3],
];

In the example, we’re performing element-wise addition on two float arrays. Imagine if there were native support for directly adding float arrays like this. That’s exactly what SIMD provides: the ability to execute a single operation, such as addition in our case, on multiple values, like two float arrays with 4 elements each.

use std::simd::f32x4;

//....

let v1 = f32x4::from_array([1.0, 2.0, 3.0, 4.0]);
let v2 = f32x4::from_array([5.0, 6.0, 7.0, 8.0]);

let add = v1 + v2;

This is a SIMD version of our example. In SIMD addition of two vectors, multiple operations are combined into a single one. There are no loops under the hood, as long as the target CPU supports SIMD. We’re essentially applying the same operation on multiple data in parallel, which is simply known as data-level parallelism. Instead of four separate add operations, we’re adding four numbers in parallel. This can significantly improve performance in software that deals with a lot of calculation or processing sequential data.

Vector Registers

SIMD operations are backed by vector registers, which are registers capable of holding 128, 256 or even 512 bits of data. We have the ability to perform operations on these registers, such as our example above, where we use two 128-bit registers to add two arrays of 32-bit floats.

Modern system programming languages provide support for these registers through vector primitives or structs such as f32x4 in Rust, @Vector in Zig, and std::native_simd in C++. For documentation on SIMD support in each of these languages, you can follow the hyperlinks: Rust, Zig, C++.

These vector registers also support 64-bit floating-point and integer data types, as well as boolean masks ranging from 8-bit to 64-bit per element.

CPU Support

Not all CPUs support vector registers, especially the larger ones like 512-bit. However, all widely used CPU architectures do support 128-bit vector registers, making it important for programmers to be aware of their availability and utilize them effectively. In this article we’ll mainly work with 128-bit register as they are widely supported. Here are the documentations for 128-bit SIMD support on various architectures:

Vector Operations

Let’s go through some of the fundamental operations you can perform with these vector registers:

Arithmetic

These don’t need any explanation; they are just your old regular arithmetic operations. The only difference being, in the case of SIMD vectors, they’re element-wise operations. So, if you multiply, add, divide, or subtract two SIMD vectors, the operations will be done element-wise. Here’s a simple diagram for multiplication; all other operations work the same way.

Comparison, Logical and Masks

Like regular primitives, SIMD vectors also support logical (AND, OR, NOT, XOR) and comparison (EQUAL, LESS, GREATER) operations. However, they work slightly differently because we’re operating on multiple values simultaneously.

The comparison operation returns a packed bitmask, where four 32-bit masks are packed into a single 128-bit SIMD vector. Each mask contains all 1’s for true and all 0’s for false. For example, SIMD A > B essentially boils down to the following pseudo code. Additionally, it’s worth noting that these masks are stored and represented as a set of integers (four 32-bit ints in our example).

A = [a0, a1, a2, a3]
B = [b0, b1, b2, b3]

// M = A > B
Mask = [
    (a0 > b0) ? 0xFFFFFFFF : 0x00000000,
    (a1 > b1) ? 0xFFFFFFFF : 0x00000000,
    (a2 > b2) ? 0xFFFFFFFF : 0x00000000,
    (a3 > b3) ? 0xFFFFFFFF : 0x00000000,
]

The logical operations like arithmetic are applied element wise.

A = [a0, a1, a2, a3]
B = [b0, b1, b2, b3]

// M = B | A
M = [
    a0 | b0,    
    a1 | b1,
    a2 | b2,
    a3 | b3,
]

Data Movement

Data movement involves swizzling or shuffling, where you can create a new SIMD vector by combining two input vectors based on a user-defined mask. For example:

A = [a0, a1, a2, a3]
B = [b0, b1, b2, b3]

R = shuffle(A, B, [0, 1, 6, 7]) // M = [a0, a1, b2, b3]

Here, we are copying the first two elements (a0, a1) from A and the last two elements (b2, b3) from B based on our mask [0, 1, 6, 7]. The mask is represented by an array of indices from the concatenation of A and B, i.e. [a0, a1, a2, a3, b0, b1, b2, b3].

The representation of the mask differs based on the programming language and its SIMD library. Rust uses concatenated array indices for masks, while Zig uses positive indices to select elements from the first input and negative indices to select elements from the second input.

Rust example, rust uses the term swizzle for data movement operation. Rust Docs

let v1 = f32x4::from_array([1.0, 2.0, 3.0, 4.0]);
let v2 = f32x4::from_array([5.0, 6.0, 7.0, 8.0]);

let m: f32x4 = simd_swizzle!(v1, v2, [0, 1, 6, 7]); // m = [1.0, 2.0, 7.0, 8.0]

Zig example, zig uses negative indices for masks. Zig Docs

const v1 = @Vector(4, f32){1.0, 2.0, 3.0, 4.0};
const v2 = @Vector(4, f32){5.0, 6.0, 7.0, 8.0};

const m = @shuffle(v1, v2, [_]i32{0, 1, -3, -4}); // m = [1.0, 2.0, 7.0, 6.0]

You can also rearrange the order of elements in a SIMD vector using shuffle/swizzle operations.

use std::simd::f32x4;
use std::simd::simd_swizzle;

// ...

let v1 = f32x4::from_array([1.0, 2.0, 3.0, 4.0]);
let v2 = simd_swizzle!(v1, v1, [3, 0, 2, 1]); // v2 = [4.0, 1.0, 3.0, 2.0]

Reduction

Until now, all the vector operations we explored were mostly element-wise operations on two input vectors, known as vertical operations. Another type of operation we can perform is among the elements in the same SIMD vector, known as horizontal operation. For example, adding all elements to a single 32-bit float value, i.e., reducing to a single value.

A = [a0, a1, a2, a3];

r = a0 + a1 + a2 + a3 ; // reduce sum

All three languages offer macros or helper functions for reduction.

Rust example, Rust Docs

use std::simd::f32x4;
use std::simd::num::SimdFloat;

//....

let v1 = f32x4::from_array([1.0, 2.0, 3.0, 4.0]);
let sum = v1.reduce_sum(); // 10

Zig example, Zig Docs

const v1 = @Vector(4, f32){1.0, 2.0, 3.0, 4.0};
const sum = @reduce(.Add, v1); // 10

Practical Examples

Now, let’s explore some practical use cases for these SIMD vectors and operations we’ve just covered.

Dot Product

The dot product is a mathematical operation on vectors that involves element-wise multiplication followed by the summation of the elements of the multiplication result.

let v1 = f32x4::from_array([1.0, 2.0, 3.0, 4.0]);
let v2 = f32x4::from_array([-1.0, -2.0, -3.0, -4.0]);

let dot_product = (v1 * v2).reduce_sum();

Similarly, SIMD can be applied to other linear algebra operations such as matrix multiplication, transposition, decomposition, etc. Vectors and matrices are widely used in computer graphics and image processing, making SIMD essential for accelerating computation in these areas

Sarrus Rule

Sarrus rule is another mathematical operation often used to calculate the determinant of a 3x3 matrix or the cross product of two 3D vectors.

// 3x3 matrix, assuming the 4th component to be zero
fn determinant(mat: [f32x4; 3]) -> f32 {
    let m0 = mat[0]
        * simd_swizzle!(mat[1], [1, 2, 0, 3])
        * simd_swizzle!(mat[2], [2, 0, 1, 3]);

    let m1 = simd_swizzle!(mat[0], [1, 0, 2, 3])
        * simd_swizzle!(mat[1], [0, 2, 1, 3])
        * simd_swizzle!(mat[2], [2, 1, 0, 3]);

    return (m0 - m1).reduce_sum();
}

// Assuming the 4th element to be 0 for both a and b
fn cross(a: f32x4, b: f32x4) -> f32x4 {
    let temp0 = simd_swizzle!(a, [1, 2, 0, 3]);
    let temp1 = simd_swizzle!(b, [1, 2, 0, 3]);
    let temp2 = simd_swizzle!(a, [2, 0, 1, 3]);
    let temp3 = simd_swizzle!(b, [2, 0, 1, 3]);

    return (temp0 * temp3) - (temp2 * temp1);
}

Example on Compiler Explorer

String Search

At this point, it’s pretty clear that SIMD can be effectively utilized to optimize mathematical calculations. But what about other use cases? Another area where SIMD has proven its effectiveness is in parsing data. SIMD can significantly speed up something like JSON parsing. Take a look at the benchmark on simdjson.

Let’s return to our string search example. When searching for a substring in a lengthy text, we can leverage SIMD vectors to compare multiple bytes simultaneously, allowing us to implement more efficient string searching algorithms.

// Assuming the string are ASCII 8-bit each
pub fn contains_substr(text: &str, substr: &str) -> bool {
    if substr.is_empty() || substr.len() > text.len() {
        return false;
    } else {
        let substr_bytes = substr.as_bytes();
        let substr_len = substr_bytes.len();

        let first_char = substr_bytes[0];
        let last_char = substr_bytes[substr_len - 1];

        // fingerprint from first and last character of the substring
        let first_fing = u8x16::splat(first_char.try_into().unwrap());
        let last_fing = u8x16::splat(last_char.try_into().unwrap());

        // create 16-byte chunks for figerprint checks
        let text_bytes = pad_text(text.as_bytes(), substr_len);
        let total_chunks = text_bytes.len() / 16;
        for (i, chunk) in text_bytes.chunks(16).enumerate().take(total_chunks) {
            // blocks to compare fingerprints with
            let first_block = u8x16::from_slice(chunk);
            // second_block start from start + offset where offset = substr_len - 1
            let sb_start = (i * 16) + substr_len - 1;
            let second_block = u8x16::from_slice(&text_bytes[sb_start..sb_start + 16]);

            let eq_a = first_block.simd_eq(first_fing);
            let eq_b = last_fing.simd_eq(second_block);

            let mut mask = eq_a & eq_b;

            // fingerprint match
            while mask.any() {
                // actual comparison, we can replace this with SIMD aswell but this should be
                // trivial enough for compiler optimization
                let set_index = mask.first_set().unwrap();
                let f = set_index + (i * 16) + 1;
                if text_bytes[f..f + substr_len - 2] == substr_bytes[1..substr_len - 1] {
                    return true;
                }
                // f - 1 starting index of substring in the text
                mask.set(set_index, false);
            }
        }
        return false;
    }
}

// Padding could be done in a better way
fn pad_text(data: &[u8], substr_len: usize) -> Vec<u8> {
    // Determine the padding needed
    let padding_needed = (16 - (data.len() % 16)) % 16 + substr_len - 1;
    let mut padded_data = Vec::with_capacity(data.len() + padding_needed);
    padded_data.extend_from_slice(data);
    padded_data.resize(data.len() + padding_needed, b'\0');
    padded_data
}

Example on Compiler Explorer

This sub-string search example is based on SIMD-friendly Rabin-Karp modification. While I haven’t benchmarked the algorithm myself, the referenced article does contain benchmarks demonstrating its effectiveness.

Auto Vectorization

Auto vectorization is a compiler optimization technique where the compiler automatically vectorizes array operations to some extent. While it’s generally beneficial to let the compiler handle most optimization tasks, auto vectorization doesn’t always yield the desired outcome, particularly in cases where vectorization is nontrivial. In such situations, you may need to manually write your own vectorized code. This is where the support for SIMD in modern languages truly shines, offering developers the flexibility to optimize performance-critical code.

Wrap-up

Vector primitives are incredibly powerful tools for speeding up computations. System programming languages are now incorporating support for them, whether through libraries or as first-class language features. This support gives developers the ability to leverage SIMD technology in a more portable manner, enabling us to write more efficient software.

Hey, you made it to the end! You might want to check out a linear algebra library I recently wrote in Zig called zig_matrix. I’m extensively using Zig’s @Vector SIMD support in my implementation of some of the most widely known and utilized linear algebra operations. Feel free to email me with any feedback or questions!

References

Two-Level Segregated Fit Memory Allocator

2024-04-20T00:00:00+07:00

Last week, I decided to develop a simple memory allocator for Vulkan. Initially, it was meant to be a quick combination of a pool allocator and a free-list allocator (with the free-list backed by pools of memory). However, I was not satisfied with it and started looking into improving my allocator which led me to a paper titled “TLSF: a New Dynamic Memory Allocator for Real-Time Systems”. So here we are.

Introduction

The most basic data structure for memory allocation is free-list. As the name suggests, it involves maintaining a linked list of free memory blocks. Depending on our strategy (first-fit, best-fit, good-fit), we traverse through the linked list to find a suitable block for a memory allocation. While being very simple to implement, the approach often results in an O(N) runtime. While this might not be a problem for most applications, it’s a different story for embedded or real-time systems.

Two-Level Segregated Fit (TLSF)

The TLSF memory allocation algorithm provides O(1) memory allocation and deallocation with a good-fit strategy. TSFL utilizes a two-level segregated data structure to optimize lookup on the freelist. Like many data structures in computer science, fast lookup is achieved through binning or bucketing.

The basic idea is to have M blocks or bins, each of which is further divided in N blocks or sub-bins. These sub-bins then store our free lists.

To determine the bin sizes, we follow a specific approach. The idea is to arrange first bins in intervals of power of two (2^bin_idx) and then each bin is further divided into M subbins, with the subbin division being linear.

For example, free memory blocks with size in the range [2⁴, 2⁵) will be placed inside bin with index 4. To determine the subbin index, we can take the block size, subtract it with 2^bin_idx and divide it by bin_interval / subbin_count.

Given the memory size, we can compute bin and subbin index with:

With this, we know where to store or find memory blocks given their size. But how do we determine which bin and subbins have free memory to allocate from? For this, we can use a lookup table that maps the bin index to a boolean value indicating whether it’s free or not. This can be easily implemented as a bitset. We’ll need a bitset for the first-level bin and bitsets for each subbin under each bin.

With this our datastructure is nearly complete. All that remains is to slove a minor edge case.

In the first-level bin, the starting bin intervals are very small (2⁰, 2¹, 2², …). Since they can only be used to bin a very small set of sizes, we can just opimize them by making first bin with index 0 a linear or fixed-size bin and using it for all small allocations. As you can see, the first bin looks different. Now, the first bin has a fixed size 2⁷, and our second bin starts from 2⁸ interval. This also implies that we’ll need to subtract our bin index with this fixed size in order to compute actual bin index.

First, we define our fixed linear interval (Linear), and then we compute our bin and sub-bin index accordingly.

This ensures that all blocks in the range [2^0, 2^7) exist in bin 0, and the range [2^7, …) starts from bin 1. With these adjustments, we are ready to implement our own TLSF allocator.

Implementation Details

Some Zig code snippets for implementing a TLSF allocator. Let’s start by defining our constants:

const linear: u8 = 7; // log2(min_allocaction_size)
const sub_bin: u8 = 5; // log2(sub_bin_count)
const bin_count: u32 = 64 - linear; // 64 first level bins
const sub_bin_count: u32 = 1 << sub_bin;
const min_alloc_size: u32 = 1 << linear;

Bin Mapping

We need to map allocation and memory block sizes to proper bin and subbin indices. Two types of mapping are required here: map up and map down. Whenever we what to perform a search for free blocks in order to allocate memory, we would need to map up, which is achieved by rounding up the size to the next subbin. This is necessary because we need to look for a subbin which contains blocks that can at least fit the requested size.

fn binmap_up(size: vk.DeviceSize) BlockMap {
    const bin_idx: u32 = bit_scan_msb(size | min_alloc_size);
    const log2_subbin_size: u6 = @intCast(bin_idx - sub_bin);
    const next_subbin_offset = (@as(u64, 1) << (log2_subbin_size)) - 1; // block_size - 1
    const rounded = size +% next_subbin_offset;
    const sub_bin_idx = rounded >> log2_subbin_size; // rounded_size / block_size
    
    const adjusted_bin_idx: u32 = @intCast((bin_idx - linear) + (sub_bin_idx >> sub_bin)); // adjust bin_idx with linear
    const adjusted_sub_bin_idx: u32 = @intCast(sub_bin_idx & (sub_bin_count - 1)); // sub_bin_idx % sub_bin_count
    const rounded_size = (rounded) & ~next_subbin_offset;
    
    std.debug.assert(adjusted_bin_idx < bin_count);
    std.debug.assert(adjusted_sub_bin_idx < sub_bin_count);
    
    return .{
        .bin_idx = adjusted_bin_idx,
        .sub_bin_idx = adjusted_sub_bin_idx,
        .rounded_size = rounded_size,
    };
}

And for other operations like inserting new free block, we’ll map down.

fn binmap_down(size: vk.DeviceSize) BlockMap {
    const bin_idx: u32 = bit_scan_msb(size | min_alloc_size);
    const log2_subbin_size: u6 = @intCast(bin_idx - sub_bin);
    const sub_bin_idx = size >> log2_subbin_size; // size / block_size

    const adjusted_bin_idx: u32 = @intCast((bin_idx - linear) + (sub_bin_idx >> sub_bin));
    const adjusted_sub_bin_idx: u32 = @intCast(sub_bin_idx & (sub_bin_count - 1));
    const rounded_size = size;

    std.debug.assert(adjusted_bin_idx < bin_count);
    std.debug.assert(adjusted_sub_bin_idx < sub_bin_count);

    return .{
        .bin_idx = adjusted_bin_idx,
        .sub_bin_idx = adjusted_sub_bin_idx,
        .rounded_size = rounded_size,
    };
}

Free Block Lookup

// bit sets for our lookup table
bin_bitmap: u32,
sub_bin_bitmap: [bin_count]u32,

We first map the input size to bin and subbin indices and then perform a lookup on the bitsets to check whether the mapped bin and subbin have free blocks or not. If not, we lookup the next free bin, which by default will be large enough.

fn findFreeBlock(self: TSFLAllocator, size: vk.DeviceSize) !BlockMap {
    var map = binmap_up(size);
    // look up with mapped bin and sub_bin
    var sub_bin_bitmap = self.sub_bin_bitmap[map.bin_idx] & (~@as(u32, 0) << @intCast(map.sub_bin_idx));

    // not found
    if (sub_bin_bitmap == 0) {
        // search for next free bin
        const bin_bitmap = self.bin_bitmap & (~@as(u32, 0) << @intCast(map.bin_idx + 1));
        // no free bins
        if (bin_bitmap == 0) return error.OutOfFreeBlock;
        // convert bitset flag to bin index
        map.bin_idx = @ctz(bin_bitmap);
        // any subbin will suffice
        sub_bin_bitmap = self.sub_bin_bitmap[map.bin_idx];
    }

    // get index of free block
    map.sub_bin_idx = @ctz(sub_bin_bitmap);

    return BlockMap{
        .bin_idx = map.bin_idx,
        .sub_bin_idx = map.sub_bin_idx,
        .rounded_size = map.rounded_size,
    };
}

Insert or Remove Block

Whenever we insert or remove a free block from our TSFL structure, we’ll need to update the lookup table aswell.

fn insertFreeBlock(self: *TSFLAllocator, block: *Block) void {
    const map = binmap_down(block.size);

    //////////////////////////////////////////////////////
    //  You'd be updating your freelist here
    //////////////////////////////////////////////////////
 
    // set bin and subbin bitset
    self.bin_bitmap |= @as(u32, 1) << @intCast(map.bin_idx);
    self.sub_bin_bitmap[map.bin_idx] |= @as(u32, 1) << @intCast(map.sub_bin_idx);
}

Good-fit

Since we are rounding up the size to the next subbin during free block lookup, TLSF will try to return the smallest chunk of memory big enough to hold the requested block. This makes the algorithm almost best-fit but not exactly best-fit, also called good-fit.

Wrap-Up

And that’s it, everything from here on would involve managing the freelist that are associated with our subbins. Since the operations for searching, inserting and removing are now O(1) with the help of our fast bitset lookup, the resulting allocation or free operation is also O(1). This kind of binning algorithm has multiple use cases, with optimizing memory allocation being one of them.

Complete Example

Zig Implementation

References

Code Principle - Data & Transformation

2022-08-12T00:00:00+07:00

Hello there. It’s been a while since my last post. There goes my goal of writing at least one article every month. This time it’s a lot more theoretical, as opposed to my previous pile of technical turd. So let us begin.

Introduction

Most often we tend to only think about code in terms of logic or structure. When designing software, we often worry about languages and frameworks, data structure and algorithms, and classes and modules. We tend to forget about one of the most important aspects of software – Data and Transformation. Let us explore how taking Data and Transformation into account ultimately leads to a better software design.

We shall begin by forming a concise yet flexible criteria for ‘good design’.

What is Good Software Design ?

Software – as its name suggests – is meant to change and extend. It is ‘soft’, unlike hardware, which is generally meant to be fixed (hard). Software is expected to be susceptible to changes. When the requirements change, it falls upon us the developers to make sure that our software effectively adapts to the proposed changes. This often boils down to diving into the codebase and making required changes. Based on this, we can argue that good software design is design that is susceptible to change - i.e. easier to change.

There are many patterns, principles and guidelines that help in designing software that is easier to change. Taking Data Transformation into account while designing is one such guideline that we’ll be exploring today.

Data & Transformation

Data is a vital part of any program. A program, in a basic sense, is a set of instructions that operates, communicates, processes, stores and/or presents data. When data flows through a program, it passes through operations that might introduce changes to the data in different ways, which can be defined as a Transformation. When a program operates on or changes a piece of data, we can basically say that the program transforms the data.

Programming is About Code, But Programs Are About Data.
- The Pragmatic Programmer

// A basic example of data & transformation

const dataA = [1203, 2123, 2134, 2323];

const dataB = dataA.map((x) => x.toString());

Here, the array of numbers dataA is transformed into the array of strings dataB via map, which is a function that defines the transformation.

Thinking in Terms of Transformation

We can start thinking of a program as a series of transformations. The data passes through one or more transformations, each one of which operates on the data in order to produce the desired output. Let’s make this concept clear with a quick example.

// A program that takes a user.csv file, parses it.
// And sends notification to the user while logging failures if any.

const users = CSV.parse("user.csv");
// filter users with valid email
const usersWithValidEmail = users.filter((u) => u.email);
// extract email array
const emails = usersWithValidEmail.map((u) => u.email);
// send notification
const results = sendNotification(emails);
// filter failure
const failures = result.filter((r) => !r.success);

The user.csv file here contains our data. Our program basically does 4 operations on this data:

Acquire through parse,
Filter valid emails,
Send notification to those emails,
Filter failures.

To carry out these operations, we move our data through a series of transformations until we reach our end goal.

This approach in programming – where we chain multiple transformations by passing one output as input to another is known as pipelining. Pipelining is mostly provided as a feature in functional languages, where a pipeline operator |> automatically pipes output of one transformation as input to another.

CSV.parse('user.csv')
  |> Enum.filter(& !is_nil(&1.email))
  |> Enum.map(& &1.email)
  |> send_notification()
  |> Enum.filter(& !&1.success)

This is same example but with the pipe operator. Another way to do this kind of chaining is with composition. Composition is somewhat similar to pipelining but true to its mathematical meaning.

The goal is not to hoard state, but to pass them around and have our code conduct required operations, which in turn produce new state that’s essentially derived from the old state. Note that we try not to mutate state, but instead we derive new state from the previous one.

It’s always a good idea to avoid mutation whenever possible. Avoiding mutation deserves its own separate article, but I digress.

Since ’Thinking in terms of transformation’ is a mouthful, let’s just call this Transformative Programming.

Transformative Programing a Good Design Approach

As we have learned, good software design is design that’s easier to change, but we haven’t yet explored what makes bad software design. Or to rephrase, what makes a software difficult to change?

Coupling.

Coupling ties things together, so that it's harder to change just one thing.
- The pragmatic programming

Transformative Programming helps reduce coupling. Transformations are by design independent of each other. In fact, a transformation doesn’t even need to know about the existence of any other transformation. A transformation is only concerned with a specific operation that needs to be performed on its input.

Besides reduced coupling, Transformative Programming provides other goodies, such as increased code readability and better DRY (don’t repeat yourself) (don’t repeat yourself). As the program is divided into similar well-defined transformations, readability increases. The fact that the code is being broken into smaller transformations means that they can be effectively reused as needed.

Let’s look at an example of a method sendNotification, which is responsible for sending notifications to a given user.

function sendNotification(notification, user) {
  const results = [];
  const devices = user.getDevices();
  // loop through user devices
  for (device of devices) {
    const registrationId = device.registrationId;
    // pre-process notification content
    const body = processTemplate(notification.body, user);
    const title = processTemplate(notification.title, user);
    // queue notification
    const result = NotificationClientApi.queueNotification(registrationId, {
      body: body,
      title: title,
    });
    // store result
    results.push(result);
  }

  return results;
}

Now let’s write this code with a transformative approach.

// A function to pre-process notification content,
function processNotificationContent(notification, user) {
  const body = processTemplate(notification.body, user);
  const title = processTemplate(notification.title, user);

  return { body: body, title: title };
}

// A function that actually queues the notification, using some client library.
function queueNotification(registrationIds, notificationContent) {
  return registrationIds.map((regId) =>
    NotificationClientApi.queueNotification(regId, notificationContent)
  );
}

function sendNotification(notification, user) {
  // 1 - processNotificationContent
  const notificationContent = processNotificationContent(notification, user);
  // 2 - get registrationIds
  const registrationIds = user.getDevices().map((d) => d.registrationId);
  // 3 - queue notification
  return queueNotification(registrationIds, notificationContent);
}

We can immediately notice that the code is a lot more readable, along with better separation of concern, as each transformation only applies a specific operation on the data. The code has become much easier to change.

For example, if we add an additional templated field on the notification object, just by having a glance at the piece of code, we find that the only thing that needs to be updated is processNotificationContent.

Transformative Programming in Real World

At this point, it’s possible to develop the impression that transformative programming is only applicable to functional programming. We will discover this to not be the case, since transformative programming can find its way into your code base and improve it regardless of the style you follow.

In fact, you might already be using aspects of transformative programming, with utility functions like map, reduce, filter etc. that most programming languages now provide within their standard library.

Let’s have a look at yet another contrived example:

Leys say, we have an e-commerce application, and need to calculate discounts for product orders in a shopping cart, based on a known fixed discount value. The discount might be applied to select orders or the whole cart (all orders).

// Example:1 - Normal OOP
class Order {
  /* Everything that a typical order contains */

  // Applies discount and calculates final cost
  calculateCostWithDiscount(discountValue) {
    // calculate new cost
    const newCost = this.cost - discountValue;
    // check if its within range
    return newCost > this.lowestPossiblePrice
      ? newCost
      : this.lowestPossiblePrice;
  }

  // Applies discount to the order, mutating the order
  applyDiscount(discountValue) {
    this.costAfterDiscount = calculateCostWithDiscount(discountValue);
  }
}

class Cart {
  /* Everything that a typical order contains */

  // Calculates total cost with discount applied to all orders in the card
  calculateTotalCostWithDiscount(discountValue) {
    const totalDiscount = 0;
    // calculate cost after the discount has been applied
    for (const order of this.orders) {
      totalDiscount += order.calculateCostWithDiscount(discountValue);
    }
    return totalDiscount;
  }

  // Applies discount to each order in the cart, mutating the cart
  applyDiscount(discountValue) {
    for (const order of orders) {
      this.order.applyDiscount(discountValue);
    }
  }

  // Applies discount to all orders in the cart-
  // which match the given orderIds, mutating the cart
  applyDiscountOnOrders(orderIds, discountValue) {
    for (const cartOrder of this.orders) {
      for (const applicableOrderId of orderIds) {
        if (cartOrder.id == applicableOrderId) {
          cartOrder.applyDiscount(discountValue);
        }
      }
    }
  }
}

Here we have a basic cart-order system written with an Object-Oriented approach, which basically calculates or applies discount to the orders in cart. Everything is well-contained and the coupling is managed, but we could improve it with some Transformative Programming.

// Example:2 - OOP with aspect of transformation
class Order {
  /* Everything that a typical order contains */

  calculateCostWithDiscount(discountValue) {
    // we leverage Math.max instead of comparing values
    return Math.max(this.cost - discountValue, this.lowestPossiblePrice);
  }

  function applyDiscount(discountValue) {
    this.costAfterDiscount = calculateCostWithDiscount(discountValue);
  }
}

class Cart {
  /* Everything that a typical cart contains */

  calculateTotalCostWithDiscount(discountValue) {
    // return this.orders.reduce((acc, curr) => acc + curr.calculateDiscount(discountValue), 0);

    // 1 - we first transform orders into number (cost with discount)
    const discounts= this.orders.map((order) => order.calculateCostWithDiscount(discountValue))
    // 2 - we transform the numbers using reduce to a single summed value
    return discounts.reduce((acc, curr) => acc + curr, 0);
  }

  applyDiscount(discountValue) {
    this.orders.forEach((order) => order.applyDiscount(discountValue));
  }

  applyDiscountOnOrders(orderIds, discountValue) {
    // 1 - we transform cart orders into applicableOrders using filter + find
    const applicableOrders = this.orders.filter((cartOrder) => orderIds.find((applicableOrderId) => cartOrder == applicableOrderId));
    // 2 - apply discount to each order, mutating in on the process
    applicableOrders.forEach((order) => order.applyDiscount(discountValue));
  }
}

Now the code is much more readable. It can be further improved by minimizing mutations and switching to structured types, which in-turn makes our code reusable in a way that wasn’t possible before. Also, I recommend using something like lodash to compensate for lack of proper utilities in javascript.

/**
 * order
 **/
type Order = {
  /* Everything that a typical order contains */
};

// Applies discount to the order, without mutating the order
function orderWithDiscount(order: Order, discountValue: number): Order {
  const costAfterDiscount = Math.max(
    this.price - discountValue,
    this.lowestPossibleCost
  );

  // return a new order with discount applied
  return clone(this, {
    costAfterDiscount: costAfterDiscount,
    discountValue: discountValue,
  });
}

function orderCostAfterDiscount(order): number {
  return order.costAfterDiscount;
}

/**
 * Cart
 **/
type Cart = {
  /* Everything that a typical cart contains */
};

// Calculate total cost including the discount
function cartCalcTotalCostWithDiscount(
  cart: Cart,
  discountValue: number
): number {
  // Since orderWithDiscount doesn't modify the order, we can use to to calculate total
  const orders = cart.orders.map((order) =>
    orderWithDiscount(order, discountValue)
  );
  return orders.reduce((acc, curr) => acc + orderCostAfterDiscount(order), 0);
}

// Applies discount to orders in a cart, without mutating the cart
function cartWithDiscount(cart: Cart, discountValue: number): Cart {
  const orders = cart.orders.map((order) =>
    orderWithDiscount(order, discountValue)
  );

  // return a new cart with updated orders
  return clone(cart, { orders: orders });
}

// using lodash
// Applies discount to orders that match the give order ids,
// without mutating the supplied cart
function cartWithDiscountOnOrders(
  cart: Cart,
  orderIds: OrderId[],
  discountValue: number
): Cart {
  // 1 - filter applicable orders
  const applicableOrders = _.intersectionBy(cart.orders, orders, "id");
  // 2 - apply discount to applicable orders
  const discountedOrders = _.map(applicableOrders, (order) =>
    orderWithDiscount(order, discountValue)
  );
  // 3 - create a set of orders - applicable orders
  const filteredOrders = _.difference(cart.orders, applicableOrders);

  // return new cart with updated orders
  return clone(cart, { orders: _.concat(filteredOrders, applicableOrders) });
}

Let’s add some additional functionality – like payment processing, and conditions for discounts to be applicable, which are just more transformations that contain steps required to process a payment.

The discount should be applied only if the total cost of the cart is greater than a certain threshold.

function processPayment(cart: Cart, payment: Options): Cart {
  const totalCost = cart.orders.map((order) => orderCostWithDiscount(order))
  /**
   * Apply necessary transformations, call payment service APIs
  **/
  const paymentStatus = ...;
  return clone(cart, {paymentStatus: paymentStatus})
}


/**
 Apply discount and process payment
**/
const config = ..;
const paymentOptions = ..;

// 1 - calculate total cost
const totalCost = cart.orders.reduce((acc, curr) => acc + curr.cost, 0);
// 2 - check if discount is applicable
const discountValue = (totalCost > config.discountThreshold)? config.discountValue : 0
// 3 - apply the discount
const discountedCart = cartWithDiscount(cart, constrainedDiscountValue);
// 4- process payment
const paymentProcessedCart = processPayment(cart, paymentOptions);
/**
  Somewhere down the line, we save our most recent state
 */
db.persist(paymentProcessedCart);

Here, we are able to represent our business logic in terms of data and transformation, where each transformation is self-descriptive and isolated. There is no mutation; each transformation leads to a new state that could be utilized by the caller as they prefer, like when we use orderWithDiscount inside cartCalcTotalCostWithDiscount in order to calculate total discount value, while avoiding mutation of orders in the existing cart.

Conclusion

We now understand that thinking in terms of data transformation leads to a code that is easier to change and understand. Even though transformative programming has its roots in functional programming we can easily adapt its aspects to any form programming approach.

References

The Pragmatic Programmer by David Thomas & Andrew Hunt

Understanding Static Variables in Rust

2021-05-29T00:00:00+07:00

Hello there, I hope you are doing ok. Today I would like to talk about static variables in Rust, compare them with static variables in C++ and also try to reason about the rules imposed by Rust on static variables.

Introduction

Static variable are variables declared with a static keyword and represent a specific global memory location (They are also known as global variables). Static variables have static life-time, a static life-time never goes out of scope and is guaranteed to out live any other variable. Meaning even if they are declared inside a scope their life-time does not begin or end with the scope.

fn func() -> &'static i23 {
   // global variable, same global memory location for every call.
   static SOMETHING: i32 = 0;

   return &SOMETHING;
}

// every call to func() returns reference to same memory location.
func()
func()

In Rust,

static variables must be initialized at compile-time (Meaning they cannot be initialized with state which can only be known at runtime),
The type of static variable must have the Sync trait bound (Meaning the type should be safe to share between threads, Sync is an automatically derived trait with some exceptions) and
mutating a static variable is only possible in an unsafe context.

In this post we’ll try to reason about these rules that are imposed by Rust on the static variables and also talk about why such rules are important.

// OK - 0 can be known at complete time.
static SOME_THING: i32 = 0;

// Error heap allocation is only possible in runtime.
static MEM: Box<i32> = Box::new(0)

To understand rules behind static variables let us take a short dive into the land of assembly.

Land of Assembly

Rust is a native language that compiles down to assembly. An assembly program is generally divided into three sections:

data
bss
text

The data section contains all the initialized static variables with their initial value, bss section contains all uninitialized/zero-initialized static variables and finally the text section contains all our code in assembly. You can read more about assembly layout here. (This stuff is platform dependent so take it with a grain of salt.)

Back to Rust

Rust does not allow uninitialized static variables. So, the data, bss section may contain either initialized or zero-initialized static variables. Also, since assembly is generated after compiling the Rust code and the assembly must contain static variables in special sections, the static variable must be initialized at compile time.

This does not mean you cannot have static variable that stores a state which can only be known at runtime. This just means that you need to initialize static with compile-time known state or value. There is an easy way to store a value that can only be known at runtime utilizing a enum (variant) or something like Option by setting them to a compile-time known value and updating them later at runtime.

// Ok - initialized with compile-time known state/value.
static mut MEM: Option<Box<i32>> = None;

// ....... somewhere ........ //

// Ok
unsafe { MEM = Some(Box::new(5)) };

It is not recommended to use mutable static since it is quite easy to run into an undefined behavior with it.
I recommend using lazy_static instead or checking end part of this article for slightly better implementation.

As one of Rust’s goals is to make concurrency bugs harder to run into, reading or writing a mutable static is unsafe because static variables are shared between threads and a mutable static might run into race conditions in a concurrent program. This is why it is particularly important to guard a mutable static with lock. Also, for same reasons the type of non-mutable static variable should only allow thread safe access.

Let us now move our focus to C++.

Static Initialization in C++

C++ allows initialization of a static variable even with a state which can only be known at runtime. This is possible mainly because of two reasons:

First, C++ allows uninitialized variables.
Second, C++ can do static initialization in runtime before main executes if necessary.

Since C++ can carry out static initialization before the main method executes, it might lead to an extremely hard to detect problem known as the static initialization order fiasco. It is also not clear if a variable is being initialized at compile time or at runtime. C++20 solves this problem with constinit, which makes sure that a static variable can be initialized at compile-time. That being said, there is still no solution for the static initialization fiasco in C++.

struct Test {
   // unique_pointer is a smart pointer similar to Box in rust.
   static unique_pointer<ComplexType> st_ptr;
};

// make_unique is similar to Box::new().
// This runs before main to initialized static st_ptr.
Test::st_tpr = make_unique(ComplexType());

In C++ local static variables (static variables declared inside a function, whose value is persistent across function calls) are initialized by the first function call, because of which they need to be implicitly provided with a lock guard by the compiler. This helps to avoid any race conditions that might occur during initialization, when two or more threads try to initialize the same local static variable.

auto some_function() -> ComplexType {
   // First call to some_function initializes ct.
   // Other calls will share the same ct initialized by the first call.
   // Compiler adds lock guard to avoid any race conditions.
   static ComplexType ct = ComplexType();

   return ct;
}

Rust solves all these issues that C++ suffers from by making mutable static variables unsafe and at the same time, allowing static variables to be initialized only with a state which can be known during compile-time.

fn some_function() -> SomeStruct {
   // st is initialized at compile-time (data section set)
   // all call share same st.
   static SomeStruct st = SomeStruct{ a: 0 };

   return st;
}

Hence, when it comes to static variables, Rust has fairly good reasons to impose the restrictions on how a static variable can be initialized. However, we can easily bypass these restrictions and store pretty much anything in a static variable safely with the help of lock and proper abstraction.

Better Example

As promised, here is a better example for static variable that stores a value which can be only known at runtime. Try it live on godbolt.

use std::sync::Once;
use std::cell::Cell;
use std::hint::unreachable_unchecked;

struct Test {
  pub a : Box<i32>,
}

fn get_static() -> &'static Test {

   // struct that stores our data + a lock guard
   struct Stt {
      data: Cell<Option<Test>>,
      once: Once // lock guard to make sure static is set only once
   }

   // static variable type must have the Sync trait bound.
   // and we also make sure that Stt can only be accessed in a thread safe manner.
   unsafe impl Sync for Stt {}

   // static variable
   static A: Stt = Stt{data: Cell::new(None), once: Once::new() };

   // lock, call_once makes sure that the block is execute only once
   A.once.call_once(|| {

      // init static with a state at runtime - Heap allocation
      A.data.set(Some(Test{a: Box::new(5)}));
   });

   // get reference, dereferencing a raw pointer is unsafe
   let v = unsafe { match *A.data.as_ptr() {
      Some(ref a) => a,
      None => {
         // unreachable code, we are sure that data is never None
         unreachable_unchecked();
      }
   }};

   return v;
}

pub fn main() {
   let a = get_static(); // reference to static
   let b = get_static(); // another reference to static
}

In this example, we are using Cell instead of mut static in order to update the state of the static variable once at runtime (on the first function call). This is much safer than the mutable static approach, we are also using a lock guard to avoid any race conditions.

Also, since Rust doesn’t automatically derive Sync trait for our type Stt because of Cell(Cell is not thread safe type). We have to implement the Sync trait manually, and make sure that our type Stt can only be accessed in a thread safe manner.

As I mentioned earlier, you should use lazy_static. Under the hood, behind all its macro magic lazy_static also uses similar approach.

Conclusion

Static variables in Rust are quite different from programming language such as C++, because they can be used in a much safer way. At first, it may seem like the Rust’s static variables are somewhat limited but with the help of library like lazy_static, we can utilize static safely and effectively.

This is my first blog post, so I would love to receive some feedback. You can reach me at