Optimizing Kobold for Wasm size and performance

Before I get on with this, two announcements:

I suppose I'm not the only person to often be disappointed when I see how large the Wasm files of my various Rust to WebAssembly experiments are. Demonstrating that we can have really good runtime performance in a tiny Wasm binary is one of the major factors that motivates me to work on Kobold.

Since the announcement couple weeks back I've been adding some features and optimizing things to make sure the foundations of the framework are rock-solid. The 0.5 build of the TodoMVC example featured a gzipped Wasm file of just 18.08kB, already by far the smallest in the Rust ecosystem.

The latest release of 0.8 gets the gzipped Wasm file size down to just 15.86kB. You can go lower than that if you wish. Swapping the allocator to rlsf pushes the size all the way to 12.47kB.

A good way to optimize code not just for binary size but also runtime performance is finding avoidable allocations and getting rid of them. This latest release takes this approach to the extreme.

Step 1: allocation-free event listeners

It is common for a single component to declare multiple DOM event listeners. In the aforementioned TodoMVC every entry needs at least 3:

For uninitiated in DOM these are added to the elements by (usually) passing a callback function to the addEventListener method.

In Kobold the 3 listeners I described above are just Rust closures:

bind! {
    state:

    let onchange = move |_| state.toggle(idx);
    let edit = move |_| state.edit_entry(idx);
    let remove = move |_| state.remove(idx);
}

Using wasm-bindgen there are two ways of passing Rust closures to JavaScript:

  1. Short-lived closures can be passed as &dyn _.
  2. Long-lived closures need to be boxed and turned into a Closure<dyn _>.

The short-lived closures are valid only within immediate execution block of the function we pass them into. The event listener callback however can be called at any time for as long as the element is alive, so we can't use the first option.

This leaves us with the Closure type. The way Closure is implemented in wasm-bindgen is pretty impressive. Using Closure requires a Box allocation to guarantee pointer stability when calling into this Rust closure from JavaScript.

All of our 3 closures have a size of 8 bytes in wasm32: 4 bytes for the pointer to the state, and 4 bytes for the idx (usize). Since these are all small allocations the default allocator is pretty good at dealing with them. That said, and I keep repeating myself here, doing something very fast is still slower than doing nothing.

⚠️ DIY unsafe closures in Wasm

To have long-lived un-boxed closures we need a different way. Using only integers as data we can actually describe a closure as two values:

  1. A pointer to the closure itself, or more precisely a pointer to the data captured by the closure.
  2. A function pointer that knows the concrete type of the closure and can execute it.

Since the event listener always takes only one argument of a known type this is actually not that difficult.

First we need a JavaScript factory function for our listeners:

export function makeEventHandler(closure, vcall) {
    // Return a new JS function. Since all JS functions are closures
    // this will capture the `closure` and `vcall` values.
    return event => koboldCallback(event, closure, vcall);
}

We'll get back to koboldCallback in a moment, bear with me.

We import this function in Rust:

#[wasm_bindgen(js_name = "makeEventHandler")]
pub(crate) fn make_event_handler(closure: *mut (), vcall: usize) -> JsValue;

The vcall here is going to be our function pointer. Since fn pointers are not supported by wasm-bindgen ABI, we'll just have to cast it to usize. I'm not sure if fn pointers have any provenance that must be retained, but I'm also not sure we can avoid erasing it here since these need to be handled as JavaScript numbers anyway.

Now comes the fun part. Rust functions can't capture generics from outside of their scope, but we can still create the fn pointer by using a zero-sized closure:

impl<F, E> ListenerHandle for ListenerProduct<F, E>
where
    F: FnMut(E) + 'static,
    E: EventCast,
{
    fn js(&mut self) -> JsValue {
        let vcall: fn(E, *mut ()) = |e, ptr| unsafe {
            // `vcall` will know which `F` to cast the pointer into
            (*(ptr as *mut F))(e)
        };

        make_event_handler(
            // Erase the type from the pointer to closure
            (&mut self.closure) as *mut F as *mut (),
            // Cast `vcall` into `usize` to pass it to JS
            vcall as usize,
        )
    }
}

I'm not sorry.

Both the *mut () pointer to the original closure F and the vcall will be passed to JavaScript as plain 32-bit integers in the makeEventHandler defined above. Yes, JavaScript technically doesn't have an integer type, except it kind of does.

The final piece of the puzzle is the koboldCallback function that the JavaScript handler invokes. This itself is a short Rust function:

#[wasm_bindgen(js_name = "koboldCallback")]
pub fn kobold_callback(event: web_sys::Event, closure: *mut (), vcall: usize) {
    let vcall: fn(web_sys::Event, *mut ()) = unsafe { std::mem::transmute(vcall) };

    vcall(event, closure);
}

That's it! The chain of events (pun not intended) is:

  1. Our JavaScript listener function is called.
  2. It calls the koboldCallback with two integers that are Rust pointers.
  3. vcall is transmuted back into an fn pointer and called with the pointer to the closure.

There is just one, teeny-tiny, really insignificant problem:

That pointer to the closure is not stable making it all Undefined Behavior. 🪦

Step 0: building views in-place

As a reminder, Kobold is building the DOM using the View trait with two required methods:

pub trait View {
    type Product: Mountable;

    fn build(self) -> Self::Product;

    fn update(self, p: &mut Self::Product);
}

Event listeners use a Listener trait that's almost identical.

The problem is that the &mut self.closure reference is being cast into *mut () raw pointer in the build part of the parent View. Since the product of the view (which includes all the ListenerProducts) is returned at the end of the build that raw pointer will immediately become invalid. Situations like these is why Closure requires a Box.

There is an alternative though.

We could refactor everything and build all products (both View and Listener) in pre-allocated uninitialized memory with a guaranteed pointer stability, a pattern known as build-in-place or initialize-in-place.

One crude way to do it would be to use MaybeUninit:

fn build(self, p: &mut MaybeUninit<Self::Product>);

Problem here is the caller has no guarantee that we're actually going to initialize the memory.

We could try to force the use of the write method by expecting a return value of &mut Self::Product:

// trait signature
fn build(self, p: &mut MaybeUninit<Self::Product>) -> &mut Self::Product;

// mocked implementation
fn build(self, p: &mut MaybeUninit<u32>) -> &mut u32 {
    p.write(42)
}

This is better, however there is still no guarantee that memory will be initialized since write is not the only way to get a mutable reference. This code is also safe:

fn build(self, p: &mut MaybeUninit<u32>) -> &mut u32 {
    Box::leak(Box::new(42))
}

In & Out

To enforce soundness via the type system I've defined two new types:

/// Uninitialized stable pointer to `T`.
#[must_use]
#[repr(transparent)]
pub struct In<'a, T>(&'a mut MaybeUninit<T>);

/// Initialized stable pointer to `T`.
#[repr(transparent)]
pub struct Out<'a, T>(&'a mut T);

The names aren't exactly descriptive of what these pointers are, but they do immediately describe how to use them. I also really like them because they are short.

In action:

// signature
fn build(self, p: In<Self::Product>) -> Out<Self::Product>;

// mock implementation
fn build(self, p: In<u32>) -> Out<u32> {
    p.put(42)
}

The only safe way to obtain an Out pointer is by calling the put method which consumes the In pointer.

One In goes in, one Out goes out.

You could still do something weird like nesting calls and return the wrong Out for the wrong In, but that would be merely incorrect, not Undefined Behavior.

With that in place and a week of refactoring later the allocation-free event listeners are done.

Step 2: flatten allocations in lists

I wish that was the end of the journey, but alas, it is not. Requiring that all the view products have stable pointers meant that I had to switch how list products are stored from a Vec<T> to a Vec<Box<T>>. All things considered that was still a major win, after all 1 allocation is better than 3 (or 6), but you know what they say:

Good is the enemy of perfect.

Or something like that. 😉

One simple improvement here would be to get rid of the Vec and just make a simple one-directional linked list. If you have to separately allocate all the view products anyway, might as well use those allocations as nodes, right?

Sure, but we can do better. I ended up coding about 5 different prototypes for storage and eventually settled on a linked list variant that can store a variable number of elements per node:

#[repr(C)]
struct Node<T> {
    /// Pointer to the next `Node`.
    next: Option<NonNull<Node<T>>>,

    /// Number of initialized elements in `data`.
    len: usize,

    /// All the elements of the `Node`.
    data: [MaybeUninit<T>],
}

The data field makes this is an usized type that needs to be manually allocated, but otherwise this is a relatively simple data structure. Miri doesn't seem to complain about all the unsafe code, so I won't either.

The actual collection type is simply:

pub struct PageList<T> {
    /// First `Node` in the list
    first: Option<NonNull<Node<T>>>,
}

This ends up working really well and it checks multiple boxes for my very narrow purpose:

  1. Wasm file is smaller when using PageList<T> than it was with Vec<Box<T>>.
  2. There are no re-allocations so all references to T are stable pointers.
  3. Creating a list from an exact-size iterator requires just one allocation for all elements.
  4. Growing the list by one element at a time is amortized by Node having a minimum allocation size depending on the exact size of T.

That's it!

Kobold can now render a list of any number of elements, each with multiple event listeners, using just one allocation in the Rust code. All of this is just changing the internal nuts and bolts, the actual code of the TodoMVC has not changed.

I got my performance cake and ate the size too.