7.3 - Serializations

The logical pages are serialized before storing in physical pages. This is a process that should obviously be reversible. In this chapter, we will describe how to serialize a Leaf and Node.

Leaf serialization

A Leaf contains some metadata, and a set of keys and values. A key can have many values, and we have as many keys as values. The internal format of a serialized leaf contains:

    | revision (long)                  | 8 bytes
    | nbElems (int)                    | 4 bytes
    | dataSize (int)                   | 4 bytes = sum[0..nbElems]( valueLength, keyLength ) + 8 x nbElems
    | +------------------------------+ |
    | | valueLength[0] (int)         | | 4 bytes \
    | +------------------------------+ |           > n+4 bytes
    | | value[0] (byte[])            | | n bytes /
    | +------------------------------+ |
    | | keyLength[0] (int)           | | 4 bytes \
    | +------------------------------+ |           > n+4 bytes
    | | key[0] (byte[])              | | n bytes /
    | +------------------------------+ |
    ...                              ...
    | +------------------------------+ |
    | | valueLength[nbElems-1] (int) | | 4 bytes \
    | +------------------------------+ |           > n+4 bytes
    | | value[nbElems-1] (byte[])    | | n bytes /
    | +------------------------------+ |
    | | keyLength[nbElems-1] (int)   | | 4 bytes \
    | +------------------------------+ |           > n+4 bytes
    | | key[nbElems-1] (byte[])      | | n bytes /
    | +------------------------------+ |

The length of each serialized key and value is stored so that a complete byte[] can be passed to the key or value deserializer (ie, the exact number of bytes needed to be able to deserialize the key or the value).

The dataSize value is used to know how many bytes needs to be read - thus the number of physical pages to read - in order to get a full page.

Leaf deserialization

When a leaf needs to be deserialized, not all the keys and the values are deserialized at once. It would be too expensive, if the leaf is discarded from memory immediately, when we only need to read one single key and value, hence the keys and values are kept in byte[] form, and will be deserialized on demand.

Two data structures are used to store a key and a value : a KeyHolder for the key a ValueHolder for the value


The KeyHolder data structure holds the key in two ways : serialized (raw (byte[])) deserialized (key)

The keys are not deserialized immediately after the data was read from disk instead they are deserialized on demand.


Mavibot supports multiple values for a key. The ValueHolder data structure will store a set of values associated with a key. This is a complex data structure when compared with the KeyHolder.

An array is used as the default container to hold these values. In some cases, the number of values to be stored is really big, and using an array will impact the performance. In such cases the array is replaced by a BTree, this helps in improving the retrieval time, and also page copying becomes more efficient.

When the number of the values stored reaches a threshold, the array is replaced with a BTree, likewise if the number of values stored are is below the threshold then BTree is replaced by an array. In order to avoid many array <-> btree transformations (e.g. when continusously adding and deleting a value), the array -> btree threshold is bigger than the btree -> array threshold.

   >-------------Array----------->>---BTree---... When we add new values.
                     |////////////|               These values will remain in an array or a BTree until
                                                  we reach oe of the threshold values.
   <-----Array-----<<--------BTree------------... When we delete values.

It's important to note that the values inserted into the sub-BTree will be stored as keys, and all the values of this sub-BTree will contain nulls. The sub-btree Keys will be the values of the key present in the parent-BTree.

Raw/deserialized values

One key for obtaining good performances is to avoid any useless deserialization. This is easy to implement for the KeyHolder, as we only store one single key. For values, it's slightly more complex, as we may have more than one value. The following rules should be followed :

  • don't deserialized until necessary (ie, until one needs to get a value)
  • don't serialize until writing (i.e, until the value(s) must be written to disk)

In fact, a value may be in three different states :

  • all the values are serialized (when we just read the page from disk)
  • none of the values are serialized (when we just created a ValueHolder with new values)
  • somewhere in the middle, when we are modifying a ValueHolder which has been read from the disk

The third case is the complex one. We should consider two different cases though : the values are stored in a sub-BTree : we don't have to deal with this problem, it's up to the sub-btree to deal with it the values are stored in an array : we don't want to store half of the values as byte[], and half of the values as java instances. We must pick either one form or the other. In this case, as soon as we have to manipulate values in Java, we must deserialize all the values.

ValueHolder operations

The possible operations on a ValueHolder are the following :

  • add( value ) : Insert a new value into the ValueHolder. If we reach the upper threshold, then the array is converted into a BTree. In any case, we inject the new value into the array(in the correct order using the comparator present in the serializer) or the BTree.

As we need to compare values, they must be deserialised, so we need to do it if it's not already done (the values are not deserialiezed when the page is read from the disk). Note that it's not necessary for the sub-btree, as it's up to the sub-btree to deserialize the keys on the fly.

Thus the add algorithm will be :

  if the values are not yet deserialized
    then deserialize all the values