7.3 - Serializations

The logical pages are serialized before being stored on physical pages. This is a process that should obviously be reversible. W will describe in this chapter how we serialize Leaf and Node.

Leaf serialization

A Leaf contains some medata data, and a set of keys and values. A key can have many values, and we have as many keys as values. That being said, let's describe the internal format of a serialized leaf :

    +----------------------------------+
    | revision (long)                  | 8 bytes
    +----------------------------------+
    | nbElems (int)                    | 4 bytes
    +----------------------------------+
    | dataSize (int)                   | 4 bytes = sum[0..nbElems]( valueLength, keyLength ) + 8 x nbElems
    +----------------------------------+
    | +------------------------------+ |
    | | valueLength[0] (int)         | | 4 bytes \
    | +------------------------------+ |           > n+4 bytes
    | | value[0] (byte[])            | | n bytes /
    | +------------------------------+ |
    | | keyLength[0] (int)           | | 4 bytes \
    | +------------------------------+ |           > n+4 bytes
    | | key[0] (byte[])              | | n bytes /
    | +------------------------------+ |
    ...                              ...
    | +------------------------------+ |
    | | valueLength[nbElems-1] (int) | | 4 bytes \
    | +------------------------------+ |           > n+4 bytes
    | | value[nbElems-1] (byte[])    | | n bytes /
    | +------------------------------+ |
    | | keyLength[nbElems-1] (int)   | | 4 bytes \
    | +------------------------------+ |           > n+4 bytes
    | | key[nbElems-1] (byte[])      | | n bytes /
    | +------------------------------+ |
    +----------------------------------+

We keep the length of each serialized key and value just because we need to provide a complete byte[] to the key or value deserializer (ie, the exact number of bytes needed to be able to deserialize the key or the value).

The dataSize value is used so that we can know how many bytes we will have to read - thus the number of physical pages to read - in order to get a full page.

Leaf deserialization

On the other way, when we deserialize a leaf, we won't deserialize the keys and the values (it would be too expensive, if the leaf is discarded from memory immediately, when we only need to read one single key and value). We thus keep the byte[] form of each keys and each values, which will be deserialized on demand.

We use two data structures to store a key and a value : a KeyHolder for the key a ValueHolder for the value

KeyHolder

The KeyHolder data structure holds the key in two ways : serialized (raw) deserialized (key)

When we just read the data from disk, we don't deserialize the keys. We do that when needed.

Here is a description of this class :

public class KeyHolder
{
    /** The deserialized key */
    private K key;

    /** The ByteBuffer storing the key */
    private byte[] raw;

    /** The Key serializer */
    private ElementSerializer keySerializer;
}

KeyHolder operations

Here is a description of the available methods in the KeyHolder class.

Constructors

We have two constructors for this class, one which takes a deserialized key, the other which takes a byte[].

  • KeyHolder( ElementSerializer keySerializer, K key )

Here, we need to serailize the key immediately, as we may have to flush the key to the disk. We then serialize the Key immediately and store the resulting byte[] into the raw field.

  • KeyHolder( ElementSerializer keySerializer, byte[] raw )

Here, we just get the serialized form. We don't need to deserialize it, as the key might not be used anytime soon. We thus just update the raw field, and the key field remains null.

getKey()

This method retuns the deserialized key. If it does not exist, then we deserialize it on the fly using the raw field.

setKey()

This method set the key. We immediately serialize it, and store the results in the raw field.

getRaw()

Returns the raw field. This method is only visible from the classes in the same package.

ValueHolder

The ValueHolder data structure will store the list of values associated with a key. As we may have more than one value, we use an internal structure for that purpose. This is a complex data structure, if we compare it with the KeyHolder one.

In some case, the number of values to store is really big, this we need to use an internal data structure that allows a quick retrieval of a value, plus we need to be able to copy a page containing such a value in an efficient way. For these reasons, we use two different internal data structures : an array up to a threshold a sub-BTree above this threshold

When we reach the threshold, the array is transformed into a BTree, and the way back if we get below this number. In order to avoid many array <-> btree transformations if we continusously add and delete a value, the array -> btree threshold is bigger than the btree -> array threshold.

   0---1---2---...---TH-low--...--TH-high---...
   >-------------Array----------->>---BTree---... When we add new values.
                     |////////////|               These values will remain in an array or a BTree until
                                                  we reach oe of the threshold values.
   <-----Array-----<<--------BTree------------... When we delete values.

It's important to know that the sub-BTree will hold only keys, and no values. The sub-btree Keys will be the values we have to store.

Here is the description of this class :

public class ValueHolder implements Cloneable
{
    /** The deserialized value */
    private V[] valueArray;

    /** The BTree storing multiple value, if we have moe than a threashold values */
    private BTree valueBtree;

    /** The serialized value */
    private byte[] raw;

    /** A flag set to true if the values are stored in a BTree */
    private boolean isSubBtree = false;

    /** The RecordManager */
    private BTree btree;

    /** The Value serializer */
    private ElementSerializer valueSerializer;

    /** An internal flag used when the values are not yet deserialized */
    private boolean isRaw = true;
}

As we can see, we use two different fields to store the data, either the valueArray field or the valueBtree field. The raw field contains the serialized values when the values are stored in an array; It's null either if the values are stored in a BTree, or if we already have deserialized the values.

Raw/deserialized values

One key for obtaining good performances is to avoid any useless deserialization. This is easy to implement for the KeyHolder, as we only store one single key. For values, it's slightly more complex, as we may have more than one value. The following rules should be followed :

  • don't deserialized until necessary (ie, when one need to get one value)
  • don't serialize when unnedded (ie until the values must be written back to disk)

In fact, we may be in three different states :

  • all the values are serialized (when we just read the page from disk)
  • non of the values are serialized (when we just created a ValueHolder with new values)
  • somewhere in the middle, when we are modifying a ValueHolder which has been read from the disk

The third case is the complex one. We should consider two different cases though : the values are stored in a sub BTree : we don't have to deal with this problem, it's up to the sub-btree to deal with it the values are stored in an array : we don't want to store half of the values as byte[], and half of the values as java instances. We must pick either one form or the other. In this case, as soon as we have to manipulate values in Java, then we need to deserialize all the values.

ValueHolder operations

The possible operations on a ValueHolder are the following :

  • add( value ) : Insert a new value into the ValueHolder. If we reach the upper threshold, then the array is converted into a BTree. In any case, we inject the new value into the array or the BTree so that we keep all the value ordered (the ValueSerializer must have a Comparator).

As we need to compare values, they must be deserialised, so we need to do it if it's not already done (the values are not deserialiezed when the page is read from the disk). Note that it's not necessary for the sub BTree, as it's up to the sub-btree to deserialize the keys on the fly

The add algorithm will thus be :

  if the values are not yet deserialized
    then deserialize all the values