Edit this page

This page does not reflect the current implementation, nor the targeted implementation of the backed. It just contains some ideas to share them with the other developpers. By no mean it should be consider as valid !

Work in progress

This site is in the process of being reviewed and updated.

This page contains all the information relative to the backend improvment we want to add in ADS 2.0

Database structure

The first thing is to define the overall structure of the database, whatever kind of physical implementation we will choose.

Data 

We are manipulating those kind of data :

  • Partition : They are complete trees, and they should not be considered as data per se. A partition is a root from which we store entries
  • DN : this is the key of  all entries. The Distinguish Name identify uniquely all the elements.
  • Entry : an entry store all the attributes associated to a element named by a DN.
  • Alias : An alias is a refernece to an existing entry
  • Index : an index is a ordered relation built on an attribute. It allows a faster lookup of data

Partition

A partition is the root of a tree. It stores entries. Considering the DN uid=user.1, ou=users, dc=example, dc=com, the partition will be dc=example, dc=com. One can parse the tree by going down level by level, but the dc=com level is not accessible.

The DN is seen as :
[uid=user.1] : 3rd level, [ou=users] : 2nd level, [dc=example, dc=com] first level

A Ldap server can manage N partitions, and each partition can be managed in a specific database. We can imagine that the ou=system is managed by a in-memory database, backed by a b-tree on a file, dc=example, dc=com partition use JDBM and that dc=company, dc=com partition is stored on Oracle.

A partition contains a few data :

  • An ID, which is an integer, uniquely identifying the partition internally
  • A external name, which is used by managing tools
  • A suffix which is a LdapDN object (dc=example, dc=com, for instance)
  • A type, which identify the kind of backend it uses
  • A cacheSize used to set a specific amount of memory to be used for this partition cache
  • And some specific data associated with the backend, like indices, parameters (syncOnWrite, OptimizerEnabled, etc).

We can imagine more informations, like replication information, transaction support, etc.

Lookup in partition table must be fast. We use the normalized form of the DN to find the partition ID in a hashmap. We also store the partition table into the configuration file (server.xml)

To be clear, Partition is a very specific beast.

 The following page go farther into Partition structuration : Partitions

DN

The Distinguished Name is the key used to describe an entry. Each entry has one and only one DN. A DN points to one and only one entry. A DN has at least three representations :

  1. Internal structure : a DN is a list of RDN which is a list of AttributeTypeAndValue. This is used internally to the server, and never exposed to the client
  2. Normalized form (normDN) : it's a String which represent the DN after having normalized all the RDN, attribyteType and attributeValue. We used the specific rules definied for each attribute to normalize the DN, and spaces are removed. (see RFC 2253 and RFC 2252)
  3. User Provided form (upDN) : this is the form which is used by a client, before normalization.

It could be interesting to keep a cache which will store a relation between the upDN and the normDN. This is actually the case, but we should question this option : to avoid a normalization, we will have to do a lookup into a hashMap which will keep the LdapDN. It will be very interesting to keep in this cache some DN which are really used widely (like partion's DN), but storing all the DN may be overdone : the cost of managing a synchronize LRUtable, plus the memory dedicated to this table, may be higher than a parsing of this DN and a simple lookup. (Note that I'm not sure of what I'm currently writing. Some tests need to be done, on real world data)

DN will be stored into a specific table, and each DN will have a unique ID. DNs will be ordered using the normDN, and following the alphanumeric order. 

A stored DN contains those data :

  • ID, the unique numeric identifier for this DN. This is an integer
  • normDN, the normalized form of the DN
  • javaDN, which is the serialized form of the Java representation of the DN. It can be usefull if we want to modify the tree with a modifyDN operation, sparing a parsing.
  • the partitionID, so we will be able to directly point to it without having to do a lookup.
  • the entryId, which is the ID of the entry pointed by the DN.
  • the suffix, which will be used to avoid a lookup in the partition table to get it. (not sure this is necessary ...)

Entry

An entry stores all its attributes and values. It has an ID, and a DN. There is a 1-1 relation with a DN, but an entry can be referenced by more than one alias

The entry  is backed in java by the BasicAttributesclass, and we will store a serialized form of this class into a special table, which will contain all the entries of a partition (as a partition can be backed by differnent database system, that makes perfect sense).

A stored entry contains those data :

  • ID, the unique identifier of this entry
  • javaEntry, the serialized form of the entry instance. It will speed up the manipulation of this object
  • DNid, the reference to the DN associated with this entry
  • partitionID, the partition this entry is associated with

Alias

TODO : define the alias structure (is it needed ?) An alias is a reference to and entry, and is a DN

Indices 

Indices are used to speed up operations on tha base. We use system indixes, and user defined indices. We have two kind of indices :

  • Hashed indices
  • Tree indices

The first kind of indices are used when the ordering is not important, but when we need to find an object immediatly. A good example is the DN index : it is better to use an hashed index because we generally know the DN we are looking for.

The second kind of indices are used when the ordering is important. We use them for attributes, like name, givenname, etc.

As a rule of thumb, when a value is supposed to be unique (like for uid), an hashed index is good. Otherwise, we will use a tree index. The choice is given to the user.

Sometime, we will need to keep those two kind of indices on a value, for instance when we want a fast access, or an ordered index. Looking for a name must be fast if the user use the equals filter (like '(cn=apache)'), but we may need an ordered index on the same attribute if we use the substring filter : '(cn=apa*)'

An index contains the following informations :

  • value, the attribute value
  • ID, the unique entry which attribute's value is taken from

Here, we see that an ID cannot be unique, as a value can be found in many entries (think about ObjectClass attribute : all entries contains the top value). However, we want it to point to an unique entry, so that we don't have to store a list of entries associated with the value.

From a Database point of view, we have this relation between an entry and an attribute value :


This is a N-N relation, which means that an intermediate table is created to be able to keep the database to be in 3rd normal form :

For instance, here are the values and the entries we want to play with :
GivenName Attribute index :

givenanme Value ID
alex 1
emmanuel 2
ersin 3
stefan 4

Apache committers entries :

entry givenname ID
alex Karasulu alex 1
emmanuel Lécharny emmanuel 2
emmanuel Venisse emmanuel 3
ersin Er ersin 4
stefan Szörner stefan
5

Now, we see that if we want to search for an entry which givename is Emmanuel, then I have two candidates. In the third table, we will have those values :

Value ID
Entry ID
1 1
2 2
2
3
3 4
4 5

This is good if we use a relationnal database. In our case, we can't assume it will be the case. The actual solution stores a treeset of entries identifiants into the index table, like that :

Value entries ID
alex 1
emmanuel
treeset(2,3)
ersin 4
stefan 5

If we have thousands of entries associated with a value, we are having troubles (and this is the case for values like top, person, etc) : the treeset is getting huge, and as we have to put it in memory, we have a really bad impact on performance. It may even be totally impossible to have a working server if we want to store millions of entries. We have a workaround for this case : we create an intermediate Btree, which is like the N-N table added in a RDBMS

Now, let's imagine that this table is a b-tree (t's not complicated to imagine this, because it *is* a b-tree). We can find all the associated values simply if we change the key. IDs are the keys, and they are integers. We can combine the value's ID with the related entry's ID in a long. The attribute index table will then looks like this one (the valueID is the key):

If we have thousands of entries associated with a value, we are having troubles (and this is the case for values like top, person, etc) : the treeset is getting huge, and as we have to put it in memory, we have a really bad impact on performance. It may even be totally impossible to have a working server if we want to store millions of entries.

Now, let's imagine that this table is a b-tree (actually, it's not complicated to imagine, because it *is* a b-tree). We can find all the associated values simply if we change the key. IDs are the keys, and they are integers. We can combine the value's ID with the related entry's ID in a long. The attribute index table will then looks like this one (the valueID is the key):

ID
4 294 967 297 = 1 * 2^32 + 1
8 589 934 594 = 2 * 2^32 + 2
8 589 934 595 = 2 * 2^32 + 3
12 884 901 891 = 3 * 2^32 + 4
17 179 869 188 = 4 * 2^32 + 5

The trick is to associate the entry id with the value , but we take the value factor 2^32, in  order to avoid collisions. Looking for all entries associatied with the value emmanuel is just a question of looking for all the data which value is the interval [ 2 * (2^32), 3 * (2^32) -1] (here, 2 is the ID associated with emmanuel, and 3 is the ID associated with ersin ).

There are two drawbacks with this method :

  • we can't store more than MAX_INT values : more than 4 billion entries however. We can also create a special kay which is a <long x long> so that we have a higher limit, at the price of lower performances.
  • we have to know the integer associated whith each value. This is easy for ObjectClass, because we have a set of all the possible values, and we can store this association in the server, but for user values, we need an association table. At the end, this is exactly the same as storing a N-N table

Search filter indices 

We have  at least 6 kind of search filters :

filter operation
index
attribute >= value
ordered BTree(attribute)
attribute <= value ordered BTree(attribute)
attribute = value ordered BTree(attribute)
attribute =* (present)
ordered BTree(attribute), value is not stored
attribute = * any
ordered BTree(attribute)
attribute = any *
ordered BTree(attribute), value is stored mirrored
attribute = [init] any [final] full scan on master table, using the =* BTree(attribute)
extensible ???
hierarchy ordered BTree(DN)
~= (approx)
We don't implement this index

Some of the indices are mandatory :

  • Hierarchy index is used to limit the search to a specific scope (the search is applied starting at some point in the DN)
  • the Present  index does not store the values. We only need to keep a relation between an attribute type and all the entries which contain it. The key is just the entry IDs
  • the reverted index should be optionnal
  • the BTree(attribute) is also optionnal, but if declared, it should be used for all the operation. The key is a combinaison of known values which are combined with entry IDs

Search algorithm

An ldap server is all about searching. This should be the fastest operation, even if Adding data is slow. We should also favor some searches above other (approx filter is not essential)

Many factors should be considered, like :

  • caching data
  • disk speed
  • filter ordering
  • memory
  • data structure

 Of course, caching data allows faster search, because reading on the disk is just a lot more slower than reading in memory. But this is balanced by the limited size of the cache.

Usually, searches are done for a specific value in a specific branch, using a limited scope (One level). The scope helps a lot, because it split the data in two categories :

  1. data in the selected subtree
  2. data outside this subtree

As we have a hierarchy index, we will use it to limit the search to the selected scope. Here, we have two options :

  1. considering that the scope is seen as a big set of entries into this scope
  2. or considering that we have a set of hierarchically set of data, which should be filtered recursively

Using the first option seems appealing because we can work on a simple set, but there are many drawbacks :

  • the set must be build first
  • it can be huge, when only a few entries will be returned
  •  and to build this set, we must have a specific storage, like a BTree for each level, so data will be duplicated

The second option, even if it looks a little bit more complex, looks like a better one. The idea is to walk the tree recursively, handling chunks of data. As the result is build and return on the fly, this is a natural way to select matching entries.

Now, this is not the best approach, when the filter can select smallest results. For instance, sarching for all entries which attribute CN equals "administrator" might return only one element, and using the indexed attribute 'CN' will just limit the numbe rof element to test to this single entry.

Basically, the idea is to select the smallest possible set of result given by all the filters. Suppose we have this filter :

(&(ObjectClass=person)(cn=test)) applied to the dc=example,dc=com with a SUBTREE scope

then looking first for all the entries containing cn=test will just return a few of them, when the first filter, ObjectClass=person might returns thousand of entries.

We must add some counters to each kind of indexed attributes so that we can quickly know which is the smallest set of selected data.

For OR filters, this is not true : each filter must be analyzed.

NOT filters is just a special case : we must transfer the NOT operation down the tree to the attributes. For instance, for the (! ( ( a | b ) & ( c | d ) ) ) filter, where a, b, c, d are filter operations, we will translate it to ( ( ! ( a | b )  ) |  ( ! ( c | d ) ) ) first, and then to  ( ( !a & !b ) | ( !c & !d ) ), so that the negation is applied directly to leaves, not to the result of a combined selection.