Sunday 14 March 2010

Tree structures and DAGs in SQL with efficient querying using transitive closures

This post
is about

SQL

Motivation

It’s a familiar situation to anyone who has tried representing tree structures (for example, a folder hierarchy) in an SQL database. You might have a table that tells you the parent-child relationships:

CREATE TABLE EntityParents (
    EntityID nvarchar(64) NOT NULL,
    ParentEntityID nvarchar(64) NOT NULL
)

But now how do you retrieve a list of not only an entity’s immediate children, but their children too, all the way down — i.e. a list of all descendants of an entity? You might need to do that to prevent the user from creating cycles in the structure.

A common strategy is to retrieve all the immediate children first, then all their children in another query, and to keep going until no more descendants are returned. Unfortunately, the number of queries increases with the log of the size of the tree, and even worse, the size of each query increases linearly. It’s a time-bomb waiting to go off when your tree becomes so large that the RDBMS begins to refuse processing queries with such a long list of constants in it.

In this blog entry, I will consider a more general case than tree structures, namely directed acyclic graphs (DAGs). This means every entity can have multiple parents, not just one. We only need to keep one restriction: no cycles. Hence acyclic. All trees are also DAGs, so everything works exactly the same for tree structures. My examples will also assume that you have a parent-child table just like the one above, which you will need for DAGs; for tree structures, you don’t need it (you can just have a ParentEntityID column in the entity's primary table), but you can easily adapt the examples to work accordingly.

Introducing transitive closures

The solution involves an extra table, whose contents are strictly speaking redundant (because they can theoretically be derived from the parent-child table), but which will enable very efficient queries of the kind I mentioned earlier. This extra table will contain the transitive closure of the parent-child relationship. Sounds complicated? It’s actually very simple: The transitive-closure table contains a row for every pair of entities that are ancestor and descendant of each other. In other words, if [EntityA, EntityB] is a row in your table, it means that EntityB is either a child of EntityA, or a grandchild, or a great-grandchild, etc. Conversely, it also means that EntityA is either a parent of EntityB, or a grandparent, etc.

To make querying this table even simpler, we additionally throw in all pairs of the same entity. In order words, [EntityA, EntityA] will also be a row in the table; so if [EntityA, EntityB] is a row, it could be that EntityA and EntityB are the same entity. Mathematically speaking, that makes the table the transitive-reflexive closure of the parent-child relationship, but I will sacrifice total accuracy for brevity and call them just transitive closures.

Finally, I’ll also have a Distance column which will tell us how far away two entities are in the hierarchy, i.e. how many generations there are between them:

CREATE TABLE EntityTC (
    Ancestor nvarchar(64) NOT NULL,
    Descendant nvarchar(64) NOT NULL,
    Distance int NOT NULL
)

To illustrate all this with a simple example, let’s consider a situation with four entities whose parent-child relationships are shown below. In that scenario, the transitive closure would look like this:

Notice the last entry, [Cedric, Adrian, 2], which tells us that there is a link from Cedric to Adrian via something else (Lilo in this example).

Keeping the transitive closure up to date efficiently

Generating such a transitive-closure table from the parent-child relationship table can take a while if the structure is large. Therefore, we cannot afford to re-generate it every time anything changes. We need to keep it up to date incrementally. It turns out that it’s actually possible to do that pretty efficiently. We’ll consider four cases and see how we can update the transitive closure in no time:

  1. when a new entity is created, initially with no parents or children;
  2. when an entity with no parents or children is deleted;
  3. when a parent-child link is added;
  4. when a parent-child link is removed.

Let’s go through these in order.

1. when a new entity is created

This case is trivial. If the entity has no parents or children, then it has no ancestors or descendents either. We only need to insert the one row into the transitive-closure table that actually makes it the transitive-reflexive closure, remember?

INSERT INTO EntityTC (Ancestor, Descendant, Distance)
    VALUES (<ID>, <ID>, 0)

Of course, you will have to replace “<ID>” with the ID of the new entity, properly SQL-escaped if it’s a string (wouldn’t want your software to end up having SQL injection vulnerabilities).

2. when an entity is deleted

If the entity has no parents or children, this case is also trivial. The reflexive row is the only one left, so just delete it.

If the entity has parents, but no children, it is similarly easy. You can safely delete all rows in the transitive closure where the descendant is the entity. Since the entity has no children, it provides no links to any other entities, so all the distance values remain correct.

If the entity does have parents and/or children, the case is not simple. If you want to maintain the distance column, you need to delete each of those parent/child relationships individually and run step #4 (below) for each. But if you don’t use the distance column, you can make life easier for yourself with the second option: connect the parents and children up — i.e., add a [P, C] row to EntityParents for every direct parent P and every direct child C of the deleted entity:

INSERT INTO EntityParents (EntityID, ParentEntityID)
    SELECT Children.EntityID, Parents.ParentEntityID
    FROM EntityParents Children, EntityParents Parents
    WHERE Children.ParentEntityID=<ID> AND Parents.EntityID=<ID>

The idea here is that we maintain all the links from all the ancestors to all the descendants, so we don’t need to touch them in the transitive-closure table. We only need to delete all references to the deleted entity:

DELETE FROM EntityTC WHERE Ancestor=<ID> OR Descendant=<ID>

3. when a parent-child link is added

At this point I want to start introducing some notation. I’m going to refer to specific entities by letters from now on. Additionally, I’ll use the following symbols:

  • X → Y refers to a link from entity X to entity Y. In other words, it means X is an ancestor of Y, and Y is a descendant of X.
  • dist(X → Y) refers to the distance of that link. If X and Y are the same entity, the distance is 0. If they are parent and child, it is 1, etc.
  • An(X) refers to the set of ancestors of any particular entity X, including X itself.
  • De(X) refers to the set of descendants of any particular entity X, including X itself.

Now let’s think about this case carefully. Let’s call the parent entity P and the child entity C. There are only two effects on the transitive closure that the new P → C link can have:

  • The new link could have provided a “faster” route from some ancestor X to some descendent Y that already has a “longer” path. For those existing links, we might need to update the Distance column. Any such “faster” route would have to be X → P → C → Y, so we’ll just calculate the distance of that and update it in the table if it is smaller than whatever’s already there. Calculating dist(X → P → C → Y) is easy: we know that dist(X → P) and dist(C → Y) are already correct in the table, and dist(P → C) is 1. Thus:
    UPDATE EntityTC SET Distance = (
        SELECT XtoP.Distance + CtoY.Distance + 1 FROM EntityTC XtoP, EntityTC CtoY WHERE
        XtoP.Ancestor=EntityTC.Ancestor AND XtoP.Descendant=<ParentID> AND
        CtoY.Ancestor=<ChildID> AND CtoY.Descendant=EntityTC.Descendant
    ) WHERE EXISTS (
        SELECT * FROM EntityTC XtoP, EntityTC CtoY WHERE
        XtoP.Ancestor=EntityTC.Ancestor AND XtoP.Descendant=<ParentID> AND
        CtoY.Ancestor=<ChildID> AND CtoY.Descendant=EntityTC.Descendant AND
        EntityTC.Distance > XtoP.Distance + CtoY.Distance + 1
    )
  • The new link will have created a new connection from all X ∈ An(P) (including P itself) to all Y ∈ De(C) (including C). Simply add the ones that aren’t already there. Calculating the distance is again easy: the formula for dist(X → P → C → Y) is exactly the same as before, and we know that it’s correct because there is no other link that could be shorter. Thus:
    INSERT INTO EntityTC (Ancestor, Descendant, Distance)
        SELECT XtoP.Ancestor, CtoY.Descendant, XtoP.Distance + CtoY.Distance + 1
        FROM EntityTC XtoP, EntityTC CtoY
        WHERE XtoP.Descendant=<ParentID> AND CtoY.Ancestor=<ChildID>
        AND NOT EXISTS (
            SELECT * FROM EntityTC test
            WHERE test.Ancestor=XtoP.Ancestor AND test.Descendant=CtoY.Descendant
        )

4. when a parent-child link is removed

Once again, let P be the parent entity, C be the child. In order to update the transitive closure after a parent-child link has been deleted, we observe the following invariants:

  1. The only links that could potentially be affected are links X → Y for which X ∈ An(P) and Y ∈ De(C).
  2. Since the graph is acyclic, An(P) and De(C) are necessarily disjoint, so we know that Y ∉ An(P).
  3. We can trust that all the links that we know are not affected (and their distance values) are correct. In particular, any link between two entities is fine if they are either both in An(P) or both outside An(P).
  4. For any affected link X → Y, it needs to be kept in the table if there is some other path between those entities that doesn't go through the P → C link.
  5. Any such other link X → Y must go from inside An(P) (where X is) to outside An(P) (where Y is) somewhere other than at P → C. (Equivalently, one could consider the boundary from outside De(C) to inside De(C).)
  6. Therefore, a link X → Y must be retained if and only if there exists a parent-child pair G → H such that G ∈ An(P), H ∉ An(P), and X → G and H → Y exist in the transitive closure. (Notice that it is possible that G = P or that H = C, but we must disallow for both to be the case simultaneously because we’re trying to remove the P → C link. Thus, remove [P, C] from EntityParents first.)

Therefore, any link X → Y that doesn’t satisfy the criterion in #6 needs to be deleted:

DELETE FROM EntityTC WHERE
    Ancestor IN (SELECT Ancestor FROM EntityTC WHERE Descendant=<ParentID>) AND
    Descendant IN (SELECT Descendant FROM EntityTC WHERE Ancestor=<ChildID>) AND
    NOT EXISTS (
        SELECT * FROM EntityTC XtoG, EntityTC HtoY WHERE
        XtoG.Ancestor=EntityTC.Ancestor AND
        XtoG.Descendant IN (SELECT Ancestor FROM EntityTC WHERE Descendant=<ParentID>) AND
        HtoY.Ancestor IN (
            SELECT EntityID FROM EntityParents
            WHERE ParentEntityID=XtoG.Descendant
        ) AND
        HtoY.Ancestor NOT IN (SELECT Ancestor FROM EntityTC WHERE Descendant=<ParentID>) AND
        HtoY.Descendant=EntityTC.Descendant
    )

Any link X → Y that does satisfy criterion #6 may need to have its distance corrected. We know that X and G are both in An(P), and H and Y are both outside An(P), so by criterion #3 we can trust their existing distances in the table. So we simply take the minimum distance over all possible G → H’s:

UPDATE EntityTC SET Distance = (
    SELECT MIN(XtoG.Distance + HtoY.Distance + 1) FROM EntityTC XtoG, EntityTC HtoY WHERE
    XtoG.Ancestor=EntityTC.Ancestor AND
    XtoG.Descendant IN (SELECT Ancestor FROM EntityTC WHERE Descendant=<ParentID>) AND
    HtoY.Ancestor IN (
        SELECT EntityID FROM EntityParents
        WHERE ParentEntityID=XtoG.Descendant
    ) AND
    HtoY.Ancestor NOT IN (SELECT Ancestor FROM EntityTC WHERE Descendant=<ParentID>) AND
    HtoY.Descendant=EntityTC.Descendant
) WHERE
    Ancestor IN (SELECT Ancestor FROM EntityTC WHERE Descendant=<ParentID>) AND
    Descendant IN (SELECT Descendant FROM EntityTC WHERE Ancestor=<ChildID>)

As an aside, an alternative would be to run the UPDATE query first, which would set the Distance column to NULL wherever there is no link left, and then the DELETE query is very simple (just delete all rows with a NULL distance). However, that approach requires for the Distance column to be nullable, which pains the soul of a true purist.

Conclusion

You can have a DAG in your database and query it efficiently. Each modification to the DAG requires only a few queries to keep the auxiliary transitive-closure table up to date. The number and size of queries does not grow with the amount of data.

2 comments:

  1. do you have some code implementations...i wanna compare this solution to others!

    ReplyDelete
  2. Have you been able to manage efficient multiple parents? Say Cedric is also Adrian's parent?

    ReplyDelete