*communicate the approach*. Feel free to check out our conference paper, and stay tuned for the journal paper. In this blog post, I will first give an introduction to de Bruijn graphs and how they are used in DNA assembly. Then I will briefly explain some previous implementations before reaching the main topic of this post: our new succinct representation. The explanation first explains some preliminaries, such as how we were inspired by the Burrows Wheeler Transform, and what rank and select are, which are required to understand the construction method, and traversal interface respectively. For those following along at home, I have implemented a demo version in Python. It doesn’t use efficient implementations of rank and select, nor provide any compression – it is merely meant to demonstrate the key ideas with (hopefully) readable high level code. An optimised version will be made available at some point. Okay, here we go…

## De Bruijn Graphs

De Bruijn graphs are a beautifully simple, yet useful combinatoric object which I challenge you not to lose sleep over. Since their discovery around 1946, they have been used in a variety of applications, such as encryption, psychology, chess and even card tricks. Quite recently they have become a popular data structure for DNA assembly of short read data. They are defined as a directed graph, where each node $u$ represents a fixed-length string (say, of length $k$, and an edge exists from $u$ to $v$ iff they overlap by $k–1$ symbols. That is, $u[2..k] = v[1..k–1]$. Let’s make this concrete with a diagram:*feel good*to look at? Now tilt your head and look at it this way: it is essentially a Finite State Machine with additional bounded memory of the last $k$ visited states, if you added a few more states to allow for incomplete input at the start. For example, if X represents blank input, from a starting state XXX we might have the transition chain: XXX -> XX1 -> X10 -> 101 (which is already a node in the graph). Put this way, it is easy to see that the $k$ previous edges define the current node. This perspective will make things easier to understand later, I promise. Still with me? Good (: Then let’s also consider that if

*one node*is defined by a $k$-length string, then a

*pair of nodes*(i.e. an edge) can be identified by a $k+1$ length string, since they overlap and differ by 1 symbol. This will also be important later. And of course, this can be extended to larger alphabet sizes than binary (say, 4…).

## DNA Assembly

First suggested in 2001 by Pevzner et al.[1], we can use de Bruijn graphs to represent a network of overlapping*short read data*. The long and short of it (heh heh) is that a DNA molecule is currently too difficult to sequence (that is, read it into a computer) in its entirety. Special methods must be used so we can sequence parts of the molecule, and hand off the putting-back-together process (assembly) to an algorithm. One current popular sequencing method is

*shotgun sequencing*, which clones the genome a bunch of times, then randomly breaks each clone into short segments. If we can sequence the short segments, then the fact that we randomly cut up the clones should lead us to have overlapping reads. Of course it is a bit more complicated than this in reality, but this is the essence of it.

*every single k-mer*present, and there is usually repeated regions. This means our data will be sparse. Consider the following contrived example. Take the sequence

`TACGACGTCGACT`

. If we set \(k=3\), our k-mers will be `TAC`

, `ACG`

, `CGA`

, and so on.
We would end up with this de Bruijn graph:
*subgraph*, these things still grow pretty big. It is worthwhile considering how to handle this scalability issue, if only to reduce hardware requirements of sequencing (thus proliferating personal genomics), and potentially improve traversal speed (due to better memory locality). Increased efficiency might also enable richer multiple-genomic analysis.

## Previous Representations

One of the first approaches to this was to scale “horizontally”. Simpson et al.[2] introduced ABySS in 2009. The graph for reads from a human genome (HapMap: NA18507), which used a distributed hash table, reached 336 GB. In 2011, Conway and Bromage[3] instead approached this problem from a “vertical” scaling perspective (that is, scaling to make better use of a single system’s resources), by using a sparse bitvector (by Okanohara and Sadakane[4]) to represent the $(k + 1)$-mers (the edges), and used rank and select (to be described shortly) to traverse it. As a result, their representation took 32 GB for the same data set. Minia, by Cikhi and Rizk (2012)[5], proposed yet another approach by using a bloom filter (with additional structure to avoid false positive edges that would affect the assembly). They traverse by generating possible edges and testing for it in the bloom filter. Using this approach, the graph was reduced to 5.7 GB.## Our Succinct Representation

As stated, we were able to represent the same graph in 2.5 GB (after some further compression techniques, which I will save for a future post). The key insight is that the edges define overlapping node labels. This is similar to that of Conway and Bromage, although they have some redundancy, since some*nodes*are represented more times than necessary. We further exploit the mutual information of edges by taking inspiration from the Burrows Wheeler Transform[6].

### Inspiration from the Burrows Wheeler Transform

The Burrows Wheeler Transform[6] is a reversible string permutation that can be searched directly and has the admirable quality of having long strings of repeated characters (great for compression). The easiest way to calculate the BWT of a string is to sort each symbol by their prefixes in colex order (that is, alphabetic order of the reverse of the string, not reverse alphabetic!) More information can be found on Wikipedia and this Dr. Dobbs article. The XBW is a generalisation of the BWT that applies to rooted, labeled trees[7]. The idea is that instead of taking all suffixes, we sort all paths from the root to each node, and support tree navigation (since it isn’t a linearly shaped string) with auxiliary bit vectors indicating which edges are leaves, and which are the last edges (of their siblings) of internal nodes. I won’t go into detail, but in the next section you should be able to see glimpses of these two ideas.### Construction

The simplest construction method[8] is to take every <node, edge> pair and sort them based on the reverse of the node label (colex order), removing duplicates[9]. We also padding to ensure every node has an incoming and an outgoing edge. This maintains the fact that a node is defined by its previous k edges. An example can be seen below:`ACG`

and `CGA`

both have two incoming edges.
*last*edge exiting a node. This means that each node will have a sequence of zero-or-more 0-bits, followed by a single 1-bit.

### Rank and Select

Rank and select are the bread and butter of succinct data structures, because so many operations can be implemented using them alone. $rank_c(i)$ returns the number of occurences of symbol $c$ on the closed range $[0, i]$, whereas $select_c(i)$ returns the position of the $i^{th}$ occurence of symbol $c$.*kind of*like inverse functions, although $rank()$ is not injective, so cannot have a true inverse. For this reason, if you want to find the

*position*of the left-nearest $c$, you would have to use $select()$ in orchestra with $rank()$ (a common pattern). Speaking of patterns of use, it may also help to keep this in mind: rank is for counting, and select is for searching, or can be thought of as an indirect addressing technique (such as addressing a node using the $L$ array). Two rank queries can count over a range, whereas two select queries can find a range. A rank and a select query can find a range where either a start point or end point are fixed. Rank, select and standard array access can all be done in $\mathcal{O}(1)$ time when $\sigma = polylog(N)$[11], if represent a bitvector using the structure described by Raman, Raman and Rao in 2007[12] (which I explained in an earlier blog post), and for larger alphabets use the index described by Ferragina, Manzini, Makinen, and Navarro in 2006[13]. In our implementation, we use modified versions to get it down to 3 bits per edge.

### Interface Overview

While it might not be obvious, these three arrays provides support for a full suite of navigation operations. An overview is given in these tables, which link to the implementation details that follow. First, to navigate each of the edges, we define two internal functions (using rank and select calls):Operation | Description | Complexity |
---|---|---|

\(forward(i)\) | Return index of the last edge of the node pointed to by edge \(i\). |
$\mathcal{O}(1)$ |

\(backward(i)\) | Return index of the first edge that points to the node that the edge at \(i\) exits. |
\(\mathcal{O}(1)\) |

Operation | Description | Complexity |
---|---|---|

\(outdegree(v)\) | Return number of outgoing edges from node \(v\). | \(\mathcal{O}(1)\) |

\(outgoing(v,c)\) | From node \(v\), follow the edge labeled by symbol \(c\). | \(\mathcal{O}(1)\) |

\(label(v)\) | Return (string) label of node \(v\). | \(\mathcal{O}(k)\) |

\(indegree(v)\) | Return number of incoming edges to node \(v\). | \(\mathcal{O}(1)\) |

\(incoming(v,c)\) | Return predecessor node starting with symbol \(c\), that has an edge to node \(v\). | \(\mathcal{O}(k \log \sigma) \) |

### Forward

In order to support the public interface, we create for ourselves a simpler way to work with edges: the complementing forward and backward functions. Recall that all node labels are defined by predecessor edges, then we have represented each edge in two different places: the \(F\) array (which is equivalent to the last column of the “Node” array), and the edge array \(W\). It follows that, since it is sorted, the node labels maintain the same*relative order*as the edge labels. This can be seen in the following figure:

`C`

s in the last column of Node is different from the number of `C`

s in \(W\), because the first `C`

in \(W\) points to two edges.
For this reason, we ignore the first edge from node `GAC`

(although it doesn’t affect the relative order). In fact, we ignore any edge that doesnt have L[i] == 1.
Then, following an edge is simply finding the corresponding relatively positioned node! All it takes is some creative counting, using rank and select, as pictured
below:
*edges*with this label. Let’s call this relative index \(r\). In our example we are following the 2nd C-labeled edge. To find the 2nd occurence of C in F, first we need to know where the first occurence is. We can use F to find that. Then we can select to the 2nd one, using the last array. Because the last array is binary only, this requires us to count how many 1s there are before the run of Cs (using rank), then adding 2 to land us at the 2nd C. The W access, rank over W, rank and select over L, accessing F, and the addition are all done in O(1) time, so forward also takes O(1) time.

### Backward

Backward is very similar to forward, but calculated in a different order: we find the relative index of the node label first, and use that to find the corresponding edge (which may not be the only edge to point to this node, but we define it to point to the first one, that is one that isn’t flagged with a minus). We can find our relative index of the node label by issuing two rank queries instead, and using select on W to find the first incoming edge.### Outdegree

This is an easy one. This function accepts a**node**(not edge!) index

`v`

, and returns the number of outgoing edges from that node. Why did I say this was easy?
Well, remember that all our outgoing edges from the same node are contiguous in our data structure. See the diagrams below, paying attention to the node `ACG`

.
*except the last*edge. All we need to do is count how many 0s there are, then add 1. Put another way, we need to measure the distance between 1s (since the previous 1 will belong to the previous node). Since we know the node index, we can use select for that! In the above example, we are querying the outdegree of node 6 (the 7th node due to zero-basing). First we select to find the position of the 7th 1, which gives us the last edge of that node. Then we simply subtract the position of the previous node (node 5, the 6th node): \(select(7) – select(6) = 8 – 6 = 2\). Boom. Select queries can be answered in O(1) time, so outdegree is also O(1).

### Outgoing

Outgoing(v,c) returns the target node after traversing edge c from node v, which might not exist. The hard part is finding the correct edge index to follow; after that we can conveniently use the forward() function we defined earlier. To ease the explanation, consider a simple bit vector`00110100`

. To count how many 1s there are up to and including the 7th position, we would use rank.
The answer is 3, but at this stage we still don’t know the position of the 3rd 1 (we can’t see the bitvector). In general, it may or may not be in the 7th position.
We could scan backwards, or we could just use select(3) to find the position (since this returns the first position i that has rank(i) = 3).
So essentially we can count how many of those edges there are before the node we are interested in, then use select to find the position. If the position
is inside this nodes range (of contiguous edges), then we follow it.
### Label

At some point (e.g. during traversal) we are probably going to want to print the node labels out. Let’s work out how to do that (: Remember, we aren’t storing the node labels explicitly. The F array will come in handy: We can use the position of our node (found using select) as a reverse lookup into F. This can be done in constant time with a sparse bit vector, or in logarithmic time using binary search, or we can use a linear scan if our alphabet is small. In any case, lets assume it is O(1) time, although a linear scan might be faster in practice (fewer function calls may yield a lower constant coefficient).### Indegree

In a similar manner to outdegree, all we need to do is count the edges that point to the current node label. Take for example the graph:### Incoming

Incoming, which returns the predecessor node that begins with the provided symbol, is probably the most difficult operation to implement. However, it does use approaches similar to the previous functions. Consider this: from indegree(), we already know how to count each of the predecessor nodes. We can access these nodes if we instead use select to iterate over the predecessor nodes, rather than using rank to simply count. Then, to disambiguate them by first character, we can use label(). A linear scan over the predecessors in this fashion would work, but for large alphabets we can use binary search (with a select call before each array access) to support this in $O(k \log \sigma)$ time; $\log \sigma$ for the binary search, where each access is a $O(1)$ select, followed by $O(k)$ to compute the label. This is demonstrated in the example below:## Conclusion

In conclusion, by using memory more efficiently, hopefully the cost of genome seqeuencing can be reduced, both proliferating the technology, but also giving way to more advanced population analysis. To that end, we have described a novel approach to representing de Bruijn graphs efficiently, while supporting a full suite of navigation operations quickly. Much of the (BWT-inspired) construction can be done efficiently on disk, but we intend to improve this soon to compete with Minia. The total space is a theoretical $m(2 + \log{\sigma} + o(1))$ bits in general, or $4m + o(m)$ bits for DNA, given $m$ edges. Using specially modified indexes we can lower this to around 3 bits per edge. I apologize that an efficient implementation isn’t available, nor have I provided experimental results. But if you found this post interesting you can get your hands dirty with the Python implementation I have provided. The results and efficient implementation are on their way (: And as always, follow me on Twitter, and feel free to send me questions, criticisms, and especially pull requests!- Pevzner, P. A., Tang, H., and Waterman, M. S. (2001). An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences, 98(17):9748–9753. ↩
- Simpson, J. T., Wong, K., Jackman, S. D., Schein, J. E., Jones, S. J. M., and Birol, I. (2009). ABySS: A parallel assembler for short read sequence data. Genome Research, 19(6):1117–1123. ↩
- Conway, T. C. and Bromage, A. J. (2011). Succinct data structures for assembling large genomes. Bioinformatics, 27(4):479–486. ↩
- Okanohara, D. and Sadakane, K. (2006). Practical Entropy-Compressed Rank/Select Dictionary. ↩
- R. Chikhi, G. Rizk. (2012) Space-efficient and exact de Bruijn graph representation based on a Bloom filter. WABI. ↩
- M. Burrows and D. J. Wheeler. A Block-sorting Lossless Data Compression Algorithms. Technical Report 124, Digital SRC Research Report, 1994. ↩
- P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Compressing and indexing labeled trees, with applications. Journal of the ACM, 57(1):4:1–4:33, 2009. ↩
- The paper also describes an online construction method (where we can update by appending), and an iterative method that builds the k-dimensional de Bruijn graph from an existing (k–1)-dimensional de Bruijn graph. ↩
- We currently use an external merge sort, but intend to optimise as this is where Minia beats us time-wise. ↩
- This is “little o” notation, which may be unfamiliar to some people. Intuitively it means “grows much slower than”, and is stricter than big O. A formal definition can be found on Wikipedia. ↩
- http://stackoverflow.com/questions/1801135/what-is-the-meaning-of-o-polylogn-in-particular-how-is-polylogn-defined ↩
- R. Raman, V. Raman, and S. R. Satti. Succinct indexable dictionaries with applications to encoding k-ary trees, prefix sums and multisets. ACM Trans. Algorithms, 3(4), November 2007. ↩
- P. Ferragina, G. Manzini, V. M ̈akinen, and G. Navarro. Compressed Representations of Sequences and Full-Text Indexes. ACM Transactions on Algorithms, 3(2):No. 20, 2006. ↩