Aho-Corasick String Searching

By lexicalscope.dev at

I usually don't think about string searching. Standard libraries and modern languages have very efficient string searching algorithms built-in. In most cases, this is all that is needed to complete the task at hand. However, understanding these algorithms is fun and often helpful when applying the algorithmic pattern to other problems. This post explores the Aho-Corasick string searching algorithm.

There are many different string searching algorithms; a great resource exists here here. All the string searching algorithms in the previous link are for searching for a single string within a text. All have a runtime in the range of O(n+σ) and O(m * n) where m is the length of the pattern and n is the length of the text. What about finding two or more string patterns within a text? How can this be done efficiently, on the order of O(n)? Let's dig in.

Aho-Corasick string searching algorithm uses finite automata to process each character from the text. Each character is an input into the automata, which transitions the state.

This algorithms cleverness is in interconnecting nodes with what they call "suffix" and "dictionary suffix" links. Another common name is "failure" and "alternative" links respectively.

A failure link is always to the longest suffix of any other pattern in the trie. If we have two patterns foobar and baz. When the automata transitions with the input chars fooba it goes nearly to the end of the foobar pattern. If at fooba, a z char is input, the node cannot transition to the r at the end, it must follow a failure link. And the longest correct suffix is ba in the baz branch, so the a node from foobar links there. And then the z is processed and a transition is made to node 9 in the diagram which completes a match of baz.

Alternative links point to a node where another match has occurred. In our example below, the word "rep" is a subset of the word "grep", so the end "p" node will have itself a match, and a link to another branch in the trie containing "rep". Any number of alternative matches can exist; each is followed (but not transitioned) until a nil alternative link.

The steps for building the automata are as follows:

  1. Build a simple trie graph of all patterns.
  2. Breadth-first traversal of the graph making failure links and alternative match links.
  3. A search function to transition the automata to a new node, checking for failure links and alternative match links at each node.

The first step is building the trie graph containing the patterns to search for. This forms the base of the automata. In this example, the patterns will be the strings gre, rep, grep, and fgrep. And the input text will be foobar fgrep prepping.

First, define how a Node will be represented.

type Node struct {
    id          int
    data        rune
    children    map[rune]*Node
    fail        *Node
    alternative *Node
    match       int
}

A match greater than or equal to 0 means this node completes a pattern and the integer is the index into the slice of patterns.

Define a function for building the trie, which will construct a base trie graph with no links, then add the failure and alternative links. This initial portion of the build function defines the nodes slice and adds a root node.

func build(input []string) []*Node {

    nodes := make([]*Node, 0)

    root := &Node{
        id:          0,
        data:        '\000',
        children:    make(map[rune]*Node),
        fail:        nil,
        alternative: nil,
        match:       -1,
    }

    nodes = append(nodes, root)

    cur := root
    ids := 1
    ...

Next, iterate over each input pattern character, adding nodes where appropriate. Branches can overlap; each branch need not be unique to one pattern.

...
for i, word := range input {
    cur = root

    for j, letter := range word {

        if node, ok := cur.children[letter]; ok {
        
            // if this is the end, set the match index
            if j == len(word)-1 {
                node.match = i
            }

            cur = node
            continue
        }

        node := &Node{
            id:       ids,
            data:     letter,
            children: make(map[rune]*Node),
            fail:     nil,
            match:    -1,
        }

        nodes = append(nodes, node)
        ids++

        // if this is the end, set the match index
        if j == len(word)-1 {
            node.match = i
        }

        cur.children[letter] = node
        cur = node
    }
}
...

The initial trie graph looks like the picture below. Each node with a double circle indicates an "accepted" node. A match is triggered if the automata transitions to one of those nodes.

This initial trie graph isn't that effective because it does not have a function to cover each potential transition. At the very least, in this simple example, without failure and alternate links, each node should have a link back to the root or starting node if a pattern fails. More on this next.

Adding the failure and alternative nodes requires a breadth-first traversal. Setting up that traversal, we create a queue and add the first level of nodes or children of the root.

...
root.fail = root

failQueue := make([]*Node, 0)

// set the first node of each branch fail point back to root
for _, child := range root.children {
    child.fail = root
    failQueue = append(failQueue, child)
}
...

Loop over the queue until it is empty. Adding new nodes as we traverse the trie.

for {
    // if the initial branch nodes are all consumed
    if len(failQueue) == 0 {
        break
    }

    // pop one item off the queue
    cur, failQueue = failQueue[0], failQueue[1:]

    // for each child in this branch find the largest suffix
    for _, child := range cur.children {
        temp := cur.fail

        for {
            _, ok := temp.children[child.data]

            // we're at the largest suffix so far
            if !ok && temp != root {
                temp = temp.fail
            }

            // if we're back at the root then we already passed 
            // largest suffix
            if temp == root {
                break
            }

        }

        // if there is a suffix found, link to it. Otherwise, link 
        // back to the root node
        if node, ok := temp.children[child.data]; ok {
            child.fail = node // proper suffix
        } else {
            child.fail = root // no suffix
        }

        // add this node to the queue for processing
        failQueue = append(failQueue, child)
    }

    // add an alternative match link
    if cur.fail != nil && cur.fail.match >= 0 {
        cur.alternative = cur.fail
    } else {
        cur.alternative = cur.fail.alternative
    }
}

return nodes
}
// end build function

At this point, the entire automata should look like the image below. Solid lines are the pattern links, dashed lines are the failure links, and dotted lines are the alternative links.

Searching is transitioning the automata based on the input. For each input character, follow the transition rules below:

  1. if the current node contains a branch matching the input char, traverse to that node.
  2. If the new node contains a match, log it.
  3. If the new node contains an alternative match, log it. Additionally, follow each alternative link from node to node until a nil alternative link. Do not move from current node.
  4. If the current node does not contain a match, move to each failure node and test if that node has a good transition. If no transition exists, continue moving until a successful transition can be made or the root node is reached.

The search function will return a slice of integers with the same length as the input patterns. Each item in the new slice will count how often that pattern exists in the text.

func search(root *Node, patterns []string, input string) []int {

    node := root
    matches := make([]int, len(patterns))

    for i := 0; i < len(input); i++ {
        c := rune(input[i])

        // if there is a child node that matches the input
        if new, ok := node.children[c]; ok {

            // transition to the next node
            node = new

            // check if this node completes a match
            if node.match >= 0 {
                matches[node.match]++
            }

            // jump to each alternative matching node if they exist
            temp := node.alternative
            for {
                if temp == nil {
                    break
                }

                matches[temp.match]++
                temp = temp.alternative
            }

        } else {

            // jump to each failure node until a next matching
            // transition node is found, or back at the root.
            for {
                _, ok := node.children[c]

                if node == root {
                    break
                }

                if ok {
                    break
                }

                node = node.fail
            }

            // rerun this input char again as we moved nodes
            if _, ok := node.children[c]; ok {
                i--
            }
        }
    }

    return matches
}