Aho-Corasick String Searching
By lexicalscope.dev at
I usually don't think about string searching. Standard libraries and modern languages have very efficient string searching algorithms built-in. In most cases, this is all that is needed to complete the task at hand. However, understanding these algorithms is fun and often helpful when applying the algorithmic pattern to other problems. This post explores the Aho-Corasick string searching algorithm.
There are many different string searching algorithms; a great resource exists
here here. All
the string searching algorithms in the previous link are for searching for a
single string within a text. All have a runtime in the range of O(n+σ)
and
O(m * n)
where m
is the length of the pattern and n
is the length of the
text. What about finding two or more string patterns within a text? How can this
be done efficiently, on the order of O(n)
? Let's dig in.
Aho-Corasick string searching algorithm uses finite automata to process each character from the text. Each character is an input into the automata, which transitions the state.
This algorithms cleverness is in interconnecting nodes with what they call "suffix" and "dictionary suffix" links. Another common name is "failure" and "alternative" links respectively.
A failure link is always to the longest suffix of any other pattern in the trie.
If we have two patterns foobar
and baz
. When the automata transitions with
the input chars fooba
it goes nearly to the end of the foobar
pattern. If at
fooba
, a z
char is input, the node cannot transition to the r
at the end,
it must follow a failure link. And the longest correct suffix is ba
in the
baz
branch, so the a
node from foobar
links there. And then the z
is
processed and a transition is made to node 9 in the diagram which completes a
match of baz
.
Alternative links point to a node where another match has occurred. In our
example below, the word "rep" is a subset of the word "grep", so the end "p"
node will have itself a match, and a link to another branch in the trie
containing "rep". Any number of alternative matches can exist; each is followed
(but not transitioned) until a nil
alternative link.
The steps for building the automata are as follows:
- Build a simple trie graph of all patterns.
- Breadth-first traversal of the graph making failure links and alternative match links.
- A search function to transition the automata to a new node, checking for failure links and alternative match links at each node.
The first step is building the trie graph containing the patterns to search for.
This forms the base of the automata. In this example, the patterns will be the
strings gre
, rep
, grep
, and fgrep
. And the input text will be
foobar fgrep prepping
.
First, define how a Node
will be represented.
type Node struct {
id int
data rune
children map[rune]*Node
fail *Node
alternative *Node
match int
}
A match
greater than or equal to 0 means this node completes a pattern and the
integer is the index into the slice of patterns.
Define a function for building the trie, which will construct a base trie graph with no links, then add the failure and alternative links. This initial portion of the build function defines the nodes slice and adds a root node.
func build(input []string) []*Node {
nodes := make([]*Node, 0)
root := &Node{
id: 0,
data: '\000',
children: make(map[rune]*Node),
fail: nil,
alternative: nil,
match: -1,
}
nodes = append(nodes, root)
cur := root
ids := 1
...
Next, iterate over each input pattern character, adding nodes where appropriate. Branches can overlap; each branch need not be unique to one pattern.
...
for i, word := range input {
cur = root
for j, letter := range word {
if node, ok := cur.children[letter]; ok {
// if this is the end, set the match index
if j == len(word)-1 {
node.match = i
}
cur = node
continue
}
node := &Node{
id: ids,
data: letter,
children: make(map[rune]*Node),
fail: nil,
match: -1,
}
nodes = append(nodes, node)
ids++
// if this is the end, set the match index
if j == len(word)-1 {
node.match = i
}
cur.children[letter] = node
cur = node
}
}
...
The initial trie graph looks like the picture below. Each node with a double circle indicates an "accepted" node. A match is triggered if the automata transitions to one of those nodes.
This initial trie graph isn't that effective because it does not have a function to cover each potential transition. At the very least, in this simple example, without failure and alternate links, each node should have a link back to the root or starting node if a pattern fails. More on this next.
Adding the failure and alternative nodes requires a breadth-first traversal. Setting up that traversal, we create a queue and add the first level of nodes or children of the root.
...
root.fail = root
failQueue := make([]*Node, 0)
// set the first node of each branch fail point back to root
for _, child := range root.children {
child.fail = root
failQueue = append(failQueue, child)
}
...
Loop over the queue until it is empty. Adding new nodes as we traverse the trie.
for {
// if the initial branch nodes are all consumed
if len(failQueue) == 0 {
break
}
// pop one item off the queue
cur, failQueue = failQueue[0], failQueue[1:]
// for each child in this branch find the largest suffix
for _, child := range cur.children {
temp := cur.fail
for {
_, ok := temp.children[child.data]
// we're at the largest suffix so far
if !ok && temp != root {
temp = temp.fail
}
// if we're back at the root then we already passed
// largest suffix
if temp == root {
break
}
}
// if there is a suffix found, link to it. Otherwise, link
// back to the root node
if node, ok := temp.children[child.data]; ok {
child.fail = node // proper suffix
} else {
child.fail = root // no suffix
}
// add this node to the queue for processing
failQueue = append(failQueue, child)
}
// add an alternative match link
if cur.fail != nil && cur.fail.match >= 0 {
cur.alternative = cur.fail
} else {
cur.alternative = cur.fail.alternative
}
}
return nodes
}
// end build function
At this point, the entire automata should look like the image below. Solid lines are the pattern links, dashed lines are the failure links, and dotted lines are the alternative links.
Searching is transitioning the automata based on the input. For each input character, follow the transition rules below:
- if the current node contains a branch matching the input char, traverse to that node.
- If the new node contains a match, log it.
- If the new node contains an alternative match, log it. Additionally, follow each alternative link from node to node until a nil alternative link. Do not move from current node.
- If the current node does not contain a match, move to each failure node and test if that node has a good transition. If no transition exists, continue moving until a successful transition can be made or the root node is reached.
The search function will return a slice of integers with the same length as the input patterns. Each item in the new slice will count how often that pattern exists in the text.
func search(root *Node, patterns []string, input string) []int {
node := root
matches := make([]int, len(patterns))
for i := 0; i < len(input); i++ {
c := rune(input[i])
// if there is a child node that matches the input
if new, ok := node.children[c]; ok {
// transition to the next node
node = new
// check if this node completes a match
if node.match >= 0 {
matches[node.match]++
}
// jump to each alternative matching node if they exist
temp := node.alternative
for {
if temp == nil {
break
}
matches[temp.match]++
temp = temp.alternative
}
} else {
// jump to each failure node until a next matching
// transition node is found, or back at the root.
for {
_, ok := node.children[c]
if node == root {
break
}
if ok {
break
}
node = node.fail
}
// rerun this input char again as we moved nodes
if _, ok := node.children[c]; ok {
i--
}
}
}
return matches
}