ahocorasick

An extremely fast implementation of Aho Corasick algorithm based on Double Array Trie.Can be used for fuzzy matching, prohibited word marking.

中文

Common usage scenarios are:

1.When chatting, block when the user-entered qury hits a prohibited word.

2.Determine if the user input word hits the prohibited word blacklist,and then follow up with logic.

3.other scenarios that require high-performance fuzzy matching.

Characteristic

In addition to the construction based on double arrays, the algorithm also has additional optimization to ensure higher efficiency

1.When constructing the algorithm, the double array and the fail pointer and output items are built at the same time, greatly increasing the speed of construction and reducing the time it takes to build a dictionary.

2.Compress the output items to reduce build time and memory useage and gc.

3.Compressed leaf nodes, reduced memory consumption, improve retrieval efficiency.

Explanation of compressed output items:

e.g. the prohibited word dictionary is:

he 
she

The input is: ushe,then the returned hit item is: she

Explanation:

If we don't do output compression, the hit item is:he,she;

Considering that in practice, whether it is blocking prohibited words (e.g., ushe above, blockingprohibited words afteru,) or just determining whether a prohibited word has been hit,

there is no need for redundant output items,

compression of output items, not only can reduce the build time, but also reduce memory consumption,

for go, can also reduce gc scanning.

Performance Compare

A popular project cloudflare/ahocorasick and return the same results (using api:MultiPatternIndexes(content) (see test file forcomparison code).

Dictionary:

dictionary.txt

Number of strings: 153151.

Number of characters: 401552 (average 2.6 characters per article)

File to be retrieved:

text.txt

Characters: 815006

Dictionary and file to be retrieved are already in the project

Running machine:

Lenovo Small New Pro13 Ryzen 5 3550H

Match the file to be retrieved 100 times

go Version: 1.15

gc before matching (runtime. Gc())

Warehouse	Dictionary BuildTime(ms)	One gc Time(ms)	100 full-text matches(ms)	inuse_space	inuse_objects
cloudflare/ahocorasick	59	686	14910	4.67G	360455
TheFutureIsOurs/ahocorasick	2431	0	5341	14.2M	4

As you can see, performance is 64% faster than cloudflare/ahocorasick (retrieving 81.5w characters takes only 54ms), thanks to double array implementations and optimization of output items, the time to perform a gc can be ignored (only four slice headers are held). It also uses only 14.2M of memory.

Thanks to the various optimizations above, the build time is optimized to 2.4s, although it is still longer than cloudflare/ahocorasick,

but considering that the build time is more complex, and most applications only need to be built once at startup, the second stage for the sensory time is short, the build time is controllable (the author with a business 61w line black word test, build time 3.5s), but the higher quantifiable indicators of the return, this is worth it.

How to use

Download

go get -u github.com/TheFutureIsOurs/ahocorasick

Useage

import "github.com/TheFutureIsOurs/ahocorasick"

First step: construct an algorithm.

It can be constructed from a list of strings:

dictionary := []string{"hers", "his", "she", "he"}

ac, err := ahocorasick.Build(dictionary)

or by file construction:

dictionary.txt file content is:

hers
his
she
he

ac, err := BuildFromFile("./dictionary.txt")

Step 2: Match.

api list:

MultiPatternSearch(content []rune) []Hit

returns all hit strings at the beginning and end of the original string andstrings.

MultiPatternIndexes(content []rune) []int

returns the starting point of all hit strings in the original character.

MultiPatternHit(content []rune) bool

returns whether the content hit the dictionary. If hit, true is returned immediately and the query is not continued.

A complete example is as follows:

// Constructs dictionary from the list of 

dictionary := []string{"hers", "his", "she", "he"}

ac, err := ahocorasick.Build(dictionary)

search := ac.MultiPatternSearch([]rune("ushers")) // returns the longest string of all hits

/*

// Returns a list of all strings hit as follows:
// The starting position of the string at the beginning and end of the original string and the hit string
// The startingvalue of the original string is 0
1	3	she
2	5	hers

*/

for _, v := range search {

    fmt.Printf("%d\t%d\t%s\n", v.Begin, v.End, string(v.Value))

}

Thanks

The double array trie was inspired by the open source project of the darts-java

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README-ZH.md		README-ZH.md
README.md		README.md
ac.go		ac.go
ac_test.go		ac_test.go
dictionary.txt		dictionary.txt
go.mod		go.mod
text.txt		text.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ahocorasick

Characteristic

Performance Compare

How to use

Download

Useage

Thanks

About

Releases

Packages

Languages

License

TheFutureIsOurs/ahocorasick

Folders and files

Latest commit

History

Repository files navigation

ahocorasick

Characteristic

Performance Compare

How to use

Download

Useage

Thanks

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages