Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
Lips7 committed Jul 8, 2024
1 parent 2e63057 commit c4611e5
Show file tree
Hide file tree
Showing 3 changed files with 53 additions and 47 deletions.
52 changes: 29 additions & 23 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -184,33 +184,17 @@ bench fastest │ slowest
```

## Roadmap
- [x] Cache get_process_matcher results globally, instead cache result inside SimpleMatcher.
- [x] Expose reduce_process_text to Python.

### Performance
- [x] ~~Cache middle results during different SimpleMatchType reduce_process_text function calling. (failed, too slow)~~
- [x] More detailed and rigorous benchmarks.
- [x] More detailed and rigorous tests.
- [x] Try more aho_corasick library to improve performance and reduce memory usage.
- [x] Try more aho-corasick library to improve performance and reduce memory usage.
- [x] ~~https://github.com/daac-tools/crawdad (produce char-wise index, not byte-wise index, it's not acceptable)~~
- [x] https://github.com/daac-tools/daachorse (use it when Fanjian, PinYin or PinYinChar transformation is performed)
- [x] ~~Test char-wise HashMap transformation for Chinese Characters. (Too slow)~~
- [x] Add a new function that can handle single simple match type.
- [x] `text_process` now is available.
- [x] More detailed simple match type explanation.
- [x] Add fuzzy matcher, https://github.com/lotabout/fuzzy-matcher.
- [x] Use `rapidfuzz` instead.
- [x] More precise and convenient MatchTable.
- [x] Make SimpleMatcher and Matcher serializable.
- [x] Make aho-corasick serializable.
- [x] See https://github.com/Lips7/aho-corasick.
- [x] Implement NOT logic word-wise.
- [ ] More detailed [DESIGN](./DESIGN.md).
- [x] Support stable rust.
- [x] Unsafe aho-corasick crate implement.
- [x] Faster and faster!
- [x] Make aho-corasick unsafe.
- [x] See https://github.com/Lips7/aho-corasick.
- [ ] Support iterator.
- [ ] Optimize NOT logic word-wis
- [x] Optimize regex matcher with RegexSet.
- [ ] Optimize NOT logic word-wise.
- [x] Optimize regex matcher using RegexSet.
- [ ] Optimize simple matcher when multiple simple match types are used.
1. Consider if there are multiple simple match types
* None
Expand All @@ -225,4 +209,26 @@ bench fastest │ slowest
- [ ] Optimize process matcher when perform reduce text processing.
1. Consider we have to perform FanjianDeleteNormalize, we need to perform Fanjian first, then Delete, then Normalize, 3 kinds of Process Matcher are needed to perform replacement or delete, the text has to be scanned 3 times.
2. What if we only construct only 1 Process Matcher which's patterns contains all the Fanjian, Delete and Normalize 3 kinds of patterns? We could scan the text only once to get all the positions that should be perform replacement or delete.
3. We need to take care of the byte index will change after replacement or delete, so we need to take the offset changes into account.
3. We need to take care of the byte index will change after replacement or delete, so we need to take the offset changes into account.

### Flexibility
- [x] Cache get_process_matcher results globally, instead of caching result inside SimpleMatcher.
- [x] Expose reduce_process_text to Python.
- [x] Add a new function that can handle single simple match type.
- [x] `text_process` now is available.
- [x] Add fuzzy matcher, https://github.com/lotabout/fuzzy-matcher.
- [x] Use `rapidfuzz` instead.
- [x] Make SimpleMatcher and Matcher serializable.
- [x] Make aho-corasick serializable.
- [x] See https://github.com/Lips7/aho-corasick.
- [x] Implement NOT logic word-wise.
- [x] Support stable rust.
- [ ] Support iterator.
- [ ] A real java package.

### Readability
- [x] More precise and convenient MatchTable.
- [x] More detailed and rigorous benchmarks.
- [x] More detailed and rigorous tests.
- [x] More detailed simple match type explanation.
- [ ] More detailed [DESIGN](./DESIGN.md).
2 changes: 1 addition & 1 deletion matcher_c/matcher_c.h
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ bool matcher_is_match(void* matcher, char* text);
char* matcher_word_match(void* matcher, char* text);
void drop_matcher(void* matcher);

void* init_simple_matcher(char* simple_wordlist_dict_bytes);
void* init_simple_matcher(char* simple_match_type_word_map_bytes);
bool simple_matcher_is_match(void* simple_matcher, char* text);
char* simple_matcher_process(void* simple_matcher, char* text);
void drop_simple_matcher(void* simple_matcher);
Expand Down
46 changes: 23 additions & 23 deletions matcher_c/src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
use std::{
ffi::{c_char, CStr, CString},
str::from_utf8,
str,
};

use matcher_rs::{MatchTableMap, Matcher, SimpleMatchTypeWordMap, SimpleMatcher, TextMatcherTrait};
Expand Down Expand Up @@ -128,7 +128,7 @@ pub unsafe extern "C" fn matcher_is_match(matcher: *mut Matcher, text: *const c_
matcher
.as_ref()
.unwrap()
.is_match(from_utf8(CStr::from_ptr(text).to_bytes()).unwrap_or(""))
.is_match(str::from_utf8_unchecked(CStr::from_ptr(text).to_bytes()))
}
}

Expand Down Expand Up @@ -158,7 +158,7 @@ pub unsafe extern "C" fn matcher_is_match(matcher: *mut Matcher, text: *const c_
/// ```
/// use std::collections::HashMap;
/// use std::ffi::{CStr, CString};
/// use std::str::from_utf8;
/// use std::str;
///
/// use matcher_c::*;
/// use matcher_rs::{MatchTable, MatchTableType, SimpleMatchType};
Expand All @@ -184,29 +184,29 @@ pub unsafe extern "C" fn matcher_is_match(matcher: *mut Matcher, text: *const c_
/// let not_match_text_bytes = CString::new("test").unwrap();
///
/// assert_eq!(
/// from_utf8(
/// unsafe {
/// unsafe {
/// str::from_utf8_unchecked(
/// CStr::from_ptr(
/// matcher_word_match(
/// matcher_ptr,
/// match_text_bytes.as_ptr()
/// )
/// ).to_bytes()
/// }
/// ).unwrap_or(""),
/// )
/// },
/// r#"{"1":[{"table_id":1,"word":"hello"},{"table_id":1,"word":"world"}]}"#
/// );
/// assert_eq!(
/// from_utf8(
/// unsafe {
/// unsafe {
/// str::from_utf8_unchecked(
/// CStr::from_ptr(
/// matcher_word_match(
/// matcher_ptr,
/// not_match_text_bytes.as_ptr()
/// )
/// ).to_bytes()
/// }
/// ).unwrap_or(""),
/// )
/// },
/// r#"{}"#
/// );
///
Expand All @@ -222,7 +222,7 @@ pub unsafe extern "C" fn matcher_word_match(
matcher
.as_ref()
.unwrap()
.word_match_as_string(from_utf8(CStr::from_ptr(text).to_bytes()).unwrap_or("")),
.word_match_as_string(str::from_utf8_unchecked(CStr::from_ptr(text).to_bytes())),
)
.unwrap()
};
Expand Down Expand Up @@ -391,7 +391,7 @@ pub unsafe extern "C" fn simple_matcher_is_match(
simple_matcher
.as_ref()
.unwrap()
.is_match(from_utf8(CStr::from_ptr(text).to_bytes()).unwrap_or(""))
.is_match(str::from_utf8_unchecked(CStr::from_ptr(text).to_bytes()))
}
}

Expand Down Expand Up @@ -419,7 +419,7 @@ pub unsafe extern "C" fn simple_matcher_is_match(
/// ```
/// use std::collections::HashMap;
/// use std::ffi::{CStr, CString};
/// use std::str::from_utf8;
/// use std::str;
///
/// use matcher_c::*;
/// use matcher_rs::{SimpleMatcher, SimpleMatchType};
Expand All @@ -436,29 +436,29 @@ pub unsafe extern "C" fn simple_matcher_is_match(
/// let non_match_text_bytes = CString::new("test").unwrap();
///
/// assert_eq!(
/// from_utf8(
/// unsafe {
/// unsafe {
/// str::from_utf8_unchecked(
/// CStr::from_ptr(
/// simple_matcher_process(
/// simple_matcher_ptr,
/// match_text_bytes.as_ptr()
/// )
/// ).to_bytes()
/// }
/// ).unwrap_or(""),
/// )
/// },
/// r#"[{"word_id":1,"word":"hello&world"}]"#
/// );
/// assert_eq!(
/// from_utf8(
/// unsafe {
/// unsafe {
/// str::from_utf8_unchecked(
/// CStr::from_ptr(
/// simple_matcher_process(
/// simple_matcher_ptr,
/// non_match_text_bytes.as_ptr()
/// )
/// ).to_bytes()
/// }
/// ).unwrap_or(""),
/// )
/// },
/// r#"[]"#
/// );
///
Expand All @@ -475,7 +475,7 @@ pub unsafe extern "C" fn simple_matcher_process(
&simple_matcher
.as_ref()
.unwrap()
.process(from_utf8(CStr::from_ptr(text).to_bytes()).unwrap_or("")),
.process(str::from_utf8_unchecked(CStr::from_ptr(text).to_bytes())),
)
.unwrap(),
)
Expand Down

0 comments on commit c4611e5

Please sign in to comment.