CRAN version 0.5
Changes in Version 0.5 (2015-04-29)
- Fix: edit_dict() on Mac
- New function: filter_segment() to filter segmentation result
- New function: vector_keywords() to extract keywords from a string
- Enhancement: Segmentation support: Vector input => List output
- Enhancement: Segmentation support: Input by lines => Output by lines
- Enhancement: Add option write = "NOFILE"
- Enhancement: New rules for "English word + Numbers"
- Update documentation
一、 增加过滤分词结果的方法 filter_segment()
,类似于关键词提取中使用的停止词功能。
cutter = worker()
result_segment = cutter["我是测试文本,用于测试过滤分词效果。"]
result_segment
[1] "我" "是" "测试" "文本" "用于" "测试" "过滤" "分词" "效果"
filter_words = c("我","你","它","大家")
filter_segment(result_segment,filter_words)
[1] "是" "测试" "文本" "用于" "测试" "过滤" "分词" "效果"
二、 分词支持 “向量文本输入 => list
输出” 与 “按行输入文件 => list
输出”
通过 bylines 选项控制是否按行输出,默认值为bylines = FALSE
。
cutter = worker(bylines = TRUE)
cutter
Worker Type: Mix Segment
Detect Encoding : TRUE
Default Encoding: UTF-8
Keep Symbols : FALSE
Output Path :
Write File : TRUE
By Lines : TRUE
Max Read Lines : 1e+05
....
cutter[c("这是非常的好","大家好才是真的好")]
[[1]]
[1] "这是" "非常" "的" "好"
[[2]]
[1] "大家" "好" "才" "是" "真的" "好"
cutter$write = FALSE
# 输入文件文本是:
# 这是一个分行测试文本
# 用于测试分行的输出结果
cutter["files.path"]
[[1]]
[1] "这是" "一个" "分行" "测试" "文本"
[[2]]
[1] "用于" "测试" "分行" "的" "输出" "结果"
# 按行写入文件
cutter$write = TRUE
cutter$bylines = TRUE
三、可以使用 vector_keywords
对一个文本向量提取关键词。
keyworker = worker("keywords")
cutter = worker()
vector_keywords(cutter["这是一个比较长的测试文本。"],keyworker)
8.94485 7.14724 4.77176 4.29163 2.81755
"文本" "测试" "比较" "这是" "一个"
vector_keywords(c("今天","天气","真的","十分","不错","的","感觉"),keyworker)
6.45994 6.18823 5.64148 5.63374 4.99212
"天气" "不错" "感觉" "真的" "今天"
四、增加 write = "NOFILE"
选项,不检查文件路径。
cutter = worker(write = "NOFILE",symbol = TRUE)
cutter["./test.txt"] # 目录下有test.txt 文件
[1] "." "/" "test" "." "txt"