Name		Name	Last commit message	Last commit date
parent directory ..
data		data
img		img
model		model
README.md		README.md
config.py		config.py
main.py		main.py
model.py		model.py
predict.py		predict.py
requirement.txt		requirement.txt
test.py		test.py
utils.py		utils.py

README.md

模型都未进行调参，未能使模型的准确率达到最高

项目名称：

使用 Word2Vec-TextCNN 模型来对中文进行分类，即文本分类

项目环境：

pytorch、python
相关库安装

pip install -r requirement.txt

项目目录：

Word2Vec-TextCNN         
    |-- data                 数据集   
    |-- img                  存放相关图片
    |-- model                保存的模型               
    |-- config.py            配置文件                    
    |-- main.py              主函数                      
    |-- model.py             模型文件                     
    |-- predict.py           预测文件                         
    |-- requirement.txt      需要的安装包   
    |-- TextCNN.pdf          TextCNN 的论文
    |-- utils.py             数据处理文件

项目数据集

数据集使用THUCNews中的train.txt、test.txt、dev.txt，为十分类问题。其中训练集一共有 180000 条，验证集一共有 10000 条，测试集一共有 10000 条。其类别为 finance、realty、stocks、education、science、society、politics、sports、game、entertainment 这十个类别。

模型介绍

详细内容请看：TextCNN 文本分类介绍

修改部分

相对于原始 TextCNN 模型的 Emdedding 层，此项目用了 Word2Vec 来代替。关于 Word2Vec 训练得到词向量，可以看：Word2Vec 字&词向量

# 添加 "<pad>" 和 "<UNK>"
# {"<PAD>": np.zeros(self.embedding), "<UNK>": np.random.randn(self.embedding)}
self.Embedding = self.model.vectors
self.Embedding = np.insert(self.Embedding, self.num_word, [np.zeros(self.embedding), np.random.randn(self.embedding)], axis=0)

self.word_2_index = self.model.key_to_index
self.word_2_index.update({"<PAD>": self.num_word, "<UNK>": self.num_word + 1})

text_id = [self.word_2_index.get(i, self.word_2_index["<UNK>"]) for i in text]
        text_id = text_id + [self.word_2_index["<PAD>"]] * (self.max_len - len(text_id))

wordEmbedding = np.array([self.Embedding[i] for i in text_id])

text_id = torch.tensor(wordEmbedding).unsqueeze(dim=0)

模型训练

python main.py

模型预测

python predict.py

微信交流群

我们有一个微信交流群，大家如果有需要，可以加入我们，一起进行学习。关注公众号后会有一个私人微信，添加微信，备注进群，就可以拉你进群，进行学习。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

04-Word2Vec+TextCNN 文本分类

04-Word2Vec+TextCNN 文本分类

README.md

项目名称：

项目环境：

项目目录：

项目数据集

模型介绍

修改部分

模型训练

模型预测

微信交流群

Files

04-Word2Vec+TextCNN 文本分类

Directory actions

More options

Directory actions

More options

Latest commit

History

04-Word2Vec+TextCNN 文本分类

Folders and files

parent directory

README.md

项目名称：

项目环境：

项目目录：

项目数据集

模型介绍

修改部分

模型训练

模型预测

微信交流群