Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
xtonev committed Oct 24, 2019
2 parents 2d3e594 + bf202b2 commit 22dd19b
Show file tree
Hide file tree
Showing 2 changed files with 54 additions and 103 deletions.
12 changes: 6 additions & 6 deletions README-rus.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,12 +6,12 @@
Библиотека ```topicnet``` помогает строить тематические модели посредством автоматизации рутинных процессов моделирования.

### Как работать с библиотекой?
Сначала, вы инициализируете объект ```TopicModel``` с помощью имеющейся ARTM модели или конструируете первую модель
Сначала, вы инициализируете объект ```TopicModel``` с помощью имеющейся ARTM модели, или конструируете первую модель
при помощи модуля ```model_constructor```.
Полученной моделью нужно проинициализировать экземпляр класса ```Experiment``` ведущий учёт этапов тренировки
Полученной моделью нужно проинициализировать экземпляр класса ```Experiment```, ведущий учёт этапов тренировки
и моделей полученных в ходе этих этапов.
Все возможные на данный момент типы этапов тренировки содержатся в ```cooking_machine.cubes``` а просмотреть полученные модели
можно при помощи модуля ```viewers``` имеющего широкий функционал способов выведения информации о модели.
Все возможные на данный момент типы этапов тренировки содержатся в ```cooking_machine.cubes```, а просмотреть полученные модели
можно при помощи модуля ```viewers```, имеющего широкий функционал способов выведения информации о модели.

### Кому может быть полезна данная библиотека?
Данный проект будет интересен двум категориям пользователей.
Expand All @@ -22,10 +22,10 @@

---
## Как установить TopicNet
**Большая часть** функционала TopicNet завязана на библиотеку BigARTM, требующей рученой установки.
**Большая часть** функционала TopicNet завязана на библиотеку BigARTM, требующей ручной установки.
Чтобы облегчить этот процесс вы можете воспользоваться [докер образами с предустановленным BigARTM](https://hub.docker.com/r/xtonev/bigartm/tags).
Если по каким-то причинам использование докер образов вам не подходит, то подробное описание установки BigARTM можно найти здесь: [BigARTM installation manual](https://bigartm.readthedocs.io/en/stable/installation/index.html).
В полученный образ с BigARTM форкнуть данный репозиторий или же установить его с помощью команды: ```pip install topicnet```.
В полученный образ с BigARTM скачать данный репозиторий или же установить его с помощью команды: ```pip install topicnet```.

---
## Краткая инструкция по работе с TopicNet
Expand Down
145 changes: 48 additions & 97 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,20 +3,27 @@

---
### What is TopicNet?
```topicnet``` library was created to assist in the task of building topic models. It aims at automating model training routine freeing more time for artistic process of constructing a target functional for the task at hand.
### How does it work?
The work starts with defining ```TopicModel``` from an ARTM model at hand or with help from ```model_constructor``` module. This model is then assigned a root position for the ```Experiment``` that will provide infrastructure for the model building process. Further, the user can define a set of training stages by the functionality provided by the ```cooking_machine.cubes``` modules and observe results of their actions via ```viewers``` module.
### Who will use this repo?
This repo is intended to be used by people that want to explore BigARTM functionality without writing an essential overhead for model training pipelines and information retrieval. It might be helpful for the experienced users to help with rapid solution prototyping
TopicNet is a high-level interface running on top of BigARTM.

```TopicNet``` library was created to assist in the task of building topic models. It aims at automating model training routine freeing more time for artistic process of constructing a target functional for the task at hand.

Consider using TopicNet if:

* you want to explore BigARTM functionality without writing an overhead.
* you need help with rapid solution prototyping.
* you want to build a good topic model quickly (out-of-box, with default parameters).
* you have an ARTM model at hand and you want to explore it's topics.

```TopicNet``` provides an infrastructure for your prototyping (```Experiment``` class) and helps to observe results of your actions via ```viewers``` module.

### How to start?
Define `TopicModel` from an ARTM model at hand or with help from `model_constructor` module. Then create an `Experiment`, assigning a root position to this model. Further, you can define a set of training stages by the functionality provided by the `cooking_machine.cubes` module.

---
## How to install TopicNet
**Core library functionality is based on BigARTM library** which requires manual installation.
To avoid that you can use [docker images](https://hub.docker.com/r/xtonev/bigartm/tags) with preinstalled BigARTM library in them.

Alternatively, you can follow [BigARTM installation manual](https://bigartm.readthedocs.io/en/stable/installation/index.html)
After setting up the environment you can fork this repository or use ```pip install topicnet``` to install the library.

#### Using docker image
```
docker pull xtonev/bigartm:v0.10.0
Expand All @@ -29,85 +36,16 @@ import artm
artm.version()
```

Alternatively, you can follow [BigARTM installation manual](https://bigartm.readthedocs.io/en/stable/installation/index.html).
After setting up the environment you can fork this repository or use ```pip install topicnet``` to install the library.

---
## How to use TopicNet
Let's say you have a handful of raw text mined from some source and you want to perform some topic modelling on them. Where should you start?
### Data Preparation
Every ML problem starts with data preprocess step. TopicNet does not perform data preprocessing itself. Instead, it demands data being prepared by the user and loaded via [Dataset (no link yet)]() class.
Here is a basic example of how one can achieve that:
```
import nltk
import artm
import string
import pandas as pd
from glob import glob
WIKI_DATA_PATH = '/Wiki_raw_set/raw_plaintexts/'
files = glob(WIKI_DATA_PATH+'*.txt')
```
Loading all texts from files and leaving only alphabetical characters and spaces:
```
right_symbols = string.ascii_letters + ' '
data = []
for path in files:
entry = {}
entry['id'] = path.split('/')[-1].split('.')[0]
with open(path,'r') as f:
text = ''.join([char for char in f.read() if char in right_symbols])
entry['raw_text'] = ''.join(text.split('\n'))
data.append(entry)
wiki_texts = pd.DataFrame(data)
```
#### Perform tokenization:
```
tokenized_text = []
for text in wiki_texts['raw_text'].values:
tokenized_text.append(' '.join(nltk.word_tokenize(text)))
wiki_texts['tokenized'] = tokenized_text
```
#### Perform lemmatization:
```
from nltk.stem import WordNetLemmatizer
lemmatized_text = []
wnl = WordNetLemmatizer()
for text in wiki_texts['raw_text'].values:
lemmatized = [wnl.lemmatize(word) for word in text.split()]
lemmatized_text.append(lemmatized)
wiki_texts['lemmatized'] = lemmatized_text
```
#### Get bigrams:
```
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_documents(wiki_texts['lemmatized'])
finder.apply_freq_filter(5)
set_dict = set(finder.nbest(bigram_measures.pmi,32100)[100:])
documents = wiki_texts['lemmatized']
bigrams = []
for doc in documents:
entry = ['_'.join([word_first, word_second])
for word_first, word_second in zip(doc[:-1],doc[1:])
if (word_first, word_second) in set_dict]
bigrams.append(entry)
wiki_texts['bigram'] = bigrams
```

#### Write them all to Vowpal Wabbit format and save result to disk:
```
vw_text = []
for index, data in wiki_texts.iterrows():
vw_string = ''
doc_id = data.id
lemmatized = '@lemmatized ' + ' '.join(data.lemmatized)
bigram = '@bigram ' + ' '.join(data.bigram)
vw_string = ' |'.join([doc_id, lemmatized, bigram])
vw_text.append(vw_string)
wiki_texts['vw_text'] = vw_text
wiki_texts[['id','raw_text', 'vw_text']].to_csv('/Wiki_raw_set/wiki_data.csv')
```
Here is a basic example of how one can achieve that: [rtl_wiki_preprocessing (no link yet)]().

### Training topic model
Here we can finally get on the main part: making your own, best of them all, manually crafted Topic Model
#### Get your data
Expand All @@ -121,7 +59,7 @@ In case you want to start from a fresh model we suggest you use this code:
from topicnet.cooking_machine.model_constructor import init_simple_default_model
model_artm = init_simple_default_model(
dataset=demo_data,
dataset=data,
modalities_to_use={'@lemmatized': 1.0, '@bigram':0.5},
main_modality='@lemmatized',
n_specific_topics=14,
Expand All @@ -133,7 +71,7 @@ Further, if needed, one can define a custom score to be calculated during the mo
```
from topicnet.cooking_machine.models.base_score import BaseScore
class ThatCustomScore(BaseScore):
class CustomScore(BaseScore):
def __init__(self):
super().__init__()
Expand All @@ -148,7 +86,7 @@ Now, `TopicModel` with custom score can be defined:
```
from topicnet.cooking_machine.models.topic_model import TopicModel
custom_score_dict = {'SpecificSparsity': ThatCustomScore()}
custom_score_dict = {'SpecificSparsity': CustomScore()}
tm = TopicModel(model_artm, model_id='Groot', custom_scores=custom_score_dict)
```
#### Define experiment
Expand All @@ -163,7 +101,7 @@ from topicnet.cooking_machine.cubes import RegularizersModifierCube
my_first_cube = RegularizersModifierCube(
num_iter=5,
tracked_score_function=retrieve_score_for_strategy('PerplexityScore@lemmatized'),
tracked_score_function='PerplexityScore@lemmatized',
regularizer_parameters={
'regularizer': artm.DecorrelatorPhiRegularizer(name='decorrelation_phi', tau=1),
'tau_grid': [0,1,2,3,4,5],
Expand Down Expand Up @@ -191,26 +129,39 @@ for line in first_model_html:
---
## FAQ

#### In the example we used to write vw modality like **@modality** is it a VowpallWabbit format?
#### In the example we used to write vw modality like **@modality**, is it a VowpallWabbit format?

It is a convention to write data designating modalities with @ sign taken by TopicNet from BigARTM.

#### CubeCreator helps to perform a grid search over initial model parameters. How can I do it with modalities?

Modality search space can be defined using standart library logic like:
```
name: 'class_ids',
values: {
'@text': [1, 2, 3],
'@ngrams': [4, 5, 6],
},
class_ids_cube = CubeCreator(
num_iter=5,
parameters: [
name: 'class_ids',
values: {
'@text': [1, 2, 3],
'@ngrams': [4, 5, 6],
},
]
reg_search='grid',
verbose=True
)
```
However for the case of modalities a couple of slightly more convenient methods are availiable:

```
[{'name': 'class_ids@text', 'values': [1, 2, 3]},
{'name': 'class_ids@ngrams', 'values': [4, 5, 6]}]
{'class_ids@text': [1, 2, 3],
'class_ids@ngrams': [4, 5, 6]}
parameters : [
{'name': 'class_ids@text', 'values': [1, 2, 3]},
{'name': 'class_ids@ngrams', 'values': [4, 5, 6]}
]
parameters:[
{
'class_ids@text': [1, 2, 3],
'class_ids@ngrams': [4, 5, 6]
}
]
```

0 comments on commit 22dd19b

Please sign in to comment.