Skip to content

Commit

Permalink
Merge branch 'release/0.0.5'
Browse files Browse the repository at this point in the history
  • Loading branch information
giacbrd committed Dec 30, 2016
2 parents a1133e6 + 65b5dae commit 34af386
Show file tree
Hide file tree
Showing 18 changed files with 2,894 additions and 406 deletions.
4 changes: 3 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,10 +16,12 @@ before_install:
- ./miniconda.sh -b -p $HOME/miniconda
- export PATH=/home/travis/miniconda/bin:$PATH
- conda update --yes conda
- export PYTHONHASHSEED=1
install:
- conda install --yes python=$TRAVIS_PYTHON_VERSION numpy scipy
- pip install cython
- python setup.py install
script:
- ./scripts/test.sh
- python scripts/document_classification_20newsgroups.py --chi2_select 80
- python scripts/document_classification_20newsgroups.py --chi2_select 80
- python scripts/plot_out_of_core_classification.py
11 changes: 11 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,17 @@
Changelog
=========

`0.0.5 <https://github.com/giacbrd/ShallowLearn/releases/tag/0.0.5>`_ (2016-12-30)
----------------------------------------------------------------------------------

* Online learning and better pre-training in GensimFastTex:
- Hashing trick for building the vocabulary, similar to the original fastText approach
- It is possible to pre-fit word embeddings from a dataset with word2vec
- True online earning with ``partial_fit``, the vocabulary is incrementally updated
* New version of fastText.py: 0.8.2
* New version of Gensim: 0.13.4
* Fixed ``predict_proba`` output format

`0.0.4 <https://github.com/giacbrd/ShallowLearn/releases/tag/0.0.4>`_ (2016-11-05)
----------------------------------------------------------------------------------

Expand Down
3 changes: 2 additions & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -6,4 +6,5 @@ include shallowlearn/voidptr.h
include shallowlearn/word2vec_inner.c
include shallowlearn/word2vec_inner.pyx
include shallowlearn/word2vec_inner.pxd
include scripts/document_classification_20newsgroups.py
include scripts/document_classification_20newsgroups.py
include scripts/plot_out_of_core_classification.py
54 changes: 44 additions & 10 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,26 +21,35 @@ Install the latest version:
pip install shallowlearn
Import models from ``shallowlearn.models``, they implement the standard methods for supervised learning in scikit-learn,
e.g., ``fit(X, y)``, ``predict(X)``, etc.
e.g., ``fit(X, y)``, ``predict(X)``, ``predict_proba(X)``, etc.

Data is raw text, each sample in the iterable ``X`` is a list of tokens (words of a document),
while each element in the iterable ``y`` (corresponding to an element in ``X``) can be a single label or a list in case of a multi-label training set. Obviously, ``y`` must be of the same size of ``X``.
while each element in the iterable ``y`` (corresponding to an element in ``X``) can be a single label or a list in case
of a multi-label training set. Obviously, ``y`` must be of the same size of ``X``.

Models
------

GensimFastText
~~~~~~~~~~~~~~
**Choose this model if your goal is classification with fastText!** (it is going to be the most stable and rich feature-wise)

A supervised learning model based on the fastText algorithm [1]_.
The code is mostly taken and rewritten from `Gensim <https://radimrehurek.com/gensim>`_,
it takes advantage of its optimizations (e.g. Cython) and support.

It is possible to choose the Softmax loss function (default) or one of its two "approximations":
Hierarchical Softmax and Negative Sampling. It is also possible to load pre-trained word vectors at initialization,
Hierarchical Softmax and Negative Sampling.

The parameter ``bucket`` configures the feature hashing space, i.e., the *hashing trick* described in [1]_.
Using the hashing trick together with ``partial_fit(X, y)`` yields a powerful *online* text classifier (see `Online learning`_).

It is possible to load pre-trained word vectors at initialization,
passing a Gensim ``Word2Vec`` or a ShallowLearn ``LabeledWord2Vec`` instance (the latter is retrievable from a
``GensimFastText`` model by the attribute ``classifier``).
With method ``fit_embeddings(X)`` it is possible to pre-train word vectors, using the current parameter values of the model.

Constructor argument names are a mix between the ones of Gensim and the ones of fastText (see this class docstring).
Constructor argument names are a mix between the ones of Gensim and the ones of fastText (see this `class docstring <https://github.com/giacbrd/ShallowLearn/blob/develop/shallowlearn/models.py#L74>`_).

.. code:: python
Expand All @@ -54,13 +63,13 @@ FastText
~~~~~~~~
The supervised algorithm of fastText implemented in `fastText.py <https://github.com/salestock/fastText.py>`_ ,
which exposes an interface on the original C++ code.
The current advantages of this class over ``GensimFastText`` are the *subwords* ant the *n-gram features* implemented
The current advantages of this class over ``GensimFastText`` are the *subwords* and the *n-gram features* implemented
via the *hashing trick*.
The constructor arguments are equivalent to the original `supervised model
<https://github.com/salestock/fastText.py#supervised-model>`_, except for ``input_file``, ``output`` and
``label_prefix``.

**WARNING**: The only way of loading datasets in fastText.py is through the filesystem (as of version 0.8.0),
**WARNING**: The only way of loading datasets in fastText.py is through the filesystem (as of version 0.8.2),
so data passed to ``fit(X, y)`` will be written in temporary files on disk.

.. code:: python
Expand All @@ -81,14 +90,16 @@ DeepAveragingNetworks

Exclusive Features
------------------
Next cool features will be listed as Issues in Github
Next cool features will be listed as Issues in Github, for now:

Persistence
~~~~~~~~~~~
Any model can be serialized and de-serialized with the two methods ``save`` and ``load``.
They overload the `SaveLoad <https://radimrehurek.com/gensim/utils.html#gensim.utils.SaveLoad>`_ interface of Gensim,
so it is possible to control the cost on disk usage of the models, instead of simply *pickling* the objects.
``save`` can create multiple files with names prefixed by the name given to the serialized model.
The original interface also allows to use compression on the serialization outputs.

``save`` may create multiple files with names prefixed by the name given to the serialized model.

.. code:: python
Expand All @@ -99,24 +110,47 @@ so it is possible to control the cost on disk usage of the models, instead of si
Benchmarks
----------

Text classification
~~~~~~~~~~~~~~~~~~~

The script ``scripts/document_classification_20newsgroups.py`` refers to this
`scikit-learn example <http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html>`_
in which text classifiers are compared on a reference dataset;
we added our models to the comparison.
**The current results, even if still preliminary, are comparable with other
approaches, achieving the best performance in speed**.

Results as of release `0.0.4 <https://github.com/giacbrd/ShallowLearn/releases/tag/0.0.4>`_,
Results as of release `0.0.5 <https://github.com/giacbrd/ShallowLearn/releases/tag/0.0.5>`_,
with *chi2_select* option set to 80%.
The times take into account of *tf-idf* vectorization in the “classic” classifiers, and the I/O operations for the
training of fastText.py.
The evaluation measure is *macro F1*.

.. image:: https://rawgit.com/giacbrd/ShallowLearn/master/benchmark.svg
.. image:: https://cdn.rawgit.com/giacbrd/ShallowLearn/master/images/benchmark.svg
:alt: Text classifiers comparison
:align: center
:width: 888 px

Online learning
~~~~~~~~~~~~~~~

The script ``scripts/plot_out_of_core_classification.py`` computes a benchmark on some scikit-learn classifiers which are able to
learn incrementally,
a batch of examples at a time.
These classifiers can learn online by using the scikit-learn method ``partial_fit(X, y)``.
The `original example <http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html>`_
describes the approach through feature hashing, which we set with parameter ``bucket``.

**The results are decent but there is room for improvement**.
We configure our classifier with ``iter=1, size=100, alpha=0.1, sample=0, min_count=0``, so to keep the model fast and
small, and to not cut off words from the few samples we have.

.. image:: https://cdn.rawgit.com/giacbrd/ShallowLearn/master/images/onlinelearning.svg
:alt: Online learning
:align: center
:width: 888 px

References
----------
.. [1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
File renamed without changes
Loading

0 comments on commit 34af386

Please sign in to comment.