Merge branch 'release/0.0.5'

giacbrd · Dec 30, 2016 · 34af386 · 34af386
2 parents a1133e6 + 65b5dae
commit 34af386
Show file tree

Hide file tree

Showing 18 changed files with 2,894 additions and 406 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -16,10 +16,12 @@ before_install:
   - ./miniconda.sh -b -p $HOME/miniconda
   - export PATH=/home/travis/miniconda/bin:$PATH
   - conda update --yes conda
+  - export PYTHONHASHSEED=1
 install:
   - conda install --yes python=$TRAVIS_PYTHON_VERSION numpy scipy
   - pip install cython
   - python setup.py install
 script:
   - ./scripts/test.sh
-  - python scripts/document_classification_20newsgroups.py --chi2_select 80
+  - python scripts/document_classification_20newsgroups.py --chi2_select 80
+  - python scripts/plot_out_of_core_classification.py
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -1,6 +1,17 @@
 Changelog
 =========
 
+`0.0.5 <https://github.com/giacbrd/ShallowLearn/releases/tag/0.0.5>`_ (2016-12-30)
+----------------------------------------------------------------------------------
+
+* Online learning and better pre-training in GensimFastTex:
+    - Hashing trick for building the vocabulary, similar to the original fastText approach
+    - It is possible to pre-fit word embeddings from a dataset with word2vec
+    - True online earning with ``partial_fit``, the vocabulary is incrementally updated
+* New version of fastText.py: 0.8.2
+* New version of Gensim: 0.13.4
+* Fixed ``predict_proba`` output format
+
 `0.0.4 <https://github.com/giacbrd/ShallowLearn/releases/tag/0.0.4>`_ (2016-11-05)
 ----------------------------------------------------------------------------------
 

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -6,4 +6,5 @@ include shallowlearn/voidptr.h
 include shallowlearn/word2vec_inner.c
 include shallowlearn/word2vec_inner.pyx
 include shallowlearn/word2vec_inner.pxd
-include scripts/document_classification_20newsgroups.py
+include scripts/document_classification_20newsgroups.py
+include scripts/plot_out_of_core_classification.py
diff --git a/README.rst b/README.rst
@@ -21,26 +21,35 @@ Install the latest version:
     pip install shallowlearn
 
 Import models from ``shallowlearn.models``, they implement the standard methods for supervised learning in scikit-learn,
-e.g., ``fit(X, y)``, ``predict(X)``, etc.
+e.g., ``fit(X, y)``, ``predict(X)``, ``predict_proba(X)``, etc.
 
 Data is raw text, each sample in the iterable ``X`` is a list of tokens (words of a document), 
-while each element in the iterable ``y`` (corresponding to an element in ``X``) can be a single label or a list in case of a multi-label training set. Obviously, ``y`` must be of the same size of ``X``.
+while each element in the iterable ``y`` (corresponding to an element in ``X``) can be a single label or a list in case
+of a multi-label training set. Obviously, ``y`` must be of the same size of ``X``.
 
 Models
 ------
 
 GensimFastText
 ~~~~~~~~~~~~~~
+**Choose this model if your goal is classification with fastText!** (it is going to be the most stable and rich feature-wise)
+
 A supervised learning model based on the fastText algorithm [1]_.
 The code is mostly taken and rewritten from `Gensim <https://radimrehurek.com/gensim>`_,
 it takes advantage of its optimizations (e.g. Cython) and support.
 
 It is possible to choose the Softmax loss function (default) or one of its two "approximations":
-Hierarchical Softmax and Negative Sampling. It is also possible to load pre-trained word vectors at initialization,
+Hierarchical Softmax and Negative Sampling. 
+
+The parameter ``bucket`` configures the feature hashing space, i.e., the *hashing trick* described in [1]_.
+Using the hashing trick together with ``partial_fit(X, y)`` yields a powerful *online* text classifier (see `Online learning`_).
+
+It is possible to load pre-trained word vectors at initialization,
 passing a Gensim ``Word2Vec`` or a ShallowLearn ``LabeledWord2Vec`` instance (the latter is retrievable from a
 ``GensimFastText`` model by the attribute ``classifier``).
+With method ``fit_embeddings(X)`` it is possible to pre-train word vectors, using the current parameter values of the model.
 
-Constructor argument names are a mix between the ones of Gensim and the ones of fastText (see this class docstring).
+Constructor argument names are a mix between the ones of Gensim and the ones of fastText (see this `class docstring <https://github.com/giacbrd/ShallowLearn/blob/develop/shallowlearn/models.py#L74>`_).
 
 .. code:: python
 
@@ -54,13 +63,13 @@ FastText
 ~~~~~~~~
 The supervised algorithm of fastText implemented in `fastText.py <https://github.com/salestock/fastText.py>`_ ,
 which exposes an interface on the original C++ code.
-The current advantages of this class over ``GensimFastText`` are the *subwords* ant the *n-gram features* implemented
+The current advantages of this class over ``GensimFastText`` are the *subwords* and the *n-gram features* implemented
 via the *hashing trick*.
 The constructor arguments are equivalent to the original `supervised model
 <https://github.com/salestock/fastText.py#supervised-model>`_, except for ``input_file``, ``output`` and
 ``label_prefix``.
 
-**WARNING**: The only way of loading datasets in fastText.py is through the filesystem (as of version 0.8.0),
+**WARNING**: The only way of loading datasets in fastText.py is through the filesystem (as of version 0.8.2),
 so data passed to ``fit(X, y)`` will be written in temporary files on disk.
 
 .. code:: python
@@ -81,14 +90,16 @@ DeepAveragingNetworks
 
 Exclusive Features
 ------------------
-Next cool features will be listed as Issues in Github
+Next cool features will be listed as Issues in Github, for now:
 
 Persistence
 ~~~~~~~~~~~
 Any model can be serialized and de-serialized with the two methods ``save`` and ``load``.
 They overload the `SaveLoad <https://radimrehurek.com/gensim/utils.html#gensim.utils.SaveLoad>`_ interface of Gensim,
 so it is possible to control the cost on disk usage of the models, instead of simply *pickling* the objects.
-``save`` can create multiple files with names prefixed by the name given to the serialized model.
+The original interface also allows to use compression on the serialization outputs.
+
+``save`` may create multiple files with names prefixed by the name given to the serialized model.
 
 .. code:: python
 
@@ -99,24 +110,47 @@ so it is possible to control the cost on disk usage of the models, instead of si
 
 Benchmarks
 ----------
+
+Text classification
+~~~~~~~~~~~~~~~~~~~
+
 The script ``scripts/document_classification_20newsgroups.py`` refers to this
 `scikit-learn example <http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html>`_
 in which text classifiers are compared on a reference dataset;
 we added our models to the comparison.
 **The current results, even if still preliminary, are comparable with other
 approaches, achieving the best performance in speed**.
 
-Results as of release `0.0.4 <https://github.com/giacbrd/ShallowLearn/releases/tag/0.0.4>`_,
+Results as of release `0.0.5 <https://github.com/giacbrd/ShallowLearn/releases/tag/0.0.5>`_,
 with *chi2_select* option set to 80%.
 The times take into account of *tf-idf* vectorization in the “classic” classifiers, and the I/O operations for the
 training of fastText.py.
 The evaluation measure is *macro F1*.
 
-.. image:: https://rawgit.com/giacbrd/ShallowLearn/master/benchmark.svg
+.. image:: https://cdn.rawgit.com/giacbrd/ShallowLearn/master/images/benchmark.svg
     :alt: Text classifiers comparison
     :align: center
     :width: 888 px
 
+Online learning
+~~~~~~~~~~~~~~~
+
+The script ``scripts/plot_out_of_core_classification.py`` computes a benchmark on some scikit-learn classifiers which are able to
+learn incrementally,
+a batch of examples at a time.
+These classifiers can learn online by using the scikit-learn method ``partial_fit(X, y)``.
+The `original example <http://scikit-learn.org/stable/auto_examples/applications/plot_out_of_core_classification.html>`_
+describes the approach through feature hashing, which we set with parameter ``bucket``.
+
+**The results are decent but there is room for improvement**.
+We configure our classifier with ``iter=1, size=100, alpha=0.1, sample=0, min_count=0``, so to keep the model fast and
+small, and to not cut off words from the few samples we have.
+
+.. image:: https://cdn.rawgit.com/giacbrd/ShallowLearn/master/images/onlinelearning.svg
+    :alt: Online learning
+    :align: center
+    :width: 888 px
+
 References
 ----------
 .. [1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification
diff --git a/benchmark.svg → images/benchmark.svg b/benchmark.svg → images/benchmark.svg