Skip to content

Commit

Permalink
Merge branch 'release/0.0.3'
Browse files Browse the repository at this point in the history
  • Loading branch information
giacbrd committed Oct 27, 2016
2 parents 2af63c6 + e7ce7a1 commit dd46aa6
Show file tree
Hide file tree
Showing 17 changed files with 3,723 additions and 1,600 deletions.
1 change: 0 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -97,5 +97,4 @@ Thumbs.db

# Others
.idea
temp/
.pypirc
4 changes: 2 additions & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -21,5 +21,5 @@ install:
- pip install cython
- python setup.py install
script:
- python setup.py test
#FIXME add a script like "python scripts/document_classification_20newsgroups.py" without plotting
- ./scripts/test.sh
- python scripts/document_classification_20newsgroups.py --chi2_select 80
10 changes: 10 additions & 0 deletions CHANGELOG.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,16 @@
Changelog
=========

`0.0.3 <https://github.com/giacbrd/ShallowLearn/releases/tag/0.0.3>`_ (2016-xx-xx)
----------------------------------------------------------------------------------

* FastText classifier based on version 0.8.0 of https://github.com/salestock/fastText.py
* GensimFastText has now:
- negative sampling
- softmax as alternative output function
- almost complete LabeledWord2Vec as subclass of Gensim's Word2Vec
* More tests

`0.0.2 <https://github.com/giacbrd/ShallowLearn/releases/tag/0.0.2>`_ (2016-14-10)
----------------------------------------------------------------------------------

Expand Down
3 changes: 2 additions & 1 deletion MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -5,4 +5,5 @@ include COPYING
include shallowlearn/voidptr.h
include shallowlearn/word2vec_inner.c
include shallowlearn/word2vec_inner.pyx
include shallowlearn/word2vec_inner.pxd
include shallowlearn/word2vec_inner.pxd
include scripts/document_classification_20newsgroups.py
67 changes: 51 additions & 16 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@ ShallowLearn
============
A collection of supervised learning models based on shallow neural network approaches (e.g., word2vec and fastText)
with some additional exclusive features.
Written in Python and fully compatible with `Scikit-learn <http://scikit-learn.org>`_.
Written in Python and fully compatible with `scikit-learn <http://scikit-learn.org>`_.

.. image:: https://travis-ci.org/giacbrd/ShallowLearn.svg?branch=master
:target: https://travis-ci.org/giacbrd/ShallowLearn
Expand All @@ -18,44 +18,79 @@ Install the latest version:
pip install cython
pip install shallowlearn
Import models from ``shallowlearn.models``, they implement the standard methods for supervised learning in Scikit-learn,
Import models from ``shallowlearn.models``, they implement the standard methods for supervised learning in scikit-learn,
e.g., ``fit(X, y)``, ``predict(X)``, etc.

Data is raw text, each sample is a list of tokens (words of a document), while each target value in ``y`` can be a
single label (or a list in case of multi-label training set) associated with the relative sample.

Models
------
``shallowlearn.models.GensimFastText``
A supervised learning model based on the fastText algorithm [1]_.
The code is mostly taken and rewritten from `Gensim <https://radimrehurek.com/gensim>`_,
it takes advantage of its optimizations (e.g. Cython) and support.
GensimFastText
~~~~~~~~~~~~~~
A supervised learning model based on the fastText algorithm [1]_.
The code is mostly taken and rewritten from `Gensim <https://radimrehurek.com/gensim>`_,
it takes advantage of its optimizations (e.g. Cython) and support.

``shallowlearn.models.FastText``
**TODO**: The supervised algorithm of fastText implemented in https://github.com/salestock/fastText.py
It is possible to choose the Softmax loss function (default) or one of its two "approximations":
Hierarchical Softmax and Negative Sampling. It is also possible to load pre-trained word vectors at initialization,
passing a Gensim ``Word2Vec`` or a ShallowLearn ``LabeledWord2Vec`` instance (the latter is retrievable from a
``GensimFastText`` model by the attribute ``classifier``).

``shallowlearn.models.DeepInverseRegression``
**TODO**: Based on https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.score
Constructor argument names are a mix between the ones of Gensim and the ones of fastText (see this class docstring).

.. code:: python
>>> from shallowlearn.models import GensimFastText
>>> clf = GensimFastText(size=100, min_count=0, loss='hs', max_iter=3, random_state=66)
>>> clf.fit([('i', 'am', 'tall'), ('you', 'are', 'fat')], ['yes', 'no'])
>>> clf.predict([('tall', 'am', 'i')])
['yes']
FastText
~~~~~~~~
The supervised algorithm of fastText implemented in `fastText.py <https://github.com/salestock/fastText.py>`_ ,
which exposes an interface on the original C++ code.
The current advantages of this class over ``GensimFastText`` are the *subwords* ant the *n-gram features* implemented
via the *hashing trick*.
The constructor arguments are equivalent to the original `supervised model
<https://github.com/salestock/fastText.py#supervised-model>`_, except for ``input_file``, ``output`` and
``label_prefix``.

**WARNING**: The only way of loading datasets in fastText.py is through the filesystem (as of version 0.8.0),
so data passed to ``fit(X, y)`` will be written in temporary files on disk.

.. code:: python
>>> from shallowlearn.models import FastText
>>> clf = FastText(dim=100, min_count=0, loss='hs', epoch=3, bucket=5, word_ngrams=2)
>>> clf.fit([('i', 'am', 'tall'), ('you', 'are', 'fat')], ['yes', 'no'])
>>> clf.predict([('tall', 'am', 'i')])
['yes']
DeepInverseRegression
~~~~~~~~~~~~~~~~~~~~~
*TODO*: Based on https://radimrehurek.com/gensim/models/word2vec.html#gensim.models.word2vec.Word2Vec.score

Exclusive Features
------------------
**TODO**
*TODO: future features are going to be listed as Issues*

Benchmarks
----------
The script ``scripts/document_classification_20newsgroups.py`` refers to this
`Scikit-learn example <http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html>`_
`scikit-learn example <http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html>`_
in which text classifiers are compared on a reference dataset;
we added our models to the comparison.
**The current results, even if still preliminary, are comparable with other
approaches, achieving the best performance in speed**.

Results as of release `0.0.2 <https://github.com/giacbrd/ShallowLearn/releases/tag/0.0.2>`_,
Results as of release `0.0.3 <https://github.com/giacbrd/ShallowLearn/releases/tag/0.0.3>`_,
with *chi2_select* option set to 80%.
The times take into account of *tf-idf* vectorization in the “classic” classifiers;
the evaluation measure is *macro F1*.
The times take into account of *tf-idf* vectorization in the “classic” classifiers, and the I/O operations for the
training of fastText.py. The evaluation measure is *macro F1*.

.. image:: https://cdn.rawgit.com/giacbrd/ShallowLearn/develop/benchmark.svg
.. image:: https://rawgit.com/giacbrd/ShallowLearn/develop/benchmark.svg
:alt: Text classifiers comparison
:align: center
:width: 888 px
Expand Down
Loading

0 comments on commit dd46aa6

Please sign in to comment.