Skip to content

Latest commit

 

History

History
88 lines (64 loc) · 4.53 KB

README.md

File metadata and controls

88 lines (64 loc) · 4.53 KB

Perceptron Text Classification

Università degli Studi Firenze

Overview

Despite the fact that the standard Perceptron provides a good result in its simplicity, it presents some criticalities. Suppose, for example, that we want to classify the XOR function. It is immediately evident that it is impossible to draw a plane that divides the positive examples from the negative ones without committing any error. It follows that the algorithm, not being data linearly separable, will continue to generate each time a different plan and the final one will be randomly determined by the moment in which the stop occurs after a certain number of iterations.

xor function

Suppose now to train the Perceptron and obtain, after a few iterations, a satisfactory classifier that correctly predicts the next 5000 submitted data points. If the last datum is classified incorrectly, the plan must be updated despite its previous accuracy. To limit these situations, whenever a plan has to change, the number c of correct consecutive classifications will be saved. In this way, during testing, it will be possible to possible to determine the sign of an example by weighing the contribution of each plan, according to the formula:

The experiments revealed, as expected, dependence on the order in which the data were shown as input. This implies that, for the same problem, different seeds can generate very different performance for the standard version, while the voted one remains stable. More details, schematized as the table below, can be found in the final report.

result table

Prerequisites

  • Scikit-Learn to obtain the 20 Newsgroup dataset and various functionalities to transform the text into a numeric input.
  • Numpy to perform vectorized operations.
  • Memory Profiler useful to keep trace of memory occupation.
  • Pretty Table for a nice confusion matrix formatting.

Run

Experiments can be launched from the test.py file, containing three category couples as an example. In general, it is possible to choose them from the following list:

  • comp.os.ms-windows.misc
  • comp.sys.ibm.pc.hardware
  • comp.sys.mac.hardware
  • comp.windows.x
  • rec.autos
  • rec.motorcycles
  • rec.sport.baseball
  • rec.sport.hockey
  • sci.crypt
  • sci.electronics
  • sci.med
  • sci.space
  • misc.forsale
  • talk.politics.misc
  • talk.politics.guns
  • talk.politics.mideast
  • talk.religion.misc
  • alt.atheism
  • soc.religion.christian

In the two main functions it is possible to change the max_iter and seed parameters, in order to affect the number of cycles on the training data and to obtain different scenarios according to the shuffling.

perceptron.test_default(categories, max_iter=10, seed=8)
perceptron.test_voted(categories, max_iter=10, seed=8)

Although it is not recommended, within the util.py class it is possible to include additional elements of the original text such as headers, footers and quotes by removing the last attribute.

train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True,
                               random_state=seed, remove=('headers', 'footers', 'quotes'))
test = fetch_20newsgroups(subset='test', categories=categories,
                              remove=('headers', 'footers', 'quotes'))

If you want to get a graphical detail of the memory usage you need to run

mprof run test.py
mprof plot

References