Skip to content

Yet another PHP implementation of the Rapid Automatic Keyword Extraction algorithm (RAKE).

License

Notifications You must be signed in to change notification settings

SpekArchive/rake-php-plus

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

68 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rake-php-plus

Yet another PHP implementation of the Rapid Automatic Keyword Extraction algorithm (RAKE).

Latest Stable Version Total Downloads License

Why is this package useful?

Keywords describe the main topics expressed in a document/text. Keyword extraction in turn allows for the extraction of important words and phrases from text. This in turn can be used for building a list of tags or to build a keyword search index or grouping similar content by its topics and much more. This library provides an easy method for PHP developers to get a list of keywords and phrases from a string of text.

This project is based on another project called RAKE-PHP by Richard Filipčík, which is a translation from a Python implementation simply called RAKE.

As described in: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons.

This particular package intends to include the following benefits over the original RAKE-PHP package:

  1. Add PSR-2 coding standards.
  2. Implement PSR-4 in order to be Composer installable.
  3. Add additional functionality such as method chaining.
  4. Add multiple ways to provide source stopwords.
  5. Full unit test coverage.
  6. Performance improvements.
  7. Improved documentation.

Currently Supported Languages

  • English US (en_US)
  • Spanish/español (es_AR)
  • French/le français (fr_FR)
  • Polish/język polski (pl_PL)
  • Russian/русский язык (ru_RU)
  • Brazilian Portuguese/português do Brasil (pt_BR)
  • European Portuguese/português europeu (pt_PT)
  • Sorani Kurdish/سۆرانی (ckb_IQ)
  • Arabic (United Arab Emirates)/لإمارات العربية المتحدة (ar_AE)
  • German (Germany)/Deutsch (Deutschland) (de_DE)

Version

v1.0.16

Special Thanks

Installation

With Composer

$ composer require donatello-za/rake-php-plus
{
    "require": {
        "donatello-za/rake-php-plus": "^1.0"
    }
}
<?php
require 'vendor/autoload.php';

use DonatelloZa\RakePlus\RakePlus;

Without Composer

<?php

require 'path/to/AbstractStopwordProvider.php';
require 'path/to/StopwordArray.php';
require 'path/to/StopwordsPatternFile.php';
require 'path/to/StopwordsPHP.php';
require 'path/to/RakePlus.php';

use DonatelloZa\RakePlus\RakePlus;

Example 1

Creates a new instance of RakePlus, extract the phrases and return the results. Assumes that the specified text is English (US).

use DonatelloZa\RakePlus\RakePlus;

$text = "Criteria of compatibility of a system of linear Diophantine equations, " .
    "strict inequations, and nonstrict inequations are considered. Upper bounds " .
    "for components of a minimal set of solutions and algorithms of construction " .
    "of minimal generating sets of solutions for all types of systems are given.";

$phrases = RakePlus::create($text)->get();

print_r($phrases);
Array
(
    [0] => criteria
    [1] => compatibility
    [2] => system
    [3] => linear diophantine equations
    [4] => strict inequations
    [5] => nonstrict inequations
    [6] => considered
    [7] => upper bounds
    [8] => components
    [9] => minimal set
    [10] => solutions
    [11] => algorithms
    [12] => construction
    [13] => minimal generating sets
    [14] => types
    [15] => systems
)

Example 2

Creates a new instance of RakePlus, extract the phrases in different different orders and also shows how to get the phrase scores.

use DonatelloZa\RakePlus\RakePlus;

$text = "Criteria of compatibility of a system of linear Diophantine equations, " .
    "strict inequations, and nonstrict inequations are considered. Upper bounds " .
    "for components of a minimal set of solutions and algorithms of construction " .
    "of minimal generating sets of solutions for all types of systems are given.";

// Note: en_US is the default language.
$rake = RakePlus::create($text, 'en_US');

// 'asc' is optional and is the default sort order
$phrases = $rake->sort('asc')->get();
print_r($phrases);
Array
(
    [0] => algorithms
    [1] => compatibility
    [2] => components
    [3] => considered
    [4] => construction
    [5] => criteria
    [6] => linear diophantine equations
    [7] => minimal generating sets
    [8] => minimal set
    [9] => nonstrict inequations
    [10] => solutions
    [11] => strict inequations
    [12] => system
    [13] => systems
    [14] => types
    [15] => upper bounds
)
// Sort in descending order
$phrases = $rake->sort('desc')->get();
print_r($phrases);
Array
(
    [0] => upper bounds
    [1] => types
    [2] => systems
    [3] => system
    [4] => strict inequations
    [5] => solutions
    [6] => nonstrict inequations
    [7] => minimal set
    [8] => minimal generating sets
    [9] => linear diophantine equations
    [10] => criteria
    [11] => construction
    [12] => considered
    [13] => components
    [14] => compatibility
    [15] => algorithms
)
// Sort the phrases by score and return the scores
$phrase_scores = $rake->sortByScore('desc')->scores();
print_r($phrase_scores);
Array
(
    [linear diophantine equations] => 9
    [minimal generating sets] => 8.5
    [minimal set] => 4.5
    [strict inequations] => 4
    [nonstrict inequations] => 4
    [upper bounds] => 4
    [criteria] => 1
    [compatibility] => 1
    [system] => 1
    [considered] => 1
    [components] => 1
    [solutions] => 1
    [algorithms] => 1
    [construction] => 1
    [types] => 1
    [systems] => 1
)
// Extract phrases from a new string on the same RakePlus instance. Using the
// same RakePlus instance is faster than creating a new instance as the
// language files do not have to be re-loaded and parsed.

$text = "A fast Fourier transform (FFT) algorithm computes...";
$phrases = $rake->extract($text)->sort()->get();
print_r($phrases);
Array
(
    [0] => algorithm computes
    [1] => fast fourier transform
    [2] => fft
)

Example 3

Creates a new instance of RakePlus and extract the unique keywords from the phrases.

use DonatelloZa\RakePlus\RakePlus;

$text = "Criteria of compatibility of a system of linear Diophantine equations, " .
    "strict inequations, and nonstrict inequations are considered. Upper bounds " .
    "for components of a minimal set of solutions and algorithms of construction " .
    "of minimal generating sets of solutions for all types of systems are given.";

$keywords = RakePlus::create($text)->keywords();
print_r($keywords);
Array
(
    [0] => criteria
    [1] => compatibility
    [2] => system
    [3] => linear
    [4] => diophantine
    [5] => equations
    [6] => strict
    [7] => inequations
    [8] => nonstrict
    [9] => considered
    [10] => upper
    [11] => bounds
    [12] => components
    [13] => minimal
    [14] => set
    [15] => solutions
    [16] => algorithms
    [17] => construction
    [18] => generating
    [19] => sets
    [20] => types
    [21] => systems
)

Example 4

Creates a new instance of RakePlus without using the static RakePlus::create method.

use DonatelloZa\RakePlus;

$text = "Criteria of compatibility of a system of linear Diophantine equations, " .
    "strict inequations, and nonstrict inequations are considered. Upper bounds " .
    "for components of a minimal set of solutions and algorithms of construction " .
    "of minimal generating sets of solutions for all types of systems are given.";

$rake = new RakePlus();
$phrases = $rake->extract()->get();

// Alternative method:
$phrases = (new RakePlus($text))->get();

Example 5

You can provide custom stopwords in four different ways:

use DonatelloZa\RakePlus\RakePlus;

// 1: The standard way (provide a language code)
//    RakePlus will first look for ./lang/en_US.pattern, if
//    not found, it will look for ./lang/en_US.php.
$rake = RakePlus::create($text, 'en_US');

// 2: Pass an array containing stopwords
$rake = RakePlus::create($text, ['a', 'able', 'about', 'above', ...]);

// 3: Pass the name of a PHP or pattern file,
//    see lang/en_US.php and lang/en_US.pattern for examples.
$rake = RakePlus::create($text, '/path/to/my/stopwords.pattern');

// 4: Create an instance of one of the stopword provider classes (or
//    create your own) and pass that to RakePlus:
$stopwords = StopwordArray::create(['a', 'able', 'about', 'above', ...]);
$rake = RakePlus::create($text, $stopwords);

Example 6

You can specify the minimum number of characters that a phrase\keyword must be and if less than the minimum it will be filtered out. The default is 0 (no minimum).

use DonatelloZa\RakePlus\RakePlus;

$text = '6462 Little Crest Suite, 413 Lake Carlietown, WA 12643';

// Without a minimum
$phrases = RakePlus::create($text, 'en_US', 0)->get();
print_r($phrases);
Array
(
    [0] => crest suite
    [1] => 413 lake carlietown
    [2] => wa 12643
)
// With a minimum
$phrases = RakePlus::create($text, 'en_US', 10)->get();

print_r($phrases);
Array
(
    [0] => crest suite
    [1] => 413 lake carlietown
)

Example 7

You can specify whether phrases\keywords that consists of a numeric number only should be filtered out or not. The default is to filter out numerics.

use DonatelloZa\RakePlus\RakePlus;

$text = '6462 Little Crest Suite, 413 Lake Carlietown, WA 12643';

// Filter out numerics
$phrases = RakePlus::create($text, 'en_US', 0, true)->get();
print_r($phrases);
(
    [0] => crest suite
    [1] => 413 lake carlietown
    [2] => wa 12643
)
// Do not filter out numerics
$phrases = RakePlus::create($text, 'en_US', 0, false)->get();

print_r($phrases);
Array
(
    [0] => 6462
    [1] => crest suite
    [2] => 413 lake carlietown
    [3] => wa 12643
)

How to add additional languages

Using the stopwords extractor tool

The library requires a list of "stopwords" for each language. Stopwords are common words used in a language such as "and", "are", "or", etc. An example list of such stopwords can be found here (en_US). You can also take a look at this list which have stopwords for 50 different languages in individual JSON files.

When working with a simple list such as in the first example, you can copy and paste the text into a text file and use the extractor tool to convert it into a format that this library can read efficiently. An example of such a stopwords file that have been copied from the hyperlink above have been included for your convenience (console/stopwords_en_US.txt)

Alternatively you can extract the stopwords from a JSON file of which an example have also been supplied, look under console/stopwords_en_US.json

Note: Simply replace en_US to whatever locale you wish to use in the examples below.

Important: Before using the extractor tool, make sure to use the following Linux command to check whether your locale is supported:

$ locale -a

If you do not see the locale you wish to use in the list you can install it as follows: (in this case we are installing the French locale):

$ sudo locale-gen fr_FR
$ sudo locale-gen fr_FR.utf8

To extract stopwords from a text file, run the following from the command line:

$ cd ./console
$ php extractor.php stopwords_en_US.txt --locale=en_US --output=php

To extract stopwords from a JSON file, run the following from the command line:

$ php extractor.php stopwords_en_US.json --locale=en_US --output=php

It will output the results to the terminal. You will notice that the results looks like PHP and in fact it is. You can write the results directly to a PHP file by piping it:

$ php extractor.php stopwords_en_US.txt --locale=en_US --output=php > en_US.php

Finally, copy the en_US.php file to the lang/ directory and then instantiate php-rake-plus like so:

$rake = RakePlus::create($text, 'en_US');

To improve the initial loading speed of the language file within RakePlus, you can also set the exporter to produce the results as a regular expression pattern using the --output argument:

$ php extractor.php stopwords_en_US.txt --locale=en_US --output=pattern > en_US.pattern

RakePHP will always look for a .pattern file first and if not found it will look for a .php file in the ./lang/ directory.

To run tests

./vendor/bin/phpunit tests

License

Released under MIT license (read LICENSE).

About

Yet another PHP implementation of the Rapid Automatic Keyword Extraction algorithm (RAKE).

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • PHP 100.0%