php-data-miner extracts data from structured data formats (such as PDF documents).
- Unix-like OS
- GNU Make
- PHP (>=7.4)
- NodeJS (>=8)
$ sudo apt-get install make gcc gfortran php-dev libopenblas-dev liblapacke-dev re2c build-essential
Annotate your model with @Model()
and properties with @Property()
annotations.
use PhpDataMiner\Model\Annotation\Model;
use PhpDataMiner\Model\Annotation\Property;
/**
* @Model()
*/
class Invoice
{
/**
* @var string
* @Property()
*/
protected string $number;
}
$miner = $this->miner->create($entity, [
'storage' => new CustomStorage(),
'property_types' => [
new FloatProperty(),
new IntegerProperty(),
new DateProperty(),
new Property(),
]
]);
$pdfContents = shell_exec('pdftotext -layout incoice.pdf -');
$doc = $miner->normalize($pdfContents, [
'filters' => [
DateFilter::class,
ColonFilter::class,
Section::class,
WordTree::class,
]
]);
$entity = new Invoice();
You need to have pdftotext installed to read PDF contents like shown above
- filters (or transformers) transform and normalize the content
- WordTree filter is as special kind of tokenizer for nesting and grouping the contents (by rows, columns, sentences etc)
It's recommended that you place your tokenizers as the last ones in the filters list
Train your model with data you've already entered (supervised learning):
...
$trainedProperties = $miner->train($entity, $doc);
Apply predicted data to your model:
...
$predictedProperties = $miner->predict($entity, $doc);
Edit your storage model PhpDataMiner\Storage\Model\Model::createEntryDiscriminator()
method to set entry filter:
use PhpDataMiner\Storage\Model\Model;
use PhpDataMiner\Storage\Model\ModelInterface;
class InvoiceModel extends Model implements ModelInterface
{
public static function createEntryDiscriminator($invoice): DiscriminatorInterface
{
return new Discriminator([
$invoice->getClient() ? $entity->getClient()->getId() : null,
$invoice->getId(),
]);
}
}
Version numbering is done following the semantic versioning
- Natural language toolkit (NLTK) support
- Feature vectors for properties
$ make tests [test_name]