This page is a guide on how to use the API.
- 0. Introduction
- 1. Send Your Document: POST /document
- 2. Get the queue status: GET /queue/{id}
- 3. Get the results
- 4. Server Configuration Access
First of all there is a few things to know:
- The API is RESTful: The API is over HTTP and follow REST standards.
- The API is asynchronous: There is a simple queue system and every job is managed by the API server.
The API has an endpoint prefix /api
and then, optionally, the version number /v1.0
. That mean every request must be send to:
/api/v1.0
: will use the API version 1.0/api/v1
: will use the latest API version 1.x/api
: will use the latest API version
First of all, you need to do a POST request to send the document to Parsr. Along that, you can send the configuration file to tell Parsr what kind of processing it must perform on the file. If you don't provide a config file, the default config file will be used.
Regarding the configuration file, please refer to the configuration file documentation. (Tip: You can also obtain the default configuration on the server via the endpoint: /api/v1/default-config
. See Section 4.)
curl -X POST \
http://localhost:3001/api/v1/document \
-H 'Content-Type: multipart/form-data' \
-F 'file=@/path/to/file.pdf;type=application/pdf' \
-F 'config=@/path/to/config.json;type=application/json'
Windows tip: Use double quotes "
instead of single quote '
curl -X POST \
http://localhost:3001/api/v1/document \
-H "Content-Type: multipart/form-data" \
-F "file=@/path/to/file.pdf;type=application/pdf" \
-F "config=@/path/to/config.json;type=application/json"
00cafe4463b9c12aac145b3ee8f00d
The document you sent has been accepted and is being processed. The body contains the unique queue ID. You need to keep it somewhere for later, to know what's the queue status and get the results.
This error means the file format you sent is not supported by the platform.
This request allows you to get the status of the queued document being processed. You need to give it the queue ID that was return in the previous request.
curl -X GET \
http://localhost:3001/api/v1/queue/00cafe4463b9c12aac145b3ee8f00d
{
"estimated-remaining-time": 30,
"progress-percentage": 10,
"start-date": "2018-12-31T12:34:56.789Z",
"status": "Detecting reading order..."
}
This status means the document is still being processed.
The estimated-remaining-time
is expressed in seconds.
NB: estimated-remaining-time
and progress-percentage
are not working yet and are placeholder for future usage.
{
"id": "00cafe4463b9c12aac145b3ee8f00d",
"json": "/api/v1/json/00cafe4463b9c12aac145b3ee8f00d",
"csv": "/api/v1/csv/00cafe4463b9c12aac145b3ee8f00d",
"text": "/api/v1/text/00cafe4463b9c12aac145b3ee8f00d",
"markdown": "/api/v1/markdown/00cafe4463b9c12aac145b3ee8f00d"
}
This status is sent when the processing is done. It returns links to the generated resources and the ID of the document for convenience.
This error means the queue ID doesn't refer to any known processing queue.
This error means that something went terribly wrong on the backend, probably an error coming from Parsr.
You can have results in different formats:
- JSON:
GET /json/{id}
- Markdown:
GET /markdown/{id}
- Raw text:
GET /text/{id}
- CSV:
GET /csv/{id}
These requests allow you to get the results of the processed document. You need to give it the queue ID that was return in a previous request.
The queries for JSON, Markdown and raw text are all working in the same way. CSV is a bit different and is described in the next section.
curl -X GET \
http://localhost:3001/api/v1/json/00cafe4463b9c12aac145b3ee8f00d
{
"metadata": [/* ... */],
"fonts": [/* ... */],
"pages": [/* ... */],
}
For more information on the JSON format, please refer to the specific guide.
This error means that the result file doesn't exist. Maybe it wasn't asked to be outputted in the config you sent in the first request.
Since you can have multiple tables per page, you need to query them in two steps:
First of all, get the list of every CSV files' paths:
curl -X GET \
http://localhost:3001/api/v1/csv/00cafe4463b9c12aac145b3ee8f00d
[
"/api/v1/csv/00cafe4463b9c12aac145b3ee8f00d/1/1",
"/api/v1/csv/00cafe4463b9c12aac145b3ee8f00d/2/1",
"/api/v1/csv/00cafe4463b9c12aac145b3ee8f00d/2/2",
"/api/v1/csv/00cafe4463b9c12aac145b3ee8f00d/3/1"
]
This error means that the result file doesn't exist. Maybe it wasn't asked to be outputted in the config you sent in the first request.
Then, we can get the CSV files one by one with the following parameters:
{id}
is the ID of the document{page}
is the page number{table}
is the table number
curl -X GET \
http://localhost:3001/api/v1/csv/00cafe4463b9c12aac145b3ee8f00d/1/1
3x4 table;Empty column;Numbers
;;
Item A;;3.14
"Item B
on two lines";;1,234.56
This CSV output example contains multiline cells and an empty column.
This error means that the result file doesn't exist. Maybe {page}
and {table}
parameters doesn't refer to an or it wasn't asked to be outputted in the config you sent in the first request.
You can download any of the available output formats:
- JSON:
GET /json/{id}?download=1
- Markdown:
GET /markdown/{id}?download=1
- Raw text:
GET /text/{id}?download=1
- CSV:
GET /csv/{id}?download=1
Being {id}
the same queue ID obtained in Section 3 - Get the results.
For JSON and Raw Text, a json
or txt
file will start downloading.
For Markdown, if the document has any embedded assets like images, a zip
file will start downloading, including the markdown and a folder with all required assets. If it does not contain any images, a single md
file will be downloaded.
For CSV option, a zip
will be downloaded, containing one csv
file per each table in the document.
The API can also be queried to gain access to the following server assets:
The server's default configuration can be queried (at /api/v1/default-config
) using:
curl -X GET \
http://localhost:3001/api/v1/default-config
The list of all usable modules can be queried from the server (at /api/v1/modules
) using:
curl -X GET \
http://localhost:3001/api/v1/modules
A module's configuration file, which includes name, description and each module parameter's default value and range can be queried (at /api/v1/module-config/<module_name>
) using:
curl -X GET \
http://localhost:3001/api/v1/module-config/table-detection
... which will fetch the configuration file for the table-detection module.
If you run into troubles running Parsr, a good first way to check if every required dependency is installed is by going to:
http://localhost:3001/api/v1/check-installation
A table will be displayed, showing every required (and optional) dependency, and it's path in your system. If you find that your system is missing a dependency, refer to the installation guide to fix the problem.