(As it was one of my first project)
Doc2Map is an algorithm for topic modeling and visualization. It can read any type of document files, but not OCR them. It will find topics base on the core idea of Top2Vec and hierarchicaly display them on a map similar to a Google Map: Live Demo 1 With Wikipedia Dataset
Live Demo 2 With 20 News Groups
Or on a scatter plot with a munual zoom level: Live Demo 1 With Wikipedia Dataset
Live Demo 2 With 20 News Groups
With Doc2Map, you will be able to create beautiful, intuitive, and interactive visuals to summarise your document corpus in a map, similar to Google Map, with topics, clusters, and documents, instead of the names of countries, states, and cities.
Thanks to Apache Tika –a software able to detect and extract and text from over a thousand different file types– allow Doc2Map to read virtually any kind of file.
Note: This is not OCR, can’t extract text from pictures.
There are two ways to use Doc2Vec:
- Launching directly the python module
- Importing the Doc2Map library in your script
Your first option is to directly launch the module. Once launch, you will have to wait a little for the programm to start, then you will be asked what folder you want to analyse:
Select the folder with the document you want to cartography.
For the next step, you will have to be patient. Doc2Map will analyse and convert into plain text your docuemnt, then organise them. Depending of the format, the size and the number of documents, it may take a long time...
When finished, two web pages will be automaticaly launch on your browser to show you different cartographies of you documents.
The examples are loaded from HTML files newly created. You can easily find their localization by looking at the address bar of your browser, you will see something like file://Your/Path/To/Your/Visuals
These files can easily be exported to another machine, with little of requirements:
- If your visualization is based on local files, once exported, these files may no longer be accessible by interacting with the visualisation.
- However, there will be no problem, if you use a common share hard drive with the people you share the visualisations (like it may often be the case in many firms, under the form of a local network). For the visualisation DocMap.html, you will have to include the files: DocMapdensity.svg and data.js.
If you want to use Doc2Map with python, you have first to install it:
pip install Doc2Map
Then, you will have to import it:
from Doc2Map import Doc2Map
Doc2Map is mainly based on the Top2Vec principle, and rely on Plotly and Leaflet to create beautiful visuals.
If you want to know the complete story and working of Doc2Map, I invite you to read the Medium Article about it: