To run the proyect is recomended to use dotnet watch run --project MoogleServer
. Once the program is running, it will start preprocessing the files. When a query is made, the project will call the engine query
function in the ```Moogle.cs`` file.
This method will call the query
function on the engine's internal vector model
, which will first build the array of vectors of the files (only using the query words), then multiply this array of vectors by the query vector, and then with the created scalars, the files will be sorted by their respective scallar.
After this, with the sorted files, the operators will be applied, which may change the ordering. Then we find the snippet of the best texts and the words of the queries will be highlighted, then we return the search result.
The algorithmic part will be expalined now.
The search engine uses vectorial models to find the text that most closely resembles the query. First we vectorize the text and the queries using TF-IDF
:
Where
After that for every query we can pass through all the text's vectors and see how close they are. To see that we can find the angle between the vector by the formula:
I decided to only use te dot product
of the vector because is more easy to calculate and is similar to the cosine similarity.
Mistakes happen, sometimes we write words wrong or we put capital letters where they don't belong, that's why it is necessary to distinguish that these words are the same.
To do that i used the Edit Distance
algorithm. It gives you the number of operations that you have to do to one of the strings to convert it to the other. I realized that sometimes this is not the most optimal aproach, for example word distra
is closer tto citra
than distra
to dijkstra
, and i think that it shouldnt be like that, so i created my own distance formula:
Where
In most languages, words are made up of a lexeme and a morpheme. Words that have the same lexeme usually belong to the same grammatical family, and tend to have similar meanings, so they are also important for the search.
For example in spanish, if we have the word gatico
, we can say that the words gato
and gatuno
are important too to the search, so we have to add them to the query.
One approach to this is to have a database with all the families of words, lexemes, or morphemes, but this didn't seem like an elegant solution to me.
I decided to use a Trie Data Structure to find the words that belongs to the same family.
For example, assuming that we want the words that are family of gatico
, we look for all the words that end in some node that belongs to the subtree of the prefix gat
(we remove the last
The obvious approach is to have a database of all the synonyms, and for each word look up what its synonyms are.
I found an API that provides synonyms of words so I decided to use it instead of a database, because a data base can be too large or incomplete.
⚠️ When the computer does not have internet it can take a little longer than normal because it tries to communicate with the API
I decided to use 4 operators:
in
: only shows texts in which the word appears.notin
: does not show the texts in which the word appears.imp
: Increase the importance of the wordnear
: Increases the importance of the document if the words appear close to each other.
So to calculate how important a text is we use the formula:
$ importance = dotinnotinimpnear $
Where in
it will take the value of 1 and notin
will be 0.
In te case of near
it will be proportional to the minimum distance between the two elements.
I tried to create a minimally pretty graphical user interface and I also added the functionality that when you click on a file name it opens a new tab with the full file.
The search has a complexity of
The edit distance has a complexity of
Finding a word in the trie is
A nice optimization I came up with was to use a k-dimensional tree
. This is a data structure that allows us to solve the k-nearest neighbors problem, in a time complexity of $ O(M*logN) $ where
Another idea can be to search for similar words using previously created word embeddings, but in practice this can be a bit inefficient, since you would have to search again for the k-nearest neighbors for each word.
Another interesting optimization can be to run the dot product between the matrix and the vector in a GPU, as GPUs are specialized in performing operations in parallel, to multiply matrices, the time of this operation should theoretically be faster in the GPU.
To test this I tried two ways to implement it, the first was to connect the vector model with a Python code and run the matrix multiplication in Tensorflow with GPU, I managed to program this but when I tested it I realized that it was much slower than the initial approach ( probably because of the communication between Python and C#).
Then I tried to use OpenCL, to calculate the dot product in the GPU, i downloaded a lot of OpenCl wrapers for C#, but none of those worked for me, but theoretically according to my research, it was possible to reduce the computation time with OpenCL.