Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question for finding string similarity #84

Open
Shellcat-Zero opened this issue Jan 17, 2019 · 1 comment
Open

Question for finding string similarity #84

Shellcat-Zero opened this issue Jan 17, 2019 · 1 comment

Comments

@Shellcat-Zero
Copy link

Hi,

I was hoping to leverage NearPy to find similarities between strings, but it's not clear to me how to query the engine with a string vector (if that's possible). My use case is that I have ~30 million names to store in the engine, and I have around 1.5 million names to submit as queries to find a best match from the engine. I was going to use your Redis storage adapter so that all of the queries could be submitted asynchronously. Please let me know if that is not a good use case for NearPy.

Thanks.

@pixelogik
Copy link
Owner

@Shellcat-Zero sorry for the long silence.

NearPy is very modular and allows users to customize the pipeline they are using.

It is however based on numerical vectors. So you would need to convert your strings to numerical vectors. I bet there are a couple of methods for this out there. The most straightforward way I can think of is to first lower case the name and then map the string to an array of numbers based on the character value. Depending on which encoding you are using (UTF8/UTF16) this might result in values between 0 and 255 or much larger for each character position.

Another aspect you would need to consider is the maximum name length, in characters. Because this would determine the dimension of your vector space.

Let's consider this example, where you have these names to store

Pauline
Georgie
Peter
Sebastian

The maximum name length is 9 (Sebastian) so your vector space should be of (at least) dimension 9.

You would then turn those names into numerical vectors of size 9 each (one number per character) and use the pipeline as usual.

However I might be that NearPy is NOT the framework for your project. There are so many really good Python frameworks out there for language and string processing, maybe some of them would be a better pick:

https://spacy.io/
https://radimrehurek.com/gensim/
http://www.nltk.org/

More "learning" focused, but might be useful as well:

https://scikit-learn.org/stable/

I hope I am not too late with my response. Good luck with your project!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants