29 Sep 07:19

38f036d

v0.8 Latest

Latest

Highlights

Use keybert.KeyLLM to leverage LLMs for extracting keywords 🔥
- Use it either with or without candidate keywords generated through KeyBERT
- Efficient implementation by calculating embeddings and generating keywords for a subset of the documents
Multiple LLMs are integrated: OpenAI, Cohere, LangChain, 🤗 Transformers, and LiteLLM

1. Create Keywords with KeyLLM

A minimal method for keyword extraction with Large Language Models (LLM). There are a number of implementations that allow you to mix and match KeyBERT with KeyLLM. You could also choose to use KeyLLM without KeyBERT.

from keybert import KeyBERT

kw_model = KeyBERT()

# Prepare embeddings
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs)

# Extract keywords without needing to re-calculate embeddings
keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)

2. Efficient KeyLLM

If you have embeddings of your documents, you could use those to find documents that are most similar to one another. Those documents could then all receive the same keywords and only one of these documents will need to be passed to the LLM. This can make computation much faster as only a subset of documents will need to receive keywords.

import openai
from keybert.llm import OpenAI
from keybert import KeyLLM
from sentence_transformers import SentenceTransformer

# Extract embeddings
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(documents, convert_to_tensor=True)

# Create your LLM
openai.api_key = "sk-..."
llm = OpenAI()

# Load it in KeyLLM
kw_model = KeyLLM(llm)

# Extract keywords
keywords = kw_model.extract_keywords(documents, embeddings=embeddings, threshold=.75)

3. Efficient KeyLLM + KeyBERT

This is the best of both worlds. We use KeyBERT to generate a first pass of keywords and embeddings and give those to KeyLLM for a final pass. Again, the most similar documents will be clustered and they will all receive the same keywords. You can change this behavior with the threshold. A higher value will reduce the number of documents that are clustered and a lower value will increase the number of documents that are clustered.

import openai
from keybert.llm import OpenAI
from keybert import KeyLLM, KeyBERT

# Create your LLM
openai.api_key = "sk-..."
llm = OpenAI()

# Load it in KeyLLM
kw_model = KeyBERT(llm=llm)

# Extract keywords
keywords = kw_model.extract_keywords(documents); keywords

See here for full documentation on use cases of KeyLLM and here for the implemented Large Language Models.

Fixes

Enable Guided KeyBERT for seed keywords differing among docs by @shengbo-ma in #152

Assets 2

03 Nov 08:30

MaartenGr

v0.7.0

7b763ae

v0.7.0

Highlights

Cleaned up documentation and added several visual representations of the algorithm (excluding MMR / MaxSum)
Added functions to extract and pass word- and document embeddings which should make fine-tuning much faster

from keybert import KeyBERT

kw_model = KeyBERT()

# Prepare embeddings
doc_embeddings, word_embeddings = kw_model.extract_embeddings(docs)

# Extract keywords without needing to re-calculate embeddings
keywords = kw_model.extract_keywords(docs, doc_embeddings=doc_embeddings, word_embeddings=word_embeddings)

Do note that the parameters passed to .extract_embeddings for creating the vectorizer should be exactly the same as those in .extract_keywords.

Fixes

Redundant documentation was removed by @mabhay3420 in #123
Fixed Gensim backend not working after v4 migration (#71)
Fixed candidates not working (#122)

Assets 2

27 Jul 14:20

MaartenGr

v0.6.0

9dd7b59

v0.6.0

Highlights

Major speedup, up to 2x to 5x when passing multiple documents (for MMR and MaxSum) compared to single documents
Same results whether passing a single document or multiple documents
MMR and MaxSum now work when passing a single document or multiple documents
Improved documentation
Added 🤗 Hugging Face Transformers

from keybert import KeyBERT
from transformers.pipelines import pipeline

hf_model = pipeline("feature-extraction", model="distilbert-base-cased")
kw_model = KeyBERT(model=hf_model)

Highlighting support for Chinese texts
- Now uses the CountVectorizer for creating the tokens
- This should also improve the highlighting for most applications and higher n-grams

NOTE: Although highlighting for Chinese texts is improved, since I am not familiar with the Chinese language there is a good chance it is not yet as optimized as for other languages. Any feedback with respect to this is highly appreciated!

Fixes

Fix typo in ReadMe by @priyanshul-govil in #117
Add missing optional dependencies (gensim, use, and spacy) by @yusuke1997
in #114

Assets 2

04 May 14:31

MaartenGr

v0.5.1

ce941df

v0.5.1

Added a page about leveraging CountVectorizer and KeyphraseVectorizers
- Shoutout to @TimSchopf for creating and optimizing the package!
- The KeyphraseVectorizers package can be found here
Fixed Max Sum Similarity returning incorrect similarities #92
- Thanks to @kunihik0 for the PR!
Fixed out of bounds condition in MMR
- Thanks to @artmatsak for the PR!
Started styling with Flake8 and Black (which was long overdue)
- Added pre-commit to make following through a bit easier with styling

Assets 2

28 Sep 13:30

MaartenGr

v0.5.0

6ab9af1

v0.5

Highlights:

Added Guided KeyBERT
- kw_model.extract_keywords(doc, seed_keywords=seed_keywords)
- Thanks to @zolekode for the inspiration!
Use the newest all-* models from SBERT

Miscellaneous:

Added instructions in the FAQ to extract keywords from Chinese documents

Assets 2

30 Jun 09:54

MaartenGr

v0.4.0

25dab3a

v0.4

Features

Use paraphrase-MiniLM-L6-v2 as the default (great results!)
Highlight the document with keywords:
- keywords = kw_model.extract_keywords(doc, highlight=True)

Miscellaneous

Update Flair dependencies
Added FAQ

Assets 2

10 May 09:28

MaartenGr

v0.3.0

eb6d086

v0.3

The two main features are candidate keywords and several backends to use instead of Flair and SentenceTransformers!

Highlights:

Use candidate words instead of extracting those from the documents (#25)
- KeyBERT().extract_keywords(doc, candidates)
Spacy, Gensim, USE, and Custom Backends were added (see documentation here)

Fixes:

Improved imports
Fix encoding error when locally installing KeyBERT (#30)

Miscellaneous:

Improved documentation (ReadMe & MKDocs)
Add the main tutorial as a shield
Typos (#31, #35)

Assets 2

09 Feb 10:41

MaartenGr

v0.2.0

2a982bd

Major Release v0.2

Features

Add similarity scores to the output
Add Flair as a possible back-end
Update documentation + improved testing

Assets 2

25 Jan 09:09

MaartenGr

v0.1.3

a767327

BibTeX

This release is meant as a way to create a DOI through Zenodo.

Assets 2

28 Oct 09:55

MaartenGr

v0.1.2

8fd836c

Max Sum Sim

Added Max Sum Similarity as an option to diversify your results.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Highlights

1. Create Keywords with KeyLLM

2. Efficient KeyLLM

3. Efficient KeyLLM + KeyBERT

Fixes

Highlights

Fixes

Highlights

Fixes

Features

Miscellaneous

Features

Releases: MaartenGr/KeyBERT

v0.8

Highlights

1. Create Keywords with KeyLLM

2. Efficient KeyLLM

3. Efficient KeyLLM + KeyBERT

Fixes

v0.7.0

Highlights

Fixes

v0.6.0

Highlights

Fixes

v0.5.1

v0.5

v0.4

Features

Miscellaneous

v0.3

Major Release v0.2

Features

BibTeX

Max Sum Sim