Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to the Anonymize scanner #2

Merged
merged 3 commits into from
Aug 12, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 1 addition & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -41,10 +41,9 @@ jobs:
run: |
python -m pip install --upgrade pip
python -m pip install -U -r requirements.txt -r requirements-dev.txt
python -m spacy download en_core_web_lg
- name: Download spacy dictionary
run: |
python -m spacy download en_core_web_lg
python -m spacy download en_core_web_trf
- name: Test with pytest
run: |
python -m pytest --exitfirst --verbose --failed-first --cov=.
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,16 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
### Added
-

### Fixed
-

### Changed
- [Anonymize prompt scanner] Using the transformer based Spacy model `en_core_web_trf` ([reference](https://microsoft.github.io/presidio/analyzer/nlp_engines/spacy_stanza/))
- [Anonymize prompt scanner] Supporting faker for applicable entities instead of placeholder (`use_faker` parameter)

### Removed
-

## [0.0.3] - 2023-08-10

### Added
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,12 +17,12 @@ jailbreak attacks, LLM-Guard ensures that your interactions with LLMs remain saf

## Installation

Begin your journey with LLM Guard by downloading the package and acquiring the `en_core_web_lg` spaCy model (essential
Begin your journey with LLM Guard by downloading the package and acquiring the `en_core_web_trf` spaCy model (essential
for the [Anonymize](./docs/input_scanners/anonymize.md) scanner):

```sh
pip install llm-guard
python -m spacy download en_core_web_lg
python -m spacy download en_core_web_trf
```

## Getting Started
Expand Down
4 changes: 2 additions & 2 deletions docs/CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@ cd llm-guard
pip install -r requirements.txt
python setup.py install

# download `en_core_web_lg` from SpaCy
python -m spacy download en_core_web_lg
# download SpaCy model
python -m spacy download en_core_web_trf
```

## Testing
Expand Down
6 changes: 5 additions & 1 deletion docs/input_scanners/anonymize.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@ When you use the Anonymize scanner, you can talk to LLMs without worrying about
The scanner uses a tool called the [Presidio Analyzer](https://github.com/microsoft/presidio/) library. This tool, built
with Python's spaCy, is really good at finding private info in text.

It uses transformers based model `en_core_web_trf` which uses a more modern deep-learning architecture, but is generally
slower than the default `en_core_web_lg` model.

On top of that, the Anonymize scanner can also understand special patterns to catch anything the Presidio Analyzer might
miss.

Expand Down Expand Up @@ -54,5 +57,6 @@ Here's what those options do:
- `hidden_names` are names we change to something like `[REDACTED_CUSTOM_1]`.
- You can also choose specific types of info to hide using `entity_types`.
- If you want, you can use your own patterns by giving the path in `regex_pattern_groups_path`.
- `use_faker` will replace applicable entities with fake ones.

- If you want to see the original info again, you can use the [Deanonymizer](../output_scanners/deanonymize.md).
If you want to see the original info again, you can use the [Deanonymizer](../output_scanners/deanonymize.md).
3 changes: 3 additions & 0 deletions docs/output_scanners/sensitive.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,9 @@ To combat this, it's vital to integrate data sanitization and adopt strict user
It takes advantage of the [Presidio Analyzer Engine](https://github.com/microsoft/presidio/). Coupled
with predefined internal patterns, the tool offers robust scanning capabilities.

It uses transformers based model `en_core_web_trf` which uses a more modern deep-learning architecture, but is generally
slower than the default `en_core_web_lg` model.

When running, the scanner inspects the model's output for specific entity types that may be considered sensitive. If no
types are chosen, the tool defaults to scanning for all known entity types, offering comprehensive coverage.

Expand Down
Loading