go-plagiarism
is the main algorithm that utilizes MediaWatch and is inspired by Efstathios Stamatatos paper Plagiarism detection using stopwords n-grams.
We only rely on a small list of stopwords, for each language, to calculate the plagiarism probability between two texts, in combination with n-grams that let us find, not only plagiarism but also paraphrase and patchwork plagiarism. Take a look at the images below to help you better understand the process.
During the 1st step we tokenize the strings and keep only the stopwords (red tokens) for each document, as SourceStopWords and TargetStopWords.
Later we transform the stopwords for each document into n-grams, with default N = 8, and calculate the score for each set of n-grams.
In our case (cc MediaWatch) we use this algorithm to create relationships between similar articles and map the process, or the chain of misinformation. As our scope is to track propaganda networks in the news ecosystem, this algorithm is only tested in such context.
The Chain of Misinformation
Similarity Network
go get github.com/cvcio/go-plagiarism
To use the detector you must provide either source/target texts when using with DetectWithStrings
, or a list of stopwords for each text, when using with DetectWithStopWords
. You can pass options to the detector to set your language, n-gram size or a custom stopword list. After executing one of the available detection methods, the detector will write in its interface the final score (float64), the similar n-grams (int) and the total n-grams (int). Though it seems highly experimental you can see the algorithm in action, in real-time, at app.mediawatch.io, where we continuously monitor Greek news outlets. Read the complete documentation at go-plagiarism.
package main
import (
"fmt"
"github.com/cvcio/go-plagiarism"
)
var source = `Plagiarism detection using stopwords n-grams. go-plagiarism is the main algorithm
that utilizes MediaWatch and is inspired by Efstathios Stamatatos paper.
We only rely on a list of stopwords to calculate
the plagiarism probability between two texts, in combination with n-gram
loops that let us find, not only plagiarism but also
paraphrase and patchwork plagiarism. In our case (cc MediaWatch) we
use this algorithm to create relationships between similar articles and
map the process, or the chain of misinformation. As our
scope is to track propaganda networks in the news ecosystem,
this algorithm only tested in such context.`
var target = `We only rely on a list of stopwords to calculate
the plagiarism probability between two texts, in combination with n-gram
loops that let us find, not only plagiarism but also
paraphrase and patchwork plagiarism. In our case (cc MediaWatch) we
use this algorithm to create relationships between similar articles and
map the process, or the chain of misinformation. As our
scope is to track propaganda networks in the news ecosystem,
this algorithm only tested in such context.`
func main() {
detector, _ := plagiarism.NewDetector()
err := detector.DetectWithStrings(source, target)
if err != nil {
panic(err)
}
fmt.Printf("Probability: %.2f, Similar n-grams %d, Total n-grams %d\n", detector.Score, detector.Similar, detector.Total)
}
// > Probability: 0.91, Similar n-grams 72, Total n-grams 79
Detector can be initialized with options, SetN
to set the n-gram size, SetLang
to set the detector's language model and assign the appropriate stopwords and SetStopWords
to assign a custom list of stopwords. Do not use SetLang
alongside with SetStopWords
as it will override one another.
plagiarism.SetN(n int) Option // will set the desired n-gram size
plagiarism.SetLang(lang string) Option // will set the detector's language and assign the default stopwords
plagiarism.SetStopWords(stopWords []string) Option // will set a custom list of stopwords as the default
To use the detector with options, simple pass the options during initialization.
// create a detector with 12 N n-gram size and set the language to Greek
detector, err := plagiarism.NewDetector(plagiarism.SetN(12), plagiarism.SetLang("el"))
// create a detector with default n-gram size (8) and set a custom stopword list
detector, err := plagiarism.NewDetector(plagiarism.SetStopWords([]string{"ο", "του", "η", "της", "αλλά"}))
You can find all supported languages in the stopwords.go file. All supported languages use the ISO639-1 code format as a key (string) and the corresponding stopwods list ([]string) as a value.
ISO 639-1 | Language | Tested | Tests |
---|---|---|---|
bg | Bulgarian | Partially Tested | 1 |
de | German | Tested (>10K Articles) | 1 |
el | Greek | Tested (>10M Articles) | 5 |
en | English | Tested (>1M Articles) | 2 |
fi | Finnish | Partially Tested | 1 |
fr | French | Partially Tested | 1 |
hr | Croatian | Partially Tested | 1 |
hu | Hungarian | Partially Tested | 1 |
it | Italian | Tested (>10K Articles) | 1 |
nl | Dutch, Flemish | Partially Tested | 1 |
no | Norwegian | Partially Tested | 1 |
pl | Polish | Partially Tested | 1 |
pt | Portuguese | Partially Tested | 1 |
ro | Romanian | Partially Tested | 1 |
ru | Russian | Tested (>10K Articles) | 1 |
tr | Turkish | Tested (>100K Articles) | 1 |
uk | Ukrainian | Partially Tested | 1 |
- Include additional test cases for each language
- Include tests with various n-gram sizes
- Introduce a
GetSimilar
method to retrieve similar passages
go test -v
If you're new to contributing to Open Source on Github, this guide can help you get started. Please check out the contribution guide for more details on how issues and pull requests work. Before contributing be sure to review the code of conduct.
This library is distributed under the MIT license found in the LICENSE file.