Skip to content

Latest commit

 

History

History
145 lines (107 loc) · 8.05 KB

README.md

File metadata and controls

145 lines (107 loc) · 8.05 KB

Language Build Status GoDoc Go Report Card

Plagiarism detection using stopwords n-grams

go-plagiarism is the main algorithm that utilizes MediaWatch and is inspired by Efstathios Stamatatos paper Plagiarism detection using stopwords n-grams.

We only rely on a small list of stopwords, for each language, to calculate the plagiarism probability between two texts, in combination with n-grams that let us find, not only plagiarism but also paraphrase and patchwork plagiarism. Take a look at the images below to help you better understand the process.

During the 1st step we tokenize the strings and keep only the stopwords (red tokens) for each document, as SourceStopWords and TargetStopWords.

Plagiarism Detection Algorithm - Function Words - Tokens

Later we transform the stopwords for each document into n-grams, with default N = 8, and calculate the score for each set of n-grams.

N-Grams

In our case (cc MediaWatch) we use this algorithm to create relationships between similar articles and map the process, or the chain of misinformation. As our scope is to track propaganda networks in the news ecosystem, this algorithm is only tested in such context.

The Chain of Misinformation

The Chain of Misinformation

Similarity Network

Similarity Network

Usage

go get github.com/cvcio/go-plagiarism

To use the detector you must provide either source/target texts when using with DetectWithStrings, or a list of stopwords for each text, when using with DetectWithStopWords. You can pass options to the detector to set your language, n-gram size or a custom stopword list. After executing one of the available detection methods, the detector will write in its interface the final score (float64), the similar n-grams (int) and the total n-grams (int). Though it seems highly experimental you can see the algorithm in action, in real-time, at app.mediawatch.io, where we continuously monitor Greek news outlets. Read the complete documentation at go-plagiarism.

package main

import (
    "fmt"

    "github.com/cvcio/go-plagiarism"
)

var source = `Plagiarism detection using stopwords n-grams. go-plagiarism is the main algorithm 
that utilizes MediaWatch and is inspired by Efstathios Stamatatos paper. 
We only rely on a list of stopwords to calculate 
the plagiarism probability between two texts, in combination with n-gram 
loops that let us find, not only plagiarism but also 
paraphrase and patchwork plagiarism. In our case (cc MediaWatch) we 
use this algorithm to create relationships between similar articles and 
map the process, or the chain of misinformation. As our 
scope is to track propaganda networks in the news ecosystem, 
this algorithm only tested in such context.`

var target = `We only rely on a list of stopwords to calculate 
the plagiarism probability between two texts, in combination with n-gram 
loops that let us find, not only plagiarism but also 
paraphrase and patchwork plagiarism. In our case (cc MediaWatch) we 
use this algorithm to create relationships between similar articles and 
map the process, or the chain of misinformation. As our 
scope is to track propaganda networks in the news ecosystem, 
this algorithm only tested in such context.`

func main() {
    detector, _ := plagiarism.NewDetector()
    err := detector.DetectWithStrings(source, target)
    if err != nil {
        panic(err)
    }

    fmt.Printf("Probability: %.2f, Similar n-grams %d, Total n-grams %d\n", detector.Score, detector.Similar, detector.Total)
}

// > Probability: 0.91, Similar n-grams 72, Total n-grams 79

Options

Detector can be initialized with options, SetN to set the n-gram size, SetLang to set the detector's language model and assign the appropriate stopwords and SetStopWords to assign a custom list of stopwords. Do not use SetLang alongside with SetStopWords as it will override one another.

plagiarism.SetN(n int) Option // will set the desired n-gram size
plagiarism.SetLang(lang string) Option // will set the detector's language and assign the default stopwords
plagiarism.SetStopWords(stopWords []string) Option // will set a custom list of stopwords as the default

To use the detector with options, simple pass the options during initialization.

// create a detector with 12 N n-gram size and set the language to Greek
detector, err := plagiarism.NewDetector(plagiarism.SetN(12), plagiarism.SetLang("el"))
// create a detector with default n-gram size (8) and set a custom stopword list
detector, err := plagiarism.NewDetector(plagiarism.SetStopWords([]string{"ο", "του", "η", "της", "αλλά"}))

Supported Languages

You can find all supported languages in the stopwords.go file. All supported languages use the ISO639-1 code format as a key (string) and the corresponding stopwods list ([]string) as a value.

ISO 639-1 Language Tested Tests
bg Bulgarian Partially Tested 1
de German Tested (>10K Articles) 1
el Greek Tested (>10M Articles) 5
en English Tested (>1M Articles) 2
fi Finnish Partially Tested 1
fr French Partially Tested 1
hr Croatian Partially Tested 1
hu Hungarian Partially Tested 1
it Italian Tested (>10K Articles) 1
nl Dutch, Flemish Partially Tested 1
no Norwegian Partially Tested 1
pl Polish Partially Tested 1
pt Portuguese Partially Tested 1
ro Romanian Partially Tested 1
ru Russian Tested (>10K Articles) 1
tr Turkish Tested (>100K Articles) 1
uk Ukrainian Partially Tested 1

TODO List

  • Include additional test cases for each language
  • Include tests with various n-gram sizes
  • Introduce a GetSimilar method to retrieve similar passages

Test Coverage

go test -v

Contributing

If you're new to contributing to Open Source on Github, this guide can help you get started. Please check out the contribution guide for more details on how issues and pull requests work. Before contributing be sure to review the code of conduct.

license

This library is distributed under the MIT license found in the LICENSE file.