Coming soon.
See our paper(accepted by ICPC2022) in https://arxiv.org/pdf/2204.01028.pdf
You can use the provideded docker image to avoiding environment dependence setting
Link: https://drive.google.com/file/d/17zsCf-5FnKbE1iPw6Ca4onW5ckQX69eQ/view?usp=sharing
MSCCD is in /root/MSCCD
Remember to update to the newest MSCCD by git pull
We have tested MSCCD on Ubuntu 18.04LTS / MacOS Monterey.
MSCCD mainly depends on these environments:
- Python v3.6.9
- Java 11 (Newer than Java9) (Remember to set version by editting modules/msccd_tokenizers/pom.xml when using a different version)
- Maven v3.8.5
- jinja2 (pip3)
- ujson (pip3)
We added some interfaces and methods to ANTLR4.8 and packaged a .jar file for MSCCD. Please install the provided antlr-4.8-modified.jar to your local maven repository.
mvn install:install-file -Dfile=./lib/antlr-4.8-modified.jar -DgroupId=org.nagoya_u.ertl.sa -DartifactId=antlr-v4.8-modified -Dversion=4.8 -Dpackaging=jar
First, edit ./parserConfig.json :
- parser: The path of the grammar folder, including g4 files and sometimes java programs.
- grammarName: The grammar name defined in the g4 file. It can also be checked in pom.xml (for grammars from grammarsv4)
- startSymbol: Can be easily checked in pom.xml or the g4 file.
Then, generate the tokenizer by:
python3 tokenizerGeneration.py
We can configure the tool by config.json. Here are the items:
- inputProject: A list of paths. Each path presents a project you want to detect.
- keywordsList: The path of the keywordslist.
- languageExtensionName: A list of the extension names of the target language.
- minTokens: The minimum size of the token bag in clone detection.
- minTokensForBagGeneration: The minimum size of the token bag in tokenization. A smaller value will provide a larger range of token bag sizes in clone detection; a bigger one will make the tokenizer faster when you don’t want small bags.
- detectionThreshold: The similarity threshold with a number in the range(0,1). If the overlapping similarity of a code pair is higher than the threshold, they will be seen as clones. A higher threshold will increase accuracy and reduce recall, and vice versa.
- maxRound: The max granularity value to detect.
- tokenizer: The name of generated tokenizer. It is the same as “grammarName” in parserConfig.json
- threadNum_tokenizer
- threadNum_detection
Users may always need to do several detections for the same project. So we can save the necessary data in a task object to save time for the execution next time.
By this part, we will execute the tool by generating a new task from the configuration file.
1 Edit the config.json file, and check the grammar file, keyword list file, and your input file.
2 Run by python3 controller.py, and just wait for the result.
3 Check the information in tasks/task[taskId]/, for each execution, there will be a folder named detection* to save the result files
By this part, we will execute the tool from a generated task. We can easily change the detection granularity(required) and threshold(optional) by command.
Just run it by python3 controller.py [taskId] ([statementThreshold]).
For example, python3 controller.py 1 means excute from tasks/task1. python3 controller.py 2 0.9 means excute from tasks/task2, and set the detectionThreshold to 0.9
For each task, all the data is saved in the tasks/task* folder, including configurations, file list, token bags. Here is the description:
file | description |
---|---|
fileList.txt | Each line represents a source file, formatting with (projectId, file Path). The index of each file in each project is defined as fileId. |
tokenBags | Each line represents a token bag and uses '@ @' to separate each data field: projectId @ @ fileId @ @ bagId @ @ granularity value @ @ number of keywords @ @ symbol number @@ token number @@ start line in original file -- end line in original file@@ tokens(token text :: frequency) |
taskData.obj | Configurations |
Results of each detection is saved in tasks/task*/detection* folder.
file | description |
---|---|
pairs.file | Reported clones in [[projectId,fileId,bagId],[projectId,fileId,bagId]] |
info.obj | Exection times... |
- scripts/blockPairOutput.py : generate a output file in csv format: [file1Path,startLine,endLine,file2Path,startLine,endline]
- python3 scripts/blockPairOutput.py taskId detectionId outputFile
- scripts/filePairOutput.py : generate a output file in csv format: [file1Path,file2Path]
- It's useful when MSCCD is executed as a file-level clone detector. (When setting maxRound in config.json as 1 or 0)
- python3 scripts/filePairOutput.py taskId detectionId outputFile
- Speed up
- Analysis scripts to make the detection results easier to read and use