All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
[1.0.2] - 2021-06-20 (Marcel Jerzyk)
- Research Paper "Evaluating the Potential of a Candidate for a Job Offer Based on his GitHub Profile" in
.pdf
format (here).
- Updated
README.md
with information about theReasearchPaper.pdf
.
- Old
LANG.md
file from./docs
.
[1.0.1] - 2021-06-20 (Marcel Jerzyk)
- One more idea in the Further Reasearch in
conclusions.tex
section. - Report #2 file (
report_2_for_group_m2.tex
) to thearchive
directory.
- Added missing paper images to the repository.
[1.0.0] - 2021-06-20 (Marcel Jerzyk)
Final release.
- Some final formatting to the paper like page breaks where it would make the article look better.
- Added self-defined formatting for code listings so it doesn't look as bad and highlights keywords, comments and strings properly.
- Updated
abstract.tex
with Results. - The final amount of sample size used in algorithms was added (
creating_ml_model.tex
). - Explanation what
R
script do was added to thedata_collection.tex
- Removed unused reference.
- Older versions of sections in the
paper/misc/archive
directory.
[0.11.2] - 2021-06-17 (Marcel Jerzyk)
- For privacy, removed the github links and usernames from data files.
[0.11.1] - 2021-06-17 (Marcel Jerzyk)
- Changed names of the files to be less ambiguous.
- Rebase, changelog entries.
- Empty
reproduction
directory. - Token from
repo_data.r
[0.11.0] - 2021-06-17 (Jakub Litkowski)
- New
extract_data_from_git.r
script added. - New
cleaned_data_with_ql.csv
[0.10.1] - 2021-06-17 (Marcel Jerzyk)
- Moved
model_script.r
from./reproduction
into./src/gitprofiler/r_scripts/
directory - Moved & renamed
model_data_no_labels.csv
from./reproduction
into./data/
directory. - Rebase, changelog entries and tweak &
./README.md
update regarding reproduction.
[0.10.0] - 2021-06-17 (Jakub SzaΕca)
model_script.R
for reproduction purposesmodelDataNoLabels.csv
for reproduction purposes
[0.9.0] - 2021-06-17 (Marcel Jerzyk)
conclusions.tex
section written & translated.discussion.tex
section written & translated.results.tex
section written & translated.- Conclusion written in
abstract.tex
section.
- Added
url
formatting in paper.
- Cropped image in systematic review section and adjusted the width.
[0.8.1b] - 2021-06-15 (Marcel Jerzyk & Jakub Litkowski)
README.md
updated with reproduction instruction for groupsM2
,M3
&M4
.
[0.8.1a] - 2021-06-13 (Marcel Jerzyk)
- Added
pandas
,numpy
,requests
andgitpython
as required packages inrequirements.txt
.
[0.8.1] - 2021-06-13 (Marcel Jerzyk)
- In
README.md
: information about the paper directory.
- Moved
/literature/
under the./data
directory.
- Spelling in
CHANGELOG.md
[0.8.0] - 2021-06-13 (Marcel Jerzyk)
Further work on the paper:
- Creating Machine Learning Model section created
- First results are available in Results section
- Whole paper was re-checked via grammar checker and adjusted accordingly
- Whole paper got "lifting" via improved formatting
- Sections got labels so they can be cross-referenced for improved "user reading experience"
- Improved language and added contents in Data Processing section
- Among others, details regarding manual pre-processing of questionnaire.
- Data section in
README.md
[0.7.2] - 2021-06-13 (Marcel Jerzyk)
Used literature as .pdf
s in the ./data/literature/ directory
[0.7.1] - 2021-06-04 (Marcel Jerzyk)
- Literature Review was retranslated to formal ,,paper'' English.
[0.7.0] - 2021-06-02 (Marcel Jerzyk)
-
New script:
.../merge_jsons.py
was added to the project.-
This script takes on input:
str:
github_username
,
-
Output is stored (by default) in:
-
The purpose of this script is to merge all of the
.json
files that are generated via.../scan_repositories.py
script into one.json
file that contains merged information from all of them in a 'gather' mode. This means that when in some other file information fetched by given combo of language-linter is present, then the results are summed up. This allows for even easier usage of the Mega Linter information via ML model because the processing work is already done. -
The script also takes care of different
.json
-s as some of them are linter-related but others contain aggregated results. -
Structure of the file is as follows:
{ "<lang>": { "<linter>": { "errors": int, "files": int, "fixed": int }, "total": { "clones": int, "duplicate_lines_num": int, "duplicate_tokens_num": int, "files": int, "lines": int, "tokens": int } } }
- There's special case for
<lang>
which is"Total:"
- it contains all aggregated results. - The
"total"
subkey doesn't always have to exist, for examples it does not in<lang>
:cspell
,xml
,yaml
.
- There's special case for
-
Example final output:
{ "Total:": { "total": { "clones": 163, "duplicate_lines_num": 6061, "duplicate_tokens_num": 52202, "files": 428, "lines": 44590, "tokens": 473152 } }, "java": { "checkstyle": { "errors": 108, "files": 109, "fixed": 0 }, "total": { "clones": 56, "duplicate_lines_num": 948, "duplicate_tokens_num": 10656, "files": 106, "lines": 11093, "tokens": 106282 } }, "python": { "black": { "errors": 59, "files": 62, "fixed": 0 }, "flake8": { "errors": 2438, "files": 62, "fixed": 0 }, "isort": { "errors": 37, "files": 62, "fixed": 0 }, "pylint": { "errors": 2, "files": 62, "fixed": 0 }, "total": { "clones": 40, "duplicate_lines_num": 798, "duplicate_tokens_num": 8403, "files": 58, "lines": 6111, "tokens": 57864 } }, "spell": { "cspell": { "errors": 9555, "files": 640, "fixed": 0 }, "misspell": { "errors": 18, "files": 640, "fixed": 0 } } }
-
-
Added
paper
directory.- It contains the LaTeX paper which got complete overhaul.
- Redundant text and formatting was removed.
- Comments that were pointless were removed and existing comments were standardized
- Moved sections of paper to separate files.
- Adjusted heading/title sections.
- Resolved few errors and warnings.
- Created directory
/img/
for images. - Created
/misc/
folder for all the stuff that is not tightly related to the paper or structure.
- Script
.../scan_repositories.py
now fetches repositories of given user automatically (no need to provide separate repositories list in order to fetch them)- New input:
str:
github_username
,
- New input:
[0.6.0] - 2021-05-25 (Marcel Jerzyk)
- New script:
.../scan_repositories.py
was added to the project.- This script takes on input:
str:
github username
,list[str]
:github repositories names
- It automates the repositories cloning, Mega Linter linting and Mega Linter Scraper scraping routine. It does exactly as what the intuition might suggest:
- first starts
git clone
on every single repository name given on input - then it uses MegaLinter to parse the repositories contents and generate na output file
- and at the end - uses the previously made
.../scraper.py
to parse the file contents into a machine-readable.json
format
- first starts
- The output results are stored in
/data/repositories/<username>
_(the directory will be automatically created if it's not present yet)_ - The script will also try to clean the repositories directory from directories generated by earlier launches.
- This script takes on input:
[0.5.0] - 2021-05-11 (Marcel Jerzyk)
- New directory:
./docs
containing various files regarding the technical side of the project as well as images used in markdown files. - LaTeX document changes tracker:
LANGv1.md
. It has previous versions of sections and subsections:- Systematic Review
- Research Questions
- Resources to Be Searched
- Results Selection Process
- Data Collection
- Data Pre-processing
- Moved directory:
./img/readme/*
inside./docs
(now: [./docs/img/readme/*
]).
[0.4.0] - 2021-05-11 (Marcel Jerzyk)
This changelog entry will be filled in a few days.
[0.3.1] - 2021-05-08 (Marcel Jerzyk)
- Logged in this file Version History for
0.3.1
. - Logged in this file Version History for
0.3.0
. - Logged in this file Version History for
0.2.1
. - Logged in this file Version History for
0.2.0
. - Logged in this file Version History for
0.1.0
. - Logged in this file Version History for
0.0.1
.
- Version Number in
README.md
.
[0.3.0] - 2021-05-08 (Marcel Jerzyk)
- New Script:
scrape.py
- The script takes on input output file which can be generated via Mega Linter by redirecting the standard output stream into a text file (
> output.txt
). - Script parses the log data and scrapes duplicates table information into
dictionary
:{ "language": str, "files": int, # amount of detected files in given language by linter "lines": int, # amount of detected lines in a given language "tokens": int, # amount of detected tokens ("chars") in a given language "clones": int, "duplicate_lines_num": int, "duplicate_lines_percent": float, "duplicate_tokens_num": int, "duplicate_tokens_percent": float },
- Script parses the log data and scrapes summary table information into
dictionary
:{ "language": str, "linter": str, "files": int or str, # amount of detected files in given language by linter "fixed": int, # amount of fixed errors automatically by linter "errors": int # amount of errors that could not be fixed by linter },
- All available information are properly parsed and saved as
output.json
file that contains list of the previously mentioned dictionaries.
- The script takes on input output file which can be generated via Mega Linter by redirecting the standard output stream into a text file (
- New File:
CHANGELOG.md
- This file serves as a diary of the progress of the programming side of this project.
- Added new
README.md
entry about the new script file. It contains the information about the requirements needed in order to run the script as well as the run process itself with the expected output data.
[0.2.3] - 2021-05-04 (Marcel Jerzyk)
README.md
:- Tutorial on how to install Docker environment.
- Tutorial on how to run Mega Linter locally on own repository.
[0.2.2] - 2021-04-26 (Jakub Litkowski)
- Fixed:
value is missing where true / false is required
- Fixed:
arguments suggest a different number of lines: 1, 5, 0
[0.2.1] - 2021-04-26 (Marcel Jerzyk)
README.md
:- Added information & created
.gif
files for group M2 that should work as exhaustive instruction on how to use the R Studio.- Tutorial: How to generate own GitHub Token.
- Instruction: Installing R Studio.
- Navigating in R Studio.
- Added tutorial & created
.gif
files that contain exhaustive information on how to use Mega Linter through Github Actions.- Information about the CI file and what contents it should have (including snippet for easy copy-paste).
- Step by step actions in order to trigger the CI/CD Pipeline in GitHub.com on own repository with Mega Linter job.
- Added information & created
[0.2.0] - 2021-04-20 (Jakub Litkowski)
-
GraphQL Query Created
-
Scraped information from Query into variables:
- Bio:
repositoriesNames <- json$data$repositoryOwner$repositories$edges$node$name bio <- json$data$repositoryOwner$bio isHireable <- json$data$repositoryOwner$isHireable emptyRepos <- json$data$repositoryOwner$repositories$edges$node$isEmpty commitMsgE <- json$data$repositoryOwner$repositories$edges$node$defaultBranchRef$target$history$edges
- Commit Messages:
commitMSGList <- list() commitDates <- list() for(d in commitMsgE) { for(i in 1:length(d$node$author$user$login)){ if( d$node$author$user$login[i] == "Luzkan"){ commitDates <- c(commitDates,d$node$committedDate[i]) } } } for (p in commitMsgE) { for(i in 1:length(p$node$author$user$login)){ if( p$node$author$user$login[i] == "Luzkan"){ commitMSGList <- c(commitMSGList,p$node$message[i]) } } }
- Used Languages:
languages <- json$data$repositoryOwner$repositories$edges$node$languages$edges languagesList <- list() for (l in languages) { print(l) languagesList <- c(languagesList, l$node$name ) }
-
Calculating Time Between Commits:
time_between_commits <- list() for(idx in seq_along(commitDates)){ if (idx+1 > length(commitDates)){ break } dateOne<-as.POSIXct(commitDates[[idx]], format = "%Y-%m-%dT%H:%M:%SZ") dateTwo<-as.POSIXct(commitDates[[idx+1]], format = "%Y-%m-%dT%H:%M:%SZ") time_between_commits[[idx]]<- as.numeric(difftime(dateOne,dateTwo, units="mins")) } average_time_between_commit <- mean(unlist(time_between_commits))
[0.1.0] - 2021-03-16 (Marcel Jerzyk)
-
Created
README.md
for the project that contains various useful information, requirements and instructions in order ot run the program. -
Created initial file structure.
-
github_graphql.r
file:- Added imports that are required for GraphQL query creation:
library("ghql") library("jsonlite") library("dplyr")
- GraphQL Connection Object
# GraphQL Connection Object (GitHub) connection <- GraphqlClient$new( url = "https://api.github.com/graphql", headers = list(Authorization = paste0("Bearer ", token)) )
- Informative Example GraphQL Query
new_query$query('mydata', '{ repositoryOwner(login:"Luzkan") { repositories(first: 5, orderBy: {field:PUSHED_AT,direction:DESC}, isFork:false) { edges { node { name stargazers { totalCount } } } } } }')
- Execution, parsing & writing to
.json
output
# Execute Query (result <- connection$exec(new_query$queries$mydata)) # Parse to more human readable form jsonlite::fromJSON(result) # Writing to file write(result, "output.json")
[0.0.1] - 2021-03-01 (Lech Madeyski)
Project was initialized.