diff --git a/README.md b/README.md index 86259a6..99d1414 100644 --- a/README.md +++ b/README.md @@ -1,30 +1,31 @@ + # AutoFL [![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0) [![DOI](https://zenodo.org/badge/644095707.svg)](https://zenodo.org/doi/10.5281/zenodo.10255367) -[![Docker](https://img.shields.io/badge/Docker-blue.svg)](https://img.shields.io/badge/Docker-blue) +[![Docker](https://img.shields.io/badge/Docker-blue.svg)](https://hub.docker.com/r/cezarsas/autofl/) + +Automatic source code file annotation using weak labeling. -Automatic source code file annotation using weak labelling. +## Overview + +AutoFL is a tool designed for automatic annotation of source code files through weak labeling techniques. It provides both an API and a web-based UI for easy analysis of projects across different languages. ## Setup -Clone the repository and the UI submodule [autofl-ui](https://github.com/SasCezar/autofl-ui) by running the following -command: +To set up the repository along with its UI submodule, clone it using: ```bash git clone --recursive git@github.com:SasCezar/AutoFL.git AutoFL ``` -### Optional Setup +### Optional Model Setup -To make use of certain feature like semantic based labelling functions, you need to download the model. -For example, for **w2v-so**, you can download the model from [here](https://github.com/vefstathiou/SO_word2vec), and -place it in the [data/models/w2v-so](data/models/w2v-so) folder, or a custom -path that you can use in the configs. +For advanced features like semantic-based labeling, download models as required. For example, to use **w2v-so**, download the model from [here](https://github.com/vefstathiou/SO_word2vec) and place it in the `data/models/w2v-so` folder. Alternatively, you can provide a custom path in the configuration files. ## Usage -Run docker compose in the project folder (where the [docker-compose.yaml](docker-compose.yaml) is located) by executing: +To run the tool using Docker, navigate to the project directory (where the `docker-compose.yaml` file is located) and execute: ```shell docker compose up @@ -32,57 +33,45 @@ docker compose up ### API Endpoint -You can analyze the files of project by making a request to the endpoint: +To analyze the files of a project, make a POST request to the following endpoint: ```shell -curl -X POST -d '{"name": "", "remote": "", "languages": [""]}' localhost:8000/label/files -H "content-type: application/json" +curl -X POST -d '{"name": "", "remote": "", "languages": [""]}' localhost:8000/label/files -H "content-type: application/json" ``` -For example, to analyze the files -of [https://github.com/mickleness/pumpernickel](https://github.com/mickleness/pumpernickel), you can make the following -request: +For instance, to analyze the project at [https://github.com/mickleness/pumpernickel](https://github.com/mickleness/pumpernickel), use: ```shell -curl -X POST -d '{"name": "pumpernickel", "remote": "https://github.com/mickleness/pumpernickel", "languages": ["java"]}' localhost:8000/label/files -H "content-type: application/json" +curl -X POST -d '{"name": "pumpernickel", "remote": "https://github.com/mickleness/pumpernickel", "languages": ["java"]}' localhost:8000/label/files -H "content-type: application/json" ``` -### UI +### Web UI -The tool also offers a web UI that is available at the following page (when running locally): -[http://localhost:8501](http://localhost:8501) +AutoFL provides a web-based UI accessible locally at [http://localhost:8501](http://localhost:8501): ![UI](resources/ui-screenshots/landing-page.png) -For more details, check the [UI repo](https://github.com/SasCezar/autofl-ui). - -[//]: # (For more details, check the [UI repo](https://github.com/SasCezar/autofl-ui)) +For more details, check the [UI repository](https://github.com/SasCezar/autofl-ui). ## Configuration -AutoFL uses [Hydra](https://hydra.cc/) to manage the configuration. The configuration files are located in -the [config](config) folder. -The main configuration file is [main.yaml](./config/main.yaml), which contains the following options: +AutoFL uses [Hydra](https://hydra.cc/) to manage configurations. The configuration files can be found in the `config` folder. The main configuration file, `main.yaml`, allows you to customize various options: -- **local**: which environment to use, either local or docker. [Docker](./config/local/docker.yaml) is default. -- **taxonomy**: which taxonomy to use. Currently only [gitranking](./config/taxonomy/gitranking.yaml) is supported, but - custom taxonomies can be added. -- **annotator**: which annotators to use. Default is [simple](./config/annotator/simple.yaml), which allows good results - without extra dependencies on language models. -- **version_strategy**: which version strategy to use. Default is [latest](./config/version_strategy/latest.yaml), which - will only analyze the latest version of the project. -- **dataloader**: which dataloader to use. Default is [postgres](./config/dataloader/postgres.yaml) which allows the API - to fetch already analysed projects. -- **writer**: which writer to use. Default is [postgres](./config/writer/postgres.yaml) which allows the API to store - the results in a database. +- **local**: Choose between local or Docker environments. [Docker](config/environment/docker.yaml) is the default. +- **taxonomy**: Set the taxonomy for labeling. Currently supports [gitranking](./config/taxonomy/gitranking.yaml). You can add custom taxonomies. +- **annotator**: Specify the annotators to use. The default is [simple](./config/annotator/simple.yaml), offering good results without dependencies on language models. +- **version_strategy**: Select the versioning strategy. The default is [latest](./config/version_strategy/latest.yaml). +- **dataloader**: Choose the dataloader. The default is [postgres](./config/dataloader/postgres.yaml). +- **writer**: Set the writer for storing results. The default is [postgres](./config/writer/postgres.yaml). -Other configuration can be defined by creating a new file in the folder of the specific component. +Additional configurations can be added by creating new files in the corresponding component folders. ## Functionalities - Annotation (UI/API/Script) - - File - - Package - - Project + - File-Level + - Package-Level + - Project-Level - Batch Analysis (Script Only) - Temporal Analysis (**TODO**) - Classification (**TODO**) @@ -97,26 +86,23 @@ Other configuration can be defined by creating a new file in the folder of the s ## Development -The tool is composed of multiple components, their interaction is shown in the following diagram: +AutoFL is composed of multiple components, as shown in the architecture diagram below: ![Architecture](resources/architecture/architecture.png) -### Add New Languages +### Adding Support for New Languages -In order to support more languages, a new language specific parser is needed. -We can create one quickly by using [tree-sitter](https://tree-sitter.github.io/tree-sitter/), -and a custom parser. +To add support for additional languages, a language-specific parser is required. You can use [tree-sitter](https://tree-sitter.github.io/tree-sitter/) to develop a parser quickly. -#### Parser +#### Parser Details -The parser needs to be in the [parser/languages](./src/parser/languages) folder. -It has to extend the ```BaseParser``` class, which has the following interface. +The parser needs to be located in the `parser/languages` folder. It should extend the `BaseParser` class, which follows this structure: ```python class ParserBase(ABC): - """ - Abstract class for a programming language parser. - """ +""" +Abstract class for a programming language parser. +""" def __init__(self, library_path: Path | str): """ @@ -126,92 +112,73 @@ class ParserBase(ABC): ... ``` -And the language specific class has to contain the logic to parse the language to get the identifiers. -For example for Python, the class will look like this: +To implement the parsing logic, create a class that handles extracting identifiers. For Python, the parser might look like: ```python -class PythonParser(ParserBase, - lang=Extension.python.name): # The lang argument is used to register the parser in the ParserFactory class. +class PythonParser(ParserBase, lang=Extension.python.name): """ - Python specific parser. Uses a generic grammar for multiple versions of python. Uses tree_sitter to get the AST + Python-specific parser using a generic grammar for multiple versions. Utilizes tree-sitter for AST extraction. """ def __init__(self, library_path: Path | str): - super().__init__(library_path) - self.language: Language = Language(library_path, - Extension.python.name) # Creates the tree-sitter language for python - self.parser.set_language(self.language) # Sets tree-sitter parser to parse the language - - # Pattern used to match the identifiers, it depends on the Lanugage. Check tree-sitter - self.identifiers_pattern: str = """ - ((identifier) @identifier) - """ - - # Creates the query used to find the identifiers in the AST produced by tree-sitter - self.identifiers_query = self.language.query(self.identifiers_pattern) - - # Keyword that will be ignored, in this case, the language specific keywords as the query extracts them as well. - self.keywords = set(keyword.kwlist) # Use python's built in keyword list - self.keywords.update(['self', 'cls']) + ... ``` -A custom class that does not rely on [tree-sitter](https://github.com/tree-sitter/tree-sitter) can be also used, -however, there are more methods from ParserBase that need to be -changed. Check the implementation of [ParserBase](src/parser/parser.py). +A custom parser independent of tree-sitter can also be developed. For more details, refer to the implementation of [ParserBase](src/parser/parser.py). -## Know Issues +## Known Issues + +- **Dependency Installation**: The setup process may take significant time (~10 minutes), and dependency installations might fail due to timeouts. This appears to be a network-related issue, and retrying often resolves it. Future updates will aim to simplify dependencies. +- **~~Indefinite Analysis Loops~~**: ~~In some projects, the analysis may loop indefinitely. This issue is currently under investigation.~~ Seems solved in the latest version. Will monitor for further occurrences. + +## Docker Image Availability + +AutoFL is also available as a Docker image. You can pull the image from Docker Hub using: + +```shell +docker pull cezarsas/autofl +``` -- The installation of the dependencies requires quite some time (~10 minutes), and might fail due to timout. - Unfortunately, this issue is hard to reproduce, as it - seems to be related to the network connection. If you encounter this issue, please try again. Future versions will try - to fix this issue by - cleaning up the dependencies and reducing the number of dependencies. -- For some projects, the analysis might loop indefinitely. We are still investigating the cause of this issue. +Find more details and updates at the [Docker Hub page](https://hub.docker.com/r/cezarsas/autofl/). ## Disclaimer -The project is offered as is, it still in development, and it might not work as expected in some cases. -It has been developed and tested on Docker 24.0.7 and 25.0.0 for ```Ubuntu 22.04```. While minor testing has been done -on ```Windows``` and ```MacOS```, not all functionalities might work due to differences in Docker for these OSs (e.g. -Windows uses WSL 2). +This tool is in active development and may not function as expected in some cases. It has been tested primarily on Docker versions `24.0.7` and `25.0.0` for `Ubuntu 22.04`. Limited testing has been performed on `Windows` and `MacOS`, where functionality may vary. -In case of any problems, please open an issue, make a pull request, or contact me at ```c.a.sas@rug.nl```. +If you encounter any issues, please open an issue on GitHub, make a pull request, or contact me at `c.a.sas@rug.nl`. -## Cite +## Citation -If you use this work please cite us: +If you find this tool useful, please cite our work: ### Paper -```text +```bibtex @article{sas2024multigranular, - title = {Multi-granular Software Annotation using File-level Weak Labelling}, - author = {Cezar Sas and Andrea Capiluppi}, - journal = {Empirical Software Engineering}, - volume = {29}, - number = {1}, - pages = {12}, - year = {2024}, - url = {https://doi.org/10.1007/s10664-023-10423-7}, - doi = {10.1007/s10664-023-10423-7} +title = {Multi-granular Software Annotation using File-level Weak Labelling}, +author = {Cezar Sas and Andrea Capiluppi}, +journal = {Empirical Software Engineering}, +volume = {29}, +number = {1}, +pages = {12}, +year = {2024}, +url = {https://doi.org/10.1007/s10664-023-10423-7}, +doi = {10.1007/s10664-023-10423-7} } ``` -**Note**: The code used in the paper is available in -the [https://github.com/SasCezar/CodeGraphClassification](https://github.com/SasCezar/CodeGraphClassification) -repository. -However, this tool is more up to date, easier to use, more configurable, and also offers a UI. +**Note**: The code used in this paper is available at [CodeGraphClassification](https://github.com/SasCezar/CodeGraphClassification). However, AutoFL provides enhanced features, is more user-friendly, and includes a UI. ### Tool -```text +```bibtex @software{sas2023autofl, author = {Sas, Cezar and Capiluppi, Andrea}, - month = dec, + month = oct, title = {{AutoFL}}, url = {https://github.com/SasCezar/AutoFL}, - version = {0.4.1}, - year = {2023}, + version = {0.5.0}, + year = {2024}, url = {https://doi.org/10.5281/zenodo.10255368}, doi = {10.5281/zenodo.10255368} } diff --git a/config/local/docker.yaml b/config/environment/docker.yaml similarity index 100% rename from config/local/docker.yaml rename to config/environment/docker.yaml diff --git a/config/local/local.yaml b/config/environment/local.yaml similarity index 100% rename from config/local/local.yaml rename to config/environment/local.yaml diff --git a/config/main.yaml b/config/main.yaml index 9103a43..6222e42 100644 --- a/config/main.yaml +++ b/config/main.yaml @@ -1,7 +1,7 @@ # @package _global_ defaults: - _self_ - - local: docker + - environment: docker - taxonomy: gitranking - annotator: default - version_strategy: latest diff --git a/config/runs.yaml b/config/runs.yaml index cd518e2..92e3bb2 100644 --- a/config/runs.yaml +++ b/config/runs.yaml @@ -1,7 +1,7 @@ # @package _global_ defaults: - _self_ - - local: local + - environment: docker - run: batch_annotation package_annotation: True diff --git a/config/test.yaml b/config/test.yaml index af6aae8..9ad573a 100644 --- a/config/test.yaml +++ b/config/test.yaml @@ -1,7 +1,7 @@ # @package _global_ defaults: - _self_ - - local: local + - environment: docker - taxonomy: small - annotator: default - version_strategy: latest diff --git a/data/grammars/languages.so b/data/grammars/languages.so deleted file mode 100755 index 50ff2de..0000000 Binary files a/data/grammars/languages.so and /dev/null differ diff --git a/docs/index.md b/docs/index.md index 442ccb8..16ac2c8 100644 --- a/docs/index.md +++ b/docs/index.md @@ -1,69 +1,77 @@ + # AutoFL + [![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0) [![DOI](https://zenodo.org/badge/644095707.svg)](https://zenodo.org/doi/10.5281/zenodo.10255367) -[![Docker](https://img.shields.io/badge/Docker-blue.svg)](https://img.shields.io/badge/Docker-blue) +[![Docker](https://img.shields.io/badge/Docker-blue.svg)](https://hub.docker.com/r/cezarsas/autofl/) + +Automatic source code file annotation using weak labeling. + +## Overview -Automatic source code file annotation using weak labelling. +AutoFL is a tool designed for automatic annotation of source code files through weak labeling techniques. It provides both an API and a web-based UI for easy analysis of projects across different languages. ## Setup -Clone the repository and the UI submodule [autofl-ui](https://github.com/SasCezar/autofl-ui) by running the following command: + +To set up the repository along with its UI submodule, clone it using: + ```bash git clone --recursive git@github.com:SasCezar/AutoFL.git AutoFL ``` -### Optional Setup -To make use of certain feature like semantic based labelling functions, you need to download the model. -For example, for **w2v-so**, you can download the model from [here](https://github.com/vefstathiou/SO_word2vec), and place it in the [data/models/w2v-so](data/models/w2v-so) folder, or a custom -path that you can use in the configs. +### Optional Model Setup + +For advanced features like semantic-based labeling, download models as required. For example, to use **w2v-so**, download the model from [here](https://github.com/vefstathiou/SO_word2vec) and place it in the `data/models/w2v-so` folder. Alternatively, you can provide a custom path in the configuration files. ## Usage -Run docker the docker compose file [docker-compose.yaml](docker-compose.yaml) by executing: +To run the tool using Docker, navigate to the project directory (where the `docker-compose.yaml` file is located) and execute: + ```shell docker compose up ``` -in the project folder. ### API Endpoint -You can analyze the files of project by making a request to the endpoint: + +To analyze the files of a project, make a POST request to the following endpoint: + ```shell -curl -X POST -d '{"name": "", "remote": "", "languages": [""]}' localhost:8000/label/files -H "content-type: application/json" +curl -X POST -d '{"name": "", "remote": "", "languages": [""]}' localhost:8000/label/files -H "content-type: application/json" ``` -For example, to analyze the files of [https://github.com/mickleness/pumpernickel](https://github.com/mickleness/pumpernickel), you can make the following request: + +For instance, to analyze the project at [https://github.com/mickleness/pumpernickel](https://github.com/mickleness/pumpernickel), use: + ```shell -curl -X POST -d '{"name": "pumpernickel", "remote": "https://github.com/mickleness/pumpernickel", "languages": ["java"]}' localhost:8000/label/files -H "content-type: application/json" +curl -X POST -d '{"name": "pumpernickel", "remote": "https://github.com/mickleness/pumpernickel", "languages": ["java"]}' localhost:8000/label/files -H "content-type: application/json" ``` -### UI +### Web UI -The tool also offers a web UI that is available at the following page (when running locally): -[http://localhost:8501](http://localhost:8501) +AutoFL provides a web-based UI accessible locally at [http://localhost:8501](http://localhost:8501): -![UI](resources/ui-screenshots/landing-page.png) +![UI](/resources/ui-screenshots/landing-page.png) -For more details, check the [UI repo](https://github.com/SasCezar/autofl-ui). - -[//]: # (For more details, check the [UI repo](https://github.com/SasCezar/autofl-ui)) +For more details, check the [UI repository](https://github.com/SasCezar/autofl-ui). ## Configuration -AutoFL uses [Hydra](https://hydra.cc/) to manage the configuration. The configuration files are located in the [config](config) folder. -The main configuration file is [main.yaml](./config/main.yaml), which contains the following options: -- **local**: which environment to use, either local or docker. [Docker](./config/local/docker.yaml) is default. -- **taxonomy**: which taxonomy to use. Currently only [gitranking](./config/taxonomy/gitranking.yaml) is supported. -- **annotator**: which annotators to use. Default is [simple](./config/annotator/simple.yaml), which allows good results without extra dependencies on models. -- **version_strategy**: which version strategy to use. Default is [latest](./config/version_strategy/latest.yaml), which will only analyze the latest version of the project. -- **dataloader**: which dataloader to use. Default is [postgres](./config/dataloader/postgres.yaml) which allows the API to fetch already analysed projects. -- **writer**: which writer to use. Default is [postgres](./config/writer/postgres.yaml) which allows the API to store the results in a database. +AutoFL uses [Hydra](https://hydra.cc/) to manage configurations. The configuration files can be found in the `config` folder. The main configuration file, `main.yaml`, allows you to customize various options: + +- **local**: Choose between local or Docker environments. [Docker](config/environment/docker.yaml) is the default. +- **taxonomy**: Set the taxonomy for labeling. Currently supports [gitranking](./config/taxonomy/gitranking.yaml). You can add custom taxonomies. +- **annotator**: Specify the annotators to use. The default is [simple](./config/annotator/simple.yaml), offering good results without dependencies on language models. +- **version_strategy**: Select the versioning strategy. The default is [latest](./config/version_strategy/latest.yaml). +- **dataloader**: Choose the dataloader. The default is [postgres](./config/dataloader/postgres.yaml). +- **writer**: Set the writer for storing results. The default is [postgres](./config/writer/postgres.yaml). -Other configuration can be defined by creating a new file in the folder of the specific component. +Additional configurations can be added by creating new files in the corresponding component folders. ## Functionalities - Annotation (UI/API/Script) - - File - - Package - - Project + - File-Level + - Package-Level + - Project-Level - Batch Analysis (Script Only) - Temporal Analysis (**TODO**) - Classification (**TODO**) @@ -78,21 +86,23 @@ Other configuration can be defined by creating a new file in the folder of the s ## Development -### Add New Languages +AutoFL is composed of multiple components, as shown in the architecture diagram below: + +![Architecture](resources/architecture/architecture.png) + +### Adding Support for New Languages -In order to support more languages, a new language specific parser is needed. -We can create one quickly by using [tree-sitter](https://tree-sitter.github.io/tree-sitter/), -and a custom parser. +To add support for additional languages, a language-specific parser is required. You can use [tree-sitter](https://tree-sitter.github.io/tree-sitter/) to develop a parser quickly. -#### Parser -The parser needs to be in the [parser/languages](./src/parser/languages) folder. -It has to extend the ```BaseParser``` class, which has the following interface. +#### Parser Details + +The parser needs to be located in the `parser/languages` folder. It should extend the `BaseParser` class, which follows this structure: ```python class ParserBase(ABC): - """ - Abstract class for a programming language parser. - """ +""" +Abstract class for a programming language parser. +""" def __init__(self, library_path: Path | str): """ @@ -101,75 +111,74 @@ class ParserBase(ABC): """ ... ``` -And the language specific class has to contain the logic to parse the language to get the identifiers. -For example for Python, the class will look like this: + +To implement the parsing logic, create a class that handles extracting identifiers. For Python, the parser might look like: ```python -class PythonParser(ParserBase, lang=Extension.python.name): # The lang argument is used to register the parser in the ParserFactory class. +class PythonParser(ParserBase, lang=Extension.python.name): """ - Python specific parser. Uses a generic grammar for multiple versions of python. Uses tree_sitter to get the AST + Python-specific parser using a generic grammar for multiple versions. Utilizes tree-sitter for AST extraction. """ def __init__(self, library_path: Path | str): - super().__init__(library_path) - self.language: Language = Language(library_path, Extension.python.name) # Creates the tree-sitter language for python - self.parser.set_language(self.language) # Sets tree-sitter parser to parse the language - - # Pattern used to match the identifiers, it depends on the Lanugage. Check tree-sitter - self.identifiers_pattern: str = """ - ((identifier) @identifier) - """ - - # Creates the query used to find the identifiers in the AST produced by tree-sitter - self.identifiers_query = self.language.query(self.identifiers_pattern) - - # Keyword that will be ignored, in this case, the language specific keywords as the query extracts them as well. - self.keywords = set(keyword.kwlist) # Use python's built in keyword list - self.keywords.update(['self', 'cls']) + ... ``` -A custom class that does not rely on [tree-sitter](https://github.com/tree-sitter/tree-sitter) can be also used, however, there are more methods from ParserBase that need to be -changed. Check the implementation of [ParserBase](src/parser/parser.py). +A custom parser independent of tree-sitter can also be developed. For more details, refer to the implementation of [ParserBase](src/parser/parser.py). + +## Known Issues + +- **Dependency Installation**: The setup process may take significant time (~10 minutes), and dependency installations might fail due to timeouts. This appears to be a network-related issue, and retrying often resolves it. Future updates will aim to simplify dependencies. +- **~~Indefinite Analysis Loops~~**: ~~In some projects, the analysis may loop indefinitely. This issue is currently under investigation.~~ Seems solved in the latest version. Will monitor for further occurrences. + +## Docker Image Availability + +AutoFL is also available as a Docker image. You can pull the image from Docker Hub using: + +```shell +docker pull cezarsas/autofl +``` + +Find more details and updates at the [Docker Hub page](https://hub.docker.com/r/cezarsas/autofl/). ## Disclaimer -The project is still in development, and it might not work as expected in some cases. -It has been developed and tested on Docker 24.0.7 for ```Ubuntu 22.04```. While minor testing has been done on ```Windows``` and ```MacOS```, -not all functionalities might work due to differences in Docker for these OSs (e.g. Windows uses WSL 2). +This tool is in active development and may not function as expected in some cases. It has been tested primarily on Docker versions `24.0.7` and `25.0.0` for `Ubuntu 22.04`. Limited testing has been performed on `Windows` and `MacOS`, where functionality may vary. -In case of any problems, please open an issue, make a pull request, or contact me at ```c.a.sas@rug.nl```. +If you encounter any issues, please open an issue on GitHub, make a pull request, or contact me at `c.a.sas@rug.nl`. -## Cite +## Citation -If you use this work please cite us: +If you find this tool useful, please cite our work: ### Paper -```text + +```bibtex @article{sas2024multigranular, - title = {Multi-granular Software Annotation using File-level Weak Labelling}, - author = {Cezar Sas and Andrea Capiluppi}, - journal = {Empirical Software Engineering}, - volume = {29}, - number = {1}, - pages = {12}, - year = {2024}, - url = {https://doi.org/10.1007/s10664-023-10423-7}, - doi = {10.1007/s10664-023-10423-7} +title = {Multi-granular Software Annotation using File-level Weak Labelling}, +author = {Cezar Sas and Andrea Capiluppi}, +journal = {Empirical Software Engineering}, +volume = {29}, +number = {1}, +pages = {12}, +year = {2024}, +url = {https://doi.org/10.1007/s10664-023-10423-7}, +doi = {10.1007/s10664-023-10423-7} } ``` -**Note**: The code used in the paper is available in the [https://github.com/SasCezar/CodeGraphClassification](https://github.com/SasCezar/CodeGraphClassification) repository. -However, this tool is more up to date, is easier to use, configurable, and also offers a UI. +**Note**: The code used in this paper is available at [CodeGraphClassification](https://github.com/SasCezar/CodeGraphClassification). However, AutoFL provides enhanced features, is more user-friendly, and includes a UI. ### Tool -```text + +```bibtex @software{sas2023autofl, author = {Sas, Cezar and Capiluppi, Andrea}, - month = dec, + month = oct, title = {{AutoFL}}, url = {https://github.com/SasCezar/AutoFL}, - version = {0.3.0}, - year = {2023}, + version = {0.5.0}, + year = {2024}, url = {https://doi.org/10.5281/zenodo.10255368}, doi = {10.5281/zenodo.10255368} } diff --git a/pyproject.toml b/pyproject.toml index a3ce6bd..8b16bc8 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,18 +1,18 @@ [tool.poetry] name = "autofl" -version = "0.4.1" +version = "0.5.0" description = "" authors = ["Cezar Sas "] readme = "README.md" [tool.poetry.dependencies] python = ">=3.10,<3.13" -fastapi = "^0.109.0" -gunicorn = "^21.2.0" -uvicorn = "^0.27.0" +fastapi = "^0.115.0" +gunicorn = "^23.0.0" +uvicorn = "^0.31.0" pandas = "^2.2.0" hydra-core = "^1.3.2" -setuptools = "^69.0.3" +setuptools = "^75.1.0" multiset = "^3.0.2" scikit-learn = "^1.4.0" pydantic = "^2.5.3" @@ -22,24 +22,30 @@ tqdm = "^4.66.1" # yake = { git = "git@github.com:LIAAD/yake.git" } python-rake = "^1.5.0" more-itertools = "^10.2.0" -tree-sitter = "^0.20.4" +tree-sitter = "^0.23.0" sqlalchemy = "^2.0.25" psycopg = {extras = ["binary"], version = "^3.1.17"} gensim = "^4.3.3" fasttext-wheel = "^0.9.2" transformers = "^4.37.1" -sentence-transformers = "^2.2.2" +sentence-transformers = "3.1.1" +tree-sitter-python = "^0.23.2" +tree-sitter-java = "^0.23.2" +tree-sitter-cpp = "^0.23.1" +tree-sitter-c-sharp = "^0.23.0" +tree-sitter-c = "^0.23.1" + [tool.poetry.group.dev.dependencies] notebook = "^6.5.4" jupyter = "^1.0.0" -mkdocs = "^1.5.3" -mkdocstrings = "^0.24.0" +mkdocs = "^1.6.1" +mkdocstrings = "^0.26.1" mkdocs-gen-files = "^0.5.0" -mkdocstrings-python = "^1.7.5" +mkdocstrings-python = "^1.11.1" mkdocs-literate-nav = "^0.6.1" -mkdocs-section-index = "^0.3.8" -mkdocs-material = "^9.4.14" +mkdocs-section-index = "^0.3.9" +mkdocs-material = "^9.5.39" [build-system] diff --git a/scripts/python/build_tree_sitter_library.py b/scripts/python/build_tree_sitter_library.py deleted file mode 100644 index 97f29ba..0000000 --- a/scripts/python/build_tree_sitter_library.py +++ /dev/null @@ -1,22 +0,0 @@ -from os.path import join - -import hydra -from omegaconf import DictConfig -from tree_sitter import Language - - -@hydra.main(config_path="../../config", config_name="main", version_base="1.3") -def build_library(cfg: DictConfig): - languages = ['java', 'python', 'cpp', 'c-sharp', 'c'] - repositories = [join(cfg.grammars_path, f'tree-sitter-{pl}') for pl in languages] - Language.build_library( - # Store the library in the `build` directory - cfg.languages_library, - - # Include one or more languages - repositories - ) - - -if __name__ == '__main__': - build_library() diff --git a/src/parser/languages/c.py b/src/parser/languages/c.py index 2de7d60..752cc02 100644 --- a/src/parser/languages/c.py +++ b/src/parser/languages/c.py @@ -1,7 +1,7 @@ from pathlib import Path -from tree_sitter import Language - +from tree_sitter import Language, Parser +import tree_sitter_c as tsc from parser.extensions import Extension from parser.parser import ParserBase @@ -13,12 +13,12 @@ class CParser(ParserBase, lang=Extension.c.name): def __init__(self, library_path: Path | str): super().__init__(library_path) - self.language: Language = Language(library_path, Extension.c.name) - self.parser.set_language(self.language) + self.language: Language = Language(tsc.language()) + self.parser: Parser = Parser(self.language) self.identifiers_pattern: str = """ ((identifier) @identifier) ((type_identifier) @type) """ - self.identifiers_query = self.language.query(self.identifiers_pattern)\ + self.identifiers_query = self.language.query(self.identifiers_pattern) self.keywords = {'malloc'} diff --git a/src/parser/languages/cpp.py b/src/parser/languages/cpp.py index 03a85a8..c1e4522 100644 --- a/src/parser/languages/cpp.py +++ b/src/parser/languages/cpp.py @@ -1,7 +1,7 @@ from pathlib import Path -from tree_sitter import Language - +from tree_sitter import Language, Parser +import tree_sitter_cpp as tscpp from parser.extensions import Extension from parser.parser import ParserBase @@ -13,8 +13,8 @@ class CPPParser(ParserBase, lang=Extension.cpp.name): def __init__(self, library_path: Path | str): super().__init__(library_path) - self.language: Language = Language(library_path, Extension.cpp.name) - self.parser.set_language(self.language) + self.language: Language = Language(tscpp.language()) + self.parser: Parser = Parser(self.language) # TODO Fix, doesn't work - It doesn't find the namespaced identifiers self.identifiers_pattern: str = """ ((identifier) @identifier) diff --git a/src/parser/languages/csharp.py b/src/parser/languages/csharp.py index 139dd89..c26c66f 100644 --- a/src/parser/languages/csharp.py +++ b/src/parser/languages/csharp.py @@ -1,7 +1,7 @@ from pathlib import Path -from tree_sitter import Language - +from tree_sitter import Language, Parser +import tree_sitter_c_sharp as tscs from parser.extensions import Extension from parser.parser import ParserBase @@ -13,8 +13,8 @@ class CSharpParser(ParserBase, lang=Extension.c_sharp.name): def __init__(self, library_path: Path | str): super().__init__(library_path) - self.language: Language = Language(library_path, Extension.c_sharp.name) - self.parser.set_language(self.language) + self.language: Language = Language(tscs.language()) + self.parser: Parser = Parser(self.language) self.identifiers_pattern: str = """ ((identifier) @identifier) """ diff --git a/src/parser/languages/java.py b/src/parser/languages/java.py index 3e6055e..97d1f61 100644 --- a/src/parser/languages/java.py +++ b/src/parser/languages/java.py @@ -1,7 +1,8 @@ from pathlib import Path +from typing import Dict, List -from tree_sitter import Language, Tree - +from tree_sitter import Language, Tree, Parser, Node +import tree_sitter_java as tsjava from parser.extensions import Extension from parser.parser import ParserBase @@ -13,8 +14,8 @@ class JavaParser(ParserBase, lang=Extension.java.name): def __init__(self, library_path: Path | str): super().__init__(library_path) - self.language: Language = Language(library_path, Extension.java.name) - self.parser.set_language(self.language) + self.language: Language = Language(tsjava.language()) + self.parser: Parser = Parser(self.language) self.identifiers_pattern: str = """ ((identifier) @identifier) ((type_identifier) @type) diff --git a/src/parser/languages/python.py b/src/parser/languages/python.py index 3237af3..58fe63a 100644 --- a/src/parser/languages/python.py +++ b/src/parser/languages/python.py @@ -1,8 +1,8 @@ import keyword from pathlib import Path -from tree_sitter import Language - +from tree_sitter import Language, Parser +import tree_sitter_python as tsp from parser.extensions import Extension from parser.parser import ParserBase @@ -14,8 +14,8 @@ class PythonParser(ParserBase, lang=Extension.python.name): def __init__(self, library_path: Path | str): super().__init__(library_path) - self.language: Language = Language(library_path, Extension.python.name) - self.parser.set_language(self.language) + self.language: Language = Language(tsp.language()) + self.parser: Parser = Parser(self.language) self.identifiers_pattern: str = """ ((identifier) @identifier) """ diff --git a/src/parser/parser.py b/src/parser/parser.py index b1c7a73..4980c00 100644 --- a/src/parser/parser.py +++ b/src/parser/parser.py @@ -39,11 +39,12 @@ def parse(self, file: File) -> Tuple[List[str], str]: return identifiers, package @staticmethod - def get_node_text(code: bytes, identifiers_nodes: List[Tuple[Node, str]]) -> List[str]: + def get_node_text(code: bytes, identifiers_nodes: dict[str, list[Node]]) -> List[str]: identifiers = [] - for node, _ in identifiers_nodes: - token = code[node.start_byte:node.end_byte] - identifiers.append(token.decode()) + for type in identifiers_nodes: + for node in identifiers_nodes[type]: + token = code[node.start_byte:node.end_byte] + identifiers.append(token.decode()) return identifiers diff --git a/test/test_parser/test_c_parser.py b/test/test_parser/test_c_parser.py index 59f43b5..f1d0cf3 100644 --- a/test/test_parser/test_c_parser.py +++ b/test/test_parser/test_c_parser.py @@ -39,7 +39,8 @@ def setUp(self) -> None: def test_identifiers(self): identifiers, _ = self.parser.parse(self.file) - self.assertListEqual(identifiers, self.gt) + self.assertEqual(len(identifiers), len(self.gt)) + self.assertSetEqual(set(identifiers), set(self.gt)) @staticmethod def load_file(path): diff --git a/test/test_parser/test_cpp_parser.py b/test/test_parser/test_cpp_parser.py index 8e002ff..89f0512 100644 --- a/test/test_parser/test_cpp_parser.py +++ b/test/test_parser/test_cpp_parser.py @@ -23,10 +23,9 @@ def setUp(self) -> None: def test_identifiers(self): # TODO identifiers, _ = self.parser.parse(self.file) - print(identifiers) - print(len(identifiers)) - print(self.gt) - self.assertListEqual(identifiers, self.gt) + self.assertEqual(len(identifiers), len(self.gt)) + self.assertSetEqual(set(identifiers), set(self.gt)) + def test_packages(self): # TODO diff --git a/test/test_parser/test_csharp_parser.py b/test/test_parser/test_csharp_parser.py index b334b69..3f60d2d 100644 --- a/test/test_parser/test_csharp_parser.py +++ b/test/test_parser/test_csharp_parser.py @@ -23,10 +23,9 @@ def setUp(self) -> None: def test_identifiers(self): # TODO identifiers, _ = self.parser.parse(self.file) - print(identifiers) - print(len(identifiers)) - print(self.gt) - self.assertListEqual(identifiers, self.gt) + self.assertEqual(len(identifiers), len(self.gt)) + self.assertSetEqual(set(identifiers), set(self.gt)) + def test_packages(self): # TODO diff --git a/test/test_parser/test_java_parser.py b/test/test_parser/test_java_parser.py index 1d2d427..2675172 100644 --- a/test/test_parser/test_java_parser.py +++ b/test/test_parser/test_java_parser.py @@ -2,6 +2,7 @@ from pathlib import Path from hydra import initialize, compose +from sympy import pprint from entity.file import File from parser.parser import ParserFactory, ParserBase @@ -25,7 +26,8 @@ def setUp(self) -> None: def test_identifiers(self): identifiers, _ = self.parser.parse(self.file) - self.assertListEqual(identifiers, self.gt) + self.assertEqual(len(identifiers), len(self.gt)) + self.assertSetEqual(set(identifiers), set(self.gt)) def test_packages(self): # TODO diff --git a/test/test_parser/test_python_parser.py b/test/test_parser/test_python_parser.py index 7e5f2bb..d93732a 100644 --- a/test/test_parser/test_python_parser.py +++ b/test/test_parser/test_python_parser.py @@ -29,7 +29,8 @@ def setUp(self) -> None: def test_identifiers(self): identifiers, _ = self.parser.parse(self.file) - self.assertListEqual(identifiers, self.gt) + self.assertEqual(len(identifiers), len(self.gt)) + self.assertSetEqual(set(identifiers), set(self.gt)) def test_packages(self): # TODO