Feature/issue 78 impelement file shuffler in rust #134
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
FILE SHUFFLER IN RUST
Table of Contents
Introduction
In this project, we implemented Option 1: a File Shuffler, using
Rust
as our chosen programming language. This implementation aims to address Issue #78. The primary goal of the File Shuffler is to randomize the training dataset files used for training the prediction model in Avatar.Implementation Details
Project Setup
We used
cargo
, the project manager within theRust
ecosystem, to set up our project.cargo
simplifies the development process by managing dependencies, handling builds, and providing a structured project configuration. Its commands, likecargo build
, streamline compiling and testing, allowing us to focus on writingRust
code.Command-Line Arguments
We chose to use command-line arguments to run the File Shuffler script, because it is more efficient than interactive mode, especially for automation (e.g., in CI/CD pipelines). To run the script, use the
cargo run
command with the required input directory argument. Additionally, specify an interval with the--interval
flag (or-i
for short). The interval options are:0
for "Never",1
for "Every Week", and2
for "Every 30 Seconds". If an unsupported interval is entered, the program defaults to0
(Never).Directory Structure
Due to time constraints, we kept the code in a single file rather than a modular setup, acknowledging this as technical debt for future refactoring. We used the basic directory structure created by
cargo
on project initialization:Cargo.toml
: Configuration file where dependencies are defined.Cargo.lock
: Ensures consistency of dependency versions across builds.src/
: Contains all implementation code.main.rs
: Primary implementation file for the File Shuffler logic.Shuffling Logic
Before processing, the program verifies that the specified directory exists and is at least
2 levels
deep. This depth check aligns with the requirement to start with a parent directory containing subdirectories that represent data labels (e.g.,backward
,forward
, etc.).The program models the directory structure as an
n-ary
tree, where the input directory is theroot
and its immediate child directories are the data labels. Using a recursive backtracking algorithm, we navigate to the deepest level of each branch. Upon reaching aleaf node
(a subdirectory without further subdirectories), we move its files into the parent directory and then delete the emptyleaf node
. This approach employs abottom-up
(post-order
) traversal, processingleaf nodes
before their parent directories. The files are efficiently copied by appending an incremented number to each filename to avoid duplication.For a structure with
n
layers, the recursion descends to a depth ofn - 2
, stopping at the directories directly under the root before renaming the files.Renaming Files: We start by counting the files in the current directory to rename and use a
hashset
to track renamed files, ensuring no file names are duplicated. For each file, a random number from 1 ton
(inclusive) is generated and checked against thehashset
to prevent overwrites. The files are copied into a temporary directory. Once renaming is complete, all files in the temporary directory are moved back into the original directory, and the temporary directory is deleted.Conclusion
For more information on the internal logic, functions, and documentation, you can explore the auto-generated documentation. To do so, simply run:
This command opens detailed documentation for the implementation.
Demo video
https://drive.google.com/file/d/1oRh9sY2Q5wGD0NrA2Ngd9VrgOjqT3_DC/view?usp=drive_link