- Windows 11
- OpenJDK 20.0.2
- IntelliJ IDEA
This project has been completed, fulfilling all specified requirements.
Currently, it is workable on the Windows file system.
Usage: Compile the whole project and cd
to the directory that contains cf.class
, xf.class
and pv.class
.
All the paths must be in absolute paths.
// create an archived file
java cf <InputFile>(file or directory) [OutputFile](optional)
// preview an archived file
java pv <InputFile> [OutputFile](optional)
// decompress an archived file
java xf <InputFile> [OutputPath](optional)
cf
for compression,pv
for preview,xf
for decompress, three class to handle command line input.- Package
BitwiseStream
containsBitInputStream
andBitOutputStream
, handle the bitwise input in decompression and output in compression. - Package
Huffman
contains implementation of Nodes.HuffmanTree
implements Huffman algorithm.TreeCanoniztion
canonize the Huffman tree. - Package
FileProcess
FileIO
contains Magic Number for file recognition and basic helper functionsStopWatch
provides timer functionOverwriteProtector
protect against write overrideFrequencyTable
counts the frequency of each byte patternDecoder
andEncoder
contains translation table or tree. They are helper functions for Decompression and CompressionCompression
andDecompression
main functionsPathTreePrint
helper function for file structure preview
The files have been enhanced with compression headers and some magic numbers to distinguish between files and paths. This will be discussed in detail in the next section. This section focuses solely on the compression and decompression processes.
First, the cf
and xf
commands parse the command-line input parameters, then call the corresponding Compression
and Decompression
functions.
During compression, the corresponding file is first read for the first time to generate the
corresponding FrequencyTable
. Afterward, the HuffmanTree
is called to use the Huffman algorithm to determine the
code length for each byte. This tree becomes useless after this step.
After obtaining the code lengths, TreeCanonization
produces the normalized code table. The advantage of this code
table is that it only needs to record the code length for each byte to restore the code table. This new code table is
different from the originally generated one, but it ensures that the code length for each byte is the same.
The file name and code length table are output to a file (both uncompressed), and then the Encoder
is called to
compress the bytes in the file one by one. The output during this process is in bits.
First, the original file name and code length table are read from the file. The decoding tree is reconstructed using the
code length table, which is then passed to the Decoder
. Subsequently, each bit is read one at a time, and
the Decoder
outputs the corresponding byte at the leaf and writes it to the respective file.
To implement the compression of directories, it is necessary to record the path for each file. To prevent empty folders from being ignored, each path is recorded individually. Before each name, a byte is added to determine whether it is a file or a directory. Two bytes of magic numbers are added before the compressed package to determine if the file was generated by this program, independent of the file extension.
The constants mentioned above are defined in FileIO
.
OutputFile Structure
- header: MagicNumber(2 bytes)--fileStructureHeader(Strings of file path ending with STRING_END_SIGN)--HEADER_END_SIGN
- continuous FileBlocks or DirectoryBlocks, one for a path in source file.
- FileBocks: SINGLE_FILE_MAGIC--StringOfPath--STRING_END_SIGN--CanonicalCodeTable--encodedBits(ending with special EOF)
- DirectoryBlocks: DIRECTORY_MAGIC--StringOfPath--STRING_END_SIGN
PS: String is stored in file in C style instead of Java style. String = chars + STRING_END_SIGN
Set up the command-line processing program as cf
, pv
, and xf
.
For example, using java cf <InputFile>(file or directory) [OutputFile](optional)
allows specifying the use of the
compression feature.
StopWatch
class records time usage. Some functions' return values counts the original files' size.
Add two bytes of magic numbers before the file generated by compression and verify before processing. If the magic numbers are incorrect, the program will output information and terminate.
Before creating the output stream in compression and decompression, if the file already exists, it indicates that it will be overwritten.
The program will invoke OverwriteProtect
, asking the user who can choose to proceed or abort.
Considering that many files may need to be overwritten during the decompression process, the user is given the option
to 'overwrite all.' In this case, OverwriteProtect
will be disabled for this operation.
To implement the preview function, add the relative paths of all files and directories at the beginning of the compressed package, exclusively for preview purposes.
pv
reads the file names at the beginning of the file, avoiding reading the later large compressed information, making
it very fast.
FileIO.PathTreePrint
uses the trie method to convert many input strings into a tree structure, allowing the printing
of a tree-shaped file structure.
testcase | original (KB) | compressed (KB) | ratio | compression (s) | decompression (s) |
---|---|---|---|---|---|
"D:\testcases\testcase01EmptyFile\empty.txt" | 0 | 0.282 | / | 0.017 | 0.007 |
"D:\testcases\testcase02NormalSingleFile\1.txt" | 1985 | 1111 | 55.97% | 0.251 | 0.079 |
"D:\testcases\testcase02NormalSingleFile\2.pdb" | 34.5 | 15.5 | 44.99% | 0.078 | 0.02 |
"D:\testcases\testcase02NormalSingleFile\4.gz" | 0.143 | 0.388 | 271.33% | 0.034 | 0.032 |
"D:\testcases\testcase03XLargeSingleFile\1.jpg" | 20748 | 20694 | 99.74% | 2.421 | 1.592 |
"D:\testcases\testcase03XLargeSingleFile\3.csv" | 643412 | 411715 | 63.99% | 27.203 | 15.88 |
"D:\testcases\testcase5NomalFolder\3\000007.jpg" | 114 | 114 | 100.11% | 0.077 | 0.017 |
"D:\testcases\testcase06SubFolders\2\3.txt" | 5034 | 3700 | 73.5% | 0.462 | 0.275 |
"D:\testcases\testcase07XlargeSubFolders\2\7023341.htm" | 246 | 153 | 62.22% | 0.115 | 0.047 |
"D:\testcases\testcase08Speed\1.csv" | 643412 | 411715 | 63.99% | 29.741 | 16.233 |
"D:\testcases\testcase09Ratio\1.csv" | 441771 | 277050 | 62.71% | 22.316 | 12.623 |
Tool | testcase | original size | compressed size | ratio | compression time | decompression time |
---|---|---|---|---|---|---|
Huff | "D:\testcases\testcase02NormalSingleFile\1.txt" | 1985 | 1111 | 55.97% | 0.251 | 0.079 |
7z | 571 | 28.7% | 1 | / | ||
WinRAR | 602 | 30.3% | / | / | ||
Huff | "D:\testcases\testcase02NormalSingleFile\2.pdb" | 34.5 | 15.5 | 44.99% | 0.078 | 0.02 |
7z | 6.8 | 19.7% | / | / | ||
WinRAR | 9 | 26.1% | / | / | ||
Huff | "D:\testcases\testcase02NormalSingleFile\4.gz" | 0.143 | 0.388 | 271.33% | 0.034 | 0.032 |
7z | 0.261 | 182.52% | / | / | ||
WinRAR | 0.211 | 147.55% | / | / | ||
Huff | "D:\testcases\testcase03XLargeSingleFile\3.csv" | 643412 | 411715 | 63.99% | 27.203 | 15.88 |
7z | 55629 | 8.6% | 31 | 2.5 | ||
WinRAR | 49,695 | 7.7% | 10 | 3 | ||
Huff | "D:\testcases\testcase08Speed\1.csv" | 643412 | 411715 | 63.99% | 29.741 | 16.233 |
7z | 55629 | 8.6% | 31 | 2.5 | ||
WinRAR | 49,685 | 7.72% | 10 | 3 | ||
Huff | "D:\testcases\testcase09Ratio\1.csv" | 441771 | 277050 | 62.71% | 22.316 | 12.623 |
7z | 66736 | 15.47% | 30 | 3 | ||
WinRAR | 73,231 | 16.58% | 11 | 3 |
In terms of compression ratio, both 7z and WinRAR consistently outperform the Huffman Compression program. WinRAR, in particular, WinRAR exhibits the best compression ratios across different file types.
The compression ratios for small files are relatively high for all three software tools, possibly because the compression benefits for small files are inherently limited.
In terms of compression time for larger files, WinRAR demonstrates the best performance, followed by my program, while 7z exhibits the slowest performance. This is mainly attributed to the differences in compression algorithms.
In terms of decompression time, Huff is Limited by its single-threaded nature. Decompression may occur sequentially, impacting the speed for larger files. 7z and WinRAR utilize leverage multi-threading during decompression, enabling simultaneous processing of different parts of the compressed file. This parallelism enhances the speed of decompression, especially advantageous for larger files.
Generally, decompression is much faster than compression.
-
Static Huffman Compression (Huff):
- Utilizes static Huffman coding, offering a fixed compression ratio typically ranging between 45% and 65%.
- Limited by its single-threaded nature, hindering its performance on larger files.
-
7z (LZMA2 Algorithm):
- Employs the LZMA2 algorithm, a sophisticated compression method known for its adaptability and high compression ratios.
- LZMA2 exhibits compression ratios in the range of 8% to 20%, surpassing static Huffman coding.
- Features multi-threading support, optimizing performance on multicore processors.
-
WinRAR:
- Implements a proprietary compression algorithm, potentially a combination of various methods, including but not limited to RAR and ZIP compression.
- Demonstrates superior compression ratios, especially noticeable in larger files.
- Benefits from multi-threading, enhancing efficiency in both compression and decompression processes.
-
Huff:
- Runs on the Java Virtual Machine (JVM), introducing some performance overhead.
- The higher-level nature of Java might contribute to the observed performance differences.
-
7z and WinRAR:
- Likely implemented in lower-level languages such as C or C++, optimizing execution efficiency.
- The absence of a virtual machine allows these tools to interact more closely with system resources.
-
Huff:
- Lacks inherent multi-threading capabilities, limiting its performance on larger files.
-
7z and WinRAR:
- Leverage multi-threading to maximize processor capabilities, especially advantageous for compressing and decompressing large files.
- The ability to utilize multiple cores contributes to faster processing times.
-
Huff:
- Implemented in Java, potentially resulting in a performance gap compared to tools coded in lower-level languages.
-
7z and WinRAR:
- Developed by seasoned professionals, likely using C or C++, showcasing a higher level of programming expertise.
- The efficiency of their algorithms and implementations contributes significantly to their superior performance.
-
How to store the Huffman Tree efficiently
I read an answer on stackOverflow about canonical Huffman tree.
-
How to manipulate bitwise data
At first, I wanted to use the library of Algorithms, 4th Edition, but it was banned by TA. So I wrote bitwiseStreams.
-
How to do quick preview
I didn't plan carefully for it, so I just grossly added relative path to the header.
-
How to check files are identical
I searched online and used WinMerge.
-
How to inspect binary files
I used inbox Hex-Editor in VSCode.