Project: Compression based on Huffman Coding

Development Environment

Windows 11
OpenJDK 20.0.2
IntelliJ IDEA

Compilation and Usage

This project has been completed, fulfilling all specified requirements.

Currently, it is workable on the Windows file system.

Usage: Compile the whole project and cd to the directory that contains cf.class, xf.class and pv.class.

All the paths must be in absolute paths.

// create an archived file
java cf <InputFile>(file or directory) [OutputFile](optional)   
// preview an archived file
java pv <InputFile> [OutputFile](optional)           
// decompress an archived file
java xf <InputFile> [OutputPath](optional)

Code Structure

cf for compression,pv for preview,xf for decompress, three class to handle command line input.
Package BitwiseStream contains BitInputStream and BitOutputStream, handle the bitwise input in decompression and output in compression.
Package Huffman contains implementation of Nodes. HuffmanTree implements Huffman algorithm. TreeCanoniztion canonize the Huffman tree.
Package FileProcess
- FileIO contains Magic Number for file recognition and basic helper functions
- StopWatch provides timer function
- OverwriteProtector protect against write override
- FrequencyTable counts the frequency of each byte pattern
- Decoder and Encoder contains translation table or tree. They are helper functions for Decompression and Compression
- Compression and Decompression main functions
- PathTreePrint helper function for file structure preview

Core Requirements

Compression and Decompression of Files

The files have been enhanced with compression headers and some magic numbers to distinguish between files and paths. This will be discussed in detail in the next section. This section focuses solely on the compression and decompression processes.

First, the cf and xf commands parse the command-line input parameters, then call the corresponding Compression and Decompression functions.

Compression

During compression, the corresponding file is first read for the first time to generate the corresponding FrequencyTable. Afterward, the HuffmanTree is called to use the Huffman algorithm to determine the code length for each byte. This tree becomes useless after this step.

After obtaining the code lengths, TreeCanonization produces the normalized code table. The advantage of this code table is that it only needs to record the code length for each byte to restore the code table. This new code table is different from the originally generated one, but it ensures that the code length for each byte is the same.

The file name and code length table are output to a file (both uncompressed), and then the Encoder is called to compress the bytes in the file one by one. The output during this process is in bits.

Decompression

First, the original file name and code length table are read from the file. The decoding tree is reconstructed using the code length table, which is then passed to the Decoder. Subsequently, each bit is read one at a time, and the Decoder outputs the corresponding byte at the leaf and writes it to the respective file.

Compression and Decompression of Directories

To implement the compression of directories, it is necessary to record the path for each file. To prevent empty folders from being ignored, each path is recorded individually. Before each name, a byte is added to determine whether it is a file or a directory. Two bytes of magic numbers are added before the compressed package to determine if the file was generated by this program, independent of the file extension.

The constants mentioned above are defined in FileIO.

OutputFile Structure

header: MagicNumber(2 bytes)--fileStructureHeader(Strings of file path ending with STRING_END_SIGN)--HEADER_END_SIGN
continuous FileBlocks or DirectoryBlocks, one for a path in source file.

FileBocks: SINGLE_FILE_MAGIC--StringOfPath--STRING_END_SIGN--CanonicalCodeTable--encodedBits(ending with special EOF)
DirectoryBlocks: DIRECTORY_MAGIC--StringOfPath--STRING_END_SIGN

PS: String is stored in file in C style instead of Java style. String = chars + STRING_END_SIGN

Other Requirements

User Interface

Set up the command-line processing program as cf, pv, and xf.

For example, using java cf <InputFile>(file or directory) [OutputFile](optional) allows specifying the use of the compression feature.

StopWatch class records time usage. Some functions' return values counts the original files' size.

Verify Compression Package Source

Add two bytes of magic numbers before the file generated by compression and verify before processing. If the magic numbers are incorrect, the program will output information and terminate.

Handling File Overwrite

Before creating the output stream in compression and decompression, if the file already exists, it indicates that it will be overwritten.

The program will invoke OverwriteProtect, asking the user who can choose to proceed or abort.

Considering that many files may need to be overwritten during the decompression process, the user is given the option to 'overwrite all.' In this case, OverwriteProtect will be disabled for this operation.

Compression Package Preview

To implement the preview function, add the relative paths of all files and directories at the beginning of the compressed package, exclusively for preview purposes.

pv reads the file names at the beginning of the file, avoiding reading the later large compressed information, making it very fast.

FileIO.PathTreePrint uses the trie method to convert many input strings into a tree structure, allowing the printing of a tree-shaped file structure.

Performance Test

testcase	original (KB)	compressed (KB)	ratio	compression (s)	decompression (s)
"D:\testcases\testcase01EmptyFile\empty.txt"	0	0.282	/	0.017	0.007
"D:\testcases\testcase02NormalSingleFile\1.txt"	1985	1111	55.97%	0.251	0.079
"D:\testcases\testcase02NormalSingleFile\2.pdb"	34.5	15.5	44.99%	0.078	0.02
"D:\testcases\testcase02NormalSingleFile\4.gz"	0.143	0.388	271.33%	0.034	0.032
"D:\testcases\testcase03XLargeSingleFile\1.jpg"	20748	20694	99.74%	2.421	1.592
"D:\testcases\testcase03XLargeSingleFile\3.csv"	643412	411715	63.99%	27.203	15.88
"D:\testcases\testcase5NomalFolder\3\000007.jpg"	114	114	100.11%	0.077	0.017
"D:\testcases\testcase06SubFolders\2\3.txt"	5034	3700	73.5%	0.462	0.275
"D:\testcases\testcase07XlargeSubFolders\2\7023341.htm"	246	153	62.22%	0.115	0.047
"D:\testcases\testcase08Speed\1.csv"	643412	411715	63.99%	29.741	16.233
"D:\testcases\testcase09Ratio\1.csv"	441771	277050	62.71%	22.316	12.623

Comparison to Other Tools

Results

Tool	testcase	original size	compressed size	ratio	compression time	decompression time
Huff	"D:\testcases\testcase02NormalSingleFile\1.txt"	1985	1111	55.97%	0.251	0.079
7z			571	28.7%	1	/
WinRAR			602	30.3%	/	/
Huff	"D:\testcases\testcase02NormalSingleFile\2.pdb"	34.5	15.5	44.99%	0.078	0.02
7z			6.8	19.7%	/	/
WinRAR			9	26.1%	/	/
Huff	"D:\testcases\testcase02NormalSingleFile\4.gz"	0.143	0.388	271.33%	0.034	0.032
7z			0.261	182.52%	/	/
WinRAR			0.211	147.55%	/	/
Huff	"D:\testcases\testcase03XLargeSingleFile\3.csv"	643412	411715	63.99%	27.203	15.88
7z			55629	8.6%	31	2.5
WinRAR			49,695	7.7%	10	3
Huff	"D:\testcases\testcase08Speed\1.csv"	643412	411715	63.99%	29.741	16.233
7z			55629	8.6%	31	2.5
WinRAR			49,685	7.72%	10	3
Huff	"D:\testcases\testcase09Ratio\1.csv"	441771	277050	62.71%	22.316	12.623
7z			66736	15.47%	30	3
WinRAR			73,231	16.58%	11	3

Summary

In terms of compression ratio, both 7z and WinRAR consistently outperform the Huffman Compression program. WinRAR, in particular, WinRAR exhibits the best compression ratios across different file types.

The compression ratios for small files are relatively high for all three software tools, possibly because the compression benefits for small files are inherently limited.

In terms of compression time for larger files, WinRAR demonstrates the best performance, followed by my program, while 7z exhibits the slowest performance. This is mainly attributed to the differences in compression algorithms.

In terms of decompression time, Huff is Limited by its single-threaded nature. Decompression may occur sequentially, impacting the speed for larger files. 7z and WinRAR utilize leverage multi-threading during decompression, enabling simultaneous processing of different parts of the compressed file. This parallelism enhances the speed of decompression, especially advantageous for larger files.

Generally, decompression is much faster than compression.

Compression Algorithms

Static Huffman Compression (Huff):
- Utilizes static Huffman coding, offering a fixed compression ratio typically ranging between 45% and 65%.
- Limited by its single-threaded nature, hindering its performance on larger files.
7z (LZMA2 Algorithm):
- Employs the LZMA2 algorithm, a sophisticated compression method known for its adaptability and high compression ratios.
- LZMA2 exhibits compression ratios in the range of 8% to 20%, surpassing static Huffman coding.
- Features multi-threading support, optimizing performance on multicore processors.
WinRAR:
- Implements a proprietary compression algorithm, potentially a combination of various methods, including but not limited to RAR and ZIP compression.
- Demonstrates superior compression ratios, especially noticeable in larger files.
- Benefits from multi-threading, enhancing efficiency in both compression and decompression processes.

Runtime Environment

Huff:
- Runs on the Java Virtual Machine (JVM), introducing some performance overhead.
- The higher-level nature of Java might contribute to the observed performance differences.
7z and WinRAR:
- Likely implemented in lower-level languages such as C or C++, optimizing execution efficiency.
- The absence of a virtual machine allows these tools to interact more closely with system resources.

Multi-Threading and Parallel Processing

Huff:
- Lacks inherent multi-threading capabilities, limiting its performance on larger files.
7z and WinRAR:
- Leverage multi-threading to maximize processor capabilities, especially advantageous for compressing and decompressing large files.
- The ability to utilize multiple cores contributes to faster processing times.

Programming Proficiency

Huff:
- Implemented in Java, potentially resulting in a performance gap compared to tools coded in lower-level languages.
7z and WinRAR:
- Developed by seasoned professionals, likely using C or C++, showcasing a higher level of programming expertise.
- The efficiency of their algorithms and implementations contributes significantly to their superior performance.

Problem and Solutions

How to store the Huffman Tree efficiently

I read an answer on stackOverflow about canonical Huffman tree.
How to manipulate bitwise data

At first, I wanted to use the library of Algorithms, 4th Edition, but it was banned by TA. So I wrote bitwiseStreams.
How to do quick preview

I didn't plan carefully for it, so I just grossly added relative path to the header.
How to check files are identical

I searched online and used WinMerge.
How to inspect binary files

I used inbox Hex-Editor in VSCode.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
BitwiseStream		BitwiseStream
FileProcess		FileProcess
Huffman		Huffman
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
cf.java		cf.java
pv.java		pv.java
xf.java		xf.java

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project: Compression based on Huffman Coding

Development Environment

Compilation and Usage

Code Structure

Core Requirements

Compression and Decompression of Files

Compression

Decompression

Compression and Decompression of Directories

Other Requirements

User Interface

Verify Compression Package Source

Handling File Overwrite

Compression Package Preview

Performance Test

Comparison to Other Tools

Results

Summary

Compression Algorithms

Runtime Environment

Multi-Threading and Parallel Processing

Programming Proficiency

Problem and Solutions

About

Languages

License

jacky-leng/Static-Huffman-Compression

Folders and files

Latest commit

History

Repository files navigation

Project: Compression based on Huffman Coding

Development Environment

Compilation and Usage

Code Structure

Core Requirements

Compression and Decompression of Files

Compression

Decompression

Compression and Decompression of Directories

Other Requirements

User Interface

Verify Compression Package Source

Handling File Overwrite

Compression Package Preview

Performance Test

Comparison to Other Tools

Results

Summary

Compression Algorithms

Runtime Environment

Multi-Threading and Parallel Processing

Programming Proficiency

Problem and Solutions

About

Topics

Resources

License

Stars

Watchers

Forks

Languages