Introduction

The following UDF can be used to compress a string into a binary via Huffman encoding. To keep the overhead for the huff table integrated in the binary as small as possible, the canonical form of the huffman table was implemented.

The UDF accesses many functionalities of the UDF "BinaryHelpers.au3". This is also part of the repository and offers functionalities for simplified handling of binary data in AutoIt.

How to use?

the following script:

#include "Huffman.au3"

; read out the autoitscript.com page as example string
Global $sString = BinaryToString(InetRead("https://autoitscript.com"))
ConsoleWrite("Original Size: " & BinaryLen(StringToBinary($sString)) & " Bytes" & @CRLF)

; encode the string with a canonical huffman encoding
Global $bEncoded = _huff_encodeString($sString)
ConsoleWrite("Encoded Size:  " & BinaryLen($bEncoded) & " Bytes" & @CRLF & _
	"Compress ratio: " & Round((1 - (BinaryLen($bEncoded) / BinaryLen(StringToBinary($sString)))) * 100.0, 1) & ' %' & @CRLF)

; decode the string out of the binary
Global $sStringDecoded = _huff_decodeBinary($bEncoded)

; check if original and decoded string are the same
ConsoleWrite("$sString == $sStringDecoded: " & ($sString == $sStringDecoded) & @CRLF)

produces:

Original Size: 157081 Bytes
Encoded Size:  102833 Bytes
Compress ratio: 34.5 %
$sString == $sStringDecoded: True

How to further increase the compression potential?

Huffman encoding is just a step to be able to compress data. In combination with further steps you can compress the data even more. For example, you can pre-compress the data at the text level and only then encode the data using Huffman.

The following extends the above example with a (simple) word repetition compression:

Huffman encoding + word repetition string compression

#include "Huffman.au3"

; read out the autoitscript.com page as example string
Global $sString = BinaryToString(InetRead("https://autoitscript.com"))
ConsoleWrite("Original Size: " & BinaryLen(StringToBinary($sString)) & " Bytes" & @CRLF)

; precompress at text level by word repetition:
Global $sStringCompressed = _strcmp_compressByWordRepetition($sString)

; encode the string with a canonical huffman encoding
Global $bEncoded = _huff_encodeString($sStringCompressed)
ConsoleWrite("Encoded Size:  " & BinaryLen($bEncoded) & " Bytes" & @CRLF & _
	"Compress ratio: " & Round((1 - (BinaryLen($bEncoded) / BinaryLen(StringToBinary($sString)))) * 100.0, 1) & ' %' & @CRLF)

; decode the string out of the binary
Global $sStringDecoded = _huff_decodeBinary($bEncoded)

; decompress at text level
$sStringDecoded = _strcmp_decompressByWordRepetition($sStringDecoded)

; check if original and decoded string are the same
ConsoleWrite("$sString == $sStringDecoded: " & ($sString == $sStringDecoded) & @CRLF)







; Tokenbasierte Kompression - ersetzt wiederholende Teile durch Nummern im Text
Func _strcmp_compressByWordRepetition($sString)
	Local 	$mTmp[3], $aTmp[2], $mPos, $iLen, _
			$mWords[], _
			$iPos = 1, _
			$sWord, $iWord = 0

	; Bereits vorhandene Präfixnummern escapen:
	$sString = StringRegExpReplace($sString, '(?s)\b(\d+)', Chr(27) & "$1")

	; Einzelne Wörter aus dem String extrahieren
	Do
		$aMatch = StringRegExp($sString, '(?s)\G.*?(\b[a-zA-ZäöüßÄÖÜ][\wäöüßÄÖÜ]{2,}\b)', 1, $iPos)
		IF @error Then ExitLoop
		$iPos = @extended
		$sWord = $aMatch[0]

		If MapExists($mWords, $sWord) Then
			$aTmp = $mWords[$sWord]
			$mTmp = $aTmp[1]
			MapAppend($mTmp, $iPos - StringLen($sWord))
			$aTmp[1] = $mTmp
		Else
			Local $mTmp[]
			Local $aTmp[2] = [$iWord, $mTmp]
		EndIf
		$mWords[$sWord] = $aTmp

		$iWord += 1
	Until 0

	; bereite die notwendigen Replaces auf
	Local $aReplaces[$iWord][3], $iIndRepl = 0
	For $sWord In MapKeys($mWords)
		$iLen = StringLen($sWord)
		$aTmp = $mWords[$sWord]
		$iWord = $aTmp[0]
		$mPos = $aTmp[1]

		If Ubound($mPos) < 1 Then ContinueLoop
		If $iLen <= Stringlen($iWord) Then ContinueLoop

		For $i In MapKeys($mPos)
			$iPos = $mPos[$i]

			$aReplaces[$iIndRepl][0] = $iPos
			$aReplaces[$iIndRepl][1] = $iLen
			$aReplaces[$iIndRepl][2] = $iWord

			$iIndRepl += 1

		Next
	Next
	Redim $aReplaces[$iIndRepl][3]
	_ArraySort($aReplaces)

	; Wörter durch Nummern im Text selbst ersetzen
	For $i = UBound($aReplaces) - 1 To 0 Step -1
		$sString = 	StringLeft($sString, $aReplaces[$i][0] - 1) & _
					$aReplaces[$i][2] & _
					StringTrimLeft($sString, $aReplaces[$i][0] + $aReplaces[$i][1] -1)
	Next

	Return $sString
EndFunc


; Tokenbasierte Dekompression - stellt einen mit _strcmp_compressByWordRepetition() komprimierten String wieder her
Func _strcmp_decompressByWordRepetition($sString)
	Local 	$aTmp[3], _
			$mWords[]
			$iPos = 1

	; Einzelne Wörter aus dem String extrahieren
	Do
		$aMatch = StringRegExp($sString, '(?s)\G.*?(\b[a-zA-ZäöüßÄÖÜ][\wäöüßÄÖÜ]{2,}\b|\b(?<!\x1B)\d+\b)', 1, $iPos)
		IF @error Then ExitLoop
		$iPos = @extended

		$aTmp[0] = $aMatch[0]
		$aTmp[1] = $iPos - StringLen($aMatch[0])
		$aTmp[2] = StringLen($aMatch[0])
		MapAppend($mWords, $aTmp)
	Until 0

	Local $aWords = _map_IntMap2Array($mWords)
	$aWords = _ArrayAinATo2d($aWords)

	; Nummern wieder durch Buchstaben ersetzen
	Local $iWord
	For $i = UBound($aWords) - 1  To 0 Step -1
		If Not StringRegExp($aWords[$i][0], "^\d+$", 0) Then ContinueLoop

		$iWord = Int($aWords[$i][0])
		If $iWord >= UBound($aWords) Then ContinueLoop
		;~ ConsoleWrite($iWord & @CRLF)
		$sString = 	StringLeft($sString, $aWords[$i][1] - 1) & _
					$aWords[$iWord][0] & _
					StringTrimLeft($sString, $aWords[$i][1] + $aWords[$i][2] -1)
	Next

	; Escapte Zahlen wieder unescapen:
	$sString = StringRegExpReplace($sString, '(?s)\x1B(\d+)', "$1")

	Return $sString
EndFunc


; #FUNCTION# ======================================================================================
; Name ..........: _map_IntMap2Array()
; Description ...: returns the values of a map only (useful for array-list like maps produced by MapAppend())
; Syntax ........: _map_IntMap2Array(ByRef $aMap)
; Parameters ....: ByRef $aMap - a map variable
; Return values .: 1D-array containing the values
; Author ........: aspirinjunkie
; Modified ......: 2022-07-13
; =================================================================================================
Func _map_IntMap2Array(ByRef $aMap)
	Local $aRet[UBound($aMap)]
	Local $iInd = 0
	For $vEl In $aMap
		$aRet[$iInd] = $vEl
		$iInd += 1
	Next
	Return $aRet
EndFunc   ;==>_map_IntMap2Array


; #FUNCTION# ======================================================================================
; Name ..........: _ArrayAinATo2d()
; Description ...: Convert a Arrays in Array into a 2D array
; Syntax ........: _ArrayAinATo2d(ByRef $A)
; Parameters ....: $A             - the arrays in array which should be converted
; Return values .: Success: a 2D Array build from the input array
;                  Failure: Null
;                     @error = 1: $A is'nt an 1D array
;                            = 2: $A is empty
;                            = 3: first element isn't a array
; Author ........: AspirinJunkie
; =================================================================================================
Func _ArrayAinATo2d(ByRef $A)
	If UBound($A, 0) <> 1 Then Return SetError(1, UBound($A, 0), Null)
	Local $N = UBound($A)
	If $N < 1 Then Return SetError(2, $N, Null)
	Local $u = UBound($A[0])
	If $u < 1 Then Return SetError(3, $u, Null)

	Local $a_Ret[$N][$u]

	For $i = 0 To $N - 1
		Local $t = $A[$i]
		If UBound($t) > $u Then ReDim $a_Ret[$N][UBound($t)]
		For $j = 0 To UBound($t) - 1
			$a_Ret[$i][$j] = $t[$j]
		Next
	Next
	Return SetExtended($N, $a_Ret)
EndFunc   ;==>_ArrayAinATo2d

which leads to the following result:

Original Size: 157081 Bytes
Encoded Size:  73333 Bytes
Compress ratio: 53.3 %
$sString == $sStringDecoded: True

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
BinaryHelpers.au3		BinaryHelpers.au3
Huffman.au3		Huffman.au3
LICENSE		LICENSE
README.md		README.md
example - text compression with word repetition.au3		example - text compression with word repetition.au3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

How to use?

How to further increase the compression potential?

About

Releases

Packages

Languages

License

Sylvan86/autoit-huffman-udf

Folders and files

Latest commit

History

Repository files navigation

Introduction

How to use?

How to further increase the compression potential?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages