-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor #3
Open
gagoar
wants to merge
18
commits into
main
Choose a base branch
from
refactor
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Refactor #3
Changes from all commits
Commits
Show all changes
18 commits
Select commit
Hold shift + click to select a range
1b27030
new changes
mlacosta 6301801
refactoring
mlacosta e4d5f42
add test
mlacosta 3912a1d
add files
mlacosta 76f356f
debugging
74d2341
some split tests + functions next to refactors
gagoar f2093b5
more experimentation
gagoar fd6139b
create smaller dataset
52b3548
91% tested, working
gagoar c372513
fixing workflows and lint
gagoar b9893a9
remove old folder
842cdbc
latest changes
7445b02
work on readme
bfd7a83
update readme
ef43abb
update readme
1a0a275
Update README.md
mlacosta bb9f1b9
Update README.md
mlacosta 5bd1350
Update README.md
mlacosta File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
on: | ||
push: | ||
# Sequence of patterns matched against refs/tags | ||
tags: | ||
- 'v*' # Push events to matching v*, i.e. v1.0, v20.15.10 | ||
|
||
jobs: | ||
npm: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v2 | ||
- uses: actions/setup-node@v2 | ||
with: | ||
node-version: '12.x' | ||
registry-url: 'https://registry.npmjs.org' | ||
- run: npm install | ||
- run: npm run build | ||
- run: npm publish | ||
env: | ||
NODE_AUTH_TOKEN: ${{ secrets.NPM_TOKEN }} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,50 @@ | ||
name: Validation | ||
|
||
on: [pull_request] | ||
|
||
jobs: | ||
lint: | ||
name: Linting | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@master | ||
- name: Use Node.js 12.x | ||
uses: actions/setup-node@v2 | ||
with: | ||
node-version: 12.x | ||
- name: Install dependencies | ||
run: npm install | ||
- name: ESLint | ||
run: npm run lint | ||
test: | ||
name: Run unit tests | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@master | ||
- name: Use Node.js 12.x | ||
uses: actions/setup-node@v2 | ||
with: | ||
node-version: 12.x | ||
- name: Install dependencies | ||
run: npm install | ||
- name: Jest | ||
run: npm run test --coverage | ||
- name: Send coverage to codecov | ||
uses: codecov/codecov-action@v1 | ||
with: | ||
token: ${{ secrets.CODECOV_TOKEN }} | ||
flags: unittests | ||
|
||
build: | ||
name: Run build | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@master | ||
- name: Use Node.js 12.x | ||
uses: actions/setup-node@v2 | ||
with: | ||
node-version: 12.x | ||
- name: Install dependencies | ||
run: npm install | ||
- name: Build | ||
run: npm run build |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,8 @@ | ||
dataset.js | ||
node_modules | ||
.vscode/ | ||
# ignore codecoverage output | ||
coverage/ | ||
# ignore cli binary output | ||
cli/ | ||
# ignore dist/ output | ||
dist/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
14.17.5 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) 2020 Gago | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,33 +1,217 @@ | ||
# Huffman-URL-Compressor-for-Nodejs | ||
## Author: Mariano L. Acosta | ||
<p align="center"> | ||
<h3 align="center">Huffman URL Compressor</h3> | ||
|
||
# Description | ||
Convert any kind of String into a url-friendly parameter using Huffman Encoding. | ||
<p align="center"> | ||
⚙ Convert any kind of String into a url-friendly parameter using Huffman Encoding ⚙ | ||
<br /> | ||
<a href="https://github.com/mlacosta/huffman-url-compressor#table-of-contents"><strong>Explore the docs »</strong></a> | ||
<br /> | ||
<a href="https://github.com/mlacosta/huffman-url-compressor/issues">Report Bug</a> | ||
· | ||
<a href="https://github.com/mlacosta/huffman-url-compressor/issues">Request Feature</a> | ||
</p> | ||
</p> | ||
|
||
# Installation | ||
## Table of Contents | ||
|
||
npm install --save huffman-url-compressor | ||
- [Built With](#built-with) | ||
- [Getting Started](#getting-started) | ||
- [Motivation](#motivation) | ||
- [Huffman Compression](#huffman-compression) | ||
- [Examples](#examples) | ||
- [Contributing](#contributing) | ||
- [License](#license) | ||
|
||
# Usage | ||
<!-- CONTRIBUTING --> | ||
|
||
## Parameters: | ||
### Built With | ||
|
||
- **Train (string)**: Training set used to create the encoder. This is where the algorithm gets the frequency for each char. | ||
- **Test (string)**: String that you want to encode. | ||
- [ncc](https://github.com/vercel/ncc/) | ||
- [jest](https://github.com/facebook/jest) | ||
- [ora](https://github.com/sindresorhus/ora) | ||
- [commander](https://github.com/tj/commander.js/) | ||
- [cosmiconfig](https://github.com/davidtheclark/cosmiconfig) | ||
|
||
## Example: | ||
## Getting Started | ||
|
||
import {createEncoder, encodeConfig, decodeConfig} from 'huffman-url-compressor'; | ||
To install this dependency on your project: | ||
|
||
//create encoder | ||
`npm i huffman-url-compressor` | ||
|
||
let Encoder = createEncoder(train); | ||
## Motivation | ||
|
||
//create a base64 encoded stream | ||
This library was originally intended to be used as an URL-friendly encoder/decoder. The idea was to process a chunk of text, compress the data and then embed it in an URL as a query parameter. Later on, you can retrieve the original piece of text using the same encoder. | ||
|
||
let encodedParam = encodeConfig(test,Encoder) | ||
|
||
//retrieve the original param | ||
A typical application for this library is permalink creation and sharing. For instance, if you want to put a long text on an URL but you are constrained in length, this encoder will output a shorter base64-string that you can use instead. After that, you can reduce the length further using an URL-shortening service. | ||
|
||
let decodParam= decodeConfig(encodedParam,Encoder) | ||
## Huffman Compression | ||
|
||
Huffman compression is a data encoding technique that uses a greedy approach for lossless compression based on how often a character or symbol occurs. Theoretically, It can achieve a compression rate between 20 and 90 percent. | ||
|
||
First, suppose we have a set of 6 letters and the number of occurrences (frequency) for each one: | ||
|
||
| letter | frequency | | ||
| ------ | --------- | | ||
| a | 45 | | ||
| b | 13 | | ||
| c | 12 | | ||
| d | 16 | | ||
| e | 9 | | ||
| f | 5 | | ||
|
||
Since we have 6 symbols, a naive approach would be to use a 3-bit encoding for each one of them: | ||
|
||
| letter | bitstring | | ||
| ------ | --------- | | ||
| a | 000 | | ||
| b | 001 | | ||
| c | 010 | | ||
| d | 011 | | ||
| e | 100 | | ||
| f | 101 | | ||
|
||
For instance, if we want to encode the string 'bacab' using the table from above: | ||
|
||
``` | ||
'bacab' transforms into '001000010000001' | ||
|
||
``` | ||
|
||
Then, we can encode it further using a [base64](https://en.wikipedia.org/wiki/Base64) approach: | ||
|
||
| binary (6-bits) | base64 (char) | | ||
| --------------- | ------------- | | ||
| 010000 | Q | | ||
| 001000 | I | | ||
|
||
``` | ||
'001000010000001' transforms into 'IQI=' | ||
``` | ||
|
||
In this case, the symbol `=` is used for zero-padding by convention. | ||
|
||
Note this way of encoding could be served as a binary tree where each leaf represents a letter and its frequency. More generally, each node's parent contains the summation of its children's frequency and the combination of their symbols. Starting from the root, one could simply make a symbol search and output a `0` or `1` based on if you moved to the left or right respectively. | ||
|
||
![](https://i.imgur.com/QM2laV5.jpg) | ||
|
||
However, a better approach would be to create [prefix-efficient](https://en.wikipedia.org/wiki/Prefix_code) codes based on each letter's frequency. In that way, we could generate a variable-length encoding that depends on the number of occurrences (the more frequent a letter the shorter its representation). This results in a reduced bitstring on average. This particular way of operating is known as Huffman Compression. | ||
|
||
First, we need to generate a node for each one of the letters. We can use a data structure like this: | ||
|
||
``` | ||
{ | ||
"symbol": 'a' | ||
"frequency": 45 | ||
} | ||
``` | ||
|
||
Next, we combine all the nodes in a tree-like structure using a greedy algorithm that chooses between the least two frequent symbols and merges them. In our example, we start with: | ||
|
||
``` | ||
{ | ||
"symbol": 'f' | ||
"frequency": 5 | ||
} | ||
|
||
{ | ||
"symbol": 'e' | ||
"frequency": 9 | ||
} | ||
``` | ||
|
||
and we create the node: | ||
|
||
``` | ||
{ | ||
"symbol": 'fe' | ||
"frequency": 14 | ||
} | ||
``` | ||
|
||
Finally, we remove the nodes 'e' and 'f' from our pool and we replace them with the node 'fe'. By induction, after all the nodes are merged, we would obtain a Huffman tree that serves as our encoder: | ||
|
||
![](https://i.imgur.com/roKnNFS.jpg) | ||
|
||
The optimal way to implement this is using a [min-heap](<https://en.wikipedia.org/wiki/Heap_(data_structure)>) data structure. | ||
|
||
Back to our original example but now using the Huffman tree: | ||
|
||
| letter | bitstring | | ||
| ------ | --------- | | ||
| a | 0 | | ||
| b | 101 | | ||
| c | 100 | | ||
| d | 111 | | ||
| e | 1100 | | ||
| f | 1100 | | ||
|
||
``` | ||
'bacab' transforms into '10101000101' | ||
``` | ||
|
||
We save 26,67% of space from our original case. | ||
|
||
`compression ratio (CR in %) = 11/15 x 100% = 73,34%` | ||
|
||
`saved space = (1 - CR) x 100% = 26,67 %` | ||
|
||
Later we can introduce base64 encoding: | ||
|
||
``` | ||
'10101000101' transforms into qKA= | ||
``` | ||
|
||
**Note:** there's a trade-off between the Huffman compression rate and the expansion generated by the base64 encoding that should be taken into consideration given the case. | ||
|
||
For more theoretical background check: [Cormen, T. H., Leiserson, C. E., Rivest, R. L., & Stein, C. (2009). Introduction to algorithms. MIT press.](https://books.google.com.ar/books?hl=en&lr=&id=aefUBQAAQBAJ&oi=fnd&pg=PR5&dq=introductions+to+algorithms+cormen&ots=dO5uNAXSaZ&sig=IMmhA7_JXSWjGppyqv6UiAMfufI&redir_esc=y#v=onepage&q=introductions%20to%20algorithms%20cormen&f=false) | ||
|
||
## Examples | ||
|
||
### Fluent bit configuration files | ||
|
||
We start gathering several Fluent bit configuration files that serve as a training [dataset](https://gist.github.com/mlacosta/b85c4a1788f0210a353b2fcead281403), which means, we obtain the frequency for each symbol present on all the configurations. | ||
|
||
Then, say you want to compress and embed the following configuration in an URL: | ||
|
||
``` | ||
[INPUT] | ||
Name tail | ||
Tag tail.01 | ||
Path /var/log/system.log | ||
|
||
[FILTER] | ||
Name record_modifier | ||
Match * | ||
Record hostname ${HOSTNAME} | ||
|
||
[OUTPUT] | ||
Name file | ||
Match * | ||
Path output.txt | ||
``` | ||
|
||
The trained encoder will generate a base64 compressed version that you can use as an URL parameter. | ||
|
||
``` | ||
vz9KnJmEW_yuoj6uIG3_Vxn1cQLYo8t_3nVW-LaueLCkwWxYysUNhSZa1-aPq5kd0It_ldRHvidL0AUJQIOSH2_6HVnW-oW_3xOl6HrUsqV1EbSl7QMOmpDdSjltrX4yTNTkzCLf5XUR45Axb_odWdb6hb_vOqt9K9by9W6idbW | ||
``` | ||
|
||
In this case, we obtained a sequence that is 15% shorter (Note this is below the theoretical threshold of 20% due to the expansion generated by the base64 encoding) | ||
|
||
To retrieve the original configuration just use the encoded string. | ||
|
||
## Contributing | ||
|
||
Contributions are what make the open-source community such an amazing place to learn, inspire, and create. Any contributions you make are greatly appreciated **greatly appreciated**. | ||
|
||
1. Fork the Project | ||
2. Create your Feature Branch (`git checkout -b feature/AmazingFeature`) | ||
3. Commit your Changes (`git commit -m 'Add some AmazingFeature'`) | ||
4. Push to the Branch (`git push origin feature/AmazingFeature`) | ||
5. Open a Pull Request | ||
|
||
<!-- LICENSE --> | ||
|
||
## License | ||
|
||
Distributed under the MIT License. See `LICENSE` for more information. |
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this readme should be adjusted to your repo. @mlacosta