-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Not much quicker than awk one-liner with numeric keys #25
Comments
Can you please open a PR and integrate your results in |
Also I could imagine that awk will internally use integer keys to the lookup table in the d; this may be the reason for why this is faster…numeric keys is not a case that huniq optimizes for. It treats all input as strings. |
By the way, thank you for bringing this up, we don't say thank you enough in this open source world…it's always great to have benchmarks counterchecked :) |
I tried to quickly reproduce some of these results. I did not understand the intial Benchmark 1: pee "seq 1000000" "seq 1000000" "seq 1000000"
Time (mean ± σ): 10.0 ms ± 0.6 ms [User: 20.3 ms, System: 6.9 ms]
Range (min … max): 8.7 ms … 11.8 ms 231 runs
Benchmark 1: pee "seq 1000000" "seq 1000000" "seq 1000000" | huniq
Time (mean ± σ): 407.4 ms ± 10.3 ms [User: 241.2 ms, System: 219.5 ms]
Range (min … max): 385.2 ms … 424.4 ms 10 runs
Benchmark 2: pee "seq 1000000" "seq 1000000" "seq 1000000" | awk !a[$0]++{print}
Time (mean ± σ): 281.9 ms ± 3.9 ms [User: 307.5 ms, System: 18.8 ms]
Range (min … max): 276.9 ms … 291.3 ms 10 runs
Benchmark 3: pee "seq 1000000" "seq 1000000" "seq 1000000" | runiq -
Time (mean ± σ): 578.3 ms ± 14.5 ms [User: 428.6 ms, System: 211.2 ms]
Range (min … max): 547.7 ms … 595.4 ms 10 runs
Benchmark 4: pee "seq 1000000" "seq 1000000" "seq 1000000" | perl -e 'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}'
Time (mean ± σ): 1.332 s ± 0.031 s [User: 1.168 s, System: 0.267 s]
Range (min … max): 1.277 s … 1.365 s 10 runs
Benchmark 5: pee "seq 1000000" "seq 1000000" "seq 1000000" | sort | uniq
Time (mean ± σ): 2.867 s ± 0.057 s [User: 2.912 s, System: 0.073 s]
Range (min … max): 2.760 s … 2.945 s 10 runs
Summary
'pee "seq 1000000" "seq 1000000" "seq 1000000" | awk !a[$0]++{print}' ran
1.45 ± 0.04 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | huniq'
2.05 ± 0.06 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | runiq -'
4.72 ± 0.13 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | perl -e 'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}''
10.17 ± 0.25 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | sort | uniq' Takeaways:
I expect my build of huniq to have some compiler flags slipped in that may deteriorate the performance. I will have to investigate more I used the following script to generate these:results: #!/usr/bin/env nix-shell
#!nix-shell -i bash -p moreutils hyperfine runiq perl
# command to generate test input data
test_data_cmd='pee "seq 1000000" "seq 1000000" "seq 1000000"'
# sort | uniq challenger commands
cmds=(
'huniq'
'awk !a[$0]++{print}'
'runiq -'
$'perl -e \'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}\''
'sort | uniq'
)
# confront each challenger with its input
prefixed_cmds=()
for cmd in "${cmds[@]}"
do
prefixed_cmds+=("$test_data_cmd | $cmd")
done
# find out how lon the generation of input requires
hyperfine --warmup 3 "$test_data_cmd"
# bench all the different cmds
hyperfine --warmup 1 "${prefixed_cmds[@]}" |
I've compared speed with AWK, and it's not so much quicker.
huniq
awk
unique
quniq
runiq
perl
It's only 2 seconds quicker than awk when processing 20 millions lines.
Is there any way to process the lines even quicker?
The text was updated successfully, but these errors were encountered: