Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not much quicker than awk one-liner with numeric keys #25

Open
kenorb opened this issue Jul 11, 2022 · 4 comments
Open

Not much quicker than awk one-liner with numeric keys #25

kenorb opened this issue Jul 11, 2022 · 4 comments
Assignees

Comments

@kenorb
Copy link

kenorb commented Jul 11, 2022

I've compared speed with AWK, and it's not so much quicker.

huniq

$ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | huniq | pv -l | wc -l
20.0M 0:00:30 [ 664k/s]
20007130

awk

$ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | awk '!a[$0]++{print}' | pv -l | wc -l
20.0M 0:00:32 [ 608k/s]
20008984

unique

$ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | unique | pv -l | wc -l
20.0M 0:00:38 [ 524k/s]
20006340

quniq

$ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | quniq -max-workers 10 | pv -l | wc -l
20.0M 0:00:45 [ 436k/s]
20013324

runiq

$ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | runiq - | pv -l | wc -l
20.0M 0:00:35 [ 571k/s]
20007072

perl

$ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | perl -e 'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}' | pv -l | wc -l
20.0M 0:00:55 [ 363k/s]
20008104

It's only 2 seconds quicker than awk when processing 20 millions lines.

Is there any way to process the lines even quicker?

@kenorb kenorb changed the title Not much quicker than awk Not much quicker than awk one-liner Jul 11, 2022
@koraa
Copy link
Owner

koraa commented Jul 12, 2022

Can you please open a PR and integrate your results in benchmark.sh?

@koraa
Copy link
Owner

koraa commented Jul 12, 2022

Also I could imagine that awk will internally use integer keys to the lookup table in the d; this may be the reason for why this is faster…numeric keys is not a case that huniq optimizes for. It treats all input as strings.

@koraa koraa changed the title Not much quicker than awk one-liner Not much quicker than awk one-liner with numeric keys Jul 12, 2022
@koraa
Copy link
Owner

koraa commented Jul 12, 2022

By the way, thank you for bringing this up, we don't say thank you enough in this open source world…it's always great to have benchmarks counterchecked :)

@wucke13
Copy link

wucke13 commented Sep 30, 2022

I tried to quickly reproduce some of these results. I did not understand the intial echo (therefore I removed it) and I used hyperfine for the benchmarking, since it seems to be a bit nicer regarding statistics. A first run on my (admittedly from thermal throtteling plagued laptop) indicates quite different results than those posted above:

Benchmark 1: pee "seq 1000000" "seq 1000000" "seq 1000000"
  Time (mean ± σ):      10.0 ms ±   0.6 ms    [User: 20.3 ms, System: 6.9 ms]
  Range (min … max):     8.7 ms …  11.8 ms    231 runs

Benchmark 1: pee "seq 1000000" "seq 1000000" "seq 1000000" | huniq
  Time (mean ± σ):     407.4 ms ±  10.3 ms    [User: 241.2 ms, System: 219.5 ms]
  Range (min … max):   385.2 ms … 424.4 ms    10 runs

Benchmark 2: pee "seq 1000000" "seq 1000000" "seq 1000000" | awk !a[$0]++{print}
  Time (mean ± σ):     281.9 ms ±   3.9 ms    [User: 307.5 ms, System: 18.8 ms]
  Range (min … max):   276.9 ms … 291.3 ms    10 runs

Benchmark 3: pee "seq 1000000" "seq 1000000" "seq 1000000" | runiq -
  Time (mean ± σ):     578.3 ms ±  14.5 ms    [User: 428.6 ms, System: 211.2 ms]
  Range (min … max):   547.7 ms … 595.4 ms    10 runs

Benchmark 4: pee "seq 1000000" "seq 1000000" "seq 1000000" | perl -e 'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}'
  Time (mean ± σ):      1.332 s ±  0.031 s    [User: 1.168 s, System: 0.267 s]
  Range (min … max):    1.277 s …  1.365 s    10 runs

Benchmark 5: pee "seq 1000000" "seq 1000000" "seq 1000000" | sort | uniq
  Time (mean ± σ):      2.867 s ±  0.057 s    [User: 2.912 s, System: 0.073 s]
  Range (min … max):    2.760 s …  2.945 s    10 runs

Summary
  'pee "seq 1000000" "seq 1000000" "seq 1000000" | awk !a[$0]++{print}' ran
    1.45 ± 0.04 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | huniq'
    2.05 ± 0.06 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | runiq -'
    4.72 ± 0.13 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | perl -e 'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}''
   10.17 ± 0.25 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | sort | uniq'

Takeaways:

  • on my machine, about 10ms are required to generate the input data
  • awk is about 1.5 times faster than huniq
  • both runiq and perl are slower than huniq
  • sort | uniq is indeed about ten times slower than the best performing option

I expect my build of huniq to have some compiler flags slipped in that may deteriorate the performance. I will have to investigate more

I used the following script to generate these:results:

#!/usr/bin/env nix-shell
#!nix-shell -i bash -p moreutils hyperfine runiq perl

# command to generate test input data
test_data_cmd='pee "seq 1000000" "seq 1000000" "seq 1000000"'

# sort | uniq challenger commands
cmds=(
'huniq'
'awk !a[$0]++{print}'
'runiq -'
$'perl -e \'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}\''
'sort | uniq'
)

# confront each challenger with its input
prefixed_cmds=()
for cmd in "${cmds[@]}"
do
prefixed_cmds+=("$test_data_cmd | $cmd")
done

# find out how lon the generation of input requires
hyperfine --warmup 3 "$test_data_cmd"

# bench all the different cmds
hyperfine --warmup 1 "${prefixed_cmds[@]}"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants