Not much quicker than awk one-liner with numeric keys #25

kenorb · 2022-07-11T23:21:29Z

I've compared speed with AWK, and it's not so much quicker.

`huniq`

$ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | huniq | pv -l | wc -l
20.0M 0:00:30 [ 664k/s]
20007130

`awk`

$ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | awk '!a[$0]++{print}' | pv -l | wc -l
20.0M 0:00:32 [ 608k/s]
20008984

`unique`

$ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | unique | pv -l | wc -l
20.0M 0:00:38 [ 524k/s]
20006340

`quniq`

$ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | quniq -max-workers 10 | pv -l | wc -l
20.0M 0:00:45 [ 436k/s]
20013324

`runiq`

$ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | runiq - | pv -l | wc -l
20.0M 0:00:35 [ 571k/s]
20007072

`perl`

$ echo | pee "seq 20000000" "seq 20000000" "seq 20000000" | perl -e 'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}' | pv -l | wc -l
20.0M 0:00:55 [ 363k/s]
20008104

It's only 2 seconds quicker than awk when processing 20 millions lines.

Is there any way to process the lines even quicker?

The text was updated successfully, but these errors were encountered:

koraa · 2022-07-12T10:23:10Z

Can you please open a PR and integrate your results in benchmark.sh?

koraa · 2022-07-12T10:28:20Z

Also I could imagine that awk will internally use integer keys to the lookup table in the d; this may be the reason for why this is faster…numeric keys is not a case that huniq optimizes for. It treats all input as strings.

koraa · 2022-07-12T20:42:12Z

By the way, thank you for bringing this up, we don't say thank you enough in this open source world…it's always great to have benchmarks counterchecked :)

wucke13 · 2022-09-30T20:56:12Z

I tried to quickly reproduce some of these results. I did not understand the intial echo (therefore I removed it) and I used hyperfine for the benchmarking, since it seems to be a bit nicer regarding statistics. A first run on my (admittedly from thermal throtteling plagued laptop) indicates quite different results than those posted above:

Benchmark 1: pee "seq 1000000" "seq 1000000" "seq 1000000"
  Time (mean ± σ):      10.0 ms ±   0.6 ms    [User: 20.3 ms, System: 6.9 ms]
  Range (min … max):     8.7 ms …  11.8 ms    231 runs

Benchmark 1: pee "seq 1000000" "seq 1000000" "seq 1000000" | huniq
  Time (mean ± σ):     407.4 ms ±  10.3 ms    [User: 241.2 ms, System: 219.5 ms]
  Range (min … max):   385.2 ms … 424.4 ms    10 runs

Benchmark 2: pee "seq 1000000" "seq 1000000" "seq 1000000" | awk !a[$0]++{print}
  Time (mean ± σ):     281.9 ms ±   3.9 ms    [User: 307.5 ms, System: 18.8 ms]
  Range (min … max):   276.9 ms … 291.3 ms    10 runs

Benchmark 3: pee "seq 1000000" "seq 1000000" "seq 1000000" | runiq -
  Time (mean ± σ):     578.3 ms ±  14.5 ms    [User: 428.6 ms, System: 211.2 ms]
  Range (min … max):   547.7 ms … 595.4 ms    10 runs

Benchmark 4: pee "seq 1000000" "seq 1000000" "seq 1000000" | perl -e 'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}'
  Time (mean ± σ):      1.332 s ±  0.031 s    [User: 1.168 s, System: 0.267 s]
  Range (min … max):    1.277 s …  1.365 s    10 runs

Benchmark 5: pee "seq 1000000" "seq 1000000" "seq 1000000" | sort | uniq
  Time (mean ± σ):      2.867 s ±  0.057 s    [User: 2.912 s, System: 0.073 s]
  Range (min … max):    2.760 s …  2.945 s    10 runs

Summary
  'pee "seq 1000000" "seq 1000000" "seq 1000000" | awk !a[$0]++{print}' ran
    1.45 ± 0.04 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | huniq'
    2.05 ± 0.06 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | runiq -'
    4.72 ± 0.13 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | perl -e 'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}''
   10.17 ± 0.25 times faster than 'pee "seq 1000000" "seq 1000000" "seq 1000000" | sort | uniq'

Takeaways:

on my machine, about 10ms are required to generate the input data
awk is about 1.5 times faster than huniq
both runiq and perl are slower than huniq
sort | uniq is indeed about ten times slower than the best performing option

I expect my build of huniq to have some compiler flags slipped in that may deteriorate the performance. I will have to investigate more

I used the following script to generate these:results:

#!/usr/bin/env nix-shell
#!nix-shell -i bash -p moreutils hyperfine runiq perl

# command to generate test input data
test_data_cmd='pee "seq 1000000" "seq 1000000" "seq 1000000"'

# sort | uniq challenger commands
cmds=(
'huniq'
'awk !a[$0]++{print}'
'runiq -'
$'perl -e \'while(<>){if(!$s{$_}){print $_;$|=1;$s{$_}=1;}}\''
'sort | uniq'
)

# confront each challenger with its input
prefixed_cmds=()
for cmd in "${cmds[@]}"
do
prefixed_cmds+=("$test_data_cmd | $cmd")
done

# find out how lon the generation of input requires
hyperfine --warmup 3 "$test_data_cmd"

# bench all the different cmds
hyperfine --warmup 1 "${prefixed_cmds[@]}"

kenorb changed the title ~~Not much quicker than awk~~ Not much quicker than awk one-liner Jul 11, 2022

koraa assigned kenorb Jul 12, 2022

koraa changed the title ~~Not much quicker than awk one-liner~~ Not much quicker than awk one-liner with numeric keys Jul 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not much quicker than awk one-liner with numeric keys #25

Not much quicker than awk one-liner with numeric keys #25

kenorb commented Jul 11, 2022

koraa commented Jul 12, 2022

koraa commented Jul 12, 2022

koraa commented Jul 12, 2022

wucke13 commented Sep 30, 2022 •

edited

Loading

Not much quicker than awk one-liner with numeric keys #25

Not much quicker than awk one-liner with numeric keys #25

Comments

kenorb commented Jul 11, 2022

huniq

awk

unique

quniq

runiq

perl

koraa commented Jul 12, 2022

koraa commented Jul 12, 2022

koraa commented Jul 12, 2022

wucke13 commented Sep 30, 2022 • edited Loading

`huniq`

`awk`

`unique`

`quniq`

`runiq`

`perl`

wucke13 commented Sep 30, 2022 •

edited

Loading