-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add goawk to the benchmark #37
Comments
I'll take a look at both of these. I looked into goawk a while ago and didn't see any areas where it outperformed mawk, so I omitted it (it's also my view that awk scripts generally aren't IO bound if you're reading input form a fairly recent SSD; as is the case for all of the benchmark numbers here). As for an updated mawk, I'll see about downloading a recent copy from invisible-island and re-running the linux benchmarks. |
I think it depends on the AWK script. In most of my AWK scripts there are not a lot of calculations (mostly some simple counters) and more string manipulations. In the following example, frawk is relatively slow:
Remarks:
|
Thanks so much for the benchmark and for providing your own measurements! I'll try my best to use this as a basis for further optimizations; I'm sure it'll be very helpful. Just to clarify; I am not shocked that there are benchmarks where other awks are faster than frawk. My initial inclination was to only include goawk if it outperformed mawk or gawk on a benchmark, as those two are more commonly used in the first place. What do you think? A couple more notes:
|
The sub code is only executed for 24 lines (part of the header) . All other lines will hit the first if block. I had another sub command in there but frawk does not support & for rrepeating the ccaptured group, so I removed it for the benchmark. |
Thanks! Feel free to open a new issue or append them to this one with new workloads. I think there's probably enough to chew on here for the time being in terms of optimization work, but I'm definitely interested in any cases where frawk is slower than alternatives, particularly for large inputs like this one. And thanks for the point on |
Take first half milion lines of the file and place in shared memory (gzip decompression takes around 3 seconds, so the second benchmark showed all ties instead of a win for frawk when piping the output from zcat to AWKs). zcat 1000GENOMES-phase_3.vcf.gz | head -n 5000000 > /dev/shm/head.5000000.txt Assigning a value to a column and rebuilding $0 seems slower in frawk (at least with this file). If frawk doesn't need to rebuild $0, it is faster.
At least simlple regex matching (with at least a literal in it) seems very fast (like expected when using the regex crate).
|
With
|
It looks like gawk has an option to do the same than |
okay; I think that's pretty interesting, thanks again for providing your benchmarks and measurements. That's a pretty strong signal that assigning into $1 is at least a decent portion of the total slowdown, though I suspect there's still something here that we're missing (perhaps string concatenation -- that's something that jemalloc will help more with). I'll be able to dig into this more in the next couple of days. I've got a few ideas for optimizations ranging from simple tweaks, to multi-week endeavors. I'll post periodic updates on my progress here. I'll also make sure I can produce ~comparable benchmark results to the ones in this thread before getting too far in. |
It might be worth to use hyperfine for benchmarking whole frawk (and other versions of AWK): reassign_column1 () {
local AWK="${@}";
${AWK} -F '\t' 'BEGIN { OFS="\t"; } { count += 1; $1 = $1; out = $0 } END { print count, out }' /dev/shm/head.5000000.txt
}
export -f reassign_column1
# It is better to specify more runs.
hyperfine --min-runs 3 -L AWK "${FRAWK}","${MAWK}","${GAWK_locale}","${GAWK}","${GOAWK}" 'reassign_column1 {AWK}' Hyperfine output:
Criterion could be used to benchmark functions: |
I spent some time working on optimizing frawk for this case today. My efforts are in the I'll note that while Let me know if you think I've missed something here with the benchmark workloads or setup. References to BEGIN { FS= "\t"; OFS="\t"; }
{
if ($1 !~ /^#/) {
if ($1 == "MT") {
$1 = "chrM";
} else {
$1 = "chr" $1;
}
} else if ( $1 ~ /^##contig=<ID=/ ) {
sub(/^##contig=<ID=/, "##contig=<ID=chr");
sub(/^##contig=<ID=chrMT/, "##contig=<ID=chrM");
} else if ( $0 == "##INFO=<ID=MAF,Number=1,Type=Float,Description=\"Minor Allele Frequency\">") {
print "##INFO=<ID=AF,Number=1,Type=Float,Description=\"Allele Frequency\">";
}
print $0;
} Optimizing the benchmarkOne thing I noticed for the benchmark is that it could be optimized substantially (for frawk, but also for gawk and mawk). When you assign into
In our case, # Note: the n=$1 doesn't seem to make much of a difference vs. referencing $1
# directly, this is just the program I ran.
BEGIN { FS= "\t"; OFS="\t"; }
{
n=$1
if (n !~ /^#/) {
if (n == "MT") {
nn = "chrM";
} else {
nn = "chr" n;
}
print nn, substr($0, length(nn) - 1);
next;
} else if ( n ~ /^##contig=<ID=/ ) {
sub(/^##contig=<ID=/, "##contig=<ID=chr");
sub(/^##contig=<ID=chrMT/, "##contig=<ID=chrM");
} else if ( $0 == "##INFO=<ID=MAF,Number=1,Type=Float,Description=\"Minor Allele Frequency\">") {
print "##INFO=<ID=AF,Number=1,Type=Float,Description=\"Allele Frequency\">";
}
print $0;
} This script should produce the exact same output as the initial one (and it does seem to), but it executes noticeably faster for The effects of gzip compressionI see some utilities <100% CPU, suggesting to me that the total runtime is bound by
We can see that for the first benchmark, Running the benchmark with zstd compressionTo benchmark the CPU efficiency of the different awks, I re-compressed the data with zstd compression. Zstd at level 7 produces a smaller file than gzip on this data-set while decompressing ~3-4x faster.
I understand if you don't have control over which compression codec is being used to generate the inputs here (e.g. you get the data from someone else, or it is generated by a utility that you cannot modify), but I don't think that awk optimizations will get you performance beyond
My read on this is that Scripts for running benchmarksI only include a single iteration as they take several minutes to run, but the times here are pretty stable. I'll use multiple iterations (or a more sophisticated approach like hyperfine) before updating any benchmark docs. gzipOutput listed above omits the redundant for bench in bench.awk bench.3.awk; do
set -x
zcat ./1000GENOMES-phase_3.vcf.gz | time ./goawk -f$bench > /dev/null
zcat ./1000GENOMES-phase_3.vcf.gz | time ./mawk -f$bench > /dev/null
zcat ./1000GENOMES-phase_3.vcf.gz | time ./gawk -b -f$bench > /dev/null
zcat ./1000GENOMES-phase_3.vcf.gz | time ./frawk -bcranelift -f$bench > /dev/null
zcat ./1000GENOMES-phase_3.vcf.gz | time ./frawk -bllvm -f$bench > /dev/null
set +x
done zstdOutput listed above omits the redundant #!/bin/bash
for bench in bench.awk bench.3.awk; do
set -x
zstdcat ./1000GENOMES-phase_3.vcf.zst | time ~/go/bin/goawk -f$bench > /dev/null
zstdcat ./1000GENOMES-phase_3.vcf.zst | time ./mawk -f$bench > /dev/null
zstdcat ./1000GENOMES-phase_3.vcf.zst | time ./gawk -b -f$bench > /dev/null
zstdcat ./1000GENOMES-phase_3.vcf.zst | time ./frawk -bcranelift -f$bench > /dev/null
zstdcat ./1000GENOMES-phase_3.vcf.zst | time ./frawk -bllvm -f$bench > /dev/null
set +x
done |
I added a few more changes and now frawk does a bit better on the first benchmark. I've included gzip and zstd measurements. Times for gzipWhere you can see that frawk is now io-bound, and is using noticeably less than 100% CPU (despite frawk naturally using >100% CPU on output-heavy scripts, as writes happen in a separate thread).
zstdFor the compute-bound configuration, frawk is now a bit faster than mawk.
|
I used your I also recompiled mawk, to make sure it was with It would be nice to have the jemalloc comparisons for The benchmarking should be done without Programs and settings: # Compile jemalloc:
git clone https://github.com/jemalloc/jemalloc jemalloc-git
cd jemalloc-git
./autogen.sh --prefix=/software/jemalloc/jemalloc_install EXTRA_CFLAGS='-march=native' EXTRA_CXXFLAGS='-march=native'
make
make install
# frawk: cargo +nightly build --release --no-default-features --features use_jemalloc,allow_avx2,unstable
FRAWK='frawk'
MAWK='/software/mawk/mawk-1.3.4-20200120/mawk'
MAWK_jemalloc='/software/jemalloc/jemalloc_install/bin/jemalloc.sh /software/mawk/mawk-1.3.4-20200120/mawk'
GAWK='/software/gawk/gawk-5.1.0/gawk -b'
GAWK_jemalloc='/software/jemalloc/jemalloc_install/bin/jemalloc.sh /software/gawk/gawk-5.1.0/gawk -b'
GAWK_locale='/software/gawk/gawk-5.1.0/gawk'
GAWK_locale_jemalloc='/software/jemalloc/jemalloc_install/bin/jemalloc.sh /software/gawk/gawk-5.1.0/gawk'
GOAWK='/software/goawk/goawk' zcat 1000GENOMES-phase_3.vcf.gz | head -n 5000000 > /dev/shm/head.5000000.txt
reassign_column1 () {
local AWK="${@}";
${AWK} -F '\t' 'BEGIN { OFS="\t"; } { count += 1; $1 = $1; out = $0 } END { print count, out }' /dev/shm/head.5000000.txt
}
export -f reassign_column1
reassign_column8 () {
local AWK="${@}";
${AWK} -F '\t' 'BEGIN { OFS="\t"; } { count += 1; $8 = $8; out = $0 } END { print count, out }' /dev/shm/head.5000000.txt
}
export -f reassign_column8
reassign_column1and8_double () {
local AWK="${@}";
${AWK} -F '\t' 'BEGIN { OFS="\t"; } { count += 1; $1 = $1 $1; $8 = $8 $8; out = $0 } END { print count, out }' /dev/shm/head.5000000.txt
}
export -f reassign_column1and8_double
|
Just a quick update. The changes in this branch exposed a preexisting bug in frawk that I spent a good amount of time on tracking down. I believe it is now fixed, and I should be able to have this branch merged soon. |
Add goawk to the benchmark: https://github.com/benhoyt/goawk
Also for mawk you should use 1.3.4 on Linux. The 1.3.3 version is very old and is missing a lot of bug fixes. The last Debian version finally updated to mawk 1.3.4:
https://invisible-island.net/mawk/
It would also be good to use the march flag so more advanced CPU features can be used to have a more fair comparison with frawk:
The text was updated successfully, but these errors were encountered: