Releases: dathere/qsv
0.131.0
Highlights
- Refactored
frequency
to make it smarter and faster.
frequency
's core algorithm essentially compiles an in-memory hashmap to determine the frequency of each unique value for each column. It does this using multi-threaded, multi-I/O techniques to make it blazing fast.
However, for columns with ALL unique values (e.g. ID columns), this takes a comparatively long time and consumes a lot of memory as it essentially compiles a hashmap of the ENTIRE column, with a hashmap entry for each column value with a count of 1.
Now, with the new--stats-mode
option (enabled by default),frequency
can compile the dataset in a more intelligent way by looking up a column's cardinality in the stats cache.
If the cardinality of a column is equal to the CSV's rowcount (indicating a column with ALL unique values), it short-circuits frequency calculations for that column - dramatically reducing the time and memory requirements for the ID column as it eliminates the need to maintain a hashmap for it.
Practically speaking, this makesfrequency
able to handle "real-world" datasets of any size.
To ensurefrequency
is as fast as possible, be sure toindex
and computestats
for your datasets beforehand. - Setting the stage for Datapusher+ v1 and...
The "itches we've been scratching" the past few months have been informed by our work at several clients towards the release of Datapusher+ 1.0 and qsv pro 1.0 (more info below) - both targeted for release this month.
DP+ is our third-gen, high-speed data ingestion/registration tool for CKAN that uses qsv as its data wrangling/analysis engine. It will enable us to reinvent the way data is ingested into CKAN - with exponentially faster data ingestion, metadata inferencing, data validation, computed metadata fields, and more!
We're particularly excited how qsv will allow us to compute and infer high-quality metadata for datasets (with a focus on inferring optional recommended DCAT-US v3 metadata fields) in "near real-time", while dataset publishers are still entering metadata. This will be a game-changer for CKAN administrators and data publishers! - ...qsv pro 1.0
qsv pro is datHere's enterprise-grade data wrangling/curation workbench that’s planned for v1.0 release this month.
Building the core functionality of qsv pro's Workflow feature is one of the primary reasons for a v1.0 release.
We feel qsv pro may be a game-changer for data wranglers and data curators who need to work with spreadsheets and large datasets to view statistical data and metadata while also performing complex data wrangling operations in a user-friendly way without having to write code.
Added
docs
: added Shell Completion section 556a2ffdocs:
add 🪄 emoji in legend to indicate "automagical" commands 2753c90- Add building deb package (WIP) by @tino097 in #2029
- Added GitHub workflow to test debian package (WIP) by @tino097 in #2032
tests
: added false positive to _typos.toml configuration d576af2- added more benchmarks
- added more tests
Changed
fetch
&fetchpost
: remove expired diskcache entries on startup 9b6ab5dfrequency
: smarter frequency compilation with new--stats-mode
option #2030json
: refactored for maintainability & performance 62e9216 and 4e44b18- improved
self-update
messages 5c874e0 and 0aa0b13 contrib(completions)
:frequency
updates & remove bashly/fish by @rzmk in #2031- Debian package update by @tino097 in #2017
publish
: optimized enabled CPU features when building release binaries in all GitHub Actions "publishing" workflowspublish
: ensure latest Python patch release is used when buildingqsvpy
binary variants 2ab03a0 and ec6f486tests
: also enabled CPU features in CI testsdocs
: wordsmith qsv "elevator pitch" cc47fe6docs
: point to https://100.dathere.com in Whirlwind tour fc49aefdeps
: bump polars to latest upstream post py-1.41.1 release at the time of this release- build(deps): bump bytes from 1.6.1 to 1.7.0 by @dependabot in #2018
- build(deps): bump bytes from 1.7.0 to 1.7.1 by @dependabot in #2021
- build(deps): bump flate2 from 1.0.30 to 1.0.31 by @dependabot in #2027
- build(deps): bump indexmap from 2.2.6 to 2.3.0 by @dependabot in #2020
- build(deps): bump jaq-parse from 1.0.2 to 1.0.3 by @dependabot in #2016
- build(deps): bump redis from 0.26.0 to 0.26.1 by @dependabot in #2023
- build(deps): bump regex from 1.10.5 to 1.10.6 by @dependabot in #2025
- build(deps): bump serde_json from 1.0.121 to 1.0.122 by @dependabot in #2022
- build(deps): bump sysinfo from 0.30.13 to 0.31.0 by @dependabot in #2019
- build(deps): bump sysinfo from 0.31.0 to 0.31.2 by @dependabot in #2024
- build(deps): bump tempfile from 3.11.0 to 3.12.0 by @dependabot in #2033
- build(deps): bump serde from 1.0.204 to 1.0.205 by @dependabot in #2036
- apply select clippy suggestions
- updated several indirect dependencies
- made various usage text improvements
- bumped MSRV to 1.80.1
Fixed
sqlp
&joinp
: fixed.ssv.sz
output auto-compression support 5397f6c & d86ba63docs
: fix link by @uncenter in #2026tests
: correct misnamed test 8ae6000tests
: fix flakyreverse
property tests d86ba63
Removed
docs
: "Quicksilver" is the name of the logo horse, not how you pronounce "qsv" e4551ae
New Contributors
Full Changelog: 0.130.0...0.131.0
0.130.0
Following the 0.129.0 release - the largest release to date, 0.130.0 continues to polish qsv as a data-wrangling engine, packing new features, fixes, and improvements, previewing upcoming features in qsv pro 1.0. Here are a few highlights:
Highlights
- Added
.ssv
(semicolon separated values) automatic support. Semicolon separated values are now automatically detected and supported by qsv. Though not as common as CSV, SSV is used in some regions and industries, so qsv now supports it. - Added cargo deb compatibility. In preparation for the release of DataPusher+ 1.0, we're now making it easier to upgrade
qsvdp
so CKAN administrators can install and upgrade it easily usingapt-get install qsvdp
orapt-get upgrade qsvdp
.
DP+ is our next-gen, high-speed data ingestion tool for CKAN that uses qsv as its analysis engine. Its not only a robust, fast, validating data pump that guarantees high quality data, it also does extended analysis to infer and automatically derive high-quality metadata - what we call "automagical metadata". - Upgraded to the latest Polars upstream at the py-polars-1.3.0 tag. Polars tops the TPC-H Benchmark and is several orders of magnitude faster than traditional dataframe libraries (cough - 🐼 pandas). qsv proudly rides the 🐻❄️ Polars bear to get subsecond response times even with very large datasets!
- qsv v0.130.0 shell completions files are available for download here. With shell completions, pressing tab in a compatible shell provides suggestions for various qsv commands, subcommands, and options that you can choose from. Supported shells include bash, zsh, powershell, fish, nushell, fig, and elvish. View tips on how to install completions for the bash shell here.
Added
apply
: add base62 encode/decode operations #2013headers
: add--just-count
option #2004json
: add--select
option #1990searchset
: add--not-one
flag by @rzmk in #1994- Added
.ssv
(semicolon separated values) automatic support #1987 - Added cargo deb compatibility by @tino097 in #1991
contrib(completions)
: add--just-count
forheaders
by @rzmk in #2006contrib(completions)
: add--select
forjson
by @rzmk in #1992- added several benchmarks
- added more tests
Changed
diff
: allow selection of--key
and--sort-columns
by name, not just by index #2010fetch
&fetchpost
: replace deprecated Redis execute command 75cbe2bstats
: more intelligent--infer-len
option c6a0e64validate
: return delimiter detected upon successful CSV validation #1977- bump polars to latest upstream at py-polars-1.3.0 tag #2009
- deps: bump csvs_convert from 0.8.12 to 0.8.13 d1d0800
- build(deps): bump cached from 0.52.0 to 0.53.0 by @dependabot in #1983
- build(deps): bump cached from 0.53.0 to 0.53.1 by @dependabot in #1986
- build(deps): bump postgres from 0.19.7 to 0.19.8 by @dependabot in #1985
- build(deps): bump pyo3 from 0.22.1 to 0.22.2 by @dependabot in #1979
- build(deps): bump redis from 0.25.4 to 0.26.0 by @dependabot in #1995
- build(deps): bump serde_json from 1.0.120 to 1.0.121 by @dependabot in #2011
- build(deps): bump simple-expand-tilde from 0.1.7 to 0.4.0 by @dependabot in #1984
- build(deps): bump tokio from 1.38.0 to 1.38.1 by @dependabot in #1973
- build(deps): bump tokio from 1.38.1 to 1.39.1 by @dependabot in #1988
- build(deps): bump xxhash-rust from 0.8.11 to 0.8.12 by @dependabot in #1997
- apply select clippy suggestions
- updated several indirect dependencies
- made various usage text improvements
- pin Rust nightly to 2024-07-26
Fixed
diff
: clarify--key
usage examples, resolves #1998 by @rzmk in #2001json
: refactored so it didn't need to use threads to spawnqsv select
to order the columns. Had to do this as sometimes intermediate output was sent to stdout before the final output was ready 0f25defpy
: replace row with col in usage text by @allen-chin in #2008reverse
: fix indexed bug #2007validate
: properly auto-detect tab delimiter when file extension is TSV or TAB #1975- fix panic when process_input helper fn receives unexpected input from stdin 152fec4
Removed
New Contributors
- @tino097 made their first contribution in #1991
- @allen-chin made their first contribution in #2008
Full Changelog: 0.129.1...0.130.0
To stay updated with datHere's latest news and updates (including qsv pro, datHere's CKAN DMS, and analyze.dathere.com), subscribe to the newsletter here: dathere.com/newsletter
0.129.1
This is a small patch release to fix some publishing issues, update tab completion, and to fix minor CI errors.
See 0.129.0 release notes to get the details on qsv's biggest release to date!
Changed
clipboard
: add error handling based onclipboard::Error
by @rzmk in #1970contrib(completions)
: add all commands (exceptapplydp
&generate
) by @rzmk in #1971- Temporarily suppressed some CI tests that were flaky on GH macOS Apple Silicon action runners. They previously worked fine on self-hosted macOS Apple Silicon action runners that are temporarily unavailable.
Full Changelog: 0.129.0...0.129.1
0.129.0
This release is the biggest one ever!
Packed with new features, improvements, and previews of upcoming qsv pro features, here are a few highlights:
📌 Highlights (click each dropdown for more info)
Meet @rzmk - qsv pro's software engineer now also co-maintains qsv!
@rzmk has contributed to projects in the qsv ecosystem including qsv's describegpt
, prompt
, json
, and clipboard
commands; qsv's tab completion support; qsv.dathere.com including its online configurator and benchmarks page; 100.dathere.com with its qsv lessons and exercises; and qsv pro the spreadsheet data wrangling desktop app (along with its promo site). @rzmk now also co-maintains qsv!
With @rzmk now also co-maintaining qsv, our data-wrangling portfolio's roadmap may get more intriguing as @rzmk's work on qsv pro, 100.dathere.com, and other initiatives can result in contributions to qsv as we've seen in this release. Perhaps some aims may be put towards AI; "automagical" metadata inferencing; DCAT 3; and expanded recipe support with the accelerated evolution of qsv pro as an enterprise-grade Data-Wrangling/Data Curation Workbench.
Polars v0.41.3 - numerous sqlp
and joinp
improvements
sqlp
: expanded SQL support- Natural Join support
- DuckDB-like
COLUMNS
SQL function to select columns that match a pattern - ORDER BY ALL support
- Support POSTGRESQL
^@
("starts with"),~~
,~~*
,!~~
,!~~*
("like", "ilike") string-matching operators - Support for SQL
SELECT * ILIKE
wildcard syntax - Support SQL temporal functions
STRFTIME
andSTRPTIME
sqlp
: added--streaming
option
New command qsv prompt
- Use a file dialog for qsv file input and output
Be more interactive with qsv by using a file dialog to select a file for input and output.
Here are a few key highlights:
- Start with
qsv prompt
when piping commands to provide a file as input from an open file dialog and pipe it into another command, for example:qsv prompt | qsv stats
. - End with
qsv prompt -f
when piping commands to save the output to a file you choose with a save file dialog.
There are other options too, so feel free to explore more with qsv prompt --help
.
This will allow you to create qsv pipelines that are more "user-friendly" and distribute them to non-technical users. It's not as flexible as qsv pro's full-blown GUI, but it's a start!
New command qsv json
- Convert JSON data to CSV and optionally provide a jq-like filter
The new json
command allows you to convert non-nested JSON data to CSV. If your data is not in the expected format, try using the --jaq
option to provide a jq-like filter. See qsv json --help
for more information and examples.
Here are a few key highlights:
- Specify the path to a JSON file to attempt conversion to CSV with
qsv json <filepath>
. - Attempt conversion of JSON to CSV data from
stdin
, for example:qsv slice <filepath.csv> --json | qsv json
. - Write the output to a file with the
--output <filepath>
(or-o
for short) option. - Use the
--jaq <filter>
option to try converting nested or complex JSON data into the intended format before parsing to CSV.
You may learn more by running qsv json --help
.
Along with the jsonl
command, we now have more options to convert JSON to CSV with qsv!
New command qsv clipboard
- Provide input from your clipboard and save output to your clipboard
Provide your clipboard content using qsv clipboard
and save output to your clipboard by piping into qsv clipboard --save
(or -s
for short).
100.dathere.com - Try out lessons and exercises with qsv from your browser!
You may run qsv commands from your browser without having to install it locally at 100.dathere.com.
Within the lesson (in-page) using Thebe | In a Jupyter Lab environment |
---|---|
Thanks to Jupyter Book, datHere has released a website available at 100.dathere.com where you may explore lessons and exercises with qsv by running them within the web page, in a Jupyter Lab environment, or locally after following the provided installation instructions. There are multiple exercises planned, but feel free to try out the first few available lessons/exercises by visiting 100.dathere.com and star the source code's repository here.
New multi-shell completions draft (bash, zsh, powershell, fish, nushell, fig, elvish)
There's a draft of more qsv shell completion support including 7 different shells! The plan is to add the rest of the commands in this implementation since we can use one codebase to generate the 7 shell completion script files. Feel free to try out the various shell completions in the examples
folder from contrib/completions
to verify if the examples work (as of today's release date only qsv count
and qsv clipboard
may be available) and also contribute to adding the rest of the completions if you know a bit of Rust.
The existing Bash shell completions for v0.129.0 and fish shell completions draft are available for now as the multi-shell completions draft is being developed.
Bash completions demo | Fish completions demo |
---|---|
With shell completions enabled, you may identify qsv commands more easily when pressing the tab
key on your keyboard in certain positions using the relevant Bash or fish shell from your terminal. You may follow the instructions from 100.dathere.com here to learn how to install the Bash completions and under the Usage section here for fish shell completions. Note that the fish shell completions are incomplete and both of the implementations may be replaced by the multi-shell completions implementation once complete.
qsvpro.dathere.com - Preview: Download spreadsheets from a compatible CKAN instance into the qsv pro Workflow
This is a preview of a feature, meaning it is planned for an upcoming release but may change by the time it is released.
In addition to importing local spreadsheet files and uploading to a CKAN instance, this new feature allows users to select a locally registered CKAN instance where they have the create_dataset
permission to download a spreadsheet file from their CKAN instance and load the new local spreadsheet file into the Workflow. qsv pro's Workflow would therefore have both upload and download capability to and from a compatible CKAN instance.
qsvpro.dathere.com - Preview: Attempt SQL query generation from natural language with a compatible LLM API instance
This is a preview of a feature, meaning it is planned for an upcoming release but may change by the time it is released.
Also note that this video is sped up as you may see by...
0.128.0
[0.128.0] - 2024-05-25
❤️ csv,conf,v8 Edition 🎉
🏇🏽 ¡Ándale! ¡Ándale! ¡Arriba! ¡Arriba! 💨
Yii-hah! We're Mexico bound as we head to csv,conf,v8 to present and share qsv with fellow data-makers and wranglers from all over!
And we've packed a lot into this release for the occasion:
search
got a lot of love as it now powers qsv pro's newsearch
feature to get near-instant search results even on large datasets.stats
- the ❤️ of qsv, now has several cache fine-tuning options with--cache-threshold
. It now also computesmax_precision
for floats andis_ascii
for strings. It also has a new--round
9999 sentinel value to suppress rounding of statistics.schema
&tojsonl
are now faster thanks tostats --cache-threshold
autoindex & cache creation/deletion logic.- We upgraded Polars to 0.40.0 to unlock additional capabilities in the
count
,joinp
&sqlp
commands. count
now has an additional blazing fast counting mode using Polars'read_csv()
table function.frequency
gets some micro-optimizations for even faster frequency analysis.luau
is now bundled with luau 0.625 from 0.622. We also upgraded the bundled LuaDate library from 2.2.0 to 2.2.1. All of this, while making it ~10% faster!
Overall, qsv manages to keep its performance edge despite the addition of new capabilities and features. We'll give a whirlwind tour of qsv and these updates in our talk at csv,conf,v8.
We'll also preview what we've been calling the People's APPI - our "Answering People/Policymaker Interface" in qsv pro.
This is a new way to interact with qsv that's more conversational and less command-line-y using a natural language interface. It's a way to make qsv more accessible to more people, especially those who are not comfortable with the command line.
We're excited to share all these qsv innovations with the csv,conf,v8 community and the wider world! Nos vemos en Puebla!
¡Ándele! ¡Ándele! ¡Epa! ¡Epa! ¡Epa!
Added
count
: additional Polars-powered counting mode usingread_csv()
SQL table function 05c5809input
: add--quote-style
option df3c8f1joinp
: add--coalesce
option 8d142e5search
: add--preview-match
option #1785search
: add--json
output option #1790search
: add "match-only"--flag
option mode #1799search
: add--not-one
flag for not using exit code 1 when no match by @rzmk in #1810sqlp
: add--decimal-comma
option #1832stats
: add--cache-threshold
option #1795stats
: add--cache-threshold
autoindex creation/deletion logic #1809stats
: add additional mode to--cache-threshold
63fdc55stats
: now computes max_precision for floats #1815stats
: add--round
9999 sentinel value support to suppress rounding #1818stats
: addis_ascii
column #1824- added new benchmarks for
search
command 58d73c3
Changed
count
: document three count modes 3d5a333describegpt
: update--max-tokens
type for LLMs with larger context sizes by @rzmk #1841excel
: use simplerrange::headers()
to get headers 069acbffrequency
: ensure--other-sorted
works with--other-text
7430ad7frequency
: microoptimize hot loop d9c01e1, 7c9f925 andluau
: improve usage text cb6b4d9luau
: we now bundle luau 0.625 from 0.622 4060975luau
: update vendored LuaDate library from 2.2.0 to 2.2.1 #1840schema
: adjust to reflectstats --cache-threshold
option 92fed86slice
: move json output helpers to util 1f44b48tojsonl
: refactor boolcheck helper 74d5f5adocs
: cross-referencesplit
&partition
commands #1828- contrib(bashly): update completions.bash for qsv v0.127.0 by @rzmk in #1776
- contrib(bashly): update completions.bash for qsv v0.128.0 by @rzmk in #1838
deps
: upgrade to polars 0.40.0 #1831- build(deps): bump actix-web from 4.5.1 to 4.6.0 by @dependabot in #1825
- build(deps): bump anyhow from 1.0.82 to 1.0.83 by @dependabot in #1798
- build(deps): bump anyhow from 1.0.83 to 1.0.85 by @dependabot in #1823
- build(deps): bump anyhow from 1.0.85 to 1.0.86 by @dependabot in #1826
- build(deps): bump cached from 0.50.0 to 0.51.0 by @dependabot in #1789
- build(deps): bump cached from 0.51.0 to 0.51.1 by @dependabot in #1793
- build(deps): bump cached from 0.51.1 to 0.51.2 by @dependabot in #1802
- build(deps): bump cached from 0.51.2 to 0.51.3 by @dependabot in #1805
- build(deps): bump crossbeam-channel from 0.5.12 to 0.5.13 by @dependabot in #1827
- build(deps): bump csvs_convert from 0.8.9 to 0.8.10 by @dependabot in #1808
- build(deps): bump data-encoding from 2.5.0 to 2.6.0 by @dependabot in #1780
- build(deps): bump file-format from 0.24.0 to 0.25.0 by @dependabot in #1807
- build(deps): bump flate2 from 1.0.28 to 1.0.29 by @dependabot in #1778
- build(deps): bump flate2 from 1.0.29 to 1.0.30 by @dependabot in #1784
- build(deps): bump hashbrown from 0.14.3 to 0.14.5 by @dependabot in #1781
- build(deps): bump itertools from 0.12.1 to 0.13.0 by @dependabot in #1822
- deps: bump forked jsonschema from 0.17.1 to 0.18.0 f02620f
- build(deps): bump mimalloc from 0.1.41 to 0.1.42 by @dependabot in #1829
- build(deps): bump mlua from 0.9.7 to 0.9.8 by @dependabot in #1821
- build(deps): bump qsv-stats from 0.16.0 to 0.17.1 by @dependabot in #1813
- build(deps): bump qsv-stats from 0.17.1 to 0.17.2 by @dependabot in #1814
- build(deps): bump qsv-stats from 0.17.2 to 0.18.0 by @dependabot in #1816
- build(deps): bump ryu from 1.0.17 to 1.0.18 by @dependabot in #1801
- build(deps): bump semver from 1.0.22 to 1.0.23 by @dependabot in #1800
- build(deps): bump serde from 1.0.198 to 1.0.199 by @dependabot in #1777
- build(deps): bump serde from 1.0.199 to 1.0.200 by @dependabot in #1787
- build(deps): bump serde from 1.0.200 to 1.0.201 by @dependabot in #1804
- build(deps): bump serde from 1.0.201 to 1.0.202 by @dependabot in #1817
- build(deps): bump serde_json from 1.0.116 to 1.0.117 by @dependabot in #1806
- build(deps): bump serial_test from 3.1.0 to 3.1.1 by @dependabot in #1779
- build(deps): bump simple-expand-tilde from 0.1.5 to 0.1.6 by @dependabot in #1811
- build(deps): bump sysinfo from 0.30.11 to 0.30.12 by @dependabot in https://github.com/jq...
0.127.0
📊 Enhanced Frequency Analysis 📊
This a quick release adding several frequency
enhancements for more detailed frequency analysis. The frequency
command now includes a percentage column, calculates other
values, and supports limiting unique counts and negative limits.
These options provides additional context for Datapusher+, qsv-pro and describegpt
so their metadata inferences are more accurate and comprehensive.
Previously, for a 775-row CSV file containing one column named state
with entries for all 50 states, frequency
only showed1:
qsv frequency freq_state_example.csv | qsv table
field value count
state NY 100
state NJ 70
state CA 60
state MA 55
state FL 45
state TX 43
state NM 40
state AZ 39
state NV 38
state MI 35
Now, there's a new percentage
column and other
values calculation, both of which have configurable options:
qsv frequency freq_state_example.csv | qsv table
field value count percentage
state NY 100 12.90323
state NJ 70 9.03226
state CA 60 7.74194
state MA 55 7.09677
state FL 45 5.80645
state TX 43 5.54839
state NM 40 5.16129
state AZ 39 5.03226
state NV 38 4.90323
state MI 35 4.51613
state Other (40) 250 32.25806
This release is also out of cycle to address a big performance regression in the excel
command caused by unnecessary formula info retrieval for the --error-format
option introduced in 0.126.0. This has been fixed, and the excel
command is now back to its speedy self.
Added
frequency
: added percentage column;other
values calculation, implementing #1774 #1775benchmarks
: added newfrequency
andexcel
benchmarks b83ad3a
Changed
- contrib(bashly): update completions.bash for qsv v0.126.0 by @rzmk in #1771
- build(deps): bump mimalloc from 0.1.39 to 0.1.41 by @dependabot in #1772
- build(deps): bump qsv-stats from 0.14.0 to 0.15.0 by @dependabot in #1773
- updated several indirect dependencies
- applied select clippy recommendations
Fixed
excel
: fixed performance regression because qsv was unnecessarily getting formula info (an expensive operation) for--error-format
option even when not required 772af34- renamed 0.126.0 sqlp_vs_duckdb benchmark results so they're next to each other for easy direct comparison. 7bcd59e.
Per the benchmarks,sqlp
is 2.87 times faster than duckdb v0.10.2 for a simple aggregation (0.066 secs vs 0.19 secs), and 1.42 times faster for an "expensive" aggregation (0.143 secs vs 0.203 secs).
Full Changelog: 0.126.0...0.127.0
-
with its default
--limit
setting of 10 only show the top 10 unique values in the column, sorted by occurence ↩
0.126.0
🤖 Expanded Metadata Inferencing 🤖
describegpt
headlines this release, with its new ability to support other local Large Language Models (LLMs) using popular tools that serve them through APIs such as Ollama and Jan. This broadens the tool's utility in diverse AI environments. Beyond OpenAI, qsv can now use other popular LLMs like Llama 3, Mistral, and Gemma. It also unlocks expanded metadata inferencing capabilities in qsv pro.
Several commands got additional options: cat
with --no-headers
support in the rowskey
subcommand; excel
with new options like --error-format
and short --metadata
mode; and foreach
with a --dry-run
option. frequency
also got new options, including --unq-limit
for limiting unique counts, support for negative limits, and a --lmt-threshold
option for compiling comprehensive frequencies below a threshold. slice
now supports negative indices and new JSON output options, providing more flexibility in data slicing.
This is all rounded out with sqlp
improvements, including support for single-line comments in SQL scripts and a special SKIP_INPUT
value to skip input preprocessing when using table functions directly in Polars SQL (e.g. read_csv()
and read_parquet()
) - all while increasing performance thanks to the Polars engine being upgraded to 0.39.2.
New Features
cat
: Added--no-headers
support to therowskey
subcommand.describegpt
: Added compatibility for other local Large Language Models (LLMs) such as Ollama and Jan, broadening the tool's utility in diverse AI environments.excel
: Introduced new options in the excel command:--error-format
for better error handling and a short--metadata
JSON mode.foreach
: added a--dry-run
option, allowing users to preview the results of scripts without executing them.frequency
: New options added such as--unq-limit
for limiting unique counts; support for negative limits to only show frequencies >= abs(negative limit); and a--lmt-threshold
option to allow the compilation of comprehensive frequencies below the threshold - all providing more detailed control over frequency analysis.slice
: Support for negative indices to slice from the end and new JSON output options.sqlp
: sqlp now supports single-line comments and includes a special SKIP_INPUT value for more efficient data loading. The Polars engine has also been upgraded to 0.39.2, providing enhanced performance and stability.
Changes and Optimizations
- Performance Enhancements: Microoptimizations in
datefmt
andvalidate
commands, and increased default length for--infer-len
insqlp
for improved performance. - Dependency Updates: Numerous updates including bumping Luau, jql-runner, pyo3, and other dependencies to enhance stability and security.
- Benchmarks Added: New performance benchmarks for
sqlp
vs duckdb added to ensure there are no performance regressions between releases. Right now,sqlp
is faster thanduckdb
in most cases (thanks to Polars - see the latest TPC-H benchmarks), but we want to make sure that we keep it that way.
Security and Robustness
- Security Fixes: Updated rustls to fix a specific CVE, and other minor fixes to enhance the security and robustness of network and data processing features.
- Bug Fixes: Various bug fixes including improvements in error formatting in excel and robustness in fetch and fetchpost commands.
Added
cat
: add--no-headers
support to rowskey subcommand #1762describegpt
: add compatibility for other (local) LLMs (Ollama, Jan, etc.) by @rzmk in #1761excel
: add--error-format
option #1721excel
: add--metadata
short JSON mode #1738foreach
: add--dry-run
option #1740frequency
: add--unq-limit
option #1763frequency
: add support for negative--limit
s #1765frequency
: add--lmt-threshold
option #1766slice
: add support for negative--index
option values #1726slice
: implement--json
output option #1729sqlp
: added support for single-line comments in SQL scripts bb52bcesqlp
: added SKIP_INPUT special value to short-circuit input processing if the user wants to
load input files directly using table functions (e.g. read_csv(), read_parquet(), etc.) fe850advalidate
: add--valid-output
option #1730- contrib: add sample Bashly completions implementation by @rzmk in #1731
benchmarks
: addedsqlp
vsduckdb
benchmarks.
Changed
datefmt
: microoptimize formatting 0ee27e7joinp
: adapt to breaking change in Polars 0.39 for lazyframe sort c625ca9sqlp
: change--infer-len
option default from 250 to 1000 for increased performance da1d215validate
: microoptimizeto_json_instance()
c2e4a1c- bump Luau from 0.616 to 0.622 9216ec3
- build(deps): bump jql-runner from 7.1.6 to 7.1.7 by @dependabot in #1711
- build(deps): bump pyo3 from 0.21.0 to 0.21.1 by @dependabot in #1712
- build(deps): bump pyo3 from 0.21.1 to 0.21.2 by @dependabot in #1750
- build(deps): bump strsim from 0.11.0 to 0.11.1 by @dependabot in #1715
- build(deps): bump sysinfo from 0.30.7 to 0.30.8 by @dependabot in #1716
- build(deps): bump sysinfo from 0.30.8 to 0.30.9 by @dependabot in #1732
- build(deps): bump sysinfo from 0.30.9 to 0.30.10 by @dependabot in #1735
- build(deps): bump sysinfo from 0.30.10 to 0.30.11 by @dependabot in #1755
- build(deps): bump redis from 0.25.2 to 0.25.3 by @dependabot in #1720
- build(deps): bump mlua from 0.9.6 to 0.9.7 by @dependabot in #1724
- build(deps): bump reqwest from 0.12.2 to 0.12.3 by @dependabot in #1725
- build(deps): bump reqwest from 0.12.3 to 0.12.4 by @dependabot in #1759
- build(deps): bump anyhow from 1.0.81 to 1.0.82 by @dependabot in #1733
- build(deps): bump robinraju/release-downloader from 1.9 to 1.10 by @dependabot in #1734
- build(deps): bump chrono from 0.4.37 to 0.4.38 by @dependabot in #1744
- bump polars from 0.38 to 0.39 #1745
- build(deps): bump polars from 0.39.0 to 0.39.1 by @dependabot in #1746
- build(deps): bump polars from 0.39.1 to 0.39.2 by @dependabot in #1752
- build(deps): bump qsv-dateparser from 0.12.0 to 0.12.1 by @dependabot in #1747
- build(deps): bump serde_json from 1.0.115 to 1.0.116 by @dependabot in #1749
- build(deps): bump serde from 1.0.197 to 1.0.198 by @dependabot in #1751
- build(deps): bump rustls from 0.22.3 to 0.22.4 by @dependabot in #1758
- build(deps): bump simple-expand-tilde from 0.1.4 to 0.1.5 by @dependabot in #1767
- build(deps): bump serial_test from 3.0.0 to 3.1.0 by @dependabot in #1768
- build(deps): bump actions/setup-python from 5.0.0 to 5.1.0 by @dependabot in #1769
- applied select clippy recommendations
- updated several indirect dependencies
- added several benchmarks for new/changed commands
- pin Rust nightly to 2024-04-15 - the same nightly that Polars 0.39 is pinned to
- bumped MSRV to 1.77.2
Fixed
- Make init_logger more robust #1717
count
: empty CSVs count as zero also for polars. Fixes #1741 #1742excel
: fix #1682 by adding--error-format
option #1689fetch
&fetchpost
: more robust JSON response validation ebc7287slice
: usewrite!
macro to get rid of GH Advanced Security lint c739097sqlp
: fixed docopt defaults that were not being parsed correctly fe850addeps
: bump h2 from 0.4.3 to 0.4.4 ...
0.125.0
In this release, we focused on the 🏎️ need for even more speed 🏎️ .
This was done primarily by tweaking several supporting qsv crates. qsv-docopt
now parses command-line arguments slightly faster. qsv-stats
, the crate behind commands like stats
, schema
, tojsonl
, and frequency
, has been further optimized for speed. qsv-dateparser
has been updated to support new timezone handling options in datefmt
. qsv-sniffer
also got a speed boost.
Per the benchmark suite, stats
is 25% faster (1.563 secs vs 2.067 secs) when computing the 13 "streaming" stats and 14% faster when computing --everything
(17 columns of addl stats - 3.149 secs vs 3.656 secs) for the 1M row, 41 column, 520mb sample of NYC's 311 data.
The count
command has been refactored to utilize Polars' SQLContext, which leverages LazyFrames evaluation to automagically count even very large files in just a few seconds. Previously, count
was already using Polars, but it mistakenly fell back to a slower counting mode. Now, it consistently delivers fast performance, even without an index. On the same benchmark suite, it takes 0.052 secs vs 0.503 seconds - almost 10x faster!
As count
is not just a top-level command, but also a widely used helper used by several qsv commands, this gives the entire suite a nice performance boost.
Continuing on the performance front, the excel
command now has a new short --metadata
mode, allowing users to just get a "shorter" version of the metadata report that only list the workbook's top level metadata (sheet index, sheet name, sheet type, visibility) instead of the full metadata report (which also has info like num rows, column metadata, etc.). On the benchmark suite, the short metadata report takes all of 0.005 secs vs 11.237 secs for the 1M row xlsx version of the same NYC 311 data - more than 3 orders of magnitude faster! (it may actually be faster since 0.005 secs is at the limits of what hyperfine can measure)
The datefmt
command also got some major enhancements with new timezone handling and timestamp parsing options, though at the cost of a small 15% performance penalty.
Lastly, we are excited to announce that qsv will be featured at the CSV,Conf,V8 conference in Puebla, Mexico on May 28-29. I'll be presenting a talk titled "qsv: A Blazing Fast CSV Data-Wrangling Toolkit". Hope to see you there!.
Added
excel
: added short mode to--metadata
option #1699datefmt
: addedts-resolution
option to specify resolution to use when parsing unix timestamps #1704datefmt
: added timezone handling options #1706 #1707 #1642
Changed
count
: refactored to use Polars SQLContext 43a236fstats
: refactored stats_path helper function 174c30eapply
,applydp
,datefmt
,excel
,geocode
,py
,validate
: use std::mem::take to avoid clone 1fd187f 8402d3a 8496157excel
: optimized workbook opening operation 67f662e- build(deps): bump flexi_logger from 0.27.4 to 0.28.0 by @dependabot in #1673
- build(deps): bump polars from 0.38.2 to 0.38.3 by @dependabot in #1674
- build(deps): bump uuid from 1.7.0 to 1.8.0 by @dependabot in #1675
- build(deps): bump hashbrown from 0.14.3 to 0.14.4 by @dependabot in #1680
- build(deps): bump reqwest from 0.11.26 to 0.11.27 by @dependabot in #1679
- build(deps): bump bytes from 1.5.0 to 1.6.0 by @dependabot in #1685
- build(deps): bump regex from 1.10.3 to 1.10.4 by @dependabot in #1686
- build(deps): bump indexmap from 2.2.5 to 2.2.6 by @dependabot in #1687
- build(deps): bump rayon from 1.9.0 to 1.10.0 by @dependabot in #1688
- build(deps): bump qsv_docopt from 1.6.0 to 1.7.0 by @dependabot in #1691
- build(deps): bump reqwest from 0.12.1 to 0.12.2 by @dependabot in #1693
- build(deps): bump serde_json from 1.0.114 to 1.0.115 by @dependabot in #1694
- build(deps): bump itoa from 1.0.10 to 1.0.11 by @dependabot in #1695
- build(deps): bump actions/setup-python from 5.0.0 to 5.1.0 by @dependabot in #1700
- build(deps): bump rust_decimal from 1.34.3 to 1.35.0 by @dependabot in #1701
- build(deps): bump chrono from 0.4.35 to 0.4.37 by @dependabot in #1702
- build(deps): bump tokio from 1.36.0 to 1.37.0 by @dependabot in #1703
- build(deps): bump qsv-sniffer from 0.10.2 to 0.10.3 by @dependabot in #1708
- build(deps): bump titlecase from 2.2.1 to 3.0.0 by @dependabot in #1709
- build(deps): bump qsv-stats from 0.13.0 to 0.14.0 by @dependabot in #1710
- applied select clippy recommendations
- updated several indirect dependencies
- added several benchmarks for new/changed commands
- bumped MSRV to 1.77.1
- use
#[cfg(debug_assertions)]
conditional compilation to avoid compiling debug code in release mode - use patched forks of
jsonschema
,cached
,self_update
andlocalzone
crates to avoid old dependencies
which was causing dependency bloat
Fixed
count
: fixed polars_count_input helper, as it was always falling back to "slow" counting mode 3484c89
Full Changelog: 0.124.1...0.125.0
0.124.1
Datapusher+ "Speed of Insight" Release! 🚀🚀🚀
This release is all about speed, speed, speed! We've made qsv even faster by leveraging Polars' multithreaded, mem-mapped CSV reader to get near-instant row counts of large CSV files, and near instant SQL queries and aggregations with Datapusher+ - automagically inferring metadata and giving you quick insights into your data in seconds!
We're demoing our qsv-powered Datapusher+ at the March 2024 installment of CKAN Montly Live on March 20, 2024, 13:00-14:00 UTC. Join us!
Beyond pushing data reliably at speed into your CKAN Datastore (it pushes real good! 😉), DP+ does some extended analysis, processing and enrichment of the data so it can be readily Used.
Both fetch
and fetchpost
commands now also have a --disk-cache
option and are fully synched - forming the foundation for high-speed data enrichment from Web Services - including datHere's forthcoming, fully-integrated Data Enrichment Service.
🏇🏽 Hi-ho Quicksilver, away! 🏇🏽
Added
count
: automatically use Polars multithreaded, mem-mapped CSV reader whenpolars
feature is enabled to get near-instant row counts of large CSV files even without an index #1656qsvdp
: added polars support to Datapusher+-optimized binary variant, so we can do near instant SQL queries and aggregations during DP+ processing #1664fetchpost
: added--disk-cache
options and synced usage options withfetch
#1671- extended
.infile-list
to skip empty and commented lines, and to validate file paths
20a45c8 and
2650930
Changed
sqlp
: automatically disableread_csv()
fast path optimization when a custom delimiter is specified #1648- refactored util::count_rows() helper to also use polars if available 1e09e17 and 8d321fe
- publish: updated Windows MSI publish GH Action workflow to use Wix 3.14 from 3.11 75894ef
- deps: bump polars from 0.38.1 to 0.38.2 5faf90e
- deps: update Luau from 0.614 to 0.616 eb197fe and 52331da
- build(deps): bump sysinfo from 0.30.6 to 0.30.7 by @dependabot in #1650
- build(deps): bump chrono from 0.4.34 to 0.4.35 by @dependabot in #1651
- build(deps): bump strum from 0.26.1 to 0.26.2 by @dependabot in #1658
- build(deps): bump qsv-stats from 0.12.0 to 0.13.0 by @dependabot in #1663
- build(deps): bump anyhow from 1.0.80 to 1.0.81 by @dependabot in #1662
- build(deps): bump reqwest from 0.11.25 to 0.11.26 by @dependabot in #1667
- applied select clippy recommendations
- updated several indirect dependencies
- added several benchmarks for new/changed commands
Fixed
dedup
: fixed #1665 dedup not handling numeric values properly by adding a --numeric option #1666joinp
: reenable join validation tests now that Polars 0.38.2 join validation is working again 5faf90e and fcfc75bcount
: broken in unreleased 0.124.0. Polars-powered count require a "clean" CSV file as it infers the schema based on the first 1000 rows of a CSV. This will sometimes result in an invalid "error" (e.g. it infers a column is a number column, when its not). 0.124.1 fixes this by adding a fallback to the "regular" CSV reader if a Polars error occurs a2c0869
Removed
gender_guesser
0.2.0 has been released. Remove patch.crates-io entry
97873a5
Full Changelog: 0.123.0...0.124.1
0.123.0
OPEN DATA DAY 2024 Release! 🎉🎉🎉
In celebration of Open Data Day, we're releasing qsv 0.123.0 - the biggest release ever with 330+ commits! qsv 0.123.0 continues to focus on performance, stability and reliability as we continue setting the stage for qsv's big brother - qsv pro.
We've been baking qsv pro for a while now, and it's almost ready for release. qsv pro is a cross-platform Desktop Data Wrangling tool marrying an Excel-like UI with the power of qsv, backed by cloud-based data cleaning, enrichment and enhancement service that's easy to use for casual Excel users and Data Publishers, yet powerful enough for data scientists and data engineers.
Stay tuned!
Highlights:
sqlp
now has automaticread_csv()
fast path optimization, often making optimized queries run dramatically faster - e.g what took 6.09 seconds for a non-trivial SQL aggregation on an 18 column, 657mb CSV with 7.43 million rows now takes just 0.14 seconds with the optimization - 🚀 43.5x FASTER 🚀 ! 1
# with fast path optimization turned off
/usr/bin/time qsv sqlp taxi.csv --no-optimizations "select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID"
VendorID,total_amount
1,52377417.52985942
2,89959869.13054822
4,600584.610000027
(3, 2)
6.09 real 6.82 user 0.16 sys
# with fast path optimization, fully exploiting Polars' multithreaded, mem-mapped CSV reader!
/usr/bin/time qsv sqlp taxi.csv "select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID"
VendorID,total_amount
1,52377417.52985942
2,89959869.13054822
4,600584.610000027
(3, 2)
0.14 real 1.09 user 0.09 sys
# in contrast, csvq takes 72.46 seconds - 517.57x slower
/usr/bin/time csvq "select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID"
+----------+---------------------+
| VendorID | SUM(total_amount) |
+----------+---------------------+
| 1 | 52377417.529256366 |
| 2 | 89959869.1264675 |
| 4 | 600584.6099999828 |
+----------+---------------------+
72.46 real 65.15 user 75.17 sys
"Traditional" SQL engines
qsv and csvq both operate on "bare" CSVs. For comparison, let's contrast qsv's performance against "traditional" SQL engines
that require setup and import (aka ETL). Not counting setup and import time (which alone, takes several minutes), we get:
sqlite3.43.2 takes 2.910 seconds - 20.79x slower
sqlite> .timer on
sqlite> select VendorID,sum(total_amount) from taxi group by VendorID order by VendorID;
1,52377417.53
2,89959869.13
4,600584.61
Run Time: real 2.910 user 2.569494 sys 0.272972
PostgreSQL 15.6 using PgAdmin 4 v6.12 takes 18.527 seconds - 132.34x slower
even with an index, qsv sqlp is still 5.96x faster
sqlp
now supports JSONL output format and adds compression support for Avro and Arrow output formats.fetch
now has a--disk-cache
option, so you can cache web service responses to disk, complete with cache control and expiry handling!jsonl
is now multithreaded with additional--batch
and--job
options.split
now has three modes: split by record count, split by number of chunks and split by file size.datefmt
is a new top-level command for date formatting. We extracted it fromapply
to make it easier to use, and to set the stage for expanded date and timezone handling.enum
now has a--start
option.excel
now has a--keep-zero-time
option and now has improved datetime/duration parsing/handling with upgrade of calamine from 0.23 to 0.24.tojsonl
now has--trim
and--no-boolean
options and eliminated false positive boolean inferences.
Added
apply
: addgender_guess
operation #1569datefmt
: new top-level command for date formatting. #1638enum
: add--start
option #1631excel
: added--keep-zero-time
option; improved datetime/duration parsing/handling with upgrade of calamine from 0.23 to 0.24 #1595fetch
: add--disk-cache
option #1621jsonl
: major performance refactor! Now multithreaded with addl--batch
and--job
options #1553sniff
: added addl mimetype/file formats detected by bumpingfile-format
from 0.23 to 0.24 #1589split
: add<outdir>
error handling and add usage text examples #1585split
: added--chunks
option #1587split
: add--kb-size
option #1613sqlp
: added JSONL output format and compression support for AVRO and Arrow output formats in #1635tojsonl
: add--trim
option #1554- Add QSV_DOTENV_PATH env var #1562
- Add license scan report and status by @fossabot in #1550
- Added several benchmarks for new/changed commands
Changed
luau
: bumped Luau from 0.606 to 0.614freq
: major performance refactor - 1a3a4b4split
: migrate to rayon from threadpool #1555split
: refactored to actually create chunks <= desired--kb-size
, obviating need for hacky--sep-factor
option #1615tojsonl
: improved true/false boolean inferencing false positive handling #1641tojsonl
: fine-tune boolean inferencing #1643schema
: use parallel sort when sorting enums for fields 523c60a- Use array for rustflags to avoid conflicts with user flags by @clarfonthey in #1548
- Make it easier and more consistent to package for distros by @alerque in #1549
- Replace
simple_home_dir
withsimple_expand_tilde
crate #1578 - build(deps): bump rayon from 1.8.0 to 1.8.1 by @dependabot in #1547
- build(deps): bump rayon from 1.8.1 to 1.9.0 by @dependabot in #1623
- build(deps): bump uuid from 1.6.1 to 1.7.0 by @dependabot in #1551
- build(deps): bump jql-runner from 7.1.2 to 7.1.3 by @dependabot in #1552
- build(deps): bump jql-runner from 7.1.3 to 7.1.5 by @dependabot in #1602
- build(deps): bump jql-runner from 7.1.5 to 7.1.6 by @dependabot in #1637
- build(deps): bump flexi_logger from 0.27.3 to 0.27.4 by @dependabot in #1556
- build(deps): bump regex from 1.10.2 to 1.10.3 by @dependabot in #1557
- build(deps): bump cached from 0.47.0 to 0.48.0 by @dependabot in #1558
- build(deps): bump cached from 0.48.0 to 0.48.1 by @dependabot in #1560
- build(deps): bump cached from 0.48.1 to 0.49.2 by @dependabot in #1618
- build(deps): bump chrono from 0.4.31 to 0.4.32 by @dependabot in #1559
- build(deps): bump chrono from 0.4.32 to 0.4.33 by @dependabot in #1566
- build(deps): bump mlua from 0.9.4 to 0.9.5 by @dependabot in #1565
- build(deps): bump mlua from 0.9.5 to 0.9.6 by @dependabot in #1632
- build(deps): bump serde from 1.0.195 to 1.0.196 by @dependabot in #1568
- build(deps): bump serde from 1.0.196 to 1.0.197 by @dependabot in #1612
- build(deps): bump serde_json from 1.0.111 to 1.0.112 by @dependabot in #1567
- build(deps): bump serde_json from 1.0.112 to 1.0.113 by @dependabot in #1576
- build(deps): bump serde_json from 1.0.113 to 1.0.114 by @dependabot in #1610
- bump Polars from 0.36 to 0.37 #1570
- build(deps): bump polars from 0.37.0 to 0.38.0 by @dependabot in #1629
- build(deps): bump polars from 0.38.0 to 0.38.1 by @dependabot in #1634
- build(deps): bump strum from 0.25.0 to 0.26.1 by @dependabot in #1572
- build(deps): bump indexmap from 2.1.0 to 2.2.1 by @dependabot in https://g...
-
measurements taken on an Apple Mac Mini 2023 model with an M2 Pro chip with 12 CPU cores & 32GB of RAM, running macOS Sonoma 14.4 ↩