Skip to content

Releases: dathere/qsv

2.0.0

06 Jan 12:54
Compare
Choose a tag to compare

qsv v2.0.0 is here! 🎉

It took 193 releases to get to v1.0.0, and we're already at v2.0.0 a month later!?!

Yes! We wanted a running start for 2025, and qsv 2.0.0 marks qsv's biggest release yet!

  • It fully enables the "Data Resource Upload First (DRUF)" workflow, allowing Datapusher+ to infer "automagical metadata" from the data itself. It exposes two Domain Specific Language (DSL) options - Luau and MiniJinja - to enable powerful data transformation and validation capabilities. This allows data stewards to upload data first, then use qsv's DSL capabilities inside DP+ to automatically generate rich metadata - including data dictionaries, field descriptions, data quality rules, and data validation schemas. This "automagical metadata" approach dramatically reduces the friction in compiling high-quality, high-resolution metadata (using the DCAT-US 3.0 specification as a reference) that would otherwise be a manual, laborious, and error-prone process.
    Under the hood, the fetchpost, template, stats, validate and luau commands now have the necessary scaffolding to fully support this workflow inside Datapusher+ and ckanext-scheming.
  • It adds a new "smart" pivotp command, powered by Polars, to enable fast pivot operations on large datasets. It's "smart" as it uses the stats cache to automatically suggest an aggregation based on a column's data type and summary statistics. You can now pivot your data in seconds by simply specifying the columns to pivot on while blowing past Excel's PivotTable limitations.
  • stats now computes geometric mean and harmonic mean and adds string length stats, all while getting a performance boost.
  • join and joinp got a lot of love in this release, with several new options:
    • joinp: non-equi join support! 🎉💯🥳
      See "Lightning Fast and Space Efficient Inequality Joins" paper and this Polars non-equi join tracking issue.
    • join & joinp: --right-anti and --right-semi joins
    • joinp: --ignore-leading-zeros option for join keys
    • joinp: --maintain-order option to maintain the order of the either the left or right dataset in the output
    • joinp: expanded --cache-schema options to make joinp smarter/faster by leveraging the stats cache
    • join: --keys-output option to write successfully joined keys to a separate output file.

This release lays the groundwork for the outliers "smart" command to quickly identify outliers using stats/frequency info.

It also sets the stage for an initial implementation of our "Data Concierge" that leverages all the high-quality, high-res metadata we automagically compile with DRUF to enable Metadata Gardening Agents to proactively link seemingly unrelated data and glean insights as it constantly grooms the Data Catalog - effectively making it a FAIR Data Factory.


Added

  • fetchpost: add --globals-json option #2357
  • fixlengths: add --remove-empty option; refactored for performance. Fulfills #2391. #2411
  • join: add --keys-output option. Fulfills #2407. #2408
  • join: add --right-anti and --right-semi options. Fulfills #2379. #2380
  • joinp: add non-equi join support! 🎉💯🥳 #2409
  • joinp: add --ignore-leading-zeros option. Fulfills #2398. #2400
  • joinp: add --maintain-order option #2338
  • joinp: add --right-anti and --right-semi options. Fulfills #2377. #2378
  • luau: addl helper functions. Fulfills #1782. #2362
  • luau: add qsv_writejson helper #2375
  • pivotp: new polars polars-powered command. Fulfills #799. #2364
  • pivotp: "smart" pivotp. #2367
  • stats: add geometric mean and harmonic mean. Fulfills #2227. #2342
  • stats: add string length stats to set stage for upcoming outliers "smart" command to quickly identify outliers using stats/frequency info #2390
  • template: add --globals-json option #2356
  • tojsonl: add --quiet option. Fulfills #2335. #2336
  • validate: add --validate-schema option to check if the JSON Schema itself is valid #2393
  • contrib(completions): add joinp --ignore-case and slice --invert by @rzmk in #2322
  • contrib(completions): add --quiet to tojsonl by @rzmk in #2337
  • ci: add qsv_glibc_2.31-headless to action by @rzmk in #2330
  • Add license to MSI installer by @rzmk in #2321

Changed

  • lens: optimized csvlens library usage, dropping clap dependency #2403
  • pivotp: an even smarter pivotp #2368
  • stats: performance boost 51349ba
  • Update deb package by @tino097 in #2226
  • ci: attempt using files-folder instead of files by @rzmk in #2320
  • Setting QSV_FREEMEMORY_HEADROOM_PCT to 0 disables memory availability check #2353
  • build(deps): bump actix-governor from 0.7.0 to 0.8.0 by @dependabot in #2351
  • build(deps): bump bytemuck from 1.20.0 to 1.21.0 by @dependabot in #2361
  • build(deps): bump chrono from 0.4.38 to 0.4.39 by @dependabot in #2345
  • build(deps): bump crossbeam-channel from 0.5.13 to 0.5.14 by @dependabot in #2354
  • build(deps): bump flexi_logger from 0.29.6 to 0.29.7 by @dependabot in #2348
  • build(deps): bump governor from 0.7.0 to 0.8.0 by @dependabot in #2347
  • build(deps): bump itertools from 0.13.0 to 0.14.0 by @dependabot in #2413
  • build(deps): bump jsonschema from 0.26.1 to 0.26.2 by @dependabot in #2355
  • build(deps): bump jsonschema from 0.26.2 to 0.27.0 by @dependabot in #2371
  • build(deps): bump jsonschema from 0.27.1 to 0.28.0 by @dependabot in #2389
  • build(deps): bump jsonschema from 0.28.0 to 0.28.1 by @dependabot in #2396
  • bump polars from 0.44.2 to 0.45 #2340
  • build(deps): bump polars from 0.45.0 to 0.45.1 by @dependabot in #2344
  • bump pyo3 from 0.22 to 0.23 now that Polars supports it #2352
  • build(deps): bump redis from 0.27.5 to 0.27.6 by @dependabot in #2331
  • build(deps): bump reqwest from 0.12.9 to 0.12.11 by @dependabot in #2385
  • build(deps): bump reqwest from 0.12.11 to 0.12.12 by @dependabot in #2395
  • build(deps): bump rfd from 0.15.1 to 0.15.2 by @dependabot in #2404
  • build(deps): bump serde from 1.0.215 to 1.0.216 by @dependabot in #2349
  • build(deps): bump serde from 1.0.216 to 1.0.217 by @dependabot in #2384
  • build(deps): bump serde_json from 1.0.133 to 1.0.134 by @dependabot in #2365
  • build(deps): bump sysinfo from 0.32.1 to 0.33.0 by @dependabot in #2334
  • build(deps): bump sysinfo from 0.33.0 to 0.33.1 by @dependabot in #2383
  • deps: bump tabwriter to 1.4.1 bbcbeba
  • build(deps): bump tokio from 1.41.1 to 1.42.0 by @dependabot in #2333
  • build(deps): bump xxhash-rust from 0.8.12 to 0.8.13 by @dependabot in #2359
  • build(deps): bump xxhash-rust from 0.8.13 to 0.8.14 by @dependabot in #2372
  • build(deps): bump xxhash-rust from 0.8.14 to 0.8.15 by @dependabot in #2392
  • apply several clippy suggestions
  • bumped numerous indirect dependencies to latest versions
  • bumped Rust nightly from 2024-11-28 to 2024-12-19 (same version used by Polars)

Fixed

  • joinp: refactor --cache-schema option. Resolves #2369. #2370
  • extsort underflow in CSV mode. Resolves #2391. #2412
  • instantiate logger properly 9c0c1a7
  • fix util::get_stats_records() to no longer infer boolean in StatsMode::PolarsSchema. Resolves #2369. https://github.com/da...
Read more

1.0.0

02 Dec 13:27
Compare
Choose a tag to compare

qsv v1.0.0 is here! 🎉

After over 3 years of development, nearly 200 releases, and 11,000+ commits, qsv has finally reached v1.0.0!

What started as a hobby project to learn Rust during COVID has evolved into a powerful data wrangling tool used in multiple datHere products, open source projects, and even in several mission-critical production environments!

To mark this major milestone, this larger than usual release includes major performance improvements, new features, and various optimizations!


Added

  • joinp: add --ignore-case option #2287
  • py: add ability to load python expression from file #2295
  • replace: add --not-one flag (resolves #2305) by @rzmk in #2307
  • slice: add --invert option #2298
  • stats: add dataset-level stats #2297
  • sqlp: auto-decompression of gzip, zstd & zlib compressed csv files with read_csv table function (implements suggestion from @wardi in #2301) #2315
  • template: add lookup support #2313
  • added ui feature to make it easier to make a headless build of qsv #2289
  • added better panic handling #2304
  • added new benchmark for template command cd7e480
  • added 📚 lookup support legend b46de73

Changed

  • move qsv from personal Github repo to datHere GitHub org #2317
  • template: parallelized template rendering for significant speedups #2273
  • simplify input format check #2309
  • bump embedded luau from 0.650 to 0.653 986a1d3
  • deps: Switch back to simple-home-dir from simple-expand-tilde #2319
  • deps: Add minijinja contrib #2276
  • deps: bump pyo3 down to 0.21.2 because polars-mem-engine is not compatible with pyo3 0.23.x yet 7f9fc8a
  • build(deps): bump base62 from 2.0.2 to 2.0.3 by @dependabot in #2281
  • build(deps): bump bytemuck from 1.19.0 to 1.20.0 by @dependabot in #2299
  • build(deps): bump bytes from 1.8.0 to 1.9.0 by @dependabot in #2314
  • build(deps): bump file-format from 0.25.0 to 0.26.0 by @dependabot in #2277
  • build(deps): bump hashbrown from 0.15.1 to 0.15.2 by @dependabot in #2310
  • build(deps): bump itoa from 1.0.11 to 1.0.12 by @dependabot in #2300
  • build(deps): bump itoa from 1.0.12 to 1.0.13 by @dependabot in #2302
  • build(deps): bump itoa from 1.0.13 to 1.0.14 by @dependabot in #2311
  • build(deps): bump mlua from 0.10.0 to 0.10.1 by @dependabot in #2280
  • build(deps): bump mlua from 0.10.1 to 0.10.2 by @dependabot in #2316
  • build(deps): bump serial_test from 3.1.1 to 3.2.0 by @dependabot in #2279
  • build(deps): bump minijinja from 2.4.0 to 2.5.0 by @dependabot in #2284
  • build(deps): bump minijinja-contrib from 2.3.1 to 2.5.0 by @dependabot in #2283
  • build(deps): bump rfd from 0.15.0 to 0.15.1 by @dependabot in #2291
  • build(deps): bump sanitize-filename from 0.5.0 to 0.6.0 by @dependabot in #2275
  • build(deps): bump serde from 1.0.214 to 1.0.215 by @dependabot in #2286
  • build(deps): bump serde_json from 1.0.132 to 1.0.133 by @dependabot in #2292
  • build(deps): bump tempfile from 3.13.0 to 3.14.0 by @dependabot in #2278
  • build(deps): bump tokio from 1.41.0 to 1.41.1 by @dependabot in #2274
  • build(deps): bump url from 2.5.3 to 2.5.4 by @dependabot in #2306
  • applied several clippy suggestions
  • bumped numerous indirect dependencies to latest versions
  • bumped MSRV to latest Rust stable (1.83.0)
  • bumped Rust nightly from 2024-11-01 to 2024-11-28, the same version used by Polars

Fixed

  • fix get_stats_records() helper to handle input files with embedded spaces (fixes #2294) #2296
  • added better panic handling (fixes #2301) #2304
  • implement simple format check for input files (fixes #2301) #2308

Removed

  • removed simple-expand-tilde dependency in favor of simple-home-dir #2318
  • removed patched fork of indicatif now that 0.17.9 is released, fixing GH unmaintained advisory for instant 33fa54a
  • removed clipboard command from qsvlite binary variant 9c663d8

Full Changelog: 0.138.0...1.0.0

0.138.0

06 Nov 03:23
6dd67c1
Compare
Choose a tag to compare

Highlights:

  • ⭐ New template command for rendering templates with CSV data.
    Generate complex documents from CSVs (Form letters, HTML, JSON, XML files, etc.) with the powerful MiniJinja template engine (Example template).

  • ⭐ New lookup module for fetching reference data from remote and local files.
    In addition to the typical http/https schemes for remote files, qsv adds two additional schemes - CKAN:// and datHere://, fetching lookup data from a CKAN site or datHere maintained reference data respectively. The lookup module has simple file-based caching as well to minimize repeated fetching of typically static reference data (default cache age: 600 seconds).
    The lookup module is now being used by the luau (for its qsv_register_lookup helper) and validate (for its dynamicEnum custom JSON Schema keyword) commands. More commands will take advantage of this module over time (e.g. apply, geocode, template, sqlp, etc.) to do extended lookups (e.g. lookup Census information given spatiotemporal data - like demographic info of a Census tract).

  • ✨ Enhanced fetchpost with MiniJinja templating for payload construction.
    Previously, fetchpost was limited to posting url-encoded HTML Form data with content type application/x-www-form-urlencoded. Now with the new --payload-tpl and --content-type options, users can post request bodies rendered with MiniJinja and specify other content types (typically application/json, text/plain, multipart/form-data) as well.

  • ✨ Improved Polars integration with automatic schema detection
    The joinp and sqlp commands now use qsv's stats cache to automatically determine column data types, rather than having Polars scan a sample of rows. This provides two key benefits:

    1. Faster execution by skipping Polars' schema inference step
    2. GUARANTEED data type inferencing since the stats cache analyzes the entire dataset, not just a sample
  • 🏃 fast-float2 crate for faster float parsing
    Casting string/bytes to float is now much faster (2 to 8x faster than Rust's standard library) with fast-float2.

  • 💪 Major dependency updates including Polars 0.44.2, Luau 0.650, mlua 0.10.0 and jsonschema 0.26.1
    These core crates underpin qsv's advanced commands. Using the latest version of these crates allow qsv to stay true to its goal of being the fastest and most comprehensive data-wrangling toolkit.


Added

  • added lookup module - enabling fetching and caching of reference data from remote and local files #2262
  • fetchpost: add --payload-tpl <file> and --content-type options to construct payload using MiniJinja with the appropriate content-type #2268 5921498
  • joinp: derive polars schema from stats cache 86fe22e
  • sqlp: derive polars schema from stats cache #2256
  • template: new command to render MiniJinja templates with CSV data #2267
  • validate: add dynamicEnum lookup support #2265
  • contrib(completions): add template command and update fetchpost by @rzmk in #2269
  • add fast-float2 dependency for faster bytes to float conversion 7590e4e 3ca30aa
  • added more benchmarks for new/updated commands f8a1d4f cd7e480

Changed

  • luau: adapt to mlua 0.10 API changes 268cb45
  • luau: refactored stage management 31ef58a
  • luau: now uses the lookup module 2f4be34
  • stats: minor perf refactoring 6cdd6ea
  • build(deps): bump actions/setup-python from 5.2.0 to 5.3.0 by @dependabot in #2243
  • build(deps): bump azure/trusted-signing-action from 0.4.0 to 0.5.0 by @dependabot in #2239
  • build(deps): bump bytes from 1.7.2 to 1.8.0 by @dependabot in #2231
  • build(deps): bump cached from 0.53.1 to 0.54.0 by @dependabot in #2272
  • build(deps): bump flexi_logger from 0.29.3 to 0.29.4 by @dependabot in #2229
  • build(deps): bump flexi_logger from 0.29.4 to 0.29.5 by @dependabot in #2261
  • build(deps): bump flexi_logger from 0.29.5 to 0.29.6 by @dependabot in #2266
  • build(deps): bump hashbrown from 0.15.0 to 0.15.1 by @dependabot in #2270
  • build(deps): bump jsonschema from 0.24.0 to 0.24.1 by @dependabot in #2234
  • build(deps): bump jsonschema from 0.24.1 to 0.24.2 by @dependabot in #2238
  • build(deps): bump jsonschema from 0.24.2 to 0.24.3 by @dependabot in #2240
  • build(deps): bump jsonschema from 0.25.0 to 0.25.1 by @dependabot in #2244
  • build(deps): bump jsonschema from 0.26.0 to 0.26.1 by @dependabot in #2260
  • build(deps): bump regex from 1.11.0 to 1.11.1 by @dependabot in #2242
  • build(deps): bump reqwest from 0.12.8 to 0.12.9 by @dependabot in #2258
  • build(deps): bump serde from 1.0.210 to 1.0.211 by @dependabot in #2232
  • build(deps): bump serde from 1.0.211 to 1.0.213 by @dependabot in #2236
  • build(deps): bump serde from 1.0.213 to 1.0.214 by @dependabot in #2259
  • build(deps): bump simd-json from 0.14.1 to 0.14.2 by @dependabot in #2235
  • build(deps): bump tokio from 1.40.0 to 1.41.0 by @dependabot in #2237
  • deps: updated our fork of the csv crate with more perf optimizations eae7d76
  • deps: use calamine upstream with unreleased fixes 4cc7f37
  • deps: use our csvlens fork untl PR removing unneeded arboard features is merged bb32322
  • deps: bump jsonschema from 0.25 to 0.26 #2251
  • deps: bump embedded Luau from 0.640 to 0.650 8c54b87 aca30b0
  • deps: bump mlua from 0.9 to 0.10 #2249
  • deps: bump Polars from 0.43.1 at py-1.11.0 tag to latest 0.44.2 upstream #2255 0e40a44
  • apply select clippy lint suggestions
  • updated indirect dependencies
  • aligned Rust nightly to Polars nightly - 2024-10-28 - 245bcb5

Fixed

Removed

  • removed need to set RAYON_NUM_THREADS env var and just call the Rayon API directly aa6ef89
  • removed unneeded create_dir_all_threadsafe helper now that std::create_dir_all is threadsafe d0af83b

Full Changelog: 0.137.0...0.138.0

0.137.0

21 Oct 03:57
75dbaba
Compare
Choose a tag to compare

Highlights:

  • extdedup & extsort now support two modes - LINE mode and CSV mode. Previously, both commands only sorted on a line-by-line basis (LINE mode).
    With the addition of CSV mode, you can now deduplicate or sort CSV files on a column-by-column basis, with the powerful --select option to specify which columns to deduplicate or sort on.
    This is especially useful for large CSV files with many columns, where you only want to deduplicate or sort on a subset of columns. And since both commands use disk-backed algorithms (an on-disk hash table for extdedup, and an external merge sort for extsort) - they can handle files larger than memory.
  • sqlp now has a --cache-schema option that caches the inferred schema of the input CSV file, which can significantly speed up subsequent queries on the same file, as the initial schema inferencing step is skipped.
  • fetch and fetchpost have been updated to use the jaq crate instead of the jql crate. This change was made to improve performance and to make the commands consistent with the json command which also uses jaq. Furthermore, jaq is a clone of jq - a widely used JSON parsing tool, so it should be more familiar to users.
  • stats is a tad faster as we keep squeezing more performance from this central command.

Added

  • extdedup: now supports two modes - LINE mode and CSV mode #2208
  • extsort: now also has two modes - CSV mode and LINE mode #2210
  • sqlp: add --cache-schema option #2224
  • added sqlp --cache-schema benchmarks

Changed

  • apply & applydp: use smallvec for operations vector & other minor performance optimizations #2219 & bc837ae
  • apply & applydp: specify min_length for parallel iterators 7d6ce5e
  • fetch & fetchpost: replace jql with jaq #2222
  • stats: performance optimizations f205809 e26c27f 4579c1b
  • validate: specify min_length for parallel iterators a5b8185
  • deps: updated polars to 0.43.1 at the py-1.10.0 tag.
  • build(deps): bump calamine from 0.26.0 to 0.26.1 by @dependabot in #2204
  • build(deps): bump csvs_convert from 0.8.14 to 0.9.0 by @dependabot in #2215
  • build(deps): bump flexi_logger from 0.29.2 to 0.29.3 by @dependabot in #2209
  • build(deps): bump jsonschema from 0.23.0 to 0.24.0 by @dependabot in #2223
  • build(deps): bump pyo3 from 0.22.3 to 0.22.4 by @dependabot in #2207
  • build(deps): bump pyo3 from 0.22.4 to 0.22.5 by @dependabot in #2212
  • build(deps): bump redis from 0.27.3 to 0.27.4 by @dependabot in #2202
  • build(deps): bump redis from 0.27.4 to 0.27.5 by @dependabot in #2217
  • build(deps): bump serde_json from 1.0.129 to 1.0.130 by @dependabot in #2218
  • build(deps): bump serde_json from 1.0.131 to 1.0.132 by @dependabot in #2220
  • build(deps): bump uuid from 1.10.0 to 1.11.0 by @dependabot in #2213
  • apply select clippy lints
  • bumped indirect dependencies
  • bumped MSRV to 1.82

Fixed:

  • fix performance regression in batched commands by refactoring optimal_batch_size to require indexed CSV files #2206

Removed:

  • fetch & fetchpost: removed jql options; replaced with jaq #2222

Full Changelog: 0.136.0...0.137.0

0.136.0

08 Oct 19:41
82b7611
Compare
Choose a tag to compare

🎉 qsv pro is now available in the Microsoft Store! 🎉

It's Data Wrangling Democratized on the Desktop, featuring:

  • 📊 Familiar Spreadsheet Interface
    tap the power of qsv to query, analyze, enrich, scrub and transform huge Excel files and multi-gigabyte CSV files in seconds, without having to deal with the command-line.
  • CKAN CKAN desktop client
    designed to make data publishing easier for portal operators and data stewards using the CKAN CKAN platform.
  • 📥 Flow
    allows you to build custom node-based flows and data pipelines using a visual interface.
  • 🔧 Toolbox
    features an ever-expanding library of reusable scripts for common data-wrangling use cases.
  • ⭐ and more!
    Natural Language Interface (RAG), Polars SQL query support, an API, Python/Luau support, automatic Data Dictionaries, DCAT 3 metadata profile inferencing, along with a retinue of other cloud-based services (e.g. customizable street-level geocoding, data feeds, reference data lookups, geo-ip lookups, cloud storage support, .qsv file format, etc.) that will be unveiled in future versions.

Like qsv, we're iterating rapidly with qsv pro, so your feedback is essential. Give it a try!

Get it from https://qsvpro.dathere.com or

Other highlights:

  • excel: new --table option for XLSX files; new --header-row option; expanded --range option, adding support for Named Ranges and absolute ranges (e.g. Sheet2!$A$1:$J$10); and expanded metadata export now including Named Ranges and Tables (for XLSX files)
  • Improved performance for several commands (apply, datefmt, tojsonl and validate) through automatic batch size optimization
  • validate: dynamicEnum custom JSON Schema keyword in validate command (renamed from dynenum) and enhanced email validation
  • schema: automatic JSON Schema const inferencing for columns with just one value
  • Significant dependency updates, including latest upstream versions of Polars, jsonschema, and serde_json with unreleased performance upgrades, new features and fixes

NOTE: You can see qsv & qsv pro in action in our "The Problem with Data Portals" webinar Wed, Oct 23, 2024. 1-2pm EDT


Added

  • 🎉 qsv pro is now in the Microsoft Store!!! 🎉
  • apply, datefmt, tojsonl, validate: added logic to automatically determine optimal batch size for better parallelization #2178
  • enum: added --new-column support for all enum modes, not just --increment #2173
  • excel: new --table option for XLSX files #2194
  • excel: new --header-row option 458f79a
  • excel: expanded range and metadata options #2195
  • schema: added JSON Schema automatic const inferencing #2180
  • Add signing step to qsv MSI installer GitHub Action by @rzmk in #2182
  • contrib(completions): add --table option to qsv excel by @rzmk in #2197
  • completions: add --header-row option to qsv excel e8794d5
  • added new apply operations sentiment benchmark b745e64
  • docs: added indexing section to PERFORMANCE.md 804145a

Changed

Fixed

  • schema: fix enum so it only adds a list when the number of unique values > --enum-threshold #2180
  • Upload artifact fix for Debian package publishing by @tino097 in #2168
  • fixed typos configuration 627de89
  • fixed various GitHub Actions publishing workflow issues

Full Changelog: 0.135.0...0.136.0

0.135.0

24 Sep 12:46
Compare
Choose a tag to compare

Highlights

JSON Schema validation just got a whole lot more powerful with the introduction of qsv's custom dynenum keyword!
With dynenum, you can now dynamically lookup valid enum values from a CSV (on the filesystem or on a URL), allowing for more flexible and responsive data validation.

Unlike the standardenum keyword, dynenum does not require hardcoding valid values at schema definition time, and can be used to validate data against a changing set of valid values.

For an example, see #1872 (reply in thread).

In an upcoming qsv pro release, we're planning on making dynenum even more powerful by allowing you to easily specify high-value reference data (e.g. US Census data, World Bank data, data.gov, etc.) that is maintained at data.dathere.com and other CKAN instances.

This release also add the custom currency JSON Schema format, which enables currency validation according to the ISO 4217 standard.

The Polars engine was also upgraded to 0.43.1 at the py-1.81.1 tag - making for various under-the-hood improvements for the sqlp, joinp and count commands, as we set the stage for more Polars-powered features in future releases.


Added

  • foreach: enabled foreach command on Windows prebuilt binaries def9c8f
  • lens: added support for QSV_SNIFF_DELIMITER env var and snappy auto-decompression 8340e89
  • sample: add --max-size option e845a3c
  • validate: added dynenum custom JSON Schema keyword for dynamic validation lookups #2166
  • tests: add tests for https://100.dathere.com/lessons/2 by @rzmk in #2141
  • added stats_sorted and frequency_sorted benchmarks
  • added validate_dynenum benchmarks

Changed

  • json: add error for empty key and update usage text by @rzmk in #2167
  • prompt: gate prompt command behind prompt feature #2163
  • validate: expanded currency JSON Schema custom format to support ISO 4217 currency codes and alternate formats 5202508
  • validate: migrate to new jsonschema crate api 5d65054
  • Update ubuntu version for deb package by @tino097 in #2126
  • contrib(completions): update completions for qsv v0.134.0 and fix subcommand options by @rzmk in #2135
  • contrib(completions): add --max-size completion for sample by @rzmk in #2142
  • deps: bump to polars 0.43.1 at py-1.81.1 #2130
  • deps: switch back to calamine upstream instead of our fork 677458f
  • build(deps): bump actix-governor from 0.5.0 to 0.6.0 by @dependabot in #2146
  • build(deps): bump anyhow from 1.0.87 to 1.0.88 by @dependabot in #2132
  • build(deps): bump arboard from 3.4.0 to 3.4.1 by @dependabot in #2137
  • build(deps): bump bytes from 1.7.1 to 1.7.2 by @dependabot in #2148
  • build(deps): bump geosuggest-core from 0.6.3 to 0.6.4 by @dependabot in #2153
  • build(deps): bump geosuggest-utils from 0.6.3 to 0.6.4 by @dependabot in #2154
  • build(deps): bump jql-runner from 7.1.13 to 7.2.0 by @dependabot in #2165
  • build(deps): bump jsonschema from 0.18.1 to 0.18.2 by @dependabot in #2127
  • build(deps): bump jsonschema from 0.18.2 to 0.18.3 by @dependabot in #2134
  • build(deps): bump jsonschema from 0.18.3 to 0.19.1 by @dependabot in #2144
  • build(deps): bump jsonschema from 0.19.1 to 0.20.0 by @dependabot in #2152
  • build(deps): bump pyo3 from 0.22.2 to 0.22.3 by @dependabot in #2143
  • build(deps): bump rfd from 0.14.1 to 0.15.0 by @dependabot in #2151
  • build(deps): bump simple-expand-tilde from 0.4.0 to 0.4.2 by @dependabot in #2129
  • build(deps): bump qsv_currency from 0.6.0 to 0.7.0 by @dependabot in #2159
  • build(deps): bump qsv_docopt from 1.7.0 to 1.8.0 by @dependabot in #2136
  • build(deps): bump redis from 0.26.1 to 0.27.0 by @dependabot in #2133
  • build(deps): bump simdutf8 from 0.1.4 to 0.1.5 by @dependabot in #2164
  • bump indirect dependencies
  • apply select clippy lint suggestions
  • several usage text/documentation improvements
  • bump MSRV to 1.81.0

Fixed

Removed

  • removed prompt command from qsvlite #2163
  • publish: remove lens feature from i686 targets as it does not compile 959ca76
  • deps: remove anyhow dependency #2150

Full Changelog: 0.134.0...0.135.0

0.134.0

10 Sep 12:11
Compare
Choose a tag to compare

Workflow demo Flow demo Toolbox demo

qsv pro command demo API demo Configurator demo

qsv pro v1 is here! 🎉

If you've been using qsv for a while, even if you're a command-line ninja, you'll find a lot of new capabilities in qsv pro that can make your data wrangling experience even better!

Apart from making qsv easier to use, qsv pro has a multitude of features including: view interactive data tables; browse stats/frequency/metadata; run recipes and tools (scripts); run Polars SQL queries; use Natural Language queries (using Retrieval Augmented Generation (RAG) techniques); regular expression search; export to multiple file formats; download/upload from/to compatible CKAN instances; design custom node-based flows and data pipelines; interact with a local API from external programs including the qsv pro command; run various qsv commands in a graphical user interface; and the list goes on!

And that's just the beginning, there's more to come! You just have to try it!

Download qsv pro v1 now at qsvpro.dathere.com.

Other highlights include:

  • pro: new command to allow qsv to interact with the qsv pro API to tap into qsv pro exclusive features.
  • lens: new command to interactively view CSVs using the csvlens crate.
  • The ludicrously fast diff command is now easier to use with its --drop-equal-fields option. @janriemer continues to work on his csv-diff crate, and there's more diff UX improvements coming soon!
  • stats adds sum_length and avg_length "streaming" statistics in addition to the existing min_length and max_length metrics. These are especially useful for datasets with a lot of "free text" columns.
  • stats also got "smarter" and "faster" by dog-fooding its own statistics to make it run faster!
    It's a little complicated, but the way stats works is that it compiles the "streaming" statistics on the fly first as it multiplex load the data across several threads, and the more expensive advanced statistics are "lazily" computed at the end.
    Since we now compile "sort order" in a streaming manner, we use this info when deriving cardinality at the end to see if we can skip sorting - an otherwise necessary step to get cardinality which is done by "scanning" all the sorted values of a column. Everytime two neighboring values differ in a sorted column, it increments the cardinality count.
    Apart from this "sort order" optimization, we also improved the "cardinality scan" algorithm - halving its memory footprint and making it faster still for larger datasets by parallelizing the computation. This in turn, makes the frequency command faster and more memory efficient.
    It's performance tweaks like these, that despite adding six metrics (is_ascii, sort_order, sum_length, avg_length, sem - standard error of the mean & cv - coefficient of variation) in recent releases, that stats is still able to compile 35 statistics and do GUARANTEED data type inferences of a million row, 41 column, 520 MB sample of NYC's 311 data in 1.327 seconds (753,580 records per second)!1
  • we now also use our own fork of the csv crate, featuring SIMD-accelerated UTF-8 validation and other minor perf tweaks, making the entire qsv suite faster still!

Added

  • pro: add qsv pro command to interact with qsv pro API by @rzmk in #2039
  • lens: new command to interactively view CSVs using the csvlens crate #2117
  • apply: add crc32 operation #2121
  • count: add --delimiter option #2120
  • diff: add flag --drop-equal-fields by @janriemer in #2114
  • stats: add sum_length and avg_length columns #2113
  • stats: smarter cardinality computation - added new parallel algorithm for large datasets (10,000+ rows) and updated sequential algorithm for smaller datasets 4e63fec

Changed

  • count: added comment to justify magic number 5241e39
  • stats: use simdjson for faster JSONL parsing; micro-optimize compute hot loop 0e8b734
  • stats: standardized OVERFLOW and UNDERFLOW messages 38c6128
  • sort: renamed symbol so eliminate devskim lint false positive warning 12db739
  • enable lens feature in GH workflows #2122
  • deps: bump polars 0.42.0 to latest upstream at time of release 3c17ed1
  • deps: use our own optimized fork of csv crate, with simdutf8 validation and other minor perf tweaks e4bcd71
  • build(deps): bump serde from 1.0.209 to 1.0.210 by @dependabot in #2111
  • build(deps): bump serde_json from 1.0.127 to 1.0.128 by @dependabot in #2106
  • build(deps): bump qsv-stats from 0.19.0 to 0.22.0 #2107 #2112 cb1eb60
  • apply select clippy lint suggestions
  • updated several indirect dependencies
  • made various doc and usage text improvements

Fixed

  • schema: Print an error if the qsv stats invocation fails by @abrauchli in #2110

New Contributors

Full Changelog: 0.133.1...0.134.0

  1. see stats_everything_index benchmark

0.133.1

03 Sep 19:04
Compare
Choose a tag to compare

Highlights

qsv-polars-0 133 0-relnotes1 This release doubles down on Polars' capabilities, as we now, as a matter of policy track the latest polars upstream. If you think qsv has a torrid release schedule, you should see Polars. They're constantly fixing bugs, adding new features and optimizations!
To keep up, we've added Polars revision info to the --version output, and the --envlist option now includes Polars relevant env vars. We've also added support for the POLARS_BACKTRACE_IN_ERR env var to control whether Polars backtraces are included in error messages.
We also removed the to parquet subcommand as its redundant with the Polars-powered sqlp's ability to create parquet files. This removes the HUGE duckdb dependency, which should markedly make compile times shorter and binaries smaller.

Other highlights include:

  • New edit command that allows you to edit CSV files.
  • The count command's --width option now includes record width stats beyond max length (avg, median, min, variance, stddev & MAD).
  • The fixlengths command now has --quote and --escape options.
  • The stats command adds a sort_order streaming statistic.

NOTE: 0.133.0 was skipped because of a dev dependency conflict with the csvs_convert crate, preventing us from publishing 0.133.0 to crates.io. This has been resolved in 0.133.1.


Added

  • count: expanded --width options, adding record width stats beyond max length (avg, median, min, variance, stddev & MAD). Also added --json output when using --width #2099
  • edit: add qsv edit command by @rzmk in #2074
  • fixlengths: added --quote and --escape options #2104
  • stats: add sort_order streaming statistic #2101
  • polars: add polars revision info to --version output e60e44f
  • polars: added Polars relevant env vars to --envlist option 0ad68fe
  • polars: add & document POLARS_BACKTRACE_IN_ERR env var f9cc559

Changed

  • Optimize polars optflags #2089
  • deps: bump polars 0.42.0 to latest upstream at time of release 3b7af51
  • bump polars to latest upstream, removing smartstring #2091
  • build(deps): bump actions/setup-python from 5.1.1 to 5.2.0 by @dependabot in #2094
  • build(deps): bump flate2 from 1.0.32 to 1.0.33 by @dependabot in #2085
  • build(deps): bump flexi_logger from 0.28.5 to 0.29.0 by @dependabot in #2086
  • build(deps): bump indexmap from 2.4.0 to 2.5.0 by @dependabot in #2096
  • build(deps): bump jsonschema from 0.18.0 to 0.18.1 by @dependabot in #2084
  • build(deps): bump serde from 1.0.208 to 1.0.209 by @dependabot in #2082
  • build(deps): bump serde_json from 1.0.125 to 1.0.127 by @dependabot in #2079
  • build(deps): bump sysinfo from 0.31.2 to 0.31.3 by @dependabot in #2077
  • build(deps): bump qsv-stats from 0.18.0 to 0.19.0 by @dependabot in #2100
  • build(deps): bump tokio from 1.39.3 to 1.40.0 by @dependabot in #2095
  • apply select clippy lint suggestions
  • updated several indirect dependencies
  • made various doc and usage text improvements
  • pin Rust nightly to 2024-08-26 from 2024-07-26, aligning with Polars pinned nightly

Fixed

  • Ensure portable binaries are "added" to the publish zip archive, instead of replacing all the binaries with just the portable version. Fixes #2083. 34ad206

Removed

  • removed to parquet subcommand as its redundant with sqlp's ability to create parquet files. This also removes the HUGE duckdb dependency, which should markedly make compile times shorter and binaries much smaller #2088
  • removed smartstring dependency now that Polars has its own compact inlined string type 47f047e
  • removed to parquet benchmark

Full Changelog: 0.132.0...0.133.1

  1. ChatGPT prompt: Using the logos for the Polars project and the qsv project as a baseline, can you create a version with the cowboy riding a polar bear instead?

0.132.0

21 Aug 10:34
Compare
Choose a tag to compare

Highlights

With this release, we finally finish the stats caching refactor started in 0.131.0, replacing the binary encoded stats cache with a simpler JSONL cache. The stats cache stores the necessary statistical metadata to make several key commands smarter & faster. Per the benchmarks:

  • frequency is 6x faster (frequency_index_stats_mode_auto).
    Not only is it faster, it now doesn't need to compile a hashmap for columns with ALL unique values (e.g. ID columns) - practically, making it able to handle "real-world" datasets of any size (that is, unless all the columns have ALL unique cardinalities. In that case, the entire CSV will have to fit into memory).
  • tojsonl is 2.67x faster (tojsonl_index)
  • schema is two orders of magnitude (100x) faster!!! (schema_index)

The stats cache also provides the foundation for even more "smart" features and commands in the future. It also has the side-benefit of adding a way to produce stats in JSONL format that can be used for other purposes beyond qsv.

The search, searchset, and replace commands now also have a --literal option that allows you to search for and replace strings with regex special/reserved characters. This makes it easier to search for and replace strings that contain otherwise reserved regex characters without having to escape them (especially useful with URL columns that often contain characters like ?,:,-,., etc.)


Added

  • search, searchset & replace: add --literal option #2060 & 7196053
  • slice: added usage text examples 04afaa3
  • publish: added workflow to build "portable" binaries with CPU features disabled
  • contrib(completions): add --literal for search and searchset by @rzmk in #2061
  • contrib(completions): add --literal completion to replace by @rzmk in #2062
  • add more polars metadata in --version info #2073
  • docs: added more info to SECURITY.md 609d4df
  • docs: expanded Goals/Non-Goals 54998e3
  • docs: added Installation "Option 0" quick start bf5bf82
  • added search --literal benchmark

Changed

  • stats, schema, frequency & tojsonl: stats caching refactor, replacing binary encoded stats cache with a simpler JSONL cache #2055

  • rename stats --stats-json option to stats --stats-jsonl #2063

  • changed "broken pipe" error to a warning 7353275

  • docs: update multithreading and caching sections of PERFORMANCE.md 5e6bc45

  • deps: switch to our qsv-optimized fork of csv crate 3fc1e82

  • deps: bump polars from 0.41.3 to 0.42.0 #2051

  • build(deps): bump actix-web from 4.8.0 to 4.9.0 by @dependabot in #2041

  • build(deps): bump flate2 from 1.0.31 to 1.0.32 by @dependabot in #2071

  • build(deps): bump indexmap from 2.3.0 to 2.4.0 by @dependabot in #2049

  • build(deps): bump reqwest from 0.12.6 to 0.12.7 by @dependabot in #2070

  • build(deps): bump rust_decimal from 1.35.0 to 1.36.0 by @dependabot in #2068

  • build(deps): bump serde from 1.0.205 to 1.0.206 by @dependabot in #2043

  • build(deps): bump serde from 1.0.206 to 1.0.207 by @dependabot in #2047

  • build(deps): bump serde from 1.0.207 to 1.0.208 by @dependabot in #2054

  • build(deps): bump serde_json from 1.0.122 to 1.0.124 by @dependabot in #2045

  • build(deps): bump serde_json from 1.0.124 to 1.0.125 by @dependabot in #2052

  • apply select clippy lint suggestions

  • updated several indirect dependencies

  • made various usage text improvements

Fixed

  • stats: fix --output delimiter inferencing based on file extension #2065
  • make process_input helper handle stdin better #2058
  • docs: fix completions for --stats-jsonl and qsv pro installation text update by @rzmk in #2072
  • docs: added Note about why luau feature is disabled in musl binaries - ffa2bc5 & 27d0f8e

Removed

  • Removed bincode dependency now that we're using JSONL stats cache #2055 babd92b

Full Changelog: 0.131.1...0.132.0

0.131.1

09 Aug 14:44
Compare
Choose a tag to compare

Changed

  • deps: bump polars to latest upstream post py-1.41.1 release at the time of this release
  • build(deps): bump filetime from 0.2.23 to 0.2.24 by @dependabot in #2038

Fixed

  • frequency: change --stats-mode default to none from auto.
    This is because of a big performance regression when using --stats-mode auto on datasets with columns with ALL unique values.
    See #2040 for more info.

Full Changelog: 0.131.0...0.131.1