📊 Enhanced Frequency Analysis 📊

This a quick release adding several frequency enhancements for more detailed frequency analysis. The frequency command now includes a percentage column, calculates other values, and supports limiting unique counts and negative limits.
These options provides additional context for Datapusher+, qsv-pro and describegpt so their metadata inferences are more accurate and comprehensive.

Previously, for a 775-row CSV file containing one column named state with entries for all 50 states, frequency only showed¹:

qsv frequency freq_state_example.csv | qsv table
field  value  count
state  NY     100
state  NJ     70
state  CA     60
state  MA     55
state  FL     45
state  TX     43
state  NM     40
state  AZ     39
state  NV     38
state  MI     35

Now, there's a new percentage column and other values calculation, both of which have configurable options:

qsv frequency freq_state_example.csv | qsv table
field  value       count  percentage
state  NY          100    12.90323
state  NJ          70     9.03226
state  CA          60     7.74194
state  MA          55     7.09677
state  FL          45     5.80645
state  TX          43     5.54839
state  NM          40     5.16129
state  AZ          39     5.03226
state  NV          38     4.90323
state  MI          35     4.51613
state  Other (40)  250    32.25806

This release is also out of cycle to address a big performance regression in the excel command caused by unnecessary formula info retrieval for the --error-format option introduced in 0.126.0. This has been fixed, and the excel command is now back to its speedy self.

Added

frequency: added percentage column; other values calculation, implementing #1774 #1775
benchmarks: added new frequency and excel benchmarks b83ad3a

Changed

contrib(bashly): update completions.bash for qsv v0.126.0 by @rzmk in #1771
build(deps): bump mimalloc from 0.1.39 to 0.1.41 by @dependabot in #1772
build(deps): bump qsv-stats from 0.14.0 to 0.15.0 by @dependabot in #1773
updated several indirect dependencies
applied select clippy recommendations

Fixed

excel: fixed performance regression because qsv was unnecessarily getting formula info (an expensive operation) for --error-format option even when not required 772af34
renamed 0.126.0 sqlp_vs_duckdb benchmark results so they're next to each other for easy direct comparison. 7bcd59e.
Per the benchmarks, sqlp is 2.87 times faster than duckdb v0.10.2 for a simple aggregation (0.066 secs vs 0.19 secs), and 1.42 times faster for an "expensive" aggregation (0.143 secs vs 0.203 secs).

Full Changelog: 0.126.0...0.127.0

with its default --limit setting of 10 only show the top 10 unique values in the column, sorted by occurence ↩

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.127.0

📊 Enhanced Frequency Analysis 📊

Added

Changed

Fixed

Contributors