- Add function
bind_rows_dt
, so as to facilitate row bindings of data.frames with same names but different data types. - Revise
col_max
andcol_min
function according to issue pulled at #26.
- Imports data.table (>= 1.15.0) to ensure
%notin%
could be used. - Depends on R (>=4.0.0).
- Add
round0
function to ensure rounding includes 0 from behind.
- Remove
%notin%
in tidyfst and export it directly from data.table. - Add
import_fst_chunked
to process fst files by chunks.
- Merge the request pulled recently(see #25).
- Change the .onAttach message to help users accessing the citation info.
- Update function
dummy_dt
referring tofastdummies::dummy_col
to make it faster. - Add
maxth
andminth
to get the nth highest/lowest value of a vector. - Use
bibentry
for citation info.
1.Fix error noted by CRAN.
2.Remove %notin%
function as data.table would provide it later.
1.A request has been suggested and implemented, see #21. 2.Expired URLs have been removed from README.md.
- The previous
select_dt
could not handle a special case, when selecting multiple columns (say more than 8), it tends to throw an error. This time the bug is fixed. slice_max_dt
could not handle date type when using minus symbol("-"), this has been fixed in this version.- Make
filter_dt
more robust by usingeval.parent
to evaluate it.
- Remove
top_n_dt
,top_prop_dt
andtop_dt
. These functions are considered as deprecated. - Fix bugs in
slice_max_dt
andslice_min_dt
, they could not perform correctly in group filtering by proportion in the previous version.
It seems some issues are urgent (#19), so I have to make the revision immediately. Apology for the inconvenience brought.
- Fix bug to make
pst
work. - Use
function
instead of\
to avoid platform consistency bug, as stated in #18. - Export
setnames
from data.table.
- Export
data.table::setDT
anddata.table::%chin%
for usage intidyfst
. - Introduce functions
pkg_load
andpkg_unload
asp_load
andp_unload
in packagepacman
. - Make
dummy_dt
to be robust when there are NAs in the column. Refer to #15. - Add
%notin%
function to be used. - Add new name
pst
for functionsys_time_print
for convenience.
- Make a fix in
complete_dt
, letting it become more robust. - Solve issue mentioned at #13.
- Update
sys_time_print
function to make time printing more user-friendly.
- Fix a bug in
sql_join_dt
, so as to let anti join and semi join work. - Update
pairwise_count
function to be more fast when possible. - Update ORCID number.
1.Set options("datatable.print.trunc.cols" = TRUE)
, so as to let the printing work like tibbles in dplyr.
2.Make functions in tidyfst could be used in other functions. Details see https://stackoverflow.com/questions/69098157/how-to-past-parameters-in-r-functions-using-substitute-and-eval-to-make-data. Some functions have replaced the previous eval
to eval.parent
.
3. Export %like%
from data.table.
4. Add function sql_join_dt
to implement case insensitive joining for data.frame.
5. Add function percent
and add_prop
to calculate percentage conveniently.
6. Add function pairwise_count_dt
to count pairs of items within a group.
Date:20210908
- Add "fromLast" parameter to
distinct_dt
- Add a new function named
col_max
andcol_min
to get the max/min column name - Upgrade
dummy_dt
to be faster
Date:20200901
- Do not truncate the columns by default.
- Add
print_options
to control global printing od data.table. - Add citation in the package, linking to the JOSS paper(https://doi.org/10.21105/joss.02388)
- Add
rec_num
andrec_char
function for variable recoding. - Get a cheat sheet for tidyfst.
- Export
between
from data.table. - Support summarisation of multiple functions on multiple columns in
summarise_vars
.
Date:20200801
- Add
rename_with_dt
like dplyr'srename_with
- Update
slice_dt
to support.N
- Update vignette "english_turoial" to remove the outdated codes
- Improve
count_dt
by usingselect_dt
inside - Correct error in example of
impute_dt
for user defined functions - Export
rleid
andrleidv
from data.table - Add ".name" paramter to
nest_dt
andsqueeze_dt
- Debug
slice_max_dt
andslice_min_dt
- Give the slice* family a "by" parameter to slice by group
- Debug
select_dt
- Update the vignette of English tutorial
- Update
filter_dt
and do not support comma as "&" any more - Use testthat package to implement unit test for tidyfst
- Give sample functions a "by" parameter to sample by group
- Correct errors in the English tutorial
- Import data.table v1.13.0 and use its new features
Date:20200528
- Update
separate_dt
to accepteNA
in parameter "into". - Add a new collection of
slice*
function to match dplyr 1.0.0. - Simplify the joining functions.
- Debug
complete_dt
to suppress unnecessary warning in special cases. - Debug
nest_dt
to use full join to unnest multiple columns. - Debug the joining functions to make it robust for non-data.table data frames.
Date:20200502
- Update Chinense tutorial.
- Add
impute_dt
to impute missing values using mean, mode and median. - Improve
t_dt
to be faster. - Add set operations including
union_dt
,etc. This could be used on non-data.table data.frames, which is considered to be convenient. - Update "Example 2" vignette.
Date: 20200410
0. Reason for update: The update of as_dt
is very important(see point 5), becasue it is used everywhere in tidyfst. This update might be minor inside the function, but it can improve the performance by large, especially for extremly large data sets (this means in version before 0.9.5[<=0.9.4], operation on large data frames could be quite slow because copies are made in every movement).
- Improve
distinct_dt
to receive variables more flexibly. - Add
summary_fst
to get info of the fst table. - Upgrade "mcols" in
nest_dt
to accept more flexibly by usingselect_dt
. - Debug
anti_join
andsemi_join
to become more efficient and robust. - Update
as_dt
and many functions, which make it faster by reducing data copying when possible, but still stick to principals that never modify by reference. Suppressing the copy when possible, but copies are still made when necessary(usingas.data.table
). - Improve
separate_dt
andunite_dt
. - Improve
replace_dt
. - For every
summarise_
andmutate_
, give a "by" parameter. - Add
summarise_when
.
Date: 20200402 0. Reason for update: The former introduction of modification by reference is violating the principals of the package, remove them. Modification by reference might be good, I build another package named 'tidyft' to realize it.
- Add
mat_df
anddf_mat
to covert between named matrix and tidy data.frame, using base-r only. - Add
rn_col
andcol_rn
. - Add "by" parameter for
summarise_vars
andmutate_vars
. - Make
filter_fst
more robust. - Update the vignette of
fst
. - Add a new set of join functions with another syntax.
- Improve
select_fst
withselect_dt
- Remove facilities of modification by reference in tidyfst, including
set*
family and "inplace" parameter ingroup_by_dt
Date: 20200324 0. Reason for update: The rmarkdown has a poor support of Chinese, which makes the vignette name messy on the CRAN page (see the vignette part of https://CRAN.R-project.org/package=tidyfst). Therefore, have to change it to an English name. Also, as many new adjustments coming in, there are some substantial changes for tidyfst to be safer (robust), faster, simpler and feature richer.
- Improve
group_by_dt
to let it be more flexiable. Now it can receive whatselect_dt
receives. - Improve
select_fst
, can select one single column by number now. - Improve
fill_na_dt
to make it faster withsetnafill
,shift
andfcoalesce
. - Change the parameter
data
to.data
. This change of API would be applied to all functions and some other parameters too (start with dot). - Remove
drop_all_na_cols
anddrop_all_na_rows
, usedelete_na_cols
anddelete_na_rows
instead to remove columns or rows with NAs larger than a threshold in proportion or number. - Rewrite
rename_dt
to be safer. - Improve
relocate_dt
to make it faster, by moving names but not data.frame itself, only move at the final step. - Remove
mutate_ref
. Design a new family forset_
to modify by reference. Details see?set_in_dt
. - Add
as_fst
to save a data.frame as "fst" in tempfile and parse it back in fst_table. - Improve
longer_dt
andwider_dt
by usingselect_mix
to select unchanged columns. Also, change the parameter API to make it more concise. Now it should be easier to use. The vignette of reshape(example 3) is updated too. - Make
separate_dt
to be more robust by receiving non-character as column. This means you can usedf %>% separate_dt(x, c("A", "B"))
now. See examples in?separate_dt
. - Give a "by" parameter to
mutate_dt
andtransmute_dt
to mutate by group. - Fix a bug in
select_dt
. - Remove
all-at-if
collection, usemutate_vars
andsummarise_vars
instead. - Add
replace_dt
to replace any value(s) in data.table. - Add an english tutorial and test many basic and complicated examples.
- Debug
wider_dt
and add a new functionality to takelist
as aggregated function and unchop automatically. - Improve
mutate_vars
with raw data.table codes, which is faster.
Date: 20200315
0. Reason for update: Check every function in data.table
, dplyr
and tidyr
, optimize and add functionalities when possible, and keep up with the updates of dplyr
(the upcoming v1.0.0). There are so many substantial updates, so I think an upgrade of version should be proposed. This package is driving to a stable stage later (if no fatal bugs coming after weeks), and the next minor updates will only come after the major updates of data.table (waiting for the release of v1.12.9) and the potential new bugs reported by users.
- Get better understanding on non-standard evaluation, update functions that could be optimized. The updated functions include:
mutate_dt
,transmute_dt
,arrange_dt
,distinct_dt
,slice_dt
,top_n_dt
,top_frac_dt
,mutate_when
. Therefore, now these functions should be faster than before. - Add
nth
to extract element of vector via position, useful when we want a single element from the bottom. - The API of
longer_dt
has been changed to be more powerful, and update the examples inwider_dt
. Update theExample 3: Reshape
vignette. - Rewrite the nest part,
nest_by
andunnest_col
are deprecated, switch tonest_dt
andunnest_dt
for new APIs and features. - Design
squeeze_dt
and addchop_dt
/unchop_dt
for new usage of nesting. - Exporting
frollapply
from data.table, this is a powerful function for aggregation on sliding window. - Enhances
select_dt
once more, does not exportselect_if_dt
now, merges this functionality directly intoselect_dt
. Also, we could now use-
or!
to select the negative columns for regular expressions. - Optimize
top_n
usingfrank
(faster with less memory). - Add
sys_time_print
to get the running time more intuitively. - Add
uncount_dt
, works just liketidyr::uncount
. - Add
rowwise_dt
, could carry out analysis likedplyr::rowwise
. - Add
relocate_dt
to rearrange columns in data.table. - Add
top_dt
andsample_dt
for convenience. - Add
mutate_vars
to complementall_dt
/if_dt
/at_dt
. - Add
set_dt
andmutate_ref
for fast operation by reference of data.table. - Add "fun" paramter to
wider_dt
for multiple aggregation. - Debug
separate_dt
. - Add a Chinese vignette for folks in China (titled as "tidyfst包实例分析").
- Shorten the description file to be more specific.
- Add
group_by_dt
andgroup_exe_dt
to perform more convenient and efficient group operation. - Add
select_mix
for super selection of columns. - Fix typos in description.
Date: 20200305
0. Reason for update: I've been using tidyfst
on my daily work by adding _dt
to many past and current tasks. In these experience, I debug some important functions (they run well on simple tasks, but not on complicated ones), and add more functions. These features are so many that I think an update is necessary for users to get a better tookit earlier. If the update is too frequent, please accept my apology.
- Optimize
group_dt
. First, it is faster than before because I use[][]
instead of%>%
. (Using%>%
for.SD
is slow) Second, I design an alternative to use.SD
directly ingroup_dt
, which might improve the efficiency further. - Debug
filter_dt
. - Add
fill_na_dt
to fill NAs in data.table. Debug all missing functions. Examples are refreshed. - Debug
mutate_when
. - Add
complete_dt
to complete a data.frame liketidyr::complete
. - Add
dummy_dt
to get dummy variables from columns. - Add
t_dt
to transpose data frame efficiently. - Two functions:
as_dt
andin_dt
to create a short cut to data.table facilities. Add vignette as tutorial in this feature. - Add
unite_dt
andseparate_dt
for simple usage. - Debug
mutate_dt
.
Date: 20200227
0. Reason for urgent update: The use of show_tibble
violates the principals of programming. I hope this idea would not spread in the vignette. See changes in 4.
- Improve
select_dt
to let it accepta:c
-like inputs. Add exampleiris %>% select_dt(Sepal.Length:Petal.Length)
. Moreover, nowselect_dt
supports delete columns with-
symbol. - Improve
group_dt
to let "by" parameter also accept list of variables, which means we could not usemtcars %>% group_dt(by =list(vs,am),summarise_dt(avg = mean(mpg)))
. - Fix a few typos in description and vignettes.
- Show the class of variables by default, using
options("datatable.print.class" = TRUE)
, and remove the inappropriate use ofshow_tibble
. Details see tidyverse/tibble#716. - Add
select_if_dt
function. Moreover, support negative conditional selection inif_dt
. - Delete the vignette entitled "Example 5: Tibble", as this feature is not used any more.
- Add vignette "Example 5:Fst" for better introduction of the feature.
- Update vignette "Example 1:Basic usage".
Date:20200224
- Change all
print
andcat
function tomessage
. - Use
tempdir()
to write file and read it back in the example ofparse_fst
. - Fix the bug in
count_dt
andadd_count_dt
and add examples in the function. - Add
show_tibble
function, and now the package can use the printing form of tibble to get better information of the data.table. This is not used by default, but might be preferred for tidyverse users. - Remove all the unnecessary
\donttest
and use\dontrun
when have to write files to directory, only to make an example of how to use it(refer toutils::write.table
document). This should make the best example for real usage. - Add URL to Description file.
- More vignettes added.
- Major updates:(1) Change package name to
tidyfst
(according to the suggestions from CRAN);(2) Do not usemaditr
codes any more (change the description), based onstringr
anddata.table
only; (3) Supportfst
package with tidy syntax; (4) Add 4 vignettes - Support 'fst' package in various ways (see functions end with "_fst")
- Test the functions and get three vignettes for comparison
- Totally support group computing with
group_dt
function - Correct various typos in the document
- Rewrite
nest_by
andunnest_col
. Did not use "_dt" name because they are different from thetidyverse
API. They might be even more efficient and simple to use. - Add "negate" parameter to
select_dt
function. - Add
all_dt
,at_dt
andif_dt
functions for flexible mutate and summarise.
Fix some bugs and add a vignette.
Rewrite all functions and use only data.table
and stringr
as imported packages.
Have changed the license to MIT.
This time, tidydt
is lightweight,efficient and powerful. It is totally different from the previous version in many ways.
The previous version would be archived in https://github.com/hope-data-science/tidydt0.
Some issue seems to happen, check hope-data-science/tidydt0#1. Hope to get an offical answer from CRAN. Done in the mailing list, keep moving. [20200129]
- Use new API for
rename_dt
, more like therename
indplyr
. - Change some API name, e.g.
topn_dt
totop_n_dt
. - Add functions to deal with missing values(
replace_na_dt
,drop_na_dt
). - Change the
on_attach.R
file to change the hints. - Add
pull_dt
, which I use a lot and so may many others. - Add
mutate_when
for another advancedcase_when
utility. - Fix according to CRAN suggestions.