-
Notifications
You must be signed in to change notification settings - Fork 985
Convenience features of fread
For example, from this answer.
fread("grep -v TRIAL sum_data.txt")`
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17
1: 2 0.1 0 -0.0047 -0.0168 -0.9938 -0.0087 -0.0105 -0.9709 0.0035 0.0079 -0.9754 0.0081 0.0023 0.9997 -1.35324e-10 0.0278754
2: 2 0.2 0 -0.0121 0.0002 -0.9898 -0.0364 -0.0027 -0.9925 -0.0242 -0.0050 -0.9929 0.0029 -0.0023 0.9998 -1.33521e-10 0.0425567
3: 2 0.3 0 0.0193 -0.0068 -0.9884 0.0040 0.0139 -0.9782 -0.0158 0.0150 -0.9814 0.0054 -0.0008 0.9997 -1.34103e-10 0.0255356
4: 2 0.4 0 -0.0157 0.0183 -0.9879 -0.0315 -0.0311 -0.9908 -0.0314 -0.0160 -0.9929 0.0040 0.0010 0.9998 -1.34819e-10 0.0257300
5: 2 0.5 0 -0.0402 0.0300 -0.9832 -0.0093 0.0269 -0.9781 -0.0326 0.0247 -0.9802 0.0044 -0.0010 0.9997 -1.31515e-10 0.0440350
--- `
124247: 250 49.5 0 -0.0040 0.0141 0.9802 -0.0152 0.0203 -0.9877 -0.0015 0.0123 -0.9901 0.0069 0.0003 0.9997 -1.30220e-10 0.0213215
124248: 250 49.6 0 -0.0006 0.0284 0.9819 0.0021 0.0248 -0.9920 0.0264 0.0408 -0.9919 0.0028 -0.0028 0.9997 -1.30295e-10 0.0284142
124249: 250 49.7 0 0.0378 0.0305 0.9779 -0.0261 0.0232 -0.9897 -0.0236 0.0137 -0.9928 0.0102 -0.0023 0.9997 -1.29890e-10 0.0410760
124250: 250 49.8 0 0.0569 -0.0203 0.9800 -0.0028 -0.0009 -0.9906 -0.0139 -0.0169 -0.9918 0.0039 -0.0017 0.9997 -1.31555e-10 0.0513482
124251: 250 49.9 0 0.0234 -0.0358 0.9840 -0.0340 0.0114 -0.9873 -0.0255 0.0134 -0.9888 0.0006 0.0009 0.9997 -1.30862e-10 0.0334976
The
-v
makesgrep
return all lines except lines containing the string TRIAL. Given the number of high quality engineers that have looked at the command tool grep over the years, it is most likely that it is as fast as you can get, as well as being correct, convenient, well documented online, easy to learn and search for solutions for specific tasks. If you need to do more complicated string filters (e.g. strings at the beginning or the end of the lines, etc) thengrep
syntax is very powerful. Learning its syntax is a transferable skill to other languages and environments.
The manual page ?fread
contains :
input
..., a shell command that preprocesses the file (e.g. fread("grep blah filename")), ...
On Windows we highly recommend Cygwin (run one .exe to install) which includes the command line tools such as grep
. On Linux and Mac they are included in the operating system.
Given a tab, comma, pipe, space, colon or semi-colon delimited file, you don't need to find or remember the function name or argument name to use. fread
is a single command that will look at its input and observe the separator for you. All you as the user has to do is pass the file name to the function. The rest is done for you. Of course, if you do wish to override the separator, or use a very unusual one, you can pass it to the sep=
parameter.
The manual page ?fread
contains :
sep
The separator between columns. Defaults to the first character in the set [,\t |;:] that exists on line autostart outside quoted ("") regions, and separates the rows above autostart into a consistent number of fields too.
Some files start off with blanks or zeros in some columns and only become fully populated later in the file. For this reason fread
doesn't just look at the top of the file, but it looks at the middle and the end too. It is possible to jump directly to the middle and the end of the file efficiently because fread
uses the operating system's mmap
which does not need to read from the beginning of the file to get to a particular point.
Currently fread
reads the top 5, middle 5 and bottom 5 rows. We intend to increase this to 1000 and also increase the number of points from 3 to 10. The increase from 15 to 10,000 rows used for type detection is not expected to add much overhead because even 10,000 rows is not very large. If it becomes clear that all columns contain character strings then the detection can return early.
The manual page ?fread
contains :
The first 5 rows, middle 5 rows and last 5 rows are then read to determine column types. The lowest type for each column is chosen from the ordered list integer, integer64, double, character. This enables fread to allocate exactly the right number of rows, with columns of the right type, up front once. The file may of course still contain data of a different type in rows other than first, middle and last 5. In that case, the column types are bumped mid read and the data read on previous rows is coerced. Setting verbose=TRUE reports the line and field number of each mid read type bump, and how long this type bumping took (if any).