MLUtils.batchsize
— Functionbatchsize(data::BatchView) -> Int
Return the fixed size of each batch in data
.
Examples
using MLUtils
+(a = [1 5; 2 6], b = [3 7; 4 8])
diff --git a/dev/api/index.html b/dev/api/index.html
index d2db8c4..6f5bb2a 100644
--- a/dev/api/index.html
+++ b/dev/api/index.html
@@ -8,15 +8,15 @@
julia> batch([(a=[1,2], b=[3,4])
(a=[5,6], b=[7,8])])
-(a = [1 5; 2 6], b = [3 7; 4 8])source Return the fixed size of each batch in Examples Return the fixed size of each batch in Examples Take a list of Examples Take a list of Examples Create a view of the given Note that any data access is delayed until If used as an iterator, the object will iterate over the dataset once, effectively denoting an epoch. For Arguments Examples Split In case If Examples Partition the array If the number of partition If See also Examples Partition the array If the number of partition If See also Examples An object that iterates over mini-batches of Takes as input a single data array, a tuple (or a named tuple) of arrays, or in general any The last dimension in each array is the observation dimension, i.e. the one divided into mini-batches. The original data is preserved in the Arguments Examples An object that iterates over mini-batches of Takes as input a single data array, a tuple (or a named tuple) of arrays, or in general any The last dimension in each array is the observation dimension, i.e. the one divided into mini-batches. The original data is preserved in the Arguments Examples Return an iterator over Supports the same arguments as Examples Return an iterator over Supports the same arguments as Examples Create an array with the given element type and size, based upon the given source array See also Examples Create an array with the given element type and size, based upon the given source array See also Examples Return a subset of data container Return a subset of data container Reshape arbitrarly-shaped input into a matrix-shaped output, preserving the size of the last dimension. See also Examples Return the observations corresponding to the observation index If Authors of custom data containers should implement The returned observation(s) should be in the form intended to be passed as-is to some learning algorithm. There is no strict interface requirement on how this "actual data" must look like. Every author behind some custom data container can make this decision themselves. The output should be consistent when Examples Reshape arbitrarly-shaped input into a matrix-shaped output, preserving the size of the last dimension. See also Examples Return the observations corresponding to the observation index If Authors of custom data containers should implement The returned observation(s) should be in the form intended to be passed as-is to some learning algorithm. There is no strict interface requirement on how this "actual data" must look like. Every author behind some custom data container can make this decision themselves. The output should be consistent when Examples Inplace version of Implementing this function is optional. In the case no such method is provided for the type of Concatenate data containers Inplace version of Implementing this function is optional. In the case no such method is provided for the type of Concatenate data containers Count the number of times that each element of See also Examples Count the number of times that each element of See also Examples Computes the indices of elements in the vector See also Examples Computes the indices of elements in the vector See also Examples Split data container data Split data container data Compute the train/validation assignments for Each observation is assigned to the validation subset once (and only once). Thus, a union over all validation index-vectors reproduces the full range Compute the train/validation assignments for Each observation is assigned to the validation subset once (and only once). Thus, a union over all validation index-vectors reproduces the full range Repartition a Conceptually, a k-folds repartitioning strategy divides the given In the case that the size of the dataset is not dividable by the specified Repartition a Conceptually, a k-folds repartitioning strategy divides the given In the case that the size of the dataset is not dividable by the specified Multiple variables are supported (e.g. for labeled data) By default the folds are created using static splits. Use See Compute the train/validation assignments for Each observation is assigned to the validation subset once (and only once). Thus, a union over all validation index-vectors reproduces the full range See Compute the train/validation assignments for Each observation is assigned to the validation subset once (and only once). Thus, a union over all validation index-vectors reproduces the full range Repartition a The resulting sequence of folds is returned as a lazy iterator. Only data subsets are created. That means no actual data is copied until Repartition a The resulting sequence of folds is returned as a lazy iterator. Only data subsets are created. That means no actual data is copied until See Lazily map The batched keyword argument controls the behavior of Examples See Lazily map The batched keyword argument controls the behavior of Examples Lazily map each function in tuple Map a Lazily map each function in tuple Map a Return the total number of observations contained in If Authors of custom data containers should implement See also Examples Return the total number of observations contained in If Authors of custom data containers should implement See also Examples Normalise the array Returns a lazy view of the observations in In case If instead you want to get the subset of observations corresponding to the given See Used to represent a subset of some The main purpose for the existence of Any data access is delayed until Arguments Methods Details For The following methods can also be provided and are optional: Examples Normalise the array Returns a lazy view of the observations in In case If instead you want to get the subset of observations corresponding to the given See Used to represent a subset of some The main purpose for the existence of Any data access is delayed until Arguments Methods Details For The following methods can also be provided and are optional: Examples See also Create an array with the given element type and size, based upon the given source array See also Examples See also Create an array with the given element type and size, based upon the given source array See also Examples Generate a re-balanced version of As an example, by default (i.e. with The The convenience parameter The output will contain both the resampled data and classes. See Pick a random observation or a batch of Return the given sequence padded with Examples See Pick a random observation or a batch of Return the given sequence padded with Examples Return a "subset" of The values of Return a "subset" of The values of The optional parameter For this function to work, the type of Compute the indices for two or more disjoint subsets of the range Examples The optional parameter For this function to work, the type of Compute the indices for two or more disjoint subsets of the range Examples Partition the If Supports any datatype implementing the Examples Partition the If Supports any datatype implementing the ExamplesMLUtils.batchsize
— Functionbatchsize(data::BatchView) -> Int
data
.using MLUtils
+(a = [1 5; 2 6], b = [3 7; 4 8])
MLUtils.batchsize
— Functionbatchsize(data::BatchView) -> Int
data
.using MLUtils
X, Y = MLUtils.load_iris()
A = BatchView(X, batchsize=30)
-@assert batchsize(A) == 30
MLUtils.batchseq
— Functionbatchseq(seqs, val = 0)
N
sequences, and turn them into a single sequence where each item is a batch of N
. Short sequences will be padded by val
.julia> batchseq([[1, 2, 3], [4, 5]], 0)
+@assert batchsize(A) == 30
MLUtils.batchseq
— Functionbatchseq(seqs, val = 0)
N
sequences, and turn them into a single sequence where each item is a batch of N
. Short sequences will be padded by val
.julia> batchseq([[1, 2, 3], [4, 5]], 0)
3-element Vector{Vector{Int64}}:
[1, 4]
[2, 5]
- [3, 0]
MLUtils.BatchView
— TypeBatchView(data, batchsize; partial=true, collate=nothing)
+ [3, 0]
MLUtils.BatchView
— TypeBatchView(data, batchsize; partial=true, collate=nothing)
BatchView(data; batchsize=1, partial=true, collate=nothing)
data
that represents it as a vector of batches. Each batch will contain an equal amount of observations in them. The batch-size can be specified using the parameter batchsize
. In the case that the size of the dataset is not dividable by the specified batchsize
, the remaining observations will be ignored if partial=false
. If partial=true
instead the last batch-size can be slightly smaller.getindex
is called.BatchView
to work on some data structure, the type of the given variable data
must implement the data container interface. See ObsView
for more info.data
: The object describing the dataset. Can be of any type as long as it implements getobs
and numobs
(see Details for more information).batchsize
: The batch-size of each batch. It is the number of observations that each batch must contain (except possibly for the last one).partial
: If partial=false
and the number of observations is not divisible by the batch-size, then the last mini-batch is dropped.collate
: Batching behavior. If nothing
(default), a batch is getobs(data, indices)
. If false
, each batch is [getobs(data, i) for i in indices]
. When true
, applies batch
to the vector of observations in a batch, recursively collating arrays in the last dimensions. See batch
for more information and examples.using MLUtils
X, Y = MLUtils.load_iris()
@@ -53,7 +53,7 @@
for (x, y) in BatchView(shuffleobs((X, Y)), batchsize=20)
@assert typeof(x) <: SubArray{Float64,2}
@assert typeof(y) <: SubArray{String,1}
-end
MLUtils.chunk
— Functionchunk(x, n; [dims])
+end
MLUtils.chunk
— Functionchunk(x, n; [dims])
chunk(x; [size, dims])
x
into n
parts or alternatively, if size
is an integer, into equal chunks of size size
. The parts contain the same number of elements except possibly for the last one that can be smaller.size
is a collection of integers instead, the elements of x
are split into chunks of the given sizes.x
is an array, dims
can be used to specify along which dimension to split (defaults to the last dimension).julia> chunk(1:10, 3)
3-element Vector{UnitRange{Int64}}:
1:4
@@ -103,7 +103,7 @@
julia> chunk(1:6; size = [2, 4])
2-element Vector{UnitRange{Int64}}:
1:2
- 3:6
chunk(x, partition_idxs; [npartitions, dims])
x
along the dimension dims
according to the indexes in partition_idxs
.partition_idxs
must be sorted and contain only positive integers between 1 and the number of partitions. npartitions
is not provided, it is inferred from partition_idxs
.dims
is not provided, it defaults to the last dimension.unbatch
.julia> x = reshape([1:10;], 2, 5)
+ 3:6
chunk(x, partition_idxs; [npartitions, dims])
x
along the dimension dims
according to the indexes in partition_idxs
.partition_idxs
must be sorted and contain only positive integers between 1 and the number of partitions. npartitions
is not provided, it is inferred from partition_idxs
.dims
is not provided, it defaults to the last dimension.unbatch
.julia> x = reshape([1:10;], 2, 5)
2×5 Matrix{Int64}:
1 3 5 7 9
2 4 6 8 10
@@ -112,7 +112,7 @@
3-element Vector{SubArray{Int64, 2, Matrix{Int64}, Tuple{Base.Slice{Base.OneTo{Int64}}, UnitRange{Int64}}, true}}:
[1; 2;;]
[3 5; 4 6]
- [7 9; 8 10]
MLUtils.DataLoader
— TypeDataLoader(data; [batchsize, buffer, collate, parallel, partial, rng, shuffle])
data
, each mini-batch containing batchsize
observations (except possibly the last one).data
object that implements the numobs
and getobs
methods.data
field of the DataLoader.data
: The data to be iterated over. The data type has to be supported by numobs
and getobs
.batchsize
: If less than 0, iterates over individual observations. Otherwise, each iteration (except possibly the last) yields a mini-batch containing batchsize
observations. Default 1
.buffer
: If buffer=true
and supported by the type of data
, a buffer will be allocated and reused for memory efficiency. You can also pass a preallocated object to buffer
. Default false
.collate
: Batching behavior. If nothing
(default), a batch is getobs(data, indices)
. If false
, each batch is [getobs(data, i) for i in indices]
. When true
, applies batch
to the vector of observations in a batch, recursively collating arrays in the last dimensions. See batch
for more information and examples.parallel
: Whether to use load data in parallel using worker threads. Greatly speeds up data loading by factor of available threads. Requires starting Julia with multiple threads. Check Threads.nthreads()
to see the number of available threads. Passing parallel = true
breaks ordering guarantees. Default false
.partial
: This argument is used only when batchsize > 0
. If partial=false
and the number of observations is not divisible by the batchsize, then the last mini-batch is dropped. Default true
.rng
: A random number generator. Default Random.GLOBAL_RNG
.shuffle
: Whether to shuffle the observations before iterating. Unlike wrapping the data container with shuffleobs(data)
, shuffle=true
ensures that the observations are shuffled anew every time you start iterating over eachobs
. Default false
.julia> Xtrain = rand(10, 100);
+ [7 9; 8 10]
MLUtils.DataLoader
— TypeDataLoader(data; [batchsize, buffer, collate, parallel, partial, rng, shuffle])
data
, each mini-batch containing batchsize
observations (except possibly the last one).data
object that implements the numobs
and getobs
methods.data
field of the DataLoader.data
: The data to be iterated over. The data type has to be supported by numobs
and getobs
.batchsize
: If less than 0, iterates over individual observations. Otherwise, each iteration (except possibly the last) yields a mini-batch containing batchsize
observations. Default 1
.buffer
: If buffer=true
and supported by the type of data
, a buffer will be allocated and reused for memory efficiency. You can also pass a preallocated object to buffer
. Default false
.collate
: Batching behavior. If nothing
(default), a batch is getobs(data, indices)
. If false
, each batch is [getobs(data, i) for i in indices]
. When true
, applies batch
to the vector of observations in a batch, recursively collating arrays in the last dimensions. See batch
for more information and examples.parallel
: Whether to use load data in parallel using worker threads. Greatly speeds up data loading by factor of available threads. Requires starting Julia with multiple threads. Check Threads.nthreads()
to see the number of available threads. Passing parallel = true
breaks ordering guarantees. Default false
.partial
: This argument is used only when batchsize > 0
. If partial=false
and the number of observations is not divisible by the batchsize, then the last mini-batch is dropped. Default true
.rng
: A random number generator. Default Random.GLOBAL_RNG
.shuffle
: Whether to shuffle the observations before iterating. Unlike wrapping the data container with shuffleobs(data)
, shuffle=true
ensures that the observations are shuffled anew every time you start iterating over eachobs
. Default false
.julia> Xtrain = rand(10, 100);
julia> array_loader = DataLoader(Xtrain, batchsize=2);
@@ -152,7 +152,7 @@
julia> foreach(println∘summary, DataLoader(rand(Int8, 10, 64), batchsize=30)) # partial=false would omit last
10×30 Matrix{Int8}
10×30 Matrix{Int8}
-10×4 Matrix{Int8}
MLUtils.eachobs
— Functioneachobs(data; kws...)
data
.DataLoader
. The batchsize
default is -1
here while it is 1
for DataLoader
.X = rand(4,100)
+10×4 Matrix{Int8}
MLUtils.eachobs
— Functioneachobs(data; kws...)
data
.DataLoader
. The batchsize
default is -1
here while it is 1
for DataLoader
.X = rand(4,100)
for x in eachobs(X)
# loop entered 100 times
@@ -170,7 +170,7 @@
# support for tuples, named tuples, dicts
for (x, y) in eachobs((X, Y))
# ...
-end
MLUtils.fill_like
— Functionfill_like(x, val, [element_type=eltype(x)], [dims=size(x)]))
x
. All element of the new array will be set to val
. The third and fourth arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.zeros_like
and ones_like
.julia> x = rand(Float32, 2)
+end
MLUtils.fill_like
— Functionfill_like(x, val, [element_type=eltype(x)], [dims=size(x)]))
x
. All element of the new array will be set to val
. The third and fourth arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.zeros_like
and ones_like
.julia> x = rand(Float32, 2)
2-element Vector{Float32}:
0.16087806
0.89916044
@@ -191,11 +191,11 @@
julia> fill_like(x, 1.7, Float64)
2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
1.7 1.7
- 1.7 1.7
MLUtils.filterobs
— Functionfilterobs(f, data)
data
including all indices i
for which f(getobs(data, i)) === true
.data = 1:10
+ 1.7 1.7
MLUtils.filterobs
— Functionfilterobs(f, data)
data
including all indices i
for which f(getobs(data, i)) === true
.data = 1:10
numobs(data) == 10
fdata = filterobs(>(5), data)
-numobs(fdata) == 5
MLUtils.flatten
— Functionflatten(x::AbstractArray)
unsqueeze
.julia> rand(3,4,5) |> flatten |> size
-(12, 5)
MLUtils.getobs
— Functiongetobs(data, [idx])
idx
. Note that idx
can be any type as long as data
has defined getobs
for that type. If idx
is not provided, then materialize all observations in data
.data
does not have getobs
defined, then in the case of Tables.table(data) == true
returns the row(s) in position idx
, otherwise returns data[idx]
.Base.getindex
for their type instead of getobs
. getobs
should only be implemented for types where there is a difference between getobs
and Base.getindex
(such as multi-dimensional arrays).idx
is a scalar vs vector.getobs
supports by default nested combinations of array, tuple, named tuples, and dictionaries. # named tuples
+numobs(fdata) == 5
MLUtils.flatten
— Functionflatten(x::AbstractArray)
unsqueeze
.julia> rand(3,4,5) |> flatten |> size
+(12, 5)
MLUtils.getobs
— Functiongetobs(data, [idx])
idx
. Note that idx
can be any type as long as data
has defined getobs
for that type. If idx
is not provided, then materialize all observations in data
.data
does not have getobs
defined, then in the case of Tables.table(data) == true
returns the row(s) in position idx
, otherwise returns data[idx]
.Base.getindex
for their type instead of getobs
. getobs
should only be implemented for types where there is a difference between getobs
and Base.getindex
(such as multi-dimensional arrays).idx
is a scalar vs vector.getobs
supports by default nested combinations of array, tuple, named tuples, and dictionaries. # named tuples
x = (a = [1, 2, 3], b = rand(6, 3))
getobs(x, 2) == (a = 2, b = x.b[:, 2])
@@ -206,20 +206,20 @@
x = Dict(:a => [1, 2, 3], :b => rand(6, 3))
getobs(x, 2) == Dict(:a => 2, :b => x[:b][:, 2])
-getobs(x, [1, 3]) == Dict(:a => [1, 3], :b => x[:b][:, [1, 3]])
MLUtils.getobs!
— Functiongetobs!(buffer, data, idx)
getobs(data, idx)
. If this method is defined for the type of data
, then buffer
should be used to store the result, instead of allocating a dedicated object.data
, then buffer
will be ignored and the result of getobs
returned. This could be because the type of data
may not lend itself to the concept of copy!
. Thus, supporting a custom getobs!
is optional and not required.MLUtils.joinobs
— Functionjoinobs(datas...)
datas
.data1, data2 = 1:10, 11:20
+getobs(x, [1, 3]) == Dict(:a => [1, 3], :b => x[:b][:, [1, 3]])
MLUtils.getobs!
— Functiongetobs!(buffer, data, idx)
getobs(data, idx)
. If this method is defined for the type of data
, then buffer
should be used to store the result, instead of allocating a dedicated object.data
, then buffer
will be ignored and the result of getobs
returned. This could be because the type of data
may not lend itself to the concept of copy!
. Thus, supporting a custom getobs!
is optional and not required.MLUtils.joinobs
— Functionjoinobs(datas...)
datas
.data1, data2 = 1:10, 11:20
jdata = joinumobs(data1, data2)
-getobs(jdata, 15) == 15
MLUtils.group_counts
— Functiongroup_counts(x)
x
appears.group_indices
julia> group_counts(['a', 'b', 'b'])
+getobs(jdata, 15) == 15
MLUtils.group_counts
— Functiongroup_counts(x)
x
appears.group_indices
julia> group_counts(['a', 'b', 'b'])
Dict{Char, Int64} with 2 entries:
'a' => 1
- 'b' => 2
MLUtils.group_indices
— Functiongroup_indices(x) -> Dict
x
for each distinct value contained. This information is useful for resampling strategies, such as stratified sampling.group_counts
.julia> x = [:yes, :no, :maybe, :yes];
+ 'b' => 2
MLUtils.group_indices
— Functiongroup_indices(x) -> Dict
x
for each distinct value contained. This information is useful for resampling strategies, such as stratified sampling.group_counts
.julia> x = [:yes, :no, :maybe, :yes];
julia> group_indices(x)
Dict{Symbol, Vector{Int64}} with 3 entries:
:yes => [1, 4]
:maybe => [3]
- :no => [2]
MLUtils.groupobs
— Functiongroupobs(f, data)
data
into different data containers, grouping observations by f(obs)
.data = -10:10
+ :no => [2]
MLUtils.groupobs
— Functiongroupobs(f, data)
data
into different data containers, grouping observations by f(obs)
.data = -10:10
datas = groupobs(>(0), data)
-length(datas) == 2
MLUtils.kfolds
— Functionkfolds(n::Integer, k = 5) -> Tuple
k
repartitions of n
observations, and return them in the form of two vectors. The first vector contains the index-vectors for the training subsets, and the second vector the index-vectors for the validation subsets respectively. A general rule of thumb is to use either k = 5
or k = 10
. The following code snippet generates the indices assignments for k = 5
julia> train_idx, val_idx = kfolds(10, 5);
1:n
. Note that there is no random assignment of observations to subsets, which means that adjacent observations are likely to be part of the same validation subset.julia> train_idx
+length(datas) == 2
MLUtils.kfolds
— Functionkfolds(n::Integer, k = 5) -> Tuple
k
repartitions of n
observations, and return them in the form of two vectors. The first vector contains the index-vectors for the training subsets, and the second vector the index-vectors for the validation subsets respectively. A general rule of thumb is to use either k = 5
or k = 10
. The following code snippet generates the indices assignments for k = 5
julia> train_idx, val_idx = kfolds(10, 5);
1:n
. Note that there is no random assignment of observations to subsets, which means that adjacent observations are likely to be part of the same validation subset.julia> train_idx
5-element Array{Array{Int64,1},1}:
[3,4,5,6,7,8,9,10]
[1,2,5,6,7,8,9,10]
@@ -233,14 +233,14 @@
3:4
5:6
7:8
- 9:10
kfolds(data, [k = 5])
data
container k
times using a k
folds strategy and return the sequence of folds as a lazy iterator. Only data subsets are created, which means that no actual data is copied until getobs
is invoked.data
into k
roughly equal-sized parts. Each part will serve as validation set once, while the remaining parts are used for training. This results in k
different partitions of data
.k
, the remaining observations will be evenly distributed among the parts.for (x_train, x_val) in kfolds(X, k=10)
+ 9:10
kfolds(data, [k = 5])
data
container k
times using a k
folds strategy and return the sequence of folds as a lazy iterator. Only data subsets are created, which means that no actual data is copied until getobs
is invoked.data
into k
roughly equal-sized parts. Each part will serve as validation set once, while the remaining parts are used for training. This results in k
different partitions of data
.k
, the remaining observations will be evenly distributed among the parts.for (x_train, x_val) in kfolds(X, k=10)
# code called 10 times
# nobs(x_val) may differ up to ±1 over iterations
end
for ((x_train, y_train), val) in kfolds((X, Y), k=10)
# ...
end
shuffleobs
to randomly assign observations to the folds.for (x_train, x_val) in kfolds(shuffleobs(X), k = 10)
# ...
-end
leavepout
for a related function.MLUtils.leavepout
— Functionleavepout(n::Integer, [size = 1]) -> Tuple
k ≈ n/size
repartitions of n
observations, and return them in the form of two vectors. The first vector contains the index-vectors for the training subsets, and the second vector the index-vectors for the validation subsets respectively. Each validation subset will have either size
or size+1
observations assigned to it. The following code snippet generates the index-vectors for size = 2
.julia> train_idx, val_idx = leavepout(10, 2);
1:n
. Note that there is no random assignment of observations to subsets, which means that adjacent observations are likely to be part of the same validation subset.julia> train_idx
+end
leavepout
for a related function.MLUtils.leavepout
— Functionleavepout(n::Integer, [size = 1]) -> Tuple
k ≈ n/size
repartitions of n
observations, and return them in the form of two vectors. The first vector contains the index-vectors for the training subsets, and the second vector the index-vectors for the validation subsets respectively. Each validation subset will have either size
or size+1
observations assigned to it. The following code snippet generates the index-vectors for size = 2
.julia> train_idx, val_idx = leavepout(10, 2);
1:n
. Note that there is no random assignment of observations to subsets, which means that adjacent observations are likely to be part of the same validation subset.julia> train_idx
5-element Array{Array{Int64,1},1}:
[3,4,5,6,7,8,9,10]
[1,2,5,6,7,8,9,10]
@@ -254,11 +254,11 @@
3:4
5:6
7:8
- 9:10
leavepout(data, p = 1)
data
container using a k-fold strategy, where k
is chosen in such a way, that each validation subset of the resulting folds contains roughly p
observations. Defaults to p = 1
, which is also known as "leave-one-out" partitioning.getobs
is invoked.for (train, val) in leavepout(X, p=2)
+ 9:10
leavepout(data, p = 1)
data
container using a k-fold strategy, where k
is chosen in such a way, that each validation subset of the resulting folds contains roughly p
observations. Defaults to p = 1
, which is also known as "leave-one-out" partitioning.getobs
is invoked.for (train, val) in leavepout(X, p=2)
# if nobs(X) is dividable by 2,
# then numobs(val) will be 2 for each iteraton,
# otherwise it may be 3 for the first few iterations.
-end
kfolds
for a related function.MLUtils.mapobs
— Functionmapobs(f, data; batched=:auto)
f
over the observations in a data container data
. Returns a new data container mdata
that can be indexed and has a length. Indexing triggers the transformation f
.mdata[idx]
and mdata[idxs]
where idx
is an integer and idxs
is a vector of integers:batched=:auto
(default). Let f
handle the two cases. Calls f(getobs(data, idx))
and f(getobs(data, idxs))
.batched=:never
. The function f
is always called on a single observation. Calls f(getobs(data, idx))
and [f(getobs(data, idx)) for idx in idxs]
.batched=:always
. The function f
is always called on a batch of observations. Calls getobs(f(getobs(data, [idx])), 1)
and f(getobs(data, idxs))
.julia> data = (a=[1,2,3], b=[1,2,3]);
+end
kfolds
for a related function.MLUtils.mapobs
— Functionmapobs(f, data; batched=:auto)
f
over the observations in a data container data
. Returns a new data container mdata
that can be indexed and has a length. Indexing triggers the transformation f
.mdata[idx]
and mdata[idxs]
where idx
is an integer and idxs
is a vector of integers:batched=:auto
(default). Let f
handle the two cases. Calls f(getobs(data, idx))
and f(getobs(data, idxs))
.batched=:never
. The function f
is always called on a single observation. Calls f(getobs(data, idx))
and [f(getobs(data, idx)) for idx in idxs]
.batched=:always
. The function f
is always called on a batch of observations. Calls getobs(f(getobs(data, [idx])), 1)
and f(getobs(data, idxs))
.julia> data = (a=[1,2,3], b=[1,2,3]);
julia> mdata = mapobs(data) do x
(c = x.a .+ x.b, d = x.a .- x.b)
@@ -269,10 +269,10 @@
(c = 2, d = 0)
julia> mdata[1:2]
-(c = [2, 4], d = [0, 0])
mapobs(fs, data)
fs
over the observations in data container data
. Returns a tuple of transformed data containers.mapobs(namedfs::NamedTuple, data)
NamedTuple
of functions over data
, turning it into a data container of NamedTuple
s. Field syntax can be used to select a column of the resulting data container.data = 1:10
+(c = [2, 4], d = [0, 0])
mapobs(fs, data)
fs
over the observations in data container data
. Returns a tuple of transformed data containers.mapobs(namedfs::NamedTuple, data)
NamedTuple
of functions over data
, turning it into a data container of NamedTuple
s. Field syntax can be used to select a column of the resulting data container.data = 1:10
nameddata = mapobs((x = sqrt, y = log), data)
getobs(nameddata, 10) == (x = sqrt(10), y = log(10))
-getobs(nameddata.x, 10) == sqrt(10)
MLUtils.numobs
— Functionnumobs(data)
data
.data
does not have numobs
defined, then in the case of Tables.table(data) == true
returns the number of rows, otherwise returns length(data)
.Base.length
for their type instead of numobs
. numobs
should only be implemented for types where there is a difference between numobs
and Base.length
(such as multi-dimensional arrays).getobs
supports by default nested combinations of array, tuple, named tuples, and dictionaries. getobs
.
+getobs(nameddata.x, 10) == sqrt(10)
MLUtils.numobs
— Functionnumobs(data)
data
.data
does not have numobs
defined, then in the case of Tables.table(data) == true
returns the number of rows, otherwise returns length(data)
.Base.length
for their type instead of numobs
. numobs
should only be implemented for types where there is a difference between numobs
and Base.length
(such as multi-dimensional arrays).getobs
supports by default nested combinations of array, tuple, named tuples, and dictionaries. getobs
.
# named tuples
x = (a = [1, 2, 3], b = rand(6, 3))
numobs(x) == 3
@@ -291,7 +291,7 @@
[3] numobs(data::NamedTuple{(:a, :b), Tuple{Vector{Int64}, Matrix{Float64}}})
@ MLUtils ~/.julia/dev/MLUtils/src/observation.jl:177
[4] top-level scope
- @ REPL[35]:1
MLUtils.normalise
— Functionnormalise(x; dims=ndims(x), ϵ=1e-5)
x
to mean 0 and standard deviation 1 across the dimension(s) given by dims
. Per default, dims
is the last dimension. ϵ
is a small additive factor added to the denominator for numerical stability.MLUtils.obsview
— Functionobsview(data, [indices])
data
that correspond to the given indices
. No data will be copied except of the indices. It is similar to constructing an ObsView
, but returns a SubArray
if the type of data
is Array
or SubArray
. Furthermore, this function may be extended for custom types of data
that also want to provide their own subset-type.data
is a tuple, the constructor will be mapped over its elements. That means that the constructor returns a tuple of ObsView
instead of a ObsView
of tuples.indices
in their native type, use getobs
.ObsView
for more information.MLUtils.ObsView
— TypeObsView(data, [indices])
data
of arbitrary type by storing which observation-indices the subset spans. Furthermore, subsequent subsettings are accumulated without needing to access actual data.ObsView
is to delay data access and movement until an actual batch of data (or single observation) is needed for some computation. This is particularily useful when the data is not located in memory, but on the hard drive or some remote location. In such a scenario one wants to load the required data only when needed.getindex
is called, and even getindex
returns the result of obsview
which in general avoids data movement until getobs
is called. If used as an iterator, the view will iterate over the dataset once, effectively denoting an epoch. Each iteration will return a lazy subset to the current observation.data
: The object describing the dataset. Can be of any type as long as it implements getobs
and numobs
(see Details for more information).indices
: Optional. The index or indices of the observation(s) in data
that the subset should represent. Can be of type Int
or some subtype of AbstractVector
.getindex
: Returns the observation(s) of the given index/indices. No data is copied aside from the required indices.numobs
: Returns the total number observations in the subset.getobs
: Returns the underlying data that the ObsView
represents at the given relative indices. Note that these indices are in "subset space", and in general will not directly correspond to the same indices in the underlying data set.ObsView
to work on some data structure, the desired type MyType
must implement the following interface:getobs(data::MyType, idx)
: Should return the observation(s) indexed by idx
. In what form is up to the user. Note that idx
can be of type Int
or AbstractVector
.numobs(data::MyType)
: Should return the total number of observations in data
getobs(data::MyType)
: By default this function is the identity function. If that is not the behaviour that you want for your type, you need to provide this method as well.obsview(data::MyType, idx)
: If your custom type has its own kind of subset type, you can return it here. An example for such a case are SubArray
for representing a subset of some AbstractArray
.getobs!(buffer, data::MyType, [idx])
: Inplace version of getobs(data, idx)
. If this method is provided for MyType
, then eachobs
can preallocate a buffer that is then reused every iteration. Note: buffer
should be equivalent to the return value of getobs(::MyType, ...)
, since this is how buffer
is preallocated by default.X, Y = MLUtils.load_iris()
+ @ REPL[35]:1
MLUtils.normalise
— Functionnormalise(x; dims=ndims(x), ϵ=1e-5)
x
to mean 0 and standard deviation 1 across the dimension(s) given by dims
. Per default, dims
is the last dimension. ϵ
is a small additive factor added to the denominator for numerical stability.MLUtils.obsview
— Functionobsview(data, [indices])
data
that correspond to the given indices
. No data will be copied except of the indices. It is similar to constructing an ObsView
, but returns a SubArray
if the type of data
is Array
or SubArray
. Furthermore, this function may be extended for custom types of data
that also want to provide their own subset-type.data
is a tuple, the constructor will be mapped over its elements. That means that the constructor returns a tuple of ObsView
instead of a ObsView
of tuples.indices
in their native type, use getobs
.ObsView
for more information.MLUtils.ObsView
— TypeObsView(data, [indices])
data
of arbitrary type by storing which observation-indices the subset spans. Furthermore, subsequent subsettings are accumulated without needing to access actual data.ObsView
is to delay data access and movement until an actual batch of data (or single observation) is needed for some computation. This is particularily useful when the data is not located in memory, but on the hard drive or some remote location. In such a scenario one wants to load the required data only when needed.getindex
is called, and even getindex
returns the result of obsview
which in general avoids data movement until getobs
is called. If used as an iterator, the view will iterate over the dataset once, effectively denoting an epoch. Each iteration will return a lazy subset to the current observation.data
: The object describing the dataset. Can be of any type as long as it implements getobs
and numobs
(see Details for more information).indices
: Optional. The index or indices of the observation(s) in data
that the subset should represent. Can be of type Int
or some subtype of AbstractVector
.getindex
: Returns the observation(s) of the given index/indices. No data is copied aside from the required indices.numobs
: Returns the total number observations in the subset.getobs
: Returns the underlying data that the ObsView
represents at the given relative indices. Note that these indices are in "subset space", and in general will not directly correspond to the same indices in the underlying data set.ObsView
to work on some data structure, the desired type MyType
must implement the following interface:getobs(data::MyType, idx)
: Should return the observation(s) indexed by idx
. In what form is up to the user. Note that idx
can be of type Int
or AbstractVector
.numobs(data::MyType)
: Should return the total number of observations in data
getobs(data::MyType)
: By default this function is the identity function. If that is not the behaviour that you want for your type, you need to provide this method as well.obsview(data::MyType, idx)
: If your custom type has its own kind of subset type, you can return it here. An example for such a case are SubArray
for representing a subset of some AbstractArray
.getobs!(buffer, data::MyType, [idx])
: Inplace version of getobs(data, idx)
. If this method is provided for MyType
, then eachobs
can preallocate a buffer that is then reused every iteration. Note: buffer
should be equivalent to the return value of getobs(::MyType, ...)
, since this is how buffer
is preallocated by default.X, Y = MLUtils.load_iris()
# The iris set has 150 observations and 4 features
@assert size(X) == (4,150)
@@ -333,7 +333,7 @@
end
# Indexing: take first 10 observations
-x, y = ObsView((X, Y))[1:10]
MLUtils.ones_like
— Functionones_like(x, [element_type=eltype(x)], [dims=size(x)]))
x
. All element of the new array will be set to 1. The second and third arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.zeros_like
and fill_like
.julia> x = rand(Float32, 2)
+x, y = ObsView((X, Y))[1:10]
MLUtils.ones_like
— Functionones_like(x, [element_type=eltype(x)], [dims=size(x)]))
x
. All element of the new array will be set to 1. The second and third arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.zeros_like
and fill_like
.julia> x = rand(Float32, 2)
2-element Vector{Float32}:
0.8621633
0.5158395
@@ -354,7 +354,7 @@
julia> ones_like(x, Float64)
2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
1.0 1.0
- 1.0 1.0
MLUtils.oversample
— Functionoversample(data, classes; fraction=1, shuffle=true)
+ 1.0 1.0
MLUtils.oversample
— Functionoversample(data, classes; fraction=1, shuffle=true)
oversample(data::Tuple; fraction=1, shuffle=true)
data
by repeatedly sampling existing observations in such a way that every class will have at least fraction
times the number observations of the largest class in classes
. This way, all classes will have a minimum number of observations in the resulting data set relative to what largest class has in the given (original) data
.fraction = 1
) the resulting dataset will be near perfectly balanced. On the other hand, with fraction = 0.5
every class in the resulting data with have at least 50% as many observations as the largest class.classes
input is an array with the same length as numobs(data)
. shuffle
determines if the resulting data will be shuffled after its creation; if it is not shuffled then all the repeated samples will be together at the end, sorted by class. Defaults to true
.# 6 observations with 3 features each
X = rand(3, 6)
# 2 classes, severely imbalanced
@@ -392,7 +392,7 @@
5 │ 0.376304 0.100022 a
6 │ 0.427064 0.0648339 a
7 │ 0.427064 0.0648339 a
- 8 │ 0.457043 0.490688 b
ObsView
for more information on data subsets. See also undersample
.MLUtils.randobs
— Functionrandobs(data, [n])
n
random observations from data
. For this function to work, the type of data
must implement numobs
and getobs
.MLUtils.rpad_constant
— Functionrpad_constant(v::AbstractArray, n::Union{Integer, Tuple}, val = 0; dims=:)
val
along the dimensions dims
up to a maximum length in each direction specified by n
.julia> rpad_constant([1, 2], 4, -1) # passing with -1 up to size 4
+ 8 │ 0.457043 0.490688 b
ObsView
for more information on data subsets. See also undersample
.MLUtils.randobs
— Functionrandobs(data, [n])
n
random observations from data
. For this function to work, the type of data
must implement numobs
and getobs
.MLUtils.rpad_constant
— Functionrpad_constant(v::AbstractArray, n::Union{Integer, Tuple}, val = 0; dims=:)
val
along the dimensions dims
up to a maximum length in each direction specified by n
.julia> rpad_constant([1, 2], 4, -1) # passing with -1 up to size 4
4-element Vector{Int64}:
1
2
@@ -417,17 +417,17 @@
1 2
3 4
0 0
- 0 0
MLUtils.shuffleobs
— Functionshuffleobs([rng], data)
data
that spans all observations, but has the order of the observations shuffled.data
itself are not copied. Instead only the indices are shuffled. This function calls obsview
to accomplish that, which means that the return value is likely of a different type than data
.# For Arrays the subset will be of type SubArray
+ 0 0
MLUtils.shuffleobs
— Functionshuffleobs([rng], data)
data
that spans all observations, but has the order of the observations shuffled.data
itself are not copied. Instead only the indices are shuffled. This function calls obsview
to accomplish that, which means that the return value is likely of a different type than data
.# For Arrays the subset will be of type SubArray
@assert typeof(shuffleobs(rand(4,10))) <: SubArray
# Iterate through all observations in random order
for x in eachobs(shuffleobs(X))
...
-end
rng
allows one to specify the random number generator used for shuffling. This is useful when reproducible results are desired. By default, uses the global RNG. See Random
in Julia's standard library for more info.data
must implement numobs
and getobs
. See ObsView
for more information.MLUtils.splitobs
— Functionsplitobs(n::Int; at) -> Tuple
1:n
with splits given by at
.julia> splitobs(100, at=0.7)
+end
rng
allows one to specify the random number generator used for shuffling. This is useful when reproducible results are desired. By default, uses the global RNG. See Random
in Julia's standard library for more info.data
must implement numobs
and getobs
. See ObsView
for more information.MLUtils.splitobs
— Functionsplitobs(n::Int; at) -> Tuple
1:n
with splits given by at
.julia> splitobs(100, at=0.7)
(1:70, 71:100)
julia> splitobs(100, at=(0.1, 0.4))
-(1:10, 11:50, 51:100)
splitobs(data; at, shuffle=false) -> Tuple
data
into two or more subsets. When at
is a number (between 0 and 1) this specifies the proportion in the first subset. When at
is a tuple, each entry specifies the proportion an a subset, with the last having 1-sum(at)
. In all there are length(at)+1
subsets returned.shuffle=true
, randomly permute the observations before splitting.numobs
and getobs
interfaces – including arrays, tuples & NamedTuples of arrays.julia> splitobs(permutedims(1:100); at=0.7) # simple 70%-30% split, of a matrix
+(1:10, 11:50, 51:100)
splitobs(data; at, shuffle=false) -> Tuple
data
into two or more subsets. When at
is a number (between 0 and 1) this specifies the proportion in the first subset. When at
is a tuple, each entry specifies the proportion an a subset, with the last having 1-sum(at)
. In all there are length(at)+1
subsets returned.shuffle=true
, randomly permute the observations before splitting.numobs
and getobs
interfaces – including arrays, tuples & NamedTuples of arrays.julia> splitobs(permutedims(1:100); at=0.7) # simple 70%-30% split, of a matrix
([1 2 … 69 70], [71 72 … 99 100])
julia> data = (x=ones(2,10), n=1:10) # a NamedTuple, consistent last dimension
@@ -439,13 +439,13 @@
julia> train, test = splitobs((permutedims(1.0:100.0), 101:200), at=0.7, shuffle=true); # split a Tuple
julia> vec(test[1]) .+ 100 == test[2]
-true
Missing docstring for stack
. Check Documenter's build log for details.
MLUtils.unbatch
— FunctionMissing docstring for stack
. Check Documenter's build log for details.
MLUtils.unbatch
— FunctionMLUtils.undersample
— Functionundersample(data, classes; shuffle=true)
Generate a class-balanced version of data
by subsampling its observations in such a way that the resulting number of observations will be the same number for every class. This way, all classes will have as many observations in the resulting data set as the smallest class has in the given (original) data
.
The convenience parameter shuffle
determines if the resulting data will be shuffled after its creation; if it is not shuffled then all the observations will be in their original order. Defaults to false
.
The output will contain both the resampled data and classes.
# 6 observations with 3 features each
+ [7, 8]
MLUtils.undersample
— Functionundersample(data, classes; shuffle=true)
Generate a class-balanced version of data
by subsampling its observations in such a way that the resulting number of observations will be the same number for every class. This way, all classes will have as many observations in the resulting data set as the smallest class has in the given (original) data
.
The convenience parameter shuffle
determines if the resulting data will be shuffled after its creation; if it is not shuffled then all the observations will be in their original order. Defaults to false
.
The output will contain both the resampled data and classes.
# 6 observations with 3 features each
X = rand(3, 6)
# 2 classes, severely imbalanced
Y = ["a", "b", "b", "b", "b", "a"]
@@ -478,7 +478,7 @@
1 │ 0.427064 0.0648339 a
2 │ 0.376304 0.100022 a
3 │ 0.467095 0.185437 b
- 4 │ 0.457043 0.490688 b
See ObsView
for more information on data subsets. See also oversample
.
MLUtils.unsqueeze
— Functionunsqueeze(x; dims)
Return x
reshaped into an array one dimensionality higher than x
, where dims
indicates in which dimension x
is extended. dims
can be an integer between 1 and ndims(x)+1
.
Examples
julia> unsqueeze([1 2; 3 4], dims=2)
+ 4 │ 0.457043 0.490688 b
See ObsView
for more information on data subsets. See also oversample
.
MLUtils.unsqueeze
— Functionunsqueeze(x; dims)
Return x
reshaped into an array one dimensionality higher than x
, where dims
indicates in which dimension x
is extended. dims
can be an integer between 1 and ndims(x)+1
.
Examples
julia> unsqueeze([1 2; 3 4], dims=2)
2×1×2 Array{Int64, 3}:
[:, :, 1] =
1
@@ -497,13 +497,13 @@
julia> unsqueeze(xs, dims=1)
1×3 Matrix{Vector{Int64}}:
- [1, 2] [3, 4] [5, 6]
unsqueeze(; dims)
Returns a function which, acting on an array, inserts a dimension of size 1 at dims
.
Examples
julia> rand(21, 22, 23) |> unsqueeze(dims=2) |> size
-(21, 1, 22, 23)
MLUtils.unstack
— Functionunsqueeze(; dims)
Returns a function which, acting on an array, inserts a dimension of size 1 at dims
.
Examples
julia> rand(21, 22, 23) |> unsqueeze(dims=2) |> size
+(21, 1, 22, 23)
MLUtils.unstack
— FunctionMLUtils.zeros_like
— Functionzeros_like(x, [element_type=eltype(x)], [dims=size(x)]))
Create an array with the given element type and size, based upon the given source array x
. All element of the new array will be set to 0. The second and third arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.
See also ones_like
and fill_like
.
Examples
julia> x = rand(Float32, 2)
+ [7, 8]
MLUtils.zeros_like
— Functionzeros_like(x, [element_type=eltype(x)], [dims=size(x)]))
Create an array with the given element type and size, based upon the given source array x
. All element of the new array will be set to 0. The second and third arguments are both optional, defaulting to the given array's eltype and size. The dimensions may be specified as an integer or as a tuple argument.
See also ones_like
and fill_like
.
Examples
julia> x = rand(Float32, 2)
2-element Vector{Float32}:
0.4005432
0.36934233
@@ -524,4 +524,4 @@
julia> zeros_like(x, Float64)
2×2 CuArray{Float64, 2, CUDA.Mem.DeviceBuffer}:
0.0 0.0
- 0.0 0.0
MLUtils.Datasets.load_iris
— Functionload_iris() -> X, y, names
Loads the first 150 observations from the Iris flower data set introduced by Ronald Fisher (1936). The 4 by 150 matrix X
contains the numeric measurements, in which each individual column denotes an observation. The vector y
contains the class labels as strings. The vector names
contains the names of the features (i.e. rows of X
)
[1] Fisher, Ronald A. "The use of multiple measurements in taxonomic problems." Annals of eugenics 7.2 (1936): 179-188.
MLUtils.Datasets.make_sin
— Functionmake_sin(n, start, stop; noise = 0.3, f_rand = randn) -> x, y
Generates n
noisy equally spaces samples of a sinus from start
to stop
by adding noise .* f_rand(length(x))
to the result of fun(x)
.
MLUtils.Datasets.make_spiral
— Functionmake_spiral(n, a, theta, b; noise = 0.01, f_rand = randn) -> x, y
Generates n
noisy responses for a spiral with two labels. Uses the radius, angle and scaling arguments to space the points in 2D space and adding noise .* f_randn(n)
to the response.
MLUtils.Datasets.make_poly
— Functionmake_poly(coef, x; noise = 0.01, f_rand = randn) -> x, y
Generates a noisy response for a polynomial of degree length(coef)
using the vector x
as input and adding noise .* f_randn(length(x))
to the result. The vector coef
contains the coefficients for the terms of the polynome. The first element of coef
denotes the coefficient for the term with the highest degree, while the last element of coef
denotes the intercept.
Settings
This document was generated with Documenter.jl version 0.27.12 on Sunday 31 December 2023. Using Julia version 1.7.3.