-
Notifications
You must be signed in to change notification settings - Fork 225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU Kernel Tests #1439
base: repo-refactor
Are you sure you want to change the base?
CPU Kernel Tests #1439
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 23 files at r1, 21 of 29 files at r2, 12 of 12 files at r3, all commit messages.
Reviewable status: all files reviewed, 18 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)
lib/kernels/include/kernels/cast_kernels_cpu.h
line 15 at r3 (raw file):
GenericTensorAccessorW const &output, DataType input_type, DataType output_type);
Prefer function names over extra namespaces, the namespaces in kernels
are more of a legacy holdover than something that should be used more
Suggestion:
void cpu_forward_kernel(GenericTensorAccessorR const &input,
GenericTensorAccessorW const &output,
DataType input_type,
DataType output_type);
lib/kernels/src/local_cpu_allocator.cc
line 6 at r3 (raw file):
namespace FlexFlow { void *LocalCPUAllocator::allocate(size_t requested_memory_size) { void *ptr = calloc(1, requested_memory_size);
Just use malloc
Code quote:
calloc(
lib/kernels/src/local_cpu_allocator.cc
line 28 at r3 (raw file):
LocalCPUAllocator::~LocalCPUAllocator() { for (auto ptr : ptrs) {
Suggestion:
for (void *ptr : this->ptrs) {
lib/kernels/src/local_cuda_allocator.cc
line 8 at r3 (raw file):
void *ptr; checkCUDA(cudaMalloc(&ptr, requested_memory_size)); checkCUDA(cudaMemset(ptr, 0, requested_memory_size));
I don't think code should assume that the allocated memory has been zero'd, unless there's some reason I'm not thinking of?
lib/kernels/src/cpu/reverse_kernels.cc
line 14 at r3 (raw file):
coord_t in_blk_size) { coord_t total_elements = num_out_blks * reverse_dim_size * in_blk_size; for (coord_t i = 0; i < total_elements; ++i) {
This is really hard to read right now, make more readable (maybe pull out some helper functions or something?)
lib/kernels/src/cuda/ops/reverse_kernels.cu
line 41 at r3 (raw file):
// } /* I mentioned this earlier, but I still think the reverse_forward_kernel code
See my message in slack
lib/kernels/test/src/test_batch_norm_kernel.cc
line 91 at r3 (raw file):
std::vector<float> host_bias_grad_data = load_accessor_data<DataType::FLOAT>( read_only_accessor_from_write_accessor(bias_grad_accessor));
Why is read_only_accessor_from_write_accessor
necessary here? Aren't GenericTensorW
s assumed to be RW
@reyna-abhyankar ? If so, add an overload to load_accessor_data
lib/kernels/test/src/test_combine_kernel.cc
line 64 at r3 (raw file):
GenericTensorAccessorR input_accessor_gpu = read_only_accessor_from_write_accessor( create_random_filled_accessor_w(input_shape, gpu_allocator));
Why not also define a create_random_filled_accessor_r
?
Code quote:
create_random_filled_accessor_w
lib/kernels/test/src/test_utils.h
line 11 at r3 (raw file):
#include <random> enum class GpuDirection {
If we're going to have GenericTensorAccessor
s allocated on different devices it would probably be good to have a field on GenericTensorAccessor
that tracks that, so it can at least be checked at runtime if you try to access it incorrectly, etc.?
This would also allow you to infer a lot of locations (CPU vs GPU) rather than having to pass additional arguments to the transfer functions
lib/kernels/test/src/test_utils.h
line 12 at r3 (raw file):
enum class GpuDirection { HostToDevice = 0,
Why are concrete values needed?
lib/kernels/test/src/test_utils.h
line 65 at r3 (raw file):
template <typename IDT, typename ODT, typename F> GenericTensorAccessorW create_transformed_accessor_w(TensorShape const &shape,
It seems like this is just being used to create a random tensor, seems like that might be a better behavior?
lib/kernels/test/src/test_utils.h
line 86 at r3 (raw file):
} template <DataType DT>
Where possible, probably best to add a dynamically-dispatched function as well as one that takes the argument by template. This isn't possible for some functions (such as those that return a type based on the template type), but for some of these in this function it seems possible.
lib/kernels/test/src/test_utils.h
line 88 at r3 (raw file):
template <DataType DT> GenericTensorAccessorW copy_tensor_between_memories(GenericTensorAccessorR accessor,
The behavior of this method seems rather confusing, I'd stick to explicitly stating the direction you'd like to transfer (either from Host to Device, or reversed)/
lib/kernels/test/src/test_utils.h
line 89 at r3 (raw file):
GenericTensorAccessorW copy_tensor_between_memories(GenericTensorAccessorR accessor, TensorShape const &shape,
GenericTensorAccessor
already has a shape, no need to pass an additional shape
separately
lib/kernels/test/src/test_utils.h
line 104 at r3 (raw file):
} template <DataType DT>
Pass as a normal argument, not a template argument
lib/kernels/test/src/test_utils.h
line 116 at r3 (raw file):
template <DataType DT> std::vector<real_type<DT>> load_accessor_data(GenericTensorAccessorR accessor, bool on_host = false) {
Change to an enum and make passing required/copy
Code quote:
bool on_host = false
lib/kernels/test/src/test_utils.cc
line 8 at r3 (raw file):
GenericTensorAccessorW accessor = allocator.allocate_tensor(shape); size_t volume = accessor.shape.num_elements(); std::vector<float> host_data(volume);
Can't TensorShape
have different datatypes? If you're going to only do this for float
you should at the very least have a check
lib/kernels/test/src/test_utils.cc
line 17 at r3 (raw file):
} transfer_memory(static_cast<float *>(accessor.ptr),
Would it be better to have a function that transfers a GenericTensorAccessor
, rather than having to do casting to transfer each time? That also saves you from having to pass volume
, etc.
…lexflow#1429) * initial commit for machine view adjacent modules * Formatting * Tests for new machine_view.cc functions * formatting * Minor Test correction * formatting * PR fixes * PR Fixes --------- Co-authored-by: Pietro Max Marsella <marsella@stanford.edu>
…n, other minor fixes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 2 of 50 files reviewed, 18 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)
lib/kernels/include/kernels/cast_kernels_cpu.h
line 15 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer function names over extra namespaces, the namespaces in
kernels
are more of a legacy holdover than something that should be used more
Done.
lib/kernels/src/local_cpu_allocator.cc
line 6 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Just use malloc
Done.
lib/kernels/src/local_cpu_allocator.cc
line 28 at r3 (raw file):
LocalCPUAllocator::~LocalCPUAllocator() { for (auto ptr : ptrs) {
Done.
lib/kernels/src/local_cuda_allocator.cc
line 8 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I don't think code should assume that the allocated memory has been zero'd, unless there's some reason I'm not thinking of?
Ended up adding an allocate_and_zero function to the Allocator class, not sure if this is too big of a change though?
lib/kernels/src/cpu/reverse_kernels.cc
line 14 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
This is really hard to read right now, make more readable (maybe pull out some helper functions or something?)
Is this "too" broken down?
lib/kernels/src/cuda/ops/reverse_kernels.cu
line 41 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
See my message in slack
Currently talking to Xinhao
lib/kernels/test/src/test_batch_norm_kernel.cc
line 91 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is
read_only_accessor_from_write_accessor
necessary here? Aren'tGenericTensorW
s assumed to beRW
@reyna-abhyankar ? If so, add an overload toload_accessor_data
Done.
lib/kernels/test/src/test_combine_kernel.cc
line 64 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not also define a
create_random_filled_accessor_r
?
Done.
lib/kernels/test/src/test_utils.h
line 11 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
If we're going to have
GenericTensorAccessor
s allocated on different devices it would probably be good to have a field onGenericTensorAccessor
that tracks that, so it can at least be checked at runtime if you try to access it incorrectly, etc.?This would also allow you to infer a lot of locations (CPU vs GPU) rather than having to pass additional arguments to the transfer functions
Ideally, this is using the same enum class as in allocation.h, but for some reason, whenever I define that class in accessor.h and try and include it as part of the class I get a bunch of compile errors? I think IT has to do with FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION? Is there some rule that I'm not aware of of what can and can't be passed in?
I also defined an initializer list as a lot of places in the code base reference GenericTensorAccessors and they initialize it using a three parameter constructor (ie. {datatype, shape, pointer}) so needed to add a way to default tensor location. Is it safe to say that previous references are on GPU or do I need to sift through this manually?
In addition, allocator's also now store additional state as well on if they allocate things on CPU or GPU so we can initialize the values for GenericTensorAccessors.
lib/kernels/test/src/test_utils.h
line 12 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why are concrete values needed?
Done.
lib/kernels/test/src/test_utils.h
line 65 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
It seems like this is just being used to create a random tensor, seems like that might be a better behavior?
This is mainly for debugging purposes, so can create like an iota filled accessor or etc.
lib/kernels/test/src/test_utils.h
line 88 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
The behavior of this method seems rather confusing, I'd stick to explicitly stating the direction you'd like to transfer (either from Host to Device, or reversed)/
Done.
lib/kernels/test/src/test_utils.h
line 89 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
GenericTensorAccessor
already has a shape, no need to pass an additionalshape
separately
Done.
lib/kernels/test/src/test_utils.h
line 104 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Pass as a normal argument, not a template argument
Done.
lib/kernels/test/src/test_utils.h
line 116 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Change to an enum and make passing required/copy
Just removed entirely
lib/kernels/test/src/test_utils.cc
line 8 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Can't
TensorShape
have different datatypes? If you're going to only do this forfloat
you should at the very least have a check
Done.
lib/kernels/test/src/test_utils.cc
line 17 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Would it be better to have a function that transfers a
GenericTensorAccessor
, rather than having to do casting to transfer each time? That also saves you from having to passvolume
, etc.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 48 of 48 files at r4, all commit messages.
Reviewable status: all files reviewed, 35 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)
lib/kernels/include/kernels/accessor.h
line 31 at r4 (raw file):
double *get_double_ptr() const; half *get_half_ptr() const;
Code snippet:
GenericTensorAccessorW() = delete;
lib/kernels/include/kernels/accessor.h
line 33 at r4 (raw file):
GenericTensorAccessorW(DataType dt, ArrayShape sh,
Suggestion:
ArrayShape const &sh,
lib/kernels/include/kernels/accessor.h
line 34 at r4 (raw file):
GenericTensorAccessorW(DataType dt, ArrayShape sh, req<void *> p,
Suggestion:
void *p,
lib/kernels/include/kernels/accessor.h
line 35 at r4 (raw file):
ArrayShape sh, req<void *> p, bool on_dev = true)
Suggestion:
GenericTensorAccessorW(DataType data_type,
ArrayShape shape,
req<void *> ptr,
bool on_dev = true)
lib/kernels/include/kernels/accessor.h
line 36 at r4 (raw file):
req<void *> p, bool on_dev = true) : data_type(dt), shape(sh), ptr(p), on_device(on_dev) {}
Move to .cc
file
lib/kernels/include/kernels/accessor.h
line 41 at r4 (raw file):
DataType data_type; ArrayShape shape; req<void *> ptr;
Suggestion:
void * ptr;
lib/kernels/include/kernels/accessor.h
line 42 at r4 (raw file):
ArrayShape shape; req<void *> ptr; bool on_device;
Suggestion:
DeviceType device_type;
lib/kernels/include/kernels/accessor.h
line 43 at r4 (raw file):
req<void *> ptr; bool on_device; };
Suggestion:
bool operator==(GenericTensorAccessorW const &) const;
bool operator!=(GenericTensorAccessorW const &) const;
public:
DataType data_type;
ArrayShape shape;
req<void *> ptr;
bool on_device;
private:
std::tuple<decltype(data_type) const &,
decltype(shape) const &,
decltype(ptr) const &,
decltype(on_device) const &>
tie() const;
};
// in .cc file
std::tuple<DataType const &,
ArrayShape const &,
void *,
DeviceType const &>
GenericTensorAccessorW::tie() const {
return std::tie(this->data_type, this->shape, this->ptr, this->on_device);
}
bool GenericTensorAccessorW::operator==(GenericTensorAccessorW const &other) const {
return this->tie() == other.tie();
}
bool GenericTensorAccessorW::operator!=(GenericTensorAccessorW const &other) const {
return this->tie() != other.tie();
}
lib/kernels/include/kernels/accessor.h
line 45 at r4 (raw file):
}; FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION( GenericTensorAccessorW, data_type, shape, ptr, on_device);
visitable
is deprecated
lib/kernels/include/kernels/accessor.h
line 64 at r4 (raw file):
double const *get_double_ptr() const; half const *get_half_ptr() const;
Code snippet:
GenericTensorAccessorR() = delete;
lib/kernels/include/kernels/accessor.h
line 65 at r4 (raw file):
half const *get_half_ptr() const; GenericTensorAccessorR(DataType dt,
Move to .cc
file
lib/kernels/include/kernels/accessor.h
line 77 at r4 (raw file):
bool on_device; }; FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION(
Make same tie
changes as above to add ==
, !=
, hash
back in (they're currently added by visitable
)
lib/kernels/include/kernels/allocation.h
line 14 at r4 (raw file):
struct IAllocator { virtual void *allocate(size_t) = 0; virtual void *allocate_and_zero(size_t) = 0;
Delete, if you want to zero the allocation you can zero it with some other function, don't make it part of the allocator interface
lib/kernels/include/kernels/allocation.h
line 16 at r4 (raw file):
virtual void *allocate_and_zero(size_t) = 0; virtual void deallocate(void *) = 0;
Code snippet:
virtual DeviceType get_allocation_device_type() = 0;
lib/kernels/include/kernels/allocation.h
line 25 at r4 (raw file):
GenericTensorAccessorW allocate_tensor(TensorShape const &tensor_shape); GenericTensorAccessorW allocate_tensor_and_zero(TensorShape const &tensor_shape);
Delete
lib/kernels/include/kernels/allocation.h
line 28 at r4 (raw file):
void *allocate(size_t mem_size); void *allocate_and_zero(size_t mem_size);
Delete
lib/kernels/include/kernels/allocation.h
line 40 at r4 (raw file):
Allocator(std::shared_ptr<IAllocator> ptr) : i_allocator(ptr){}; AllocLocation alloc_location;
Delete
lib/kernels/include/kernels/local_cpu_allocator.h
line 13 at r4 (raw file):
void *allocate(size_t) override; void *allocate_and_zero(size_t) override;
Delete
lib/kernels/include/kernels/local_cuda_allocator.h
line 13 at r4 (raw file):
void *allocate(size_t) override; void *allocate_and_zero(size_t) override;
Delete
lib/kernels/src/array_shape.cc
line 63 at r4 (raw file):
} TensorShape get_tensor_shape(ArrayShape const &shape, DataType DT) {
Use the implementation from @reyna-abhyankar's branch: https://github.com/reyna-abhyankar/FlexFlow/blob/e1a8a01e9e4f782805aa62c163bf50799f0638b4/lib/kernels/src/array_shape.cc#L75-L78
lib/kernels/src/local_cpu_allocator.cc
line 5 at r4 (raw file):
namespace FlexFlow { void *LocalCPUAllocator::allocate(size_t requested_memory_size) {
Use the implementations from @reyna-abhyankar's branch: https://github.com/reyna-abhyankar/FlexFlow/blob/e1a8a01e9e4f782805aa62c163bf50799f0638b4/lib/local-execution/src/local_cpu_allocator.cc
lib/kernels/src/cpu/combine_kernels.cc
line 4 at r4 (raw file):
#include "kernels/datatype_dispatch.h" namespace FlexFlow {
FYI we're now on C++17, so you can instead write namespace FlexFlow::Kernels::Combine {
lib/kernels/src/cpu/reverse_kernels.cc
line 14 at r3 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Is this "too" broken down?
The comments are helpful, but I think the big thing that would help is the access-by-coordinate function for GenericTensorAccessor
lib/kernels/src/cuda/ops/reverse_kernels.cu
line 41 at r3 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Currently talking to Xinhao
Can you change your status on this comment then to "Working": see https://docs.reviewable.io/discussions.html#dispositions-and-resolution
lib/kernels/test/src/test_batch_norm_kernel.cc
line 36 at r4 (raw file):
GenericTensorAccessorW input_accessor = create_random_filled_accessor_w<DataType::FLOAT>(input_shape,
Why is the extra template parameter suddenly necessary?
lib/kernels/test/src/test_cast_kernel.cc
line 90 at r4 (raw file):
DataType::FLOAT, DataType::INT32); std::cout << "Before GPU load" << std::endl;
Delete
lib/kernels/test/src/test_cast_kernel.cc
line 103 at r4 (raw file):
DataType::FLOAT, DataType::INT32); std::cout << "Before CPU load" << std::endl;
Delete
lib/kernels/test/src/test_combine_kernel.cc
line 111 at r4 (raw file):
output_grad_accessor_gpu, cpu_allocator); GenericTensorAccessorW input_grad_accessor_cpu = cpu_allocator.allocate_tensor_and_zero(input_shape);
Or even add a helper function create_zero_filled_accessor_w
that does these two operations
Suggestion:
GenericTensorAccessorW input_grad_accessor_cpu =
cpu_allocator.allocate_tensor(input_shape);
fill_with_zeros(input_grad_accessor_cpu);
lib/kernels/test/src/test_reverse_kernels.cc
line 147 at r4 (raw file):
// Run CPU Cast Backward Kernel GenericTensorAccessorW output_grad_accessor_cpu = copy_tensor_between_memories<DataType::FLOAT>(
Add an explicit destination (semantics should either be copy to CPU or copy to GPU, not dependent on the current placement)
Code quote:
copy_tensor_between_memories<DataType::FLOAT>(
lib/kernels/test/src/test_utils.h
line 11 at r3 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Ideally, this is using the same enum class as in allocation.h, but for some reason, whenever I define that class in accessor.h and try and include it as part of the class I get a bunch of compile errors? I think IT has to do with FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION? Is there some rule that I'm not aware of of what can and can't be passed in?
I also defined an initializer list as a lot of places in the code base reference GenericTensorAccessors and they initialize it using a three parameter constructor (ie. {datatype, shape, pointer}) so needed to add a way to default tensor location. Is it safe to say that previous references are on GPU or do I need to sift through this manually?
In addition, allocator's also now store additional state as well on if they allocate things on CPU or GPU so we can initialize the values for GenericTensorAccessors.
There shouldn't be that much code explicitly creating GenericTensorAccessor
s I think? Can you point me to some examples?
lib/kernels/test/src/test_utils.h
line 65 at r3 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
This is mainly for debugging purposes, so can create like an iota filled accessor or etc.
Either specialize to a function that creates a random tensor, or pass the current tensor index to transform
--the current abstraction is a bit awkwardly intermediate
lib/kernels/test/src/test_utils.h
line 88 at r3 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Done.
Still seeing the old behavior? The destination placement should be stated either as part of the function name or as an argument--if it's an argument, rename the function to copy_tensor_to_memory
lib/kernels/test/src/test_utils.h
line 33 at r4 (raw file):
} template <DataType DT>
Why not just get the datatype from the shape
?
lib/kernels/test/src/test_utils.cc
line 11 at r4 (raw file):
} TensorShape make_tensor_shape_from_legion_dims(FFOrdered<size_t> dims,
This doesn't seem correct, shouldn't this be taking a LegionOrdered
?
Suggestion:
legion_dims(FFOrdered<size_t> const &dims,
lib/local-execution/include/local-execution/tracked_allocator.h
line 15 at r4 (raw file):
void *allocate(size_t) override; void *allocate_and_zero(size_t) override;
Delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 8 of 83 files reviewed, 35 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)
lib/kernels/include/kernels/accessor.h
line 31 at r4 (raw file):
double *get_double_ptr() const; half *get_half_ptr() const;
Done.
lib/kernels/include/kernels/accessor.h
line 33 at r4 (raw file):
GenericTensorAccessorW(DataType dt, ArrayShape sh,
Done.
lib/kernels/include/kernels/accessor.h
line 34 at r4 (raw file):
GenericTensorAccessorW(DataType dt, ArrayShape sh, req<void *> p,
Done.
lib/kernels/include/kernels/accessor.h
line 35 at r4 (raw file):
ArrayShape sh, req<void *> p, bool on_dev = true)
Done.
lib/kernels/include/kernels/accessor.h
line 36 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Move to
.cc
file
Done.
lib/kernels/include/kernels/accessor.h
line 41 at r4 (raw file):
DataType data_type; ArrayShape shape; req<void *> ptr;
Done.
lib/kernels/include/kernels/accessor.h
line 42 at r4 (raw file):
ArrayShape shape; req<void *> ptr; bool on_device;
Done.
lib/kernels/include/kernels/accessor.h
line 43 at r4 (raw file):
req<void *> ptr; bool on_device; };
Done.
lib/kernels/include/kernels/accessor.h
line 45 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
visitable
is deprecated
Done.
lib/kernels/include/kernels/accessor.h
line 64 at r4 (raw file):
double const *get_double_ptr() const; half const *get_half_ptr() const;
Done.
lib/kernels/include/kernels/accessor.h
line 65 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Move to
.cc
file
Done.
lib/kernels/include/kernels/accessor.h
line 77 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Make same
tie
changes as above to add==
,!=
,hash
back in (they're currently added byvisitable
)
Done.
lib/kernels/include/kernels/allocation.h
line 14 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete, if you want to zero the allocation you can zero it with some other function, don't make it part of the allocator interface
Done.
lib/kernels/include/kernels/allocation.h
line 16 at r4 (raw file):
virtual void *allocate_and_zero(size_t) = 0; virtual void deallocate(void *) = 0;
Done.
lib/kernels/include/kernels/allocation.h
line 25 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
lib/kernels/include/kernels/allocation.h
line 28 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
lib/kernels/include/kernels/allocation.h
line 40 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
lib/kernels/include/kernels/local_cpu_allocator.h
line 13 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
lib/kernels/include/kernels/local_cuda_allocator.h
line 13 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
lib/kernels/src/array_shape.cc
line 63 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Use the implementation from @reyna-abhyankar's branch: https://github.com/reyna-abhyankar/FlexFlow/blob/e1a8a01e9e4f782805aa62c163bf50799f0638b4/lib/kernels/src/array_shape.cc#L75-L78
Done.
lib/kernels/src/local_cpu_allocator.cc
line 5 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Use the implementations from @reyna-abhyankar's branch: https://github.com/reyna-abhyankar/FlexFlow/blob/e1a8a01e9e4f782805aa62c163bf50799f0638b4/lib/local-execution/src/local_cpu_allocator.cc
Done.
lib/kernels/src/cpu/combine_kernels.cc
line 4 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
FYI we're now on C++17, so you can instead write
namespace FlexFlow::Kernels::Combine {
Done
lib/kernels/src/cpu/reverse_kernels.cc
line 14 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
The comments are helpful, but I think the big thing that would help is the access-by-coordinate function for
GenericTensorAccessor
Done.
lib/kernels/test/src/test_batch_norm_kernel.cc
line 36 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is the extra template parameter suddenly necessary?
I wanted create_random_filled_accessor to just return a random accessor filled with decimal values. Initially, I had it where it would always be a float, but realized from doing the cast_kernel operator, that we could sometimes want it to be a double as well rather than float
lib/kernels/test/src/test_cast_kernel.cc
line 90 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
lib/kernels/test/src/test_cast_kernel.cc
line 103 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
lib/kernels/test/src/test_combine_kernel.cc
line 111 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Or even add a helper function
create_zero_filled_accessor_w
that does these two operations
Done.
lib/kernels/test/src/test_reverse_kernels.cc
line 147 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Add an explicit destination (semantics should either be copy to CPU or copy to GPU, not dependent on the current placement)
Done.
lib/kernels/test/src/test_utils.h
line 11 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
There shouldn't be that much code explicitly creating
GenericTensorAccessor
s I think? Can you point me to some examples?
I guess you're right, not sure what I was seeing earlier... I could probably sift through all initialization of GenericTensorAccessor's and specify their DeviceType, but maybe our current behavior is fine with the initializer is fine anyways?
lib/kernels/test/src/test_utils.h
line 65 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Either specialize to a function that creates a random tensor, or pass the current tensor index to
transform
--the current abstraction is a bit awkwardly intermediate
Just deleted entirely
lib/kernels/test/src/test_utils.h
line 86 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Where possible, probably best to add a dynamically-dispatched function as well as one that takes the argument by template. This isn't possible for some functions (such as those that return a type based on the template type), but for some of these in this function it seems possible.
Done.
lib/kernels/test/src/test_utils.h
line 88 at r3 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Still seeing the old behavior? The destination placement should be stated either as part of the function name or as an argument--if it's an argument, rename the function to
copy_tensor_to_memory
Made more explicit
lib/kernels/test/src/test_utils.h
line 33 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not just get the datatype from the
shape
?
Done.
lib/kernels/test/src/test_utils.cc
line 11 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
This doesn't seem correct, shouldn't this be taking a
LegionOrdered
?
Done.
lib/local-execution/include/local-execution/tracked_allocator.h
line 15 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Delete
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 72 of 75 files at r5, 3 of 3 files at r6, all commit messages.
Reviewable status: all files reviewed, 33 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)
lib/kernels/CMakeLists.txt
line 33 at r6 (raw file):
nccl utils pcg
Why?
lib/kernels/include/kernels/accessor.h
line 15 at r6 (raw file):
namespace FlexFlow { class GenericTensorAccessorW {
@reyna-abhyankar Is there a reason we have both GenericTensorAccessorW
and GenericTensorAccessorR
over just having one GenericTensorAccessor
and passing it as either const &
or mutable &
?
lib/kernels/include/kernels/accessor.h
line 45 at r6 (raw file):
template <DataType DT, typename... Indices> real_type_t<DT> &at(Indices... indices) { if (this->device_type != DeviceType::CPU) {
Any reason for this over the simpler at(std::vector<int> const &)
?
Code quote:
template <DataType DT, typename... Indices>
real_type_t<DT> &at(Indices... indices) {
if (this->device_type != DeviceType::CPU) {
lib/kernels/include/kernels/reverse_kernels_cpu.h
line 5 at r6 (raw file):
#include "accessor.h" #include "device.h"
Prefer full include paths
Suggestion:
#include "kernels/accessor.h"
#include "kernels/device.h"
lib/kernels/src/accessor.cc
line 30 at r6 (raw file):
} size_t offset = 0;
@Marsella8 Do you think we could reasonably add some generic coord -> index
& index apsce concept in utils
and then move this code and part of the 1458 over to use it? Feels like this logic is getting unnecessarily reimplemented
lib/kernels/src/accessor.cc
line 47 at r6 (raw file):
offset += cur_idx * multiplier; multiplier *= this->shape[legion_dim_t(i)];
Prefer at
for bounds checking (ArrayShape
may have bounds-checking for both, but other containers don't so it's a good habit)
Suggestion:
multiplier *= this->shape.at(legion_dim_t(i));
lib/kernels/src/allocation.cc
line 22 at r6 (raw file):
void *ptr = this->allocate(get_size_in_bytes(tensor_shape)); return { tensor_shape.data_type, tensor_shape, ptr, get_allocation_device_type()};
Prefer explicit this->
Suggestion:
tensor_shape.data_type, tensor_shape, ptr, this->get_allocation_device_type()};
lib/kernels/src/cpu/replicate_kernels.cc
line 7 at r6 (raw file):
template <typename T> void cpu_replicate_backward_kernel(T *input,
Move over to use the new GenericTensorAccessor
access-by-multidimensional-index support?
lib/kernels/src/cpu/reverse_kernels.cc
line 14 at r3 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Done.
Yeah this is much clearer, thanks (also it would be good to rename blk
to block
as it's a bit nonobvious that blk
means block
and not bulk
)
lib/kernels/src/cpu/reverse_kernels.cc
line 13 at r6 (raw file):
GenericTensorAccessorW &output, coord_t num_out_blks, coord_t reverse_dim_size,
Can't we just get this from the shape information contained in input
?
Code quote:
coord_t reverse_dim_size,
lib/kernels/src/cpu/reverse_kernels.cc
line 49 at r6 (raw file):
DataTypeDispatch1<CPUReverseForwardKernel>{}(input_accessor.data_type, input_accessor, std::ref(output_accessor),
Why is std::ref
necessary?
Code quote:
std::ref(output_accessor),
lib/kernels/test/CMakeLists.txt
line 17 at r6 (raw file):
cudart cublas pcg
Why?
lib/kernels/test/src/test_batch_norm_kernel.cc
line 36 at r4 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
I wanted create_random_filled_accessor to just return a random accessor filled with decimal values. Initially, I had it where it would always be a float, but realized from doing the cast_kernel operator, that we could sometimes want it to be a double as well rather than float
Sure, but the DataType is already present in the input_shape
, so probably better to just use that than force the user to pass the DataType
twice
lib/kernels/test/src/test_cast_kernel.cc
line 87 at r6 (raw file):
std::vector<int32_t> result_data_gpu = load_accessor_data<DataType::INT32>(output_accessor_gpu);
Why not just get back a GenericTensorAccessorR
with its data on CPU? Throwing away all the information by dropping down to a std::vector
feels unnecessary
Code quote:
std::vector<int32_t> result_data_gpu =
load_accessor_data<DataType::INT32>(output_accessor_gpu);
lib/kernels/test/src/test_cast_kernel.cc
line 91 at r6 (raw file):
// Run CPU Forward Kernel GenericTensorAccessorW input_accessor_cpu = create_random_filled_accessor_w<DataType::FLOAT>(input_shape,
If the data is random, why would I expect input_accessor_cpu
to have the same data as input_accessor_gpu
? Would it be cleaner to generate a GenericTensorAccessor
with random data on CPU, then create a copy of that on GPU, and then run the functions?
lib/kernels/test/src/test_concat_kernel.cc
line 9 at r6 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("Test concat kernel forward and backward") { size_t num_inputs = 2;
Why these changes?
lib/kernels/test/src/test_replicate_kernel.cc
line 78 at r6 (raw file):
GenericTensorAccessorW output_accessor_gpu = gpu_allocator.allocate_tensor(output_shape); fill_with_zeros(output_accessor_gpu);
Wouldn't filling with random data here give you more assurance the function is correct?
lib/kernels/test/src/test_reverse_kernels.cc
line 150 at r6 (raw file):
GenericTensorAccessorW input_grad_accessor_cpu = cpu_allocator.allocate_tensor(input_shape); fill_with_zeros(input_grad_accessor_cpu);
I'd recommend creating a helper function for creating zero-filled tensors in test_utils
to avoid any "use before initialized" issues
Code quote:
cpu_allocator.allocate_tensor(input_shape);
fill_with_zeros(input_grad_accessor_cpu);
lib/kernels/test/src/test_utils.h
line 11 at r3 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
I guess you're right, not sure what I was seeing earlier... I could probably sift through all initialization of GenericTensorAccessor's and specify their DeviceType, but maybe our current behavior is fine with the initializer is fine anyways?
Not sure what you mean by "but maybe our current behavior is fine with the initializer is fine anyways". Can you clarify?
lib/kernels/test/src/test_utils.h
line 33 at r4 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Done.
Then why is the template parameter still here?
lib/kernels/test/src/test_utils.h
line 34 at r6 (raw file):
template <typename DT> void transfer_memory(GenericTensorAccessorW dst_accessor,
Suggestion:
void transfer_memory(GenericTensorAccessorW &dst_accessor,
lib/kernels/test/src/test_utils.h
line 35 at r6 (raw file):
template <typename DT> void transfer_memory(GenericTensorAccessorW dst_accessor, const DT *src,
Why is one parameter a TensorAccessor
and the other is a raw array? Feels inconsistent
Code quote:
void transfer_memory(GenericTensorAccessorW dst_accessor,
const DT *src,
lib/kernels/test/src/test_utils.h
line 37 at r6 (raw file):
const DT *src, DeviceType src_device_type) { size_t bytes = dst_accessor.shape.get_volume() * sizeof(DT);
Suggestion:
size_t num_bytes = dst_accessor.shape.get_volume() * sizeof(DT);
lib/kernels/test/src/test_utils.h
line 78 at r6 (raw file):
template <DataType DT> GenericTensorAccessorR create_random_filled_accessor_r(TensorShape const &shape,
Get the DataType
from the TensorShape
lib/kernels/test/src/test_utils.h
line 88 at r6 (raw file):
template <typename T> GenericTensorAccessorW create_filled_accessor_w(TensorShape const &shape,
Get the DataType
from the TensorShape
lib/kernels/test/src/test_utils.h
line 124 at r6 (raw file):
template <DataType DT> std::vector<real_type_t<DT>> load_accessor_data(GenericTensorAccessorW accessor) {
Suggestion:
load_accessor_data(GenericTensorAccessorW const &accessor) {
lib/kernels/test/src/test_utils.h
line 149 at r6 (raw file):
template <typename T> bool vectors_are_approx_equal(T lhs, T rhs) {
Now that we have support for GenericTensorAccessor
s on CPU, using vectors no longer seems necessary I think
Code quote:
template <typename T>
bool contains_non_zero(std::vector<T> &data) {
return !all_of(data, [](T const &val) { return val == 0; });
}
template <typename T>
bool vectors_are_approx_equal(T lhs, T rhs) {
lib/kernels/test/src/test_utils.cc
line 5 at r6 (raw file):
namespace FlexFlow { bool device_on_cpu(DeviceType device_type) {
Is this function really necessary? Doesn't seem like you're saving much typing or much complexity
lib/kernels/test/src/test_utils.cc
line 44 at r6 (raw file):
return DataTypeDispatch1<CopyTensorAccessorW>{}( src_accessor.data_type, src_accessor, std::ref(allocator)); }
The ability to copy tensors feels like a useful primitive by itself, might be better moved into kernels
itself?
Code quote:
template <DataType DT>
struct CopyTensorAccessorW {
GenericTensorAccessorW operator()(GenericTensorAccessorW const &src_accessor,
Allocator &allocator) {
TensorShape shape =
get_tensor_shape(src_accessor.shape, src_accessor.data_type);
GenericTensorAccessorW copied_tensor = allocator.allocate_tensor(shape);
transfer_memory(
copied_tensor, src_accessor.get<DT>(), src_accessor.device_type);
return copied_tensor;
}
};
GenericTensorAccessorW
copy_tensor_accessor_w(GenericTensorAccessorW const &src_accessor,
Allocator &allocator) {
return DataTypeDispatch1<CopyTensorAccessorW>{}(
src_accessor.data_type, src_accessor, std::ref(allocator));
}
lib/local-execution/src/local_task_argument_accessor.cc
line 31 at r6 (raw file):
tensor_backing.shape, tensor_backing.ptr, this->allocator.get_allocation_device_type()};
At this point probably better to just add a function in accessor.h
for converting from a GenericTensorAccessorW
to a GenericTensorAccessorR
Code quote:
GenericTensorAccessorR readonly_tensor_backing = {
tensor_backing.data_type,
tensor_backing.shape,
tensor_backing.ptr,
this->allocator.get_allocation_device_type()};
lib/local-execution/src/local_task_argument_accessor.cc
line 31 at r6 (raw file):
tensor_backing.shape, tensor_backing.ptr, this->allocator.get_allocation_device_type()};
Why not just use the device type from tensor_backing
?
lib/local-execution/src/local_task_argument_accessor.cc
line 49 at r6 (raw file):
for (GenericTensorAccessorW const &tensor_backing : variadic_tensor_backing) { readonly_variadic_tensor_backing.push_back(
At this point probably better to just add a function in accessor.h
for converting from a GenericTensorAccessorW
to a GenericTensorAccessorR
lib/local-execution/src/local_task_argument_accessor.cc
line 53 at r6 (raw file):
tensor_backing.shape, tensor_backing.ptr, this->allocator.get_allocation_device_type()});
Just use the device type from tensor_backing
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 51 of 85 files reviewed, 33 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)
lib/kernels/CMakeLists.txt
line 33 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why?
I'm currently using the DeviceType enum from pcg/device_type.dtg.h
lib/kernels/include/kernels/accessor.h
line 45 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Any reason for this over the simpler
at(std::vector<int> const &)
?
Mainly such that you can just use .at
lib/kernels/include/kernels/reverse_kernels_cpu.h
line 5 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer full include paths
Done.
lib/kernels/src/accessor.cc
line 47 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer
at
for bounds checking (ArrayShape
may have bounds-checking for both, but other containers don't so it's a good habit)
Done.
lib/kernels/src/allocation.cc
line 22 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Prefer explicit
this->
Done.
lib/kernels/src/cpu/replicate_kernels.cc
line 7 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Move over to use the new
GenericTensorAccessor
access-by-multidimensional-index support?
Done.
lib/kernels/src/cpu/reverse_kernels.cc
line 13 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Can't we just get this from the shape information contained in
input
?
Done.
lib/kernels/src/cpu/reverse_kernels.cc
line 49 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is
std::ref
necessary?
I run into lvalue, rvalue issues with output_accessor without utilizing std::ref. For some reason DataTypeDispatch1 I think is forwarding output accessor as an rvalue which is causing an error? I think it's related to output_accessor being non const, which it has to be in order for the .at method to return a non const pointer which we can assign values to. std::ref seems to fix this issue though.
If possible, I think we can also just modify how forwarding works in DataTypeDispatch1 to fix this issue as well though?
lib/kernels/test/CMakeLists.txt
line 17 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why?
I'm currently using the DeviceType enum from pcg/device_type.dtg.h
lib/kernels/test/src/test_batch_norm_kernel.cc
line 36 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Sure, but the DataType is already present in the
input_shape
, so probably better to just use that than force the user to pass theDataType
twice
Done.
lib/kernels/test/src/test_cast_kernel.cc
line 87 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not just get back a
GenericTensorAccessorR
with its data on CPU? Throwing away all the information by dropping down to astd::vector
feels unnecessary
Done.
lib/kernels/test/src/test_cast_kernel.cc
line 91 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
If the data is random, why would I expect
input_accessor_cpu
to have the same data asinput_accessor_gpu
? Would it be cleaner to generate aGenericTensorAccessor
with random data on CPU, then create a copy of that on GPU, and then run the functions?
Didn't catch this as test was going from float between 0 and 1 to an int lol, so it was basically just comparing if a tensor of all zeros was equal to another tensor of all zeros
lib/kernels/test/src/test_concat_kernel.cc
line 9 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why these changes?
Was printing outputs and was easier to visually compare
lib/kernels/test/src/test_replicate_kernel.cc
line 78 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Wouldn't filling with random data here give you more assurance the function is correct?
The problem I was running into is that the replicate_kernels.cu just directly adds to the value inside of the output accessor, leading to some cases when I was running the test where one of the tensors were allocated with randomized values inside of it, resulting in the test failing as the output tensors were not both initialized with the same initial state
lib/kernels/test/src/test_reverse_kernels.cc
line 150 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
I'd recommend creating a helper function for creating zero-filled tensors in
test_utils
to avoid any "use before initialized" issues
Done.
lib/kernels/test/src/test_utils.h
line 33 at r4 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Then why is the template parameter still here?
Done.
lib/kernels/test/src/test_utils.h
line 34 at r6 (raw file):
template <typename DT> void transfer_memory(GenericTensorAccessorW dst_accessor,
Done.
lib/kernels/test/src/test_utils.h
line 35 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why is one parameter a
TensorAccessor
and the other is a raw array? Feels inconsistent
It's mainly as sometimes I'm wanting to transfer memory from a vector to an accessor => my current logic for random filled accessors is dependent on this...
I think it becomes potentially more messy creating things like these random filled accessors without this construct as then we'd increment based on ptr
lib/kernels/test/src/test_utils.h
line 37 at r6 (raw file):
const DT *src, DeviceType src_device_type) { size_t bytes = dst_accessor.shape.get_volume() * sizeof(DT);
Done.
lib/kernels/test/src/test_utils.h
line 78 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Get the
DataType
from theTensorShape
Done.
lib/kernels/test/src/test_utils.h
line 88 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Get the
DataType
from theTensorShape
I think we need to use a template here, as we don't know we have to get the value that we're going to fill the accessor with too and we don't know the type of that to take as input?
lib/kernels/test/src/test_utils.h
line 124 at r6 (raw file):
template <DataType DT> std::vector<real_type_t<DT>> load_accessor_data(GenericTensorAccessorW accessor) {
Done.
lib/kernels/test/src/test_utils.h
line 149 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Now that we have support for
GenericTensorAccessor
s on CPU, using vectors no longer seems necessary I think
Done.
lib/kernels/test/src/test_utils.cc
line 5 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Is this function really necessary? Doesn't seem like you're saving much typing or much complexity
Done.
lib/kernels/test/src/test_utils.cc
line 44 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
The ability to copy tensors feels like a useful primitive by itself, might be better moved into
kernels
itself?
Done.
lib/local-execution/src/local_task_argument_accessor.cc
line 31 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
At this point probably better to just add a function in
accessor.h
for converting from aGenericTensorAccessorW
to aGenericTensorAccessorR
Done.
lib/local-execution/src/local_task_argument_accessor.cc
line 31 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Why not just use the device type from
tensor_backing
?
Done.
lib/local-execution/src/local_task_argument_accessor.cc
line 49 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
At this point probably better to just add a function in
accessor.h
for converting from aGenericTensorAccessorW
to aGenericTensorAccessorR
Done.
lib/local-execution/src/local_task_argument_accessor.cc
line 53 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
Just use the device type from
tensor_backing
?
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 33 of 34 files at r7, 1 of 1 files at r8, 7 of 7 files at r9, all commit messages.
Dismissed @lockshaw from a discussion.
Reviewable status: all files reviewed, 27 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)
lib/kernels/include/kernels/accessor.h
line 15 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
@reyna-abhyankar Is there a reason we have both
GenericTensorAccessorW
andGenericTensorAccessorR
over just having oneGenericTensorAccessor
and passing it as eitherconst &
or mutable&
?
At least in the meantime it seems like it would make sense to have GenericTensorAccessorW
non-explicitly coerce to GenericTensorAccessorR
I think to avoid all of the overloading?
lib/kernels/include/kernels/accessor.h
line 45 at r6 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
Mainly such that you can just use .at
(1,2,3) instead of .at ({1,2,3})... no need for unneeded initializer list
Let's do the std::vector
version unless there's some additional advantage you haven't mentioned yet. Yes the template syntax is cute and all, but it's substantially more complicated feature-wise and all it seems to do is remove the need to type two characters (also it doesn't allow for passing in coordinates of runtime-determined dimension)
lib/kernels/include/kernels/accessor.h
line 14 at r9 (raw file):
namespace FlexFlow { struct Allocator;
Better to #include
in the header that contains Allocator
lib/kernels/include/kernels/accessor.h
line 264 at r9 (raw file):
GenericTensorAccessorR const &src_accessor); void transfer_data_between_accessors(
Makes parameter order clear, and makes the operation a bit clearer ("transfer" sounds more like a "move" operation)
Suggestion:
void copy_accessor_data_to_l_from_r(
lib/kernels/include/kernels/replicate_kernels.h
line 15 at r9 (raw file):
void backward_kernel(ffStream_t stream, GenericTensorAccessorR const &output, GenericTensorAccessorW const &input,
Why reorder input
and output
?
lib/kernels/include/kernels/replicate_kernels_cpu.h
line 13 at r9 (raw file):
void cpu_backward_kernel(GenericTensorAccessorR const &output, GenericTensorAccessorW &input,
Why reorder input
and output
?
lib/kernels/src/accessor.cc
line 30 at r6 (raw file):
Previously, lockshaw (Colin Unger) wrote…
@Marsella8 Do you think we could reasonably add some generic
coord -> index
& index apsce concept inutils
and then move this code and part of the 1458 over to use it? Feels like this logic is getting unnecessarily reimplemented
Tracked in #1528
lib/kernels/src/accessor.cc
line 32 at r9 (raw file):
num_bytes, cudaMemcpyDeviceToDevice)); }
Suggestion:
} else {
assert (src_device_type == DeviceType::GPU);
assert (dst_device_type == DeviceType::GPU);
checkCUDA(cudaMemcpy(dst_accessor.ptr,
src_accessor.ptr,
num_bytes,
cudaMemcpyDeviceToDevice));
}
lib/kernels/src/managed_ff_stream.cc
line 14 at r9 (raw file):
ManagedFFStream &ManagedFFStream::operator=(ManagedFFStream &&other) noexcept { if (this != &other) {
Why this change?
lib/kernels/src/managed_per_device_ff_handle.cc
line 22 at r9 (raw file):
ManagedPerDeviceFFHandle &ManagedPerDeviceFFHandle::operator=( ManagedPerDeviceFFHandle &&other) noexcept { if (this != &other) {
Why change this from swap
?
lib/kernels/src/managed_per_device_ff_handle.cc
line 40 at r9 (raw file):
checkCUDA(cudaFree(this->handle->workSpace)); delete this->handle; this->handle = nullptr;
Why explicitly assign the handle
to nullptr
?
lib/kernels/src/cpu/reverse_kernels.cc
line 49 at r6 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
I run into lvalue, rvalue issues with output_accessor without utilizing std::ref. For some reason DataTypeDispatch1 I think is forwarding output accessor as an rvalue which is causing an error? I think it's related to output_accessor being non const, which it has to be in order for the .at method to return a non const pointer which we can assign values to. std::ref seems to fix this issue though.
If possible, I think we can also just modify how forwarding works in DataTypeDispatch1 to fix this issue as well though?
I'm tentatively guessing that might be due to DataTypeDispatch1
and DataTypeDispatch2
taking their arguments as Args ...
rather than as Args &&...
--can you try making that change and seeing if it fixes the problem?
lib/kernels/src/cuda/ops/linear_kernels.cu
line 138 at r9 (raw file):
in_dim, &alpha, (void *)weight_ptr,
Avoid c-style casts
Suggestion:
static_cast<void *>(weight_ptr),
lib/kernels/test/src/test_cast_kernel.cc
line 83 at r9 (raw file):
output_accessor_cpu, DataType::FLOAT, DataType::DOUBLE);
Any reason not to just get these DataType
s from the accessors' shape fields?
Code quote:
DataType::FLOAT,
DataType::DOUBLE);
lib/kernels/test/src/test_cast_kernel.cc
line 85 at r9 (raw file):
DataType::DOUBLE); CHECK(w_accessors_are_equal<DataType::DOUBLE>(output_accessor_gpu,
Why is a template parameter needed here? Isn't the type information already present in the shape field of the accessors?
Code quote:
<DataType::DOUBLE>(o
lib/kernels/test/src/test_managed_ff_stream.cc
line 10 at r9 (raw file):
ManagedFFStream base_stream{}; SUBCASE("Test ManagedFFStream Move Constructor") {
Add checks for if it's the same object on both sides of the assignment.
lib/kernels/test/src/test_managed_per_device_ff_handle.cc
line 8 at r9 (raw file):
TEST_SUITE(FF_TEST_SUITE) { TEST_CASE("Test Managed Per Device FF Handle") { ManagedPerDeviceFFHandle base_handle{};
Can we force the user to pass arguments here instead? I'm not a huge fan of default constructors where there's no obvious single correct value that should be set. If you want to add a function that creates one with some default arguments and user that that's fine, but I'd rather that be an explicit function call to make it clear that that's kinda an arbitrary choice supplied somewhere rather than intrinsic to the object semantics
Code quote:
ManagedPerDeviceFFHandle base_handle{};
lib/kernels/test/src/test_managed_per_device_ff_handle.cc
line 16 at r9 (raw file):
SUBCASE("Test ManagedPerDeviceFFHandle Move Constructor") { PerDeviceFFHandle const *base_handle_ptr = &base_handle.raw_handle();
Add checks for if it's the same object on both sides of the assignment.
lib/kernels/test/src/test_reverse_kernels.cc
line 60 at r9 (raw file):
TEST_CASE("Check Reverse Forward and Backward Kernels against CPU Kernels") { std::size_t num_out_blks = 1;
Might be nice to have this be >1 as having it be 1 could be hiding bugs (as iteration order doesn't matter if your looop has bound 1)
Code quote:
std::size_t num_out_blks = 1;
lib/kernels/test/src/test_utils.h
line 35 at r6 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
It's mainly as sometimes I'm wanting to transfer memory from a vector to an accessor => my current logic for random filled accessors is dependent on this...
I think it becomes potentially more messy creating things like these random filled accessors without this construct as then we'd increment based on ptr
Seems like this has been removed since this comment, so I guess it's fine now?
lib/kernels/test/src/test_utils.h
line 88 at r6 (raw file):
Previously, oOTigger (Dylan Lim) wrote…
I think we need to use a template here, as we don't know we have to get the value that we're going to fill the accessor with too and we don't know the type of that to take as input?
You could technically get around this by passing in a DataTypeValue
which is a variant of the real_type_t
s of all of the DataType
s, but I'm fine with the current implementation too. I'd say the big thing is that it should be checked against the DataType
in the shape
and an error raised if the the chosen T
and the TensorShape
's DataType
don't match
lib/kernels/test/src/test_utils.h
line 101 at r9 (raw file):
if (accessor_a.data_type != accessor_b.data_type) { return false; }
Suggestion:
if (accessor_a.shape != accessor_b.shape) {
throw mk_runtime_error(fmt::format("w_accessors_are_equal expected accessors to have the same shape, but received: {} != {}", accessor_a.shape, accessor_b.shape));
}
lib/kernels/test/src/test_utils.h
line 115 at r9 (raw file):
for (size_t i = 0; i < accessor_a.shape.num_elements(); i++) { if (a_data_ptr[i] != b_data_ptr[i]) { print_accessor(cpu_accessor_a);
print
ing by default seems like a bad idea, this should probably be handled by the assertion check, not the actual boolean condition itself.
lib/kernels/test/src/test_utils.cc
line 126 at r9 (raw file):
GenericTensorAccessorR create_cpu_compatible_accessor_r(GenericTensorAccessorR const &accessor,
Suggestion:
copy_to_accessor_to_cpu_if_necessary
lib/kernels/test/src/test_utils.cc
line 158 at r9 (raw file):
}; void print_accessor(GenericTensorAccessorR const &accessor) {
Suggestion:
void print_tensor_accessor_contents(
lib/kernels/test/src/test_utils.cc
line 158 at r9 (raw file):
}; void print_accessor(GenericTensorAccessorR const &accessor) {
Might be nice to have this return std::string
or take in a std::ostream &
so the user can choose where it goes rather than just forcing it to std::cout
Code quote:
void
lib/local-execution/src/ops/linear.cc
line 152 at r9 (raw file):
per_device_state, input.get_float_ptr(), (float *)input_grad.get_float_ptr(),
No cast should be necessary as you're already calling get_float_ptr
?
Tests for:
Changes to kernels/test_utils, GenericTensorAccessors, and the addition of local_cpu_allocator
This change is