CPU Kernel Tests #1439

oOTigger · 2024-07-11T21:35:13Z

Tests for:

cast_kernels
reverse_kernels
combine_kernels
replicate_kernels
managed_ff_stream
managed_per_device_handle

Changes to kernels/test_utils, GenericTensorAccessors, and the addition of local_cpu_allocator

This change is

codecov · 2024-07-11T21:41:43Z

Codecov Report

Attention: Patch coverage is 10.46512% with 231 lines in your changes missing coverage. Please review.

Project coverage is 77.59%. Comparing base (1d5140d) to head (7106dec).

Files with missing lines	Patch %	Lines
lib/kernels/src/accessor.cc	19.14%	76 Missing ⚠️
lib/kernels/src/cpu/cast_kernels.cc	0.00%	21 Missing ⚠️
lib/kernels/src/cpu/reverse_kernels.cc	0.00%	20 Missing ⚠️
lib/kernels/src/managed_per_device_ff_handle.cc	0.00%	19 Missing ⚠️
lib/kernels/include/kernels/accessor.h	0.00%	18 Missing ⚠️
lib/kernels/src/cpu/replicate_kernels.cc	0.00%	18 Missing ⚠️
lib/kernels/src/cpu/combine_kernels.cc	0.00%	17 Missing ⚠️
lib/kernels/src/local_cpu_allocator.cc	0.00%	15 Missing ⚠️
lib/kernels/src/managed_ff_stream.cc	0.00%	10 Missing ⚠️
lib/local-execution/src/ops/linear.cc	0.00%	7 Missing ⚠️
... and 3 more

Additional details and impacted files

@@                Coverage Diff                @@
##           repo-refactor    #1439      +/-   ##
=================================================
- Coverage          78.16%   77.59%   -0.57%     
=================================================
  Files                860      866       +6     
  Lines              27994    28219     +225     
  Branches             770      794      +24     
=================================================
+ Hits               21881    21897      +16     
- Misses              6113     6322     +209

Flag	Coverage Δ
unittests	`77.59% <10.46%> (-0.57%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
lib/kernels/include/kernels/allocation.h	`83.33% <ø> (ø)`
lib/kernels/include/kernels/attention_kernels.h	`0.00% <ø> (ø)`
lib/kernels/include/kernels/datatype_dispatch.h	`0.00% <ø> (ø)`
lib/kernels/include/kernels/local_cuda_allocator.h	`0.00% <ø> (ø)`
lib/kernels/include/kernels/transpose_kernels.h	`0.00% <ø> (ø)`
lib/kernels/src/allocation.cc	`72.72% <100.00%> (+10.22%)`	⬆️
lib/kernels/src/array_shape.cc	`20.83% <ø> (ø)`
...tion/include/local-execution/local_cpu_allocator.h	`100.00% <ø> (ø)`
...cution/include/local-execution/tracked_allocator.h	`0.00% <ø> (ø)`
lib/local-execution/src/local_cpu_allocator.cc	`57.14% <100.00%> (+7.14%)`	⬆️
... and 15 more

... and 1 file with indirect coverage changes

lockshaw

Reviewed 1 of 23 files at r1, 21 of 29 files at r2, 12 of 12 files at r3, all commit messages.
Reviewable status: all files reviewed, 18 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)

lib/kernels/include/kernels/cast_kernels_cpu.h line 15 at r3 (raw file):

                    GenericTensorAccessorW const &output,
                    DataType input_type,
                    DataType output_type);

Prefer function names over extra namespaces, the namespaces in kernels are more of a legacy holdover than something that should be used more

Suggestion:

void cpu_forward_kernel(GenericTensorAccessorR const &input,
                    GenericTensorAccessorW const &output,
                    DataType input_type,
                    DataType output_type);

lib/kernels/src/local_cpu_allocator.cc line 6 at r3 (raw file):

namespace FlexFlow {
void *LocalCPUAllocator::allocate(size_t requested_memory_size) {
  void *ptr = calloc(1, requested_memory_size);

Just use malloc

Code quote:

calloc(

lib/kernels/src/local_cpu_allocator.cc line 28 at r3 (raw file):

LocalCPUAllocator::~LocalCPUAllocator() {
  for (auto ptr : ptrs) {

Suggestion:

  for (void *ptr : this->ptrs) {

lib/kernels/src/local_cuda_allocator.cc line 8 at r3 (raw file):

  void *ptr;
  checkCUDA(cudaMalloc(&ptr, requested_memory_size));
  checkCUDA(cudaMemset(ptr, 0, requested_memory_size));

I don't think code should assume that the allocated memory has been zero'd, unless there's some reason I'm not thinking of?

lib/kernels/src/cpu/reverse_kernels.cc line 14 at r3 (raw file):

                            coord_t in_blk_size) {
  coord_t total_elements = num_out_blks * reverse_dim_size * in_blk_size;
  for (coord_t i = 0; i < total_elements; ++i) {

This is really hard to read right now, make more readable (maybe pull out some helper functions or something?)

lib/kernels/src/cuda/ops/reverse_kernels.cu line 41 at r3 (raw file):

// }

/* I mentioned this earlier, but I still think the reverse_forward_kernel code

See my message in slack

lib/kernels/test/src/test_batch_norm_kernel.cc line 91 at r3 (raw file):

      std::vector<float> host_bias_grad_data =
          load_accessor_data<DataType::FLOAT>(
              read_only_accessor_from_write_accessor(bias_grad_accessor));

Why is read_only_accessor_from_write_accessor necessary here? Aren't GenericTensorWs assumed to be RW @reyna-abhyankar ? If so, add an overload to load_accessor_data

lib/kernels/test/src/test_combine_kernel.cc line 64 at r3 (raw file):

      GenericTensorAccessorR input_accessor_gpu =
          read_only_accessor_from_write_accessor(
              create_random_filled_accessor_w(input_shape, gpu_allocator));

Why not also define a create_random_filled_accessor_r?

Code quote:

create_random_filled_accessor_w

lib/kernels/test/src/test_utils.h line 11 at r3 (raw file):

#include <random>

enum class GpuDirection {

If we're going to have GenericTensorAccessors allocated on different devices it would probably be good to have a field on GenericTensorAccessor that tracks that, so it can at least be checked at runtime if you try to access it incorrectly, etc.?

This would also allow you to infer a lot of locations (CPU vs GPU) rather than having to pass additional arguments to the transfer functions

lib/kernels/test/src/test_utils.h line 12 at r3 (raw file):

enum class GpuDirection {
  HostToDevice = 0,

Why are concrete values needed?

lib/kernels/test/src/test_utils.h line 65 at r3 (raw file):

template <typename IDT, typename ODT, typename F>
GenericTensorAccessorW create_transformed_accessor_w(TensorShape const &shape,

It seems like this is just being used to create a random tensor, seems like that might be a better behavior?

lib/kernels/test/src/test_utils.h line 86 at r3 (raw file):

}

template <DataType DT>

Where possible, probably best to add a dynamically-dispatched function as well as one that takes the argument by template. This isn't possible for some functions (such as those that return a type based on the template type), but for some of these in this function it seems possible.

lib/kernels/test/src/test_utils.h line 88 at r3 (raw file):

template <DataType DT>
GenericTensorAccessorW
    copy_tensor_between_memories(GenericTensorAccessorR accessor,

The behavior of this method seems rather confusing, I'd stick to explicitly stating the direction you'd like to transfer (either from Host to Device, or reversed)/

lib/kernels/test/src/test_utils.h line 89 at r3 (raw file):

GenericTensorAccessorW
    copy_tensor_between_memories(GenericTensorAccessorR accessor,
                                 TensorShape const &shape,

GenericTensorAccessor already has a shape, no need to pass an additional shape separately

lib/kernels/test/src/test_utils.h line 104 at r3 (raw file):

}

template <DataType DT>

Pass as a normal argument, not a template argument

lib/kernels/test/src/test_utils.h line 116 at r3 (raw file):

template <DataType DT>
std::vector<real_type<DT>> load_accessor_data(GenericTensorAccessorR accessor,
                                              bool on_host = false) {

Change to an enum and make passing required/copy

Code quote:

bool on_host = false

lib/kernels/test/src/test_utils.cc line 8 at r3 (raw file):

  GenericTensorAccessorW accessor = allocator.allocate_tensor(shape);
  size_t volume = accessor.shape.num_elements();
  std::vector<float> host_data(volume);

Can't TensorShape have different datatypes? If you're going to only do this for float you should at the very least have a check

lib/kernels/test/src/test_utils.cc line 17 at r3 (raw file):

  }

  transfer_memory(static_cast<float *>(accessor.ptr),

Would it be better to have a function that transfers a GenericTensorAccessor, rather than having to do casting to transfer each time? That also saves you from having to pass volume, etc.

…lexflow#1429) * initial commit for machine view adjacent modules * Formatting * Tests for new machine_view.cc functions * formatting * Minor Test correction * formatting * PR fixes * PR Fixes --------- Co-authored-by: Pietro Max Marsella <marsella@stanford.edu>

…n, other minor fixes

oOTigger

Reviewable status: 2 of 50 files reviewed, 18 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)

lib/kernels/include/kernels/cast_kernels_cpu.h line 15 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer function names over extra namespaces, the namespaces in kernels are more of a legacy holdover than something that should be used more

Done.

lib/kernels/src/local_cpu_allocator.cc line 6 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Just use malloc

Done.

lib/kernels/src/local_cpu_allocator.cc line 28 at r3 (raw file):

LocalCPUAllocator::~LocalCPUAllocator() {
  for (auto ptr : ptrs) {

Done.

lib/kernels/src/local_cuda_allocator.cc line 8 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I don't think code should assume that the allocated memory has been zero'd, unless there's some reason I'm not thinking of?

Ended up adding an allocate_and_zero function to the Allocator class, not sure if this is too big of a change though?

lib/kernels/src/cpu/reverse_kernels.cc line 14 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This is really hard to read right now, make more readable (maybe pull out some helper functions or something?)

Is this "too" broken down?

lib/kernels/src/cuda/ops/reverse_kernels.cu line 41 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

See my message in slack

Currently talking to Xinhao

lib/kernels/test/src/test_batch_norm_kernel.cc line 91 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is read_only_accessor_from_write_accessor necessary here? Aren't GenericTensorWs assumed to be RW @reyna-abhyankar ? If so, add an overload to load_accessor_data

Done.

lib/kernels/test/src/test_combine_kernel.cc line 64 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not also define a create_random_filled_accessor_r?

Done.

lib/kernels/test/src/test_utils.h line 11 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

If we're going to have GenericTensorAccessors allocated on different devices it would probably be good to have a field on GenericTensorAccessor that tracks that, so it can at least be checked at runtime if you try to access it incorrectly, etc.?

This would also allow you to infer a lot of locations (CPU vs GPU) rather than having to pass additional arguments to the transfer functions

Ideally, this is using the same enum class as in allocation.h, but for some reason, whenever I define that class in accessor.h and try and include it as part of the class I get a bunch of compile errors? I think IT has to do with FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION? Is there some rule that I'm not aware of of what can and can't be passed in?

I also defined an initializer list as a lot of places in the code base reference GenericTensorAccessors and they initialize it using a three parameter constructor (ie. {datatype, shape, pointer}) so needed to add a way to default tensor location. Is it safe to say that previous references are on GPU or do I need to sift through this manually?

In addition, allocator's also now store additional state as well on if they allocate things on CPU or GPU so we can initialize the values for GenericTensorAccessors.

lib/kernels/test/src/test_utils.h line 12 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why are concrete values needed?

Done.

lib/kernels/test/src/test_utils.h line 65 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

It seems like this is just being used to create a random tensor, seems like that might be a better behavior?

This is mainly for debugging purposes, so can create like an iota filled accessor or etc.

lib/kernels/test/src/test_utils.h line 88 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

The behavior of this method seems rather confusing, I'd stick to explicitly stating the direction you'd like to transfer (either from Host to Device, or reversed)/

Done.

lib/kernels/test/src/test_utils.h line 89 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

GenericTensorAccessor already has a shape, no need to pass an additional shape separately

Done.

lib/kernels/test/src/test_utils.h line 104 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Pass as a normal argument, not a template argument

Done.

lib/kernels/test/src/test_utils.h line 116 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Change to an enum and make passing required/copy

Just removed entirely

lib/kernels/test/src/test_utils.cc line 8 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Can't TensorShape have different datatypes? If you're going to only do this for float you should at the very least have a check

Done.

lib/kernels/test/src/test_utils.cc line 17 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Would it be better to have a function that transfers a GenericTensorAccessor, rather than having to do casting to transfer each time? That also saves you from having to pass volume, etc.

Done.

lockshaw

Reviewed 48 of 48 files at r4, all commit messages.
Reviewable status: all files reviewed, 35 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)

lib/kernels/include/kernels/accessor.h line 31 at r4 (raw file):

  double *get_double_ptr() const;
  half *get_half_ptr() const;

Code snippet:

GenericTensorAccessorW() = delete;

lib/kernels/include/kernels/accessor.h line 33 at r4 (raw file):

  GenericTensorAccessorW(DataType dt,
                         ArrayShape sh,

Suggestion:

                         ArrayShape const &sh,

lib/kernels/include/kernels/accessor.h line 34 at r4 (raw file):

  GenericTensorAccessorW(DataType dt,
                         ArrayShape sh,
                         req<void *> p,

Suggestion:

                         void *p,

lib/kernels/include/kernels/accessor.h line 35 at r4 (raw file):

                         ArrayShape sh,
                         req<void *> p,
                         bool on_dev = true)

Suggestion:

  GenericTensorAccessorW(DataType data_type,
                         ArrayShape shape,
                         req<void *> ptr,
                         bool on_dev = true)

lib/kernels/include/kernels/accessor.h line 36 at r4 (raw file):

                         req<void *> p,
                         bool on_dev = true)
      : data_type(dt), shape(sh), ptr(p), on_device(on_dev) {}

Move to .cc file

lib/kernels/include/kernels/accessor.h line 41 at r4 (raw file):

  DataType data_type;
  ArrayShape shape;
  req<void *> ptr;

Suggestion:

  void * ptr;

lib/kernels/include/kernels/accessor.h line 42 at r4 (raw file):

  ArrayShape shape;
  req<void *> ptr;
  bool on_device;

Suggestion:

  DeviceType device_type;

lib/kernels/include/kernels/accessor.h line 43 at r4 (raw file):

  req<void *> ptr;
  bool on_device;
};

Suggestion:

  bool operator==(GenericTensorAccessorW const &) const;
  bool operator!=(GenericTensorAccessorW const &) const;

public:
  DataType data_type;
  ArrayShape shape;
  req<void *> ptr;
  bool on_device;
private:
  std::tuple<decltype(data_type) const &,
             decltype(shape) const &,
             decltype(ptr) const &,
             decltype(on_device) const &>
    tie() const;
};

// in .cc file
std::tuple<DataType const &,
           ArrayShape const &,
           void *,
           DeviceType const &>
    GenericTensorAccessorW::tie() const {
  return std::tie(this->data_type, this->shape, this->ptr, this->on_device);     
}

bool GenericTensorAccessorW::operator==(GenericTensorAccessorW const &other) const {
  return this->tie() == other.tie();
}

bool GenericTensorAccessorW::operator!=(GenericTensorAccessorW const &other) const {
  return this->tie() != other.tie(); 
}

lib/kernels/include/kernels/accessor.h line 45 at r4 (raw file):

};
FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION(
    GenericTensorAccessorW, data_type, shape, ptr, on_device);

visitable is deprecated

lib/kernels/include/kernels/accessor.h line 64 at r4 (raw file):

  double const *get_double_ptr() const;
  half const *get_half_ptr() const;

Code snippet:

GenericTensorAccessorR() = delete;

lib/kernels/include/kernels/accessor.h line 65 at r4 (raw file):

  half const *get_half_ptr() const;

  GenericTensorAccessorR(DataType dt,

Move to .cc file

lib/kernels/include/kernels/accessor.h line 77 at r4 (raw file):

  bool on_device;
};
FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION(

Make same tie changes as above to add ==, !=, hash back in (they're currently added by visitable)

lib/kernels/include/kernels/allocation.h line 14 at r4 (raw file):

struct IAllocator {
  virtual void *allocate(size_t) = 0;
  virtual void *allocate_and_zero(size_t) = 0;

Delete, if you want to zero the allocation you can zero it with some other function, don't make it part of the allocator interface

lib/kernels/include/kernels/allocation.h line 16 at r4 (raw file):

  virtual void *allocate_and_zero(size_t) = 0;
  virtual void deallocate(void *) = 0;

Code snippet:

virtual DeviceType get_allocation_device_type() = 0;

lib/kernels/include/kernels/allocation.h line 25 at r4 (raw file):

  GenericTensorAccessorW allocate_tensor(TensorShape const &tensor_shape);
  GenericTensorAccessorW
      allocate_tensor_and_zero(TensorShape const &tensor_shape);

Delete

lib/kernels/include/kernels/allocation.h line 28 at r4 (raw file):

  void *allocate(size_t mem_size);
  void *allocate_and_zero(size_t mem_size);

Delete

lib/kernels/include/kernels/allocation.h line 40 at r4 (raw file):

  Allocator(std::shared_ptr<IAllocator> ptr) : i_allocator(ptr){};

  AllocLocation alloc_location;

Delete

lib/kernels/include/kernels/local_cpu_allocator.h line 13 at r4 (raw file):

  void *allocate(size_t) override;
  void *allocate_and_zero(size_t) override;

Delete

lib/kernels/include/kernels/local_cuda_allocator.h line 13 at r4 (raw file):

  void *allocate(size_t) override;
  void *allocate_and_zero(size_t) override;

Delete

lib/kernels/src/array_shape.cc line 63 at r4 (raw file):

}

TensorShape get_tensor_shape(ArrayShape const &shape, DataType DT) {

Use the implementation from @reyna-abhyankar's branch: https://github.com/reyna-abhyankar/FlexFlow/blob/e1a8a01e9e4f782805aa62c163bf50799f0638b4/lib/kernels/src/array_shape.cc#L75-L78

lib/kernels/src/local_cpu_allocator.cc line 5 at r4 (raw file):

namespace FlexFlow {
void *LocalCPUAllocator::allocate(size_t requested_memory_size) {

Use the implementations from @reyna-abhyankar's branch: https://github.com/reyna-abhyankar/FlexFlow/blob/e1a8a01e9e4f782805aa62c163bf50799f0638b4/lib/local-execution/src/local_cpu_allocator.cc

lib/kernels/src/cpu/combine_kernels.cc line 4 at r4 (raw file):

#include "kernels/datatype_dispatch.h"

namespace FlexFlow {

FYI we're now on C++17, so you can instead write namespace FlexFlow::Kernels::Combine {

lib/kernels/src/cpu/reverse_kernels.cc line 14 at r3 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Is this "too" broken down?

The comments are helpful, but I think the big thing that would help is the access-by-coordinate function for GenericTensorAccessor

lib/kernels/src/cuda/ops/reverse_kernels.cu line 41 at r3 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Currently talking to Xinhao

Can you change your status on this comment then to "Working": see https://docs.reviewable.io/discussions.html#dispositions-and-resolution

lib/kernels/test/src/test_batch_norm_kernel.cc line 36 at r4 (raw file):

    GenericTensorAccessorW input_accessor =
        create_random_filled_accessor_w<DataType::FLOAT>(input_shape,

Why is the extra template parameter suddenly necessary?

lib/kernels/test/src/test_cast_kernel.cc line 90 at r4 (raw file):

          DataType::FLOAT,
          DataType::INT32);
      std::cout << "Before GPU load" << std::endl;

Delete

lib/kernels/test/src/test_cast_kernel.cc line 103 at r4 (raw file):

          DataType::FLOAT,
          DataType::INT32);
      std::cout << "Before CPU load" << std::endl;

Delete

lib/kernels/test/src/test_combine_kernel.cc line 111 at r4 (raw file):

              output_grad_accessor_gpu, cpu_allocator);
      GenericTensorAccessorW input_grad_accessor_cpu =
          cpu_allocator.allocate_tensor_and_zero(input_shape);

Or even add a helper function create_zero_filled_accessor_w that does these two operations

Suggestion:

      GenericTensorAccessorW input_grad_accessor_cpu =
          cpu_allocator.allocate_tensor(input_shape);
      fill_with_zeros(input_grad_accessor_cpu);

lib/kernels/test/src/test_reverse_kernels.cc line 147 at r4 (raw file):

      // Run CPU Cast Backward Kernel
      GenericTensorAccessorW output_grad_accessor_cpu =
          copy_tensor_between_memories<DataType::FLOAT>(

Add an explicit destination (semantics should either be copy to CPU or copy to GPU, not dependent on the current placement)

Code quote:

copy_tensor_between_memories<DataType::FLOAT>(

lib/kernels/test/src/test_utils.h line 11 at r3 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Ideally, this is using the same enum class as in allocation.h, but for some reason, whenever I define that class in accessor.h and try and include it as part of the class I get a bunch of compile errors? I think IT has to do with FF_VISITABLE_STRUCT_NONSTANDARD_CONSTRUCTION? Is there some rule that I'm not aware of of what can and can't be passed in?

I also defined an initializer list as a lot of places in the code base reference GenericTensorAccessors and they initialize it using a three parameter constructor (ie. {datatype, shape, pointer}) so needed to add a way to default tensor location. Is it safe to say that previous references are on GPU or do I need to sift through this manually?

In addition, allocator's also now store additional state as well on if they allocate things on CPU or GPU so we can initialize the values for GenericTensorAccessors.

There shouldn't be that much code explicitly creating GenericTensorAccessors I think? Can you point me to some examples?

lib/kernels/test/src/test_utils.h line 65 at r3 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

This is mainly for debugging purposes, so can create like an iota filled accessor or etc.

Either specialize to a function that creates a random tensor, or pass the current tensor index to transform--the current abstraction is a bit awkwardly intermediate

lib/kernels/test/src/test_utils.h line 88 at r3 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Done.

Still seeing the old behavior? The destination placement should be stated either as part of the function name or as an argument--if it's an argument, rename the function to copy_tensor_to_memory

lib/kernels/test/src/test_utils.h line 33 at r4 (raw file):

}

template <DataType DT>

Why not just get the datatype from the shape?

lib/kernels/test/src/test_utils.cc line 11 at r4 (raw file):

}

TensorShape make_tensor_shape_from_legion_dims(FFOrdered<size_t> dims,

This doesn't seem correct, shouldn't this be taking a LegionOrdered?

Suggestion:

legion_dims(FFOrdered<size_t> const &dims,

lib/local-execution/include/local-execution/tracked_allocator.h line 15 at r4 (raw file):

  void *allocate(size_t) override;
  void *allocate_and_zero(size_t) override;

Delete

oOTigger

Reviewable status: 8 of 83 files reviewed, 35 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)

lib/kernels/include/kernels/accessor.h line 31 at r4 (raw file):

  double *get_double_ptr() const;
  half *get_half_ptr() const;

Done.

lib/kernels/include/kernels/accessor.h line 33 at r4 (raw file):

  GenericTensorAccessorW(DataType dt,
                         ArrayShape sh,

Done.

lib/kernels/include/kernels/accessor.h line 34 at r4 (raw file):

  GenericTensorAccessorW(DataType dt,
                         ArrayShape sh,
                         req<void *> p,

Done.

lib/kernels/include/kernels/accessor.h line 35 at r4 (raw file):

                         ArrayShape sh,
                         req<void *> p,
                         bool on_dev = true)

Done.

lib/kernels/include/kernels/accessor.h line 36 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Move to .cc file

Done.

lib/kernels/include/kernels/accessor.h line 41 at r4 (raw file):

  DataType data_type;
  ArrayShape shape;
  req<void *> ptr;

Done.

lib/kernels/include/kernels/accessor.h line 42 at r4 (raw file):

  ArrayShape shape;
  req<void *> ptr;
  bool on_device;

Done.

lib/kernels/include/kernels/accessor.h line 43 at r4 (raw file):

  req<void *> ptr;
  bool on_device;
};

Done.

lib/kernels/include/kernels/accessor.h line 45 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

visitable is deprecated

Done.

lib/kernels/include/kernels/accessor.h line 64 at r4 (raw file):

  double const *get_double_ptr() const;
  half const *get_half_ptr() const;

Done.

lib/kernels/include/kernels/accessor.h line 65 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Move to .cc file

Done.

lib/kernels/include/kernels/accessor.h line 77 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Make same tie changes as above to add ==, !=, hash back in (they're currently added by visitable)

Done.

lib/kernels/include/kernels/allocation.h line 14 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete, if you want to zero the allocation you can zero it with some other function, don't make it part of the allocator interface

Done.

lib/kernels/include/kernels/allocation.h line 16 at r4 (raw file):

  virtual void *allocate_and_zero(size_t) = 0;
  virtual void deallocate(void *) = 0;

Done.

lib/kernels/include/kernels/allocation.h line 25 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete

Done.

lib/kernels/include/kernels/allocation.h line 28 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete

Done.

lib/kernels/include/kernels/allocation.h line 40 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete

Done.

lib/kernels/include/kernels/local_cpu_allocator.h line 13 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete

Done.

lib/kernels/include/kernels/local_cuda_allocator.h line 13 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete

Done.

lib/kernels/src/array_shape.cc line 63 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Use the implementation from @reyna-abhyankar's branch: https://github.com/reyna-abhyankar/FlexFlow/blob/e1a8a01e9e4f782805aa62c163bf50799f0638b4/lib/kernels/src/array_shape.cc#L75-L78

Done.

lib/kernels/src/local_cpu_allocator.cc line 5 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Use the implementations from @reyna-abhyankar's branch: https://github.com/reyna-abhyankar/FlexFlow/blob/e1a8a01e9e4f782805aa62c163bf50799f0638b4/lib/local-execution/src/local_cpu_allocator.cc

Done.

lib/kernels/src/cpu/combine_kernels.cc line 4 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

FYI we're now on C++17, so you can instead write namespace FlexFlow::Kernels::Combine {

Done

lib/kernels/src/cpu/reverse_kernels.cc line 14 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

The comments are helpful, but I think the big thing that would help is the access-by-coordinate function for GenericTensorAccessor

Done.

lib/kernels/test/src/test_batch_norm_kernel.cc line 36 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is the extra template parameter suddenly necessary?

I wanted create_random_filled_accessor to just return a random accessor filled with decimal values. Initially, I had it where it would always be a float, but realized from doing the cast_kernel operator, that we could sometimes want it to be a double as well rather than float

lib/kernels/test/src/test_cast_kernel.cc line 90 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete

Done.

lib/kernels/test/src/test_cast_kernel.cc line 103 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete

Done.

lib/kernels/test/src/test_combine_kernel.cc line 111 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Or even add a helper function create_zero_filled_accessor_w that does these two operations

Done.

lib/kernels/test/src/test_reverse_kernels.cc line 147 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Add an explicit destination (semantics should either be copy to CPU or copy to GPU, not dependent on the current placement)

Done.

lib/kernels/test/src/test_utils.h line 11 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

There shouldn't be that much code explicitly creating GenericTensorAccessors I think? Can you point me to some examples?

I guess you're right, not sure what I was seeing earlier... I could probably sift through all initialization of GenericTensorAccessor's and specify their DeviceType, but maybe our current behavior is fine with the initializer is fine anyways?

lib/kernels/test/src/test_utils.h line 65 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Either specialize to a function that creates a random tensor, or pass the current tensor index to transform--the current abstraction is a bit awkwardly intermediate

Just deleted entirely

lib/kernels/test/src/test_utils.h line 86 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Where possible, probably best to add a dynamically-dispatched function as well as one that takes the argument by template. This isn't possible for some functions (such as those that return a type based on the template type), but for some of these in this function it seems possible.

Done.

lib/kernels/test/src/test_utils.h line 88 at r3 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Still seeing the old behavior? The destination placement should be stated either as part of the function name or as an argument--if it's an argument, rename the function to copy_tensor_to_memory

Made more explicit

lib/kernels/test/src/test_utils.h line 33 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not just get the datatype from the shape?

Done.

lib/kernels/test/src/test_utils.cc line 11 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

This doesn't seem correct, shouldn't this be taking a LegionOrdered?

Done.

lib/local-execution/include/local-execution/tracked_allocator.h line 15 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Delete

Done.

lockshaw

Reviewed 72 of 75 files at r5, 3 of 3 files at r6, all commit messages.
Reviewable status: all files reviewed, 33 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)

lib/kernels/CMakeLists.txt line 33 at r6 (raw file):

  nccl
  utils
  pcg

Why?

lib/kernels/include/kernels/accessor.h line 15 at r6 (raw file):

namespace FlexFlow {

class GenericTensorAccessorW {

@reyna-abhyankar Is there a reason we have both GenericTensorAccessorW and GenericTensorAccessorR over just having one GenericTensorAccessor and passing it as either const & or mutable &?

lib/kernels/include/kernels/accessor.h line 45 at r6 (raw file):

  template <DataType DT, typename... Indices>
  real_type_t<DT> &at(Indices... indices) {
    if (this->device_type != DeviceType::CPU) {

Any reason for this over the simpler at(std::vector<int> const &)?

Code quote:

  template <DataType DT, typename... Indices>
  real_type_t<DT> &at(Indices... indices) {
    if (this->device_type != DeviceType::CPU) {

lib/kernels/include/kernels/reverse_kernels_cpu.h line 5 at r6 (raw file):

#include "accessor.h"
#include "device.h"

Prefer full include paths

Suggestion:

#include "kernels/accessor.h"
#include "kernels/device.h"

lib/kernels/src/accessor.cc line 30 at r6 (raw file):

  }

  size_t offset = 0;

@Marsella8 Do you think we could reasonably add some generic coord -> index & index apsce concept in utils and then move this code and part of the 1458 over to use it? Feels like this logic is getting unnecessarily reimplemented

lib/kernels/src/accessor.cc line 47 at r6 (raw file):

    offset += cur_idx * multiplier;
    multiplier *= this->shape[legion_dim_t(i)];

Prefer at for bounds checking (ArrayShape may have bounds-checking for both, but other containers don't so it's a good habit)

Suggestion:

    multiplier *= this->shape.at(legion_dim_t(i));

lib/kernels/src/allocation.cc line 22 at r6 (raw file):

  void *ptr = this->allocate(get_size_in_bytes(tensor_shape));
  return {
      tensor_shape.data_type, tensor_shape, ptr, get_allocation_device_type()};

Prefer explicit this->

Suggestion:

     tensor_shape.data_type, tensor_shape, ptr, this->get_allocation_device_type()};

lib/kernels/src/cpu/replicate_kernels.cc line 7 at r6 (raw file):

template <typename T>
void cpu_replicate_backward_kernel(T *input,

Move over to use the new GenericTensorAccessor access-by-multidimensional-index support?

lib/kernels/src/cpu/reverse_kernels.cc line 14 at r3 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Done.

Yeah this is much clearer, thanks (also it would be good to rename blk to block as it's a bit nonobvious that blk means block and not bulk)

lib/kernels/src/cpu/reverse_kernels.cc line 13 at r6 (raw file):

                  GenericTensorAccessorW &output,
                  coord_t num_out_blks,
                  coord_t reverse_dim_size,

Can't we just get this from the shape information contained in input?

Code quote:

                 coord_t reverse_dim_size,

lib/kernels/src/cpu/reverse_kernels.cc line 49 at r6 (raw file):

  DataTypeDispatch1<CPUReverseForwardKernel>{}(input_accessor.data_type,
                                               input_accessor,
                                               std::ref(output_accessor),

Why is std::ref necessary?

Code quote:

std::ref(output_accessor),

lib/kernels/test/CMakeLists.txt line 17 at r6 (raw file):

    cudart
    cublas
    pcg

Why?

lib/kernels/test/src/test_batch_norm_kernel.cc line 36 at r4 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

I wanted create_random_filled_accessor to just return a random accessor filled with decimal values. Initially, I had it where it would always be a float, but realized from doing the cast_kernel operator, that we could sometimes want it to be a double as well rather than float

Sure, but the DataType is already present in the input_shape, so probably better to just use that than force the user to pass the DataType twice

lib/kernels/test/src/test_cast_kernel.cc line 87 at r6 (raw file):

      std::vector<int32_t> result_data_gpu =
          load_accessor_data<DataType::INT32>(output_accessor_gpu);

Why not just get back a GenericTensorAccessorR with its data on CPU? Throwing away all the information by dropping down to a std::vector feels unnecessary

Code quote:

      std::vector<int32_t> result_data_gpu =
          load_accessor_data<DataType::INT32>(output_accessor_gpu);

lib/kernels/test/src/test_cast_kernel.cc line 91 at r6 (raw file):

      // Run CPU Forward Kernel
      GenericTensorAccessorW input_accessor_cpu =
          create_random_filled_accessor_w<DataType::FLOAT>(input_shape,

If the data is random, why would I expect input_accessor_cpu to have the same data as input_accessor_gpu? Would it be cleaner to generate a GenericTensorAccessor with random data on CPU, then create a copy of that on GPU, and then run the functions?

lib/kernels/test/src/test_concat_kernel.cc line 9 at r6 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("Test concat kernel forward and backward") {
    size_t num_inputs = 2;

Why these changes?

lib/kernels/test/src/test_replicate_kernel.cc line 78 at r6 (raw file):

      GenericTensorAccessorW output_accessor_gpu =
          gpu_allocator.allocate_tensor(output_shape);
      fill_with_zeros(output_accessor_gpu);

Wouldn't filling with random data here give you more assurance the function is correct?

lib/kernels/test/src/test_reverse_kernels.cc line 150 at r6 (raw file):

      GenericTensorAccessorW input_grad_accessor_cpu =
          cpu_allocator.allocate_tensor(input_shape);
      fill_with_zeros(input_grad_accessor_cpu);

I'd recommend creating a helper function for creating zero-filled tensors in test_utils to avoid any "use before initialized" issues

Code quote:

          cpu_allocator.allocate_tensor(input_shape);
      fill_with_zeros(input_grad_accessor_cpu);

lib/kernels/test/src/test_utils.h line 11 at r3 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

I guess you're right, not sure what I was seeing earlier... I could probably sift through all initialization of GenericTensorAccessor's and specify their DeviceType, but maybe our current behavior is fine with the initializer is fine anyways?

Not sure what you mean by "but maybe our current behavior is fine with the initializer is fine anyways". Can you clarify?

lib/kernels/test/src/test_utils.h line 33 at r4 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Done.

Then why is the template parameter still here?

lib/kernels/test/src/test_utils.h line 34 at r6 (raw file):

template <typename DT>
void transfer_memory(GenericTensorAccessorW dst_accessor,

Suggestion:

void transfer_memory(GenericTensorAccessorW &dst_accessor,

lib/kernels/test/src/test_utils.h line 35 at r6 (raw file):

template <typename DT>
void transfer_memory(GenericTensorAccessorW dst_accessor,
                     const DT *src,

Why is one parameter a TensorAccessor and the other is a raw array? Feels inconsistent

Code quote:

void transfer_memory(GenericTensorAccessorW dst_accessor,
                     const DT *src,

lib/kernels/test/src/test_utils.h line 37 at r6 (raw file):

                     const DT *src,
                     DeviceType src_device_type) {
  size_t bytes = dst_accessor.shape.get_volume() * sizeof(DT);

Suggestion:

  size_t num_bytes = dst_accessor.shape.get_volume() * sizeof(DT);

lib/kernels/test/src/test_utils.h line 78 at r6 (raw file):

template <DataType DT>
GenericTensorAccessorR create_random_filled_accessor_r(TensorShape const &shape,

Get the DataType from the TensorShape

lib/kernels/test/src/test_utils.h line 88 at r6 (raw file):

template <typename T>
GenericTensorAccessorW create_filled_accessor_w(TensorShape const &shape,

Get the DataType from the TensorShape

lib/kernels/test/src/test_utils.h line 124 at r6 (raw file):

template <DataType DT>
std::vector<real_type_t<DT>>
    load_accessor_data(GenericTensorAccessorW accessor) {

Suggestion:

    load_accessor_data(GenericTensorAccessorW const &accessor) {

lib/kernels/test/src/test_utils.h line 149 at r6 (raw file):

template <typename T>
bool vectors_are_approx_equal(T lhs, T rhs) {

Now that we have support for GenericTensorAccessors on CPU, using vectors no longer seems necessary I think

Code quote:

template <typename T>
bool contains_non_zero(std::vector<T> &data) {
  return !all_of(data, [](T const &val) { return val == 0; });
}

template <typename T>
bool vectors_are_approx_equal(T lhs, T rhs) {

lib/kernels/test/src/test_utils.cc line 5 at r6 (raw file):

namespace FlexFlow {

bool device_on_cpu(DeviceType device_type) {

Is this function really necessary? Doesn't seem like you're saving much typing or much complexity

lib/kernels/test/src/test_utils.cc line 44 at r6 (raw file):

  return DataTypeDispatch1<CopyTensorAccessorW>{}(
      src_accessor.data_type, src_accessor, std::ref(allocator));
}

The ability to copy tensors feels like a useful primitive by itself, might be better moved into kernels itself?

Code quote:

template <DataType DT>
struct CopyTensorAccessorW {
  GenericTensorAccessorW operator()(GenericTensorAccessorW const &src_accessor,
                                    Allocator &allocator) {
    TensorShape shape =
        get_tensor_shape(src_accessor.shape, src_accessor.data_type);
    GenericTensorAccessorW copied_tensor = allocator.allocate_tensor(shape);

    transfer_memory(
        copied_tensor, src_accessor.get<DT>(), src_accessor.device_type);

    return copied_tensor;
  }
};

GenericTensorAccessorW
    copy_tensor_accessor_w(GenericTensorAccessorW const &src_accessor,
                           Allocator &allocator) {
  return DataTypeDispatch1<CopyTensorAccessorW>{}(
      src_accessor.data_type, src_accessor, std::ref(allocator));
}

lib/local-execution/src/local_task_argument_accessor.cc line 31 at r6 (raw file):

        tensor_backing.shape,
        tensor_backing.ptr,
        this->allocator.get_allocation_device_type()};

At this point probably better to just add a function in accessor.h for converting from a GenericTensorAccessorW to a GenericTensorAccessorR

Code quote:

    GenericTensorAccessorR readonly_tensor_backing = {
        tensor_backing.data_type,
        tensor_backing.shape,
        tensor_backing.ptr,
        this->allocator.get_allocation_device_type()};

lib/local-execution/src/local_task_argument_accessor.cc line 31 at r6 (raw file):

        tensor_backing.shape,
        tensor_backing.ptr,
        this->allocator.get_allocation_device_type()};

Why not just use the device type from tensor_backing?

lib/local-execution/src/local_task_argument_accessor.cc line 49 at r6 (raw file):

    for (GenericTensorAccessorW const &tensor_backing :
         variadic_tensor_backing) {
      readonly_variadic_tensor_backing.push_back(

At this point probably better to just add a function in accessor.h for converting from a GenericTensorAccessorW to a GenericTensorAccessorR

lib/local-execution/src/local_task_argument_accessor.cc line 53 at r6 (raw file):

           tensor_backing.shape,
           tensor_backing.ptr,
           this->allocator.get_allocation_device_type()});

Just use the device type from tensor_backing?

oOTigger

Reviewable status: 51 of 85 files reviewed, 33 unresolved discussions (waiting on @lockshaw and @reyna-abhyankar)

lib/kernels/CMakeLists.txt line 33 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why?

I'm currently using the DeviceType enum from pcg/device_type.dtg.h

lib/kernels/include/kernels/accessor.h line 45 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Any reason for this over the simpler at(std::vector<int> const &)?

Mainly such that you can just use .at

(1,2,3) instead of .at

({1,2,3})... no need for unneeded initializer list

lib/kernels/include/kernels/reverse_kernels_cpu.h line 5 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer full include paths

Done.

lib/kernels/src/accessor.cc line 47 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer at for bounds checking (ArrayShape may have bounds-checking for both, but other containers don't so it's a good habit)

Done.

lib/kernels/src/allocation.cc line 22 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Prefer explicit this->

Done.

lib/kernels/src/cpu/replicate_kernels.cc line 7 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Move over to use the new GenericTensorAccessor access-by-multidimensional-index support?

Done.

lib/kernels/src/cpu/reverse_kernels.cc line 13 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Can't we just get this from the shape information contained in input?

Done.

lib/kernels/src/cpu/reverse_kernels.cc line 49 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is std::ref necessary?

I run into lvalue, rvalue issues with output_accessor without utilizing std::ref. For some reason DataTypeDispatch1 I think is forwarding output accessor as an rvalue which is causing an error? I think it's related to output_accessor being non const, which it has to be in order for the .at method to return a non const pointer which we can assign values to. std::ref seems to fix this issue though.

If possible, I think we can also just modify how forwarding works in DataTypeDispatch1 to fix this issue as well though?

lib/kernels/test/CMakeLists.txt line 17 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why?

I'm currently using the DeviceType enum from pcg/device_type.dtg.h

lib/kernels/test/src/test_batch_norm_kernel.cc line 36 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Sure, but the DataType is already present in the input_shape, so probably better to just use that than force the user to pass the DataType twice

Done.

lib/kernels/test/src/test_cast_kernel.cc line 87 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not just get back a GenericTensorAccessorR with its data on CPU? Throwing away all the information by dropping down to a std::vector feels unnecessary

Done.

lib/kernels/test/src/test_cast_kernel.cc line 91 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

If the data is random, why would I expect input_accessor_cpu to have the same data as input_accessor_gpu? Would it be cleaner to generate a GenericTensorAccessor with random data on CPU, then create a copy of that on GPU, and then run the functions?

Didn't catch this as test was going from float between 0 and 1 to an int lol, so it was basically just comparing if a tensor of all zeros was equal to another tensor of all zeros

lib/kernels/test/src/test_concat_kernel.cc line 9 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why these changes?

Was printing outputs and was easier to visually compare

lib/kernels/test/src/test_replicate_kernel.cc line 78 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Wouldn't filling with random data here give you more assurance the function is correct?

The problem I was running into is that the replicate_kernels.cu just directly adds to the value inside of the output accessor, leading to some cases when I was running the test where one of the tensors were allocated with randomized values inside of it, resulting in the test failing as the output tensors were not both initialized with the same initial state

lib/kernels/test/src/test_reverse_kernels.cc line 150 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

I'd recommend creating a helper function for creating zero-filled tensors in test_utils to avoid any "use before initialized" issues

Done.

lib/kernels/test/src/test_utils.h line 33 at r4 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Then why is the template parameter still here?

Done.

lib/kernels/test/src/test_utils.h line 34 at r6 (raw file):

template <typename DT>
void transfer_memory(GenericTensorAccessorW dst_accessor,

Done.

lib/kernels/test/src/test_utils.h line 35 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why is one parameter a TensorAccessor and the other is a raw array? Feels inconsistent

It's mainly as sometimes I'm wanting to transfer memory from a vector to an accessor => my current logic for random filled accessors is dependent on this...

I think it becomes potentially more messy creating things like these random filled accessors without this construct as then we'd increment based on ptr

lib/kernels/test/src/test_utils.h line 37 at r6 (raw file):

                     const DT *src,
                     DeviceType src_device_type) {
  size_t bytes = dst_accessor.shape.get_volume() * sizeof(DT);

Done.

lib/kernels/test/src/test_utils.h line 78 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Get the DataType from the TensorShape

Done.

lib/kernels/test/src/test_utils.h line 88 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Get the DataType from the TensorShape

I think we need to use a template here, as we don't know we have to get the value that we're going to fill the accessor with too and we don't know the type of that to take as input?

lib/kernels/test/src/test_utils.h line 124 at r6 (raw file):

template <DataType DT>
std::vector<real_type_t<DT>>
    load_accessor_data(GenericTensorAccessorW accessor) {

Done.

lib/kernels/test/src/test_utils.h line 149 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Now that we have support for GenericTensorAccessors on CPU, using vectors no longer seems necessary I think

Done.

lib/kernels/test/src/test_utils.cc line 5 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Is this function really necessary? Doesn't seem like you're saving much typing or much complexity

Done.

lib/kernels/test/src/test_utils.cc line 44 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

The ability to copy tensors feels like a useful primitive by itself, might be better moved into kernels itself?

Done.

lib/local-execution/src/local_task_argument_accessor.cc line 31 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

At this point probably better to just add a function in accessor.h for converting from a GenericTensorAccessorW to a GenericTensorAccessorR

Done.

lib/local-execution/src/local_task_argument_accessor.cc line 31 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Why not just use the device type from tensor_backing?

Done.

lib/local-execution/src/local_task_argument_accessor.cc line 49 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

At this point probably better to just add a function in accessor.h for converting from a GenericTensorAccessorW to a GenericTensorAccessorR

Done.

lib/local-execution/src/local_task_argument_accessor.cc line 53 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

Just use the device type from tensor_backing?

Done.

lockshaw

Reviewed 33 of 34 files at r7, 1 of 1 files at r8, 7 of 7 files at r9, all commit messages.
Dismissed @lockshaw from a discussion.
Reviewable status: all files reviewed, 27 unresolved discussions (waiting on @oOTigger and @reyna-abhyankar)

lib/kernels/include/kernels/accessor.h line 15 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

@reyna-abhyankar Is there a reason we have both GenericTensorAccessorW and GenericTensorAccessorR over just having one GenericTensorAccessor and passing it as either const & or mutable &?

At least in the meantime it seems like it would make sense to have GenericTensorAccessorW non-explicitly coerce to GenericTensorAccessorR I think to avoid all of the overloading?

lib/kernels/include/kernels/accessor.h line 45 at r6 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

Mainly such that you can just use .at
(1,2,3) instead of .at
({1,2,3})... no need for unneeded initializer list

Let's do the std::vector version unless there's some additional advantage you haven't mentioned yet. Yes the template syntax is cute and all, but it's substantially more complicated feature-wise and all it seems to do is remove the need to type two characters (also it doesn't allow for passing in coordinates of runtime-determined dimension)

lib/kernels/include/kernels/accessor.h line 14 at r9 (raw file):

namespace FlexFlow {

struct Allocator;

Better to #include in the header that contains Allocator

lib/kernels/include/kernels/accessor.h line 264 at r9 (raw file):

    GenericTensorAccessorR const &src_accessor);

void transfer_data_between_accessors(

Makes parameter order clear, and makes the operation a bit clearer ("transfer" sounds more like a "move" operation)

Suggestion:

void copy_accessor_data_to_l_from_r(

lib/kernels/include/kernels/replicate_kernels.h line 15 at r9 (raw file):

void backward_kernel(ffStream_t stream,
                     GenericTensorAccessorR const &output,
                     GenericTensorAccessorW const &input,

Why reorder input and output?

lib/kernels/include/kernels/replicate_kernels_cpu.h line 13 at r9 (raw file):

void cpu_backward_kernel(GenericTensorAccessorR const &output,
                         GenericTensorAccessorW &input,

Why reorder input and output?

lib/kernels/src/accessor.cc line 30 at r6 (raw file):

Previously, lockshaw (Colin Unger) wrote…

@Marsella8 Do you think we could reasonably add some generic coord -> index & index apsce concept in utils and then move this code and part of the 1458 over to use it? Feels like this logic is getting unnecessarily reimplemented

Tracked in #1528

lib/kernels/src/accessor.cc line 32 at r9 (raw file):

                         num_bytes,
                         cudaMemcpyDeviceToDevice));
  }

Suggestion:

  } else {
    assert (src_device_type == DeviceType::GPU);
    assert (dst_device_type == DeviceType::GPU);
    checkCUDA(cudaMemcpy(dst_accessor.ptr,
                         src_accessor.ptr,
                         num_bytes,
                         cudaMemcpyDeviceToDevice));
  }

lib/kernels/src/managed_ff_stream.cc line 14 at r9 (raw file):

ManagedFFStream &ManagedFFStream::operator=(ManagedFFStream &&other) noexcept {
  if (this != &other) {

Why this change?

lib/kernels/src/managed_per_device_ff_handle.cc line 22 at r9 (raw file):

ManagedPerDeviceFFHandle &ManagedPerDeviceFFHandle::operator=(
    ManagedPerDeviceFFHandle &&other) noexcept {
  if (this != &other) {

Why change this from swap?

lib/kernels/src/managed_per_device_ff_handle.cc line 40 at r9 (raw file):

    checkCUDA(cudaFree(this->handle->workSpace));
    delete this->handle;
    this->handle = nullptr;

Why explicitly assign the handle to nullptr?

lib/kernels/src/cpu/reverse_kernels.cc line 49 at r6 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

I run into lvalue, rvalue issues with output_accessor without utilizing std::ref. For some reason DataTypeDispatch1 I think is forwarding output accessor as an rvalue which is causing an error? I think it's related to output_accessor being non const, which it has to be in order for the .at method to return a non const pointer which we can assign values to. std::ref seems to fix this issue though.

If possible, I think we can also just modify how forwarding works in DataTypeDispatch1 to fix this issue as well though?

I'm tentatively guessing that might be due to DataTypeDispatch1 and DataTypeDispatch2 taking their arguments as Args ... rather than as Args &&...--can you try making that change and seeing if it fixes the problem?

lib/kernels/src/cuda/ops/linear_kernels.cu line 138 at r9 (raw file):

                           in_dim,
                           &alpha,
                           (void *)weight_ptr,

Avoid c-style casts

Suggestion:

static_cast<void *>(weight_ptr),

lib/kernels/test/src/test_cast_kernel.cc line 83 at r9 (raw file):

                                        output_accessor_cpu,
                                        DataType::FLOAT,
                                        DataType::DOUBLE);

Any reason not to just get these DataTypes from the accessors' shape fields?

Code quote:

                                        DataType::FLOAT,
                                        DataType::DOUBLE);

lib/kernels/test/src/test_cast_kernel.cc line 85 at r9 (raw file):

                                        DataType::DOUBLE);

      CHECK(w_accessors_are_equal<DataType::DOUBLE>(output_accessor_gpu,

Why is a template parameter needed here? Isn't the type information already present in the shape field of the accessors?

Code quote:

<DataType::DOUBLE>(o

lib/kernels/test/src/test_managed_ff_stream.cc line 10 at r9 (raw file):

    ManagedFFStream base_stream{};

    SUBCASE("Test ManagedFFStream Move Constructor") {

Add checks for if it's the same object on both sides of the assignment.

lib/kernels/test/src/test_managed_per_device_ff_handle.cc line 8 at r9 (raw file):

TEST_SUITE(FF_TEST_SUITE) {
  TEST_CASE("Test Managed Per Device FF Handle") {
    ManagedPerDeviceFFHandle base_handle{};

Can we force the user to pass arguments here instead? I'm not a huge fan of default constructors where there's no obvious single correct value that should be set. If you want to add a function that creates one with some default arguments and user that that's fine, but I'd rather that be an explicit function call to make it clear that that's kinda an arbitrary choice supplied somewhere rather than intrinsic to the object semantics

Code quote:

    ManagedPerDeviceFFHandle base_handle{};

lib/kernels/test/src/test_managed_per_device_ff_handle.cc line 16 at r9 (raw file):

    SUBCASE("Test ManagedPerDeviceFFHandle Move Constructor") {
      PerDeviceFFHandle const *base_handle_ptr = &base_handle.raw_handle();

Add checks for if it's the same object on both sides of the assignment.

lib/kernels/test/src/test_reverse_kernels.cc line 60 at r9 (raw file):

  TEST_CASE("Check Reverse Forward and Backward Kernels against CPU Kernels") {
    std::size_t num_out_blks = 1;

Might be nice to have this be >1 as having it be 1 could be hiding bugs (as iteration order doesn't matter if your looop has bound 1)

Code quote:

    std::size_t num_out_blks = 1;

lib/kernels/test/src/test_utils.h line 35 at r6 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

It's mainly as sometimes I'm wanting to transfer memory from a vector to an accessor => my current logic for random filled accessors is dependent on this...

I think it becomes potentially more messy creating things like these random filled accessors without this construct as then we'd increment based on ptr

Seems like this has been removed since this comment, so I guess it's fine now?

lib/kernels/test/src/test_utils.h line 88 at r6 (raw file):

Previously, oOTigger (Dylan Lim) wrote…

I think we need to use a template here, as we don't know we have to get the value that we're going to fill the accessor with too and we don't know the type of that to take as input?

You could technically get around this by passing in a DataTypeValue which is a variant of the real_type_ts of all of the DataTypes, but I'm fine with the current implementation too. I'd say the big thing is that it should be checked against the DataType in the shape and an error raised if the the chosen T and the TensorShape's DataType don't match

lib/kernels/test/src/test_utils.h line 101 at r9 (raw file):

  if (accessor_a.data_type != accessor_b.data_type) {
    return false;
  }

Suggestion:

  if (accessor_a.shape != accessor_b.shape) {
    throw mk_runtime_error(fmt::format("w_accessors_are_equal expected accessors to have the same shape, but received: {} != {}", accessor_a.shape, accessor_b.shape));  
  }

lib/kernels/test/src/test_utils.h line 115 at r9 (raw file):

  for (size_t i = 0; i < accessor_a.shape.num_elements(); i++) {
    if (a_data_ptr[i] != b_data_ptr[i]) {
      print_accessor(cpu_accessor_a);

printing by default seems like a bad idea, this should probably be handled by the assertion check, not the actual boolean condition itself.

lib/kernels/test/src/test_utils.cc line 126 at r9 (raw file):

GenericTensorAccessorR
    create_cpu_compatible_accessor_r(GenericTensorAccessorR const &accessor,

Suggestion:

copy_to_accessor_to_cpu_if_necessary

lib/kernels/test/src/test_utils.cc line 158 at r9 (raw file):

};

void print_accessor(GenericTensorAccessorR const &accessor) {

Suggestion:

void print_tensor_accessor_contents(

lib/kernels/test/src/test_utils.cc line 158 at r9 (raw file):

};

void print_accessor(GenericTensorAccessorR const &accessor) {

Might be nice to have this return std::string or take in a std::ostream & so the user can choose where it goes rather than just forcing it to std::cout

Code quote:

void

lib/local-execution/src/ops/linear.cc line 152 at r9 (raw file):

                 per_device_state,
                 input.get_float_ptr(),
                 (float *)input_grad.get_float_ptr(),

No cast should be necessary as you're already calling get_float_ptr?

test_utils refactor, local_cpu_allocator

05144e5

oOTigger added repo-refactor Topics related to the repo and search refactors rr-kernels Repo refactor kernels library labels Jul 11, 2024

oOTigger self-assigned this Jul 11, 2024

test utils modification, cast, reverse, and replicate cpu kernels

3bb8ff6

oOTigger requested review from lockshaw and reyna-abhyankar July 12, 2024 19:57

lockshaw marked this pull request as ready for review July 14, 2024 06:12

oOTigger added 2 commits July 14, 2024 15:45

combine kernel

968cd6d

combine kernels .h file

723515b

lockshaw requested changes Jul 19, 2024

View reviewed changes

Marsella8 and others added 2 commits July 31, 2024 04:52

test utils logic cleanup, reverse cpu_kernel pedagogical implmentatio…

e6e2161

…n, other minor fixes

oOTigger commented Jul 31, 2024

View reviewed changes

lockshaw requested changes Aug 7, 2024

View reviewed changes

oOTigger added 2 commits September 19, 2024 18:02

Merge branch 'repo-refactor' into cpu-kernels-tests

29a2cf3

Merge branch 'repo-refactor' into cpu-kernels-tests

366bd94

lockshaw assigned rupanshusoi and unassigned oOTigger Sep 26, 2024

oOTigger added 2 commits October 8, 2024 00:18

cpu_kernel's refactor, generic tensor accessor indexing

c9c33fd

Merge branch 'repo-refactor' into cpu-kernels-tests

2a5b38a

oOTigger commented Oct 8, 2024

View reviewed changes

oOTigger added 3 commits October 8, 2024 00:26

accessor.h formatting

d50914c

mk_runtime_error formatting

f1f2698

reverse_kernels include

a7422f7

lockshaw requested changes Oct 9, 2024

View reviewed changes

oOTigger added 3 commits October 14, 2024 23:21

Merge branch 'repo-refactor' into cpu-kernels-tests

ee19931

test_utils refactor and clarity

5863880

formatting

e869ace

comment removal reverse_kernels

de230cb

oOTigger commented Oct 15, 2024

View reviewed changes

oOTigger added 3 commits October 15, 2024 19:22

Issue flexflow#1435, tests for managed stream and handle

3fc8718

flexflow#1435 formatting

d1c9e90

flexflow#1409 issue, change datatype for linear kernels away from void *

7106dec

lockshaw requested changes Oct 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU Kernel Tests #1439

CPU Kernel Tests #1439

oOTigger commented Jul 11, 2024 •

edited

Loading

codecov bot commented Jul 11, 2024 •

edited

Loading

lockshaw left a comment

oOTigger left a comment

lockshaw left a comment

oOTigger left a comment

lockshaw left a comment

oOTigger left a comment

lockshaw left a comment

CPU Kernel Tests #1439

Are you sure you want to change the base?

CPU Kernel Tests #1439

Conversation

oOTigger commented Jul 11, 2024 • edited Loading

codecov bot commented Jul 11, 2024 • edited Loading

Codecov Report

lockshaw left a comment

Choose a reason for hiding this comment

oOTigger left a comment

Choose a reason for hiding this comment

lockshaw left a comment

Choose a reason for hiding this comment

oOTigger left a comment

Choose a reason for hiding this comment

lockshaw left a comment

Choose a reason for hiding this comment

oOTigger left a comment

Choose a reason for hiding this comment

lockshaw left a comment

Choose a reason for hiding this comment

oOTigger commented Jul 11, 2024 •

edited

Loading

codecov bot commented Jul 11, 2024 •

edited

Loading