Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectors #9

Merged
merged 76 commits into from
Nov 11, 2024
Merged

Vectors #9

merged 76 commits into from
Nov 11, 2024

Conversation

claytongentry
Copy link
Member

@claytongentry claytongentry commented Nov 4, 2024

Vectors represent the "nouns" of your manifold universe. They are the entities on whose behalf you can roll up metrics, like a Page or a Campaign.

To add a vector to your umbrella, run manifolds vectors add page.

This will generate a config file in your vectors/ directory that you can fill out like so:

attributes:
  published_at: TIMESTAMP
  title: STRING

You can express the vectors you'd like to intersect in your manifold.yml like so:

vectors:
  # List the vectors you want to include in this manifold
  # Example:
  - page

Then when you run manifolds generate, your dimensions schema is generated from the vectors you specified.

@claytongentry claytongentry changed the title Entities Vectors Nov 4, 2024
@claytongentry claytongentry marked this pull request as ready for review November 4, 2024 17:44
.rubocop.yml Outdated
Comment on lines 6 to 17
RSpec/ExampleLength:
Max: 18

RSpec/MultipleExpectations:
Enabled: false

RSpec/MultipleMemoizedHelpers:
Enabled: false

RSpec/NestedGroups:
Max: 4

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are fairly loose settings. MultipleExpectations and MultipleMemoizedHelpers in particular are very helpful in keeping specs clean. Multiple assertions in a test is regarded as a smell in TDD generally, and too many memoized helpers is a pretty clear sign that you need to refactor your code. Consider removing these settings from here then run rubocop --auto-gen-config to generate a .rubocop_todo.yml. This gives you a set of goals you can work toward over time.

spec/manifolds/cli_spec.rb Outdated Show resolved Hide resolved
File.write("#{File.dirname(__FILE__)}/../../lib/manifolds/templates/config_template.yml", "vectors:\nmetrics:")
File.write("#{File.dirname(__FILE__)}/../../lib/manifolds/templates/vector_template.yml", "attributes:")
end
end

describe "#init" do
subject(:cli) { described_class.new(logger: null_logger) }

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting the action being tested in a before blocks makes tests harder to read because you need to jump back and forth in the file to figure out what's going on.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tylr may disagree with this one. In other non-Ruby projects I've always put the execution logic in the test itself rather than any setup clause. I just thought this was rspec convention.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Execution should be with test. Setup is for, well, setting up.

I've seen different practices over the years, but my understanding that RSpec maintainers recommend keeping the execution logic inside the example. In fact I believe they also recommend keeping as much setup in the example as is practical. Personally I feel that is not often very practical.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to opt for the one liners, but agree with John here. The allow calls themselves are part of a behavior.

I could maybe see it being ok in a before within a describe "#method" block where all calls to the method need to be mocked. Or even extracting a method if a ton of specs need that setup.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. In that case, I think we all actually agree.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hadn't even looked at the stubs. I think I will bite my tongue for now :D

Comment on lines 32 to 33
expect(Dir.exist?("./#{project_name}/projects")).to be true
expect(Dir.exist?("./#{project_name}/vectors")).to be true

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Either split this into two tests (my preference), or wrap them in an aggregate_failures block.

Also, expect(Pathname("some/path")).to be_directory reads a lot better.

it "creates a 'projects' directory for the project" do
  cli.init(project_name)
  expect(Pathname(".#{project_name}/projects")).to be_directory
end

it "creates a 'vectors' directory for the project" do
  cli.init(project_name)
  expect(Pathname(".#{project_name}/vectors")).to be_directory
end
it "creates 'projects' and 'vectors' directories for the project" do
  cli.init(project_name)
  aggregate_failures do
    expect(Pathname(".#{project_name}/projects")).to be_directory
    expect(Pathname(".#{project_name}/vectors")).to be_directory
  end
end

Copy link
Member

@tylr tylr Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @johncarney. I had a similar thought and recently started exploring better test matchers to implement a similar change.

This is an interesting topic to dive into...

On a related note, we should consider adopting single-expectation tests that exclude any initialization logic (see: https://www.betterspecs.org/#single). In my experience, this approach leads to better specifications for several nuanced reasons.

To transform these tests into true specs that read like specifications, I would structure them like this:

context "when a project is initialized" do
  let(:project_name) { Faker.snake_case_name_or_something }
  let(:base_path) { File.join(Dir.pwd, project_name) }

  before { cli.init(project_name) }

  it { expect(Pathname(File.join(base_path, 'projects'))).to be_directory }
  it { expect(Pathname(File.join(base_path, 'vectors'))).to be_directory }
end

This refactoring reduces the lines of code and provides a clear entry point for specifying behaviors without the need for extensive context/state setup. But there's more to consider...

expect(Pathname(File.join(base_path, 'projects'))).to be_directory

Isn’t Pathname(File.join(base_path, 'projects')) an example of application domain details creeping into our tests? I would argue that it is. Ideally, development should be driven by specs that don’t embed specific implementation details.

Here's an updated approach to consider:

describe CLI do
  subject(:cli) { CLI.new(...) }

  describe '#init' do
    context "with a name and directory" do
      let(:name) { Faker.snake_case_name_or_something }
      let(:directory) { File.join(Dir.pwd, name) }

      before { subject.init(name, directory:) }

      it { expect(Pathname(subject.vectors_directory)).to be_directory }
      it { expect(Pathname(subject.projects_directory)).to be_directory }
    end
  end
end

I'll admit that defining what makes a good spec is near impossible. However, a useful litmus test is whether the test code reads like natural language and whether the test output is concise and hierarchically clear.

Consider the current test output:

> rspec -f d
Manifolds::CLI
  #new
    is expected to be directory
  #create
    when within an umbrella project
      creates a tables directory within the project
      creates a routines directory within the project
      creates a manifold.yml file
      writes the manifold.yml file with dimensions
      writes the manifold.yml file with metrics
    when outside an umbrella project
      does not allow adding projects and logs an error
  #generate
    calls generate_dimensions_schema on bq service with correct project name

Manifolds::Services::BigQueryService
  #generate_dimensions_schema
    checks if the configuration file exists
    writes the dimensions schema to a file
    logs success message
    when the configuration file does not exist
      logs an error message

Finished in 0.01196 seconds (files took 0.08927 seconds to load)
12 examples, 0 failures

We can identify redundancies, unnatural language, missing context, etc in the output. By following my suggested organization, the output might look more like this:

Manifolds::CLI
  #init
    creates a project directory
  #create
    within a project
      creates a tables directory
      creates a routines directory

Ok, I digress.

spec/manifolds/cli_spec.rb Outdated Show resolved Hide resolved
spec/manifolds/cli_spec.rb Outdated Show resolved Hide resolved
spec/manifolds/cli_spec.rb Outdated Show resolved Hide resolved
Copy link
Member

@tylr tylr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok...I went way too deep on test thread but a couple other quick bits...

Requires

We're doing requires wrong. We should have a manifold.rb file in lib/ root that does all the necessary loading so that the files living under it's namespace don't need to maintain requires. They will eventually be duplicative and impossible to manage. Let's hit this in a subsequent PR but catch it before it goes off the...rails :)

Relative paths

Similarly, paths. Relative paths are bad and I don't believe you can guarantee ./ means the same thing from all execution contexts.

I think there are clever ways in ruby to add the lib/ root to the execution environments load path. So you can do stuff like require 'lib/manifolds/cli' anywhere within the project. If a google or AI prompt yields an obvious answer great. Otherwise, let's not add any more relative paths in this PR and start using the convention I normally see.

current_directory = Dir.pwd # same as pwd from any terminal
path = File.join(current_directory, 'my', 'path', 'to_a.file') # I believe this is a best practice because it takes care of escape sequences and delimiters. 

Ok thanks!

Comment on lines 32 to 33
expect(Dir.exist?("./#{project_name}/projects")).to be true
expect(Dir.exist?("./#{project_name}/vectors")).to be true
Copy link
Member

@tylr tylr Nov 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, @johncarney. I had a similar thought and recently started exploring better test matchers to implement a similar change.

This is an interesting topic to dive into...

On a related note, we should consider adopting single-expectation tests that exclude any initialization logic (see: https://www.betterspecs.org/#single). In my experience, this approach leads to better specifications for several nuanced reasons.

To transform these tests into true specs that read like specifications, I would structure them like this:

context "when a project is initialized" do
  let(:project_name) { Faker.snake_case_name_or_something }
  let(:base_path) { File.join(Dir.pwd, project_name) }

  before { cli.init(project_name) }

  it { expect(Pathname(File.join(base_path, 'projects'))).to be_directory }
  it { expect(Pathname(File.join(base_path, 'vectors'))).to be_directory }
end

This refactoring reduces the lines of code and provides a clear entry point for specifying behaviors without the need for extensive context/state setup. But there's more to consider...

expect(Pathname(File.join(base_path, 'projects'))).to be_directory

Isn’t Pathname(File.join(base_path, 'projects')) an example of application domain details creeping into our tests? I would argue that it is. Ideally, development should be driven by specs that don’t embed specific implementation details.

Here's an updated approach to consider:

describe CLI do
  subject(:cli) { CLI.new(...) }

  describe '#init' do
    context "with a name and directory" do
      let(:name) { Faker.snake_case_name_or_something }
      let(:directory) { File.join(Dir.pwd, name) }

      before { subject.init(name, directory:) }

      it { expect(Pathname(subject.vectors_directory)).to be_directory }
      it { expect(Pathname(subject.projects_directory)).to be_directory }
    end
  end
end

I'll admit that defining what makes a good spec is near impossible. However, a useful litmus test is whether the test code reads like natural language and whether the test output is concise and hierarchically clear.

Consider the current test output:

> rspec -f d
Manifolds::CLI
  #new
    is expected to be directory
  #create
    when within an umbrella project
      creates a tables directory within the project
      creates a routines directory within the project
      creates a manifold.yml file
      writes the manifold.yml file with dimensions
      writes the manifold.yml file with metrics
    when outside an umbrella project
      does not allow adding projects and logs an error
  #generate
    calls generate_dimensions_schema on bq service with correct project name

Manifolds::Services::BigQueryService
  #generate_dimensions_schema
    checks if the configuration file exists
    writes the dimensions schema to a file
    logs success message
    when the configuration file does not exist
      logs an error message

Finished in 0.01196 seconds (files took 0.08927 seconds to load)
12 examples, 0 failures

We can identify redundancies, unnatural language, missing context, etc in the output. By following my suggested organization, the output might look more like this:

Manifolds::CLI
  #init
    creates a project directory
  #create
    within a project
      creates a tables directory
      creates a routines directory

Ok, I digress.

Copy link
Member Author

@claytongentry claytongentry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the extensive reviews @johncarney and @tylr .

Have made changes:

  • Enforced a couple of those rubocop rules I'd allowed to lax.
  • Single assertion for most (all?) tests.
  • Removed all relative path usage in favor of Dir.pwd and File.join.

@tylr I don't quite the grasp the difference in your suggestion vs the test output:

➜  manifolds git:(claytongentry/entities) rspec -f d

Randomized with seed 7975

Manifolds::CLI
  vectors
    add
      when outside an umbrella project
        indicates that the command must be run within a project
      when adding a vector within an umbrella project
        creates a vector configuration file with 'attributes'
  #add
    when adding a project within an umbrella project
      creates a 'routines' directory
      adds metrics to the project's manifold configuration
      adds vectors to the project's manifold configuration
      creates a 'tables' directory
    when outside an umbrella project
      indicates that the command must be run within a project
  #init
    when initializing a new project
      creates a 'projects' directory
      creates a 'vectors' directory

Manifolds::Services::VectorService
  #load_vector_schema
    loads and transforms vector schema
    when vector file doesn't exist
      logs error
      returns nil

Manifolds::Services::BigQueryService
  #generate_dimensions_schema
    when the project configuration exists
      includes the expected schema structure
      generates a dimensions schema file
    when the project configuration is missing
      indicates the configuration is missing

Finished in 0.05442 seconds (files took 0.28224 seconds to load)
15 examples, 0 failures

It seems to read like English to me, as you're suggesting. Let me know if I'm missing something.

File.write("#{File.dirname(__FILE__)}/../../lib/manifolds/templates/config_template.yml", "vectors:\nmetrics:")
File.write("#{File.dirname(__FILE__)}/../../lib/manifolds/templates/vector_template.yml", "attributes:")
end
end

describe "#init" do
subject(:cli) { described_class.new(logger: null_logger) }
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tylr may disagree with this one. In other non-Ruby projects I've always put the execution logic in the test itself rather than any setup clause. I just thought this was rspec convention.

Copy link
Member

@tylr tylr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One small rubyism suggestion. Up to you to implement John / my suggestion on the mocks in before blocks. I could see that changing a few ways as test suite grows.

I think this passes my QA overall and fine merging.

Comment on lines 17 to 24
fields = []

# Load vector schemas
config["vectors"]&.each do |vector|
@logger.info("Loading vector schema for '#{vector}'.")
vector_schema = @vector_service.load_vector_schema(vector)
fields << vector_schema if vector_schema
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fields = []
# Load vector schemas
config["vectors"]&.each do |vector|
@logger.info("Loading vector schema for '#{vector}'.")
vector_schema = @vector_service.load_vector_schema(vector)
fields << vector_schema if vector_schema
end
fields = config["vectors"].reduce([]) do |list, vector|
@logger.info("Loading vector schema for '#{vector}'.")
[*@vector_service.load_vector_schema(vector), *list]
end

Copy link

@johncarney johncarney Nov 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 agree with Tyler. Map/reduce is your friend. Only thing is in the original, there's a safe access operator on config["vectors"]. The simplest way to tweak Tyler's suggestion so it handles nil is to wrap it in Array(). Array(nil) gives you an empty array:

fields = Array(config["vector"]).reduce do |list, vector|
  # ...
end

Another thing I noticed is that you could use filter_map here.

fields = Array(config["vectors"]).filter_map do |vector|
  @logger.info("Loading vector schema for '#{vector}'."))
  @vector_service.load_vector_schema(vector)
end

filter_map was introduced in Ruby 2.7 and to this day I still keep forgetting about it :D

As a side note, I dislike accessing instance variables directly. It tends to impede refactoring. Except in rare circumstances I always use attr_reader (never attr_writer or attr_accessor though—those should be removed from the language).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this — I appreciate the note. I like filter_map, will snag it in a followup.

File.write("#{File.dirname(__FILE__)}/../../lib/manifolds/templates/config_template.yml", "vectors:\nmetrics:")
File.write("#{File.dirname(__FILE__)}/../../lib/manifolds/templates/vector_template.yml", "attributes:")
end
end

describe "#init" do
subject(:cli) { described_class.new(logger: null_logger) }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like to opt for the one liners, but agree with John here. The allow calls themselves are part of a behavior.

I could maybe see it being ok in a before within a describe "#method" block where all calls to the method need to be mocked. Or even extracting a method if a ton of specs need that setup.

@claytongentry claytongentry merged commit b257a5d into master Nov 11, 2024
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants