Implement auto-rollback #1585

michaeljguarino · 2024-11-14T05:44:01Z

Handle service failures by auto-rollback to a prior service context (also persist service context history per stage)

Test Plan

unit tests

Checklist

If required, I have updated the Plural documentation accordingly.
I have added tests to cover my changes.
I have added a meaningful title and summary to convey the impact of this PR to a user.

Summary by CodeRabbit

New Features
- Introduced the ability to specify additional containers in the console deployment configuration.
- Added a new property for determining uniqueness in configurations for CustomStackRun and PrAutomation.
- Enhanced pipeline context management with a new PipelineContextHistory schema.
- Implemented new GraphQL capabilities for retrieving pipeline context history.
Bug Fixes
- Improved error handling in various modules to ensure better logging and robustness.
Documentation
- Updated GraphQL schema to include new fields and types related to pipeline context history.
Chores
- Removed outdated Prometheus monitoring modules and replaced them with a new plugin-based approach.
- Updated dependency management in the project to enhance compatibility and functionality.

linear · 2024-11-14T05:44:04Z

PROD-2887 pipeline auto-rollback apis

coderabbitai · 2024-11-19T03:39:39Z

Important

Review skipped

Auto reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The changes in this pull request enhance the flexibility and functionality of the console application and its associated components. Key modifications include the introduction of a conditional block in the Kubernetes deployment template to allow for additional containers, the addition of uniqueness constraints in Custom Resource Definitions (CRDs), and the implementation of new metrics and telemetry features. Furthermore, several new tests are added to ensure the robustness of the changes, particularly around context history management in pipelines and the enforcement of unique names in configurations.

Changes

File Path	Change Summary
`charts/console/templates/deployment.yaml`	Added conditional block for additional containers under `.Values.console.additionalContainers`. Existing containers and configurations remain unchanged.
`charts/console/values.yaml`	Added `additionalContainers: ~` under the `console` section.
`charts/controller/crds/deployments.plural.sh_customstackruns.yaml`	Added `uniqBy` property in `spec.configuration` to specify uniqueness constraints with a required `scope` field.
`charts/controller/crds/deployments.plural.sh_prautomations.yaml`	Added `uniqBy` property under `validation` in `configuration` schema, updated required fields to include `uniqBy`.
`config/config.exs`	Added configuration block for `Console.PromEx` and `prom_plugins` key. Updated `:piazza_core` shutdown delay and `:oidc_providers` setting.
`config/prod.exs`	Added configuration for `Console.PromEx` metrics server settings.
`config/test.exs`	Added `prom_plugins: []` to the `:console` configuration.
`go/client/models_gen.go`	Introduced new types for pipeline context history and updated `PipelineStage` to include `ContextHistory`. Added `UniqByAttributes` type for uniqueness enforcement.
`go/controller/api/v1alpha1/prautomation_types.go`	Added `UniqBy` field to `PrAutomationConfigurationValidation` and introduced `PrAutomationUniqBy` type.
`go/controller/api/v1alpha1/zz_generated.deepcopy.go`	Added deep copy methods for the new `PrAutomationUniqBy` struct.
`go/controller/config/crd/bases/deployments.plural.sh_customstackruns.yaml`	Added `uniqBy` property in `spec.configuration` of `CustomStackRun` CRD.
`go/controller/config/crd/bases/deployments.plural.sh_prautomations.yaml`	Added `uniqBy` in `validation` schema of `PrAutomation` CRD and updated required fields.
`lib/console/ai/chat.ex`	Updated `@summary` to change title requirements from "at most 8 words" to "at most 6 words".
`lib/console/ai/fixer.ex`	Added `@prompt` module attribute for generating code/configuration change recommendations.
`lib/console/ai/provider/anthropic.ex`	Changed default model in `chat/2` function from `"claude-3-5-haiku-latest"` to `"claude-3-5-sonnet-latest"`.
`lib/console/application.ex`	Removed references to `Console.Prom.Setup` and added `Console.PromEx` to child processes.
`lib/console/deployments/cron.ex`	Added error handling in `cache_warm/0` function.
`lib/console/deployments/git/agent.ex`	Integrated telemetry for monitoring, replacing previous metrics calls with telemetry events.
`lib/console/deployments/git/statistics.ex`	Changed metrics reporting in `disk/0` function to use telemetry events instead of `Metrics`.
`lib/console/deployments/helm/agent.ex`	Set explicit timeout for `fetch` and `digest` functions in GenServer calls.
`lib/console/deployments/observer/runner.ex`	Improved error logging for unknown observer errors in `run/1` function.
`lib/console/deployments/pipelines.ex`	Added historical context tracking for pipeline stages and enhanced context management functionalities.
`lib/console/deployments/pipelines/discovery.ex`	Introduced `revert/1` and `do_revert/1` functions for pipeline stage reversion.
`lib/console/deployments/pipelines/stage_worker.ex`	Added `revert/2` method for reverting pipeline stage context.
`lib/console/deployments/pr/validation.ex`	Enhanced validation logic to include uniqueness checks for string types.
`lib/console/deployments/pubsub/consumers/pipeline.ex`	Updated `handle_event/1` to handle `{:revert, %PipelineStage{}}` events.
`lib/console/deployments/pubsub/protocols/pipelineable.ex`	Refactored logic for handling service statuses in `pipe` function.
`lib/console/deployments/pubsub/recurse.ex`	Updated handling of pull request events to include stack preloading.
`lib/console/deployments/statistics.ex`	Shifted to telemetry for reporting cluster and service statistics.
`lib/console/graphql/deployments/git.ex`	Added new `ecto_enum` for `validation_uniq_scope` and updated input objects for uniqueness validation.
`lib/console/graphql/deployments/pipeline.ex`	Introduced new input and output objects for tracking pipeline context history.
`lib/console/graphql/resolvers/deployments/pipeline.ex`	Added function to retrieve pipeline context history for a specific stage.
`lib/console/prom/ecto.ex`	Deleted module for Ecto integration with Prometheus.
`lib/console/prom/metrics.ex`	Deleted module for managing console-related metrics.
`lib/console/prom/plugin.ex`	Introduced new PromEx plugin for monitoring console metrics.
`lib/console/prom/scraper.ex`	Deleted module for scraping deployment statistics.
`lib/console/prom/setup.ex`	Deleted module for setting up Prometheus metrics and telemetry.
`lib/console/prom_ex.ex`	Defined `Console.PromEx` module for integrating PromEx monitoring capabilities.
`lib/console/schema/embeds.ex`	Added `UniqScope` enum and updated embedded schema functionality for unique validation.
`lib/console/schema/pipeline_context_history.ex`	Introduced schema for managing pipeline context history with associated query functions.
`mix.exs`	Removed `:horde` dependency, added multiple new dependencies for telemetry and metrics, and updated existing dependencies.
`plural/helm/console/crds/deployments.plural.sh_customstackruns.yaml`	Added `uniqBy` field in `spec.configuration` for `CustomStackRun` CRD.
`plural/helm/console/crds/deployments.plural.sh_prautomations.yaml`	Added `uniqBy` field in `validation` schema for `PrAutomation` CRD.
`priv/repo/migrations/20241113160231_pipeline_context_history.exs`	Created migration for `pipeline_context_history` table with foreign keys and timestamps.
`schema/schema.graphql`	Added new fields and types for tracking historical context data in the GraphQL schema.
`test/console/deployments/git_test.exs`	Added test for rejecting pull requests with duplicate names and updated existing tests for unique name validation.
`test/console/deployments/pipelines_test.exs`	Enhanced tests for pipeline context management and added tests for reverting contexts.
`test/console/graphql/queries/deployments/pipeline_queries_test.exs`	Added test for fetching pipeline context history within a pipeline stage.
`test/support/factory.ex`	Introduced factory function for creating `PipelineContextHistory` records.

Poem

In the fields where rabbits play,
New changes hop in bright array.
With metrics and histories to explore,
Our console's magic grows ever more!
So let us cheer with joyful glee,
For all the wonders yet to see! 🐇✨

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 18

🧹 Outside diff range and nitpick comments (51)

lib/console/deployments/git/statistics.ex (2)
13-14: Consider adding error handling and documentation

While the telemetry implementation is correct, consider these improvements:

Add error handling around telemetry execution

Document the metrics and their significance

Consider applying this improvement:
 def disk() do
   {count, size} =
     :ets.tab2list(Briefly.Entry.Dir)
     |> Enum.map(fn {_pid, dir} -> dir end)
     |> Enum.reduce({0, 0}, fn dir, {count, size} ->
       {dc, ds} = Console.df(dir)
       {count + dc, size + ds}
     end)

+  # Report total number of files in the cache
   :telemetry.execute(metric_scope(:file_count), %{total: count}, %{})
+  # Report total size of files in bytes
   :telemetry.execute(metric_scope(:file_size), %{total: size}, %{})
+ rescue
+  error ->
+    Logger.error("Failed to report disk statistics: #{inspect(error)}")
+    reraise error, __STACKTRACE__
 end
Consider adding error handling for directory operations

The implementation of Console.df/1 uses File.stat!/1 and File.ls!/1 which raise exceptions on errors. Since the directories are obtained from an ETS table and could be removed while being processed, the code should handle these potential failures gracefully:

In lib/console.ex, df/2 uses bang variants (File.stat!/1, File.ls!/1) which can raise exceptions

The caller in lib/console/deployments/git/statistics.ex doesn't handle these potential exceptions

Suggested changes:

Replace File.stat!/1 with File.stat/1

Replace File.ls!/1 with File.ls/1

Add appropriate error handling for cases where directories are removed or inaccessible

Regarding parallel processing: While Task.async_stream/3 is used in other parts of the codebase, the current sequential implementation is acceptable unless profiling shows it to be a bottleneck, as disk I/O operations may not benefit significantly from parallelization.

🔗 Analysis chain

Line range hint 4-11: Verify handling of edge cases and consider performance optimization

The current implementation might benefit from additional error handling and performance optimization:

Consider handling cases where directory entries are removed while being processed

For large number of entries, consider using Task.async_stream for parallel processing

Let's verify the potential impact:

Let me gather more information about the error handling patterns and the Console.df/1 function:
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Check for similar patterns in the codebase to understand how ETS table access is typically handled
rg -A 5 "tab2list.*Entry\.Dir"

# Look for any existing parallel processing patterns
ast-grep --pattern 'Task.async_stream($$$)'
Length of output: 4336

Script:
#!/bin/bash
# Check the implementation of Console.df/1
ast-grep --pattern 'def df($$$)'

# Look for error handling patterns with directories
rg -A 3 "rescue.*File\." 
Length of output: 799
priv/repo/migrations/20241113160231_pipeline_context_history.exs (1)
5-11: Consider adding additional indexes for common query patterns.

Based on the PR objectives mentioning "maintaining history of service contexts for each stage", it's likely that queries will need to efficiently retrieve history for a specific context.

Consider adding an index on context_id as well:
 create table(:pipeline_context_history, primary_key: false) do
   add :id,         :uuid, primary_key: true
   add :stage_id,   references(:pipeline_stages, type: :uuid, on_delete: :delete_all)
   add :context_id, references(:pipeline_contexts, type: :uuid, on_delete: :delete_all)

   timestamps()
 end

 create index(:pipeline_context_history, [:stage_id])
+create index(:pipeline_context_history, [:context_id])
lib/console/deployments/statistics.ex (2)
15-16: Consider optimizing telemetry execution and adding debug logging

The implementation is correct, but could be improved for maintainability and debugging.

Consider applying these improvements:
+ @empty_metadata %{}  # Define at module level
  defp cluster_stats() do
    Cluster.stats()
    |> Console.Repo.one()
    |> case do
      %{unhealthy: h, count: c} ->
-       :telemetry.execute(metric_scope(:cluster_count), %{total: c}, %{})
-       :telemetry.execute(metric_scope(:unhealthy_cluster_count), %{total: h}, %{})
+       require Logger
+       Logger.debug("Recording cluster stats: total=#{c} unhealthy=#{h}")
+       :telemetry.execute(metric_scope(:cluster_count), %{total: c}, @empty_metadata)
+       :telemetry.execute(metric_scope(:unhealthy_cluster_count), %{total: h}, @empty_metadata)
Line range hint 1-30: Overall implementation aligns well with auto-rollback objectives

The transition to telemetry-based metrics provides a solid foundation for tracking service and cluster health, which is essential for the auto-rollback feature. The implementation is clean and consistent, following a clear pattern for both cluster and service statistics.

Consider these architectural recommendations:

Document the metric naming convention and purpose in the module documentation

Consider adding metric labels/tags for better categorization (e.g., by environment, region)

Add alerting rules based on these metrics to trigger auto-rollback decisions
lib/console/deployments/pubsub/consumers/pipeline.ex (1)

24-24: Consider adding documentation for the revert functionality.

While the implementation is correct, adding documentation for the revert behavior would improve maintainability. This is particularly important as it's part of a new auto-rollback feature.

Would you like me to help draft documentation for:

The revert event handling behavior

Example usage scenarios

Integration with the auto-rollback feature
lib/console/prom_ex.ex (1)
30-45: Consider documenting dashboard configurations

While the code provides a good structure for adding custom dashboards, it would be beneficial to:

Document the metrics available in each dashboard

Add a comment explaining the expected format of custom dashboard JSON files

Add documentation above the dashboards function:
+  @doc """
+  Defines the dashboards to be imported into Grafana.
+  
+  Each dashboard is defined as a tuple of {:otp_app, "path_to_json"} where:
+  - :otp_app: The OTP application containing the dashboard JSON
+  - path_to_json: Path to the dashboard JSON file in the priv directory
+  
+  Built-in dashboards provide the following metrics:
+  - application.json: Application-specific metrics (memory, processes)
+  - beam.json: BEAM VM metrics (garbage collection, memory usage)
+  """
   @impl true
   def dashboards do
lib/console/schema/pipeline_context_history.ex (1)
20-34: Improve function name clarity and query efficiency

The last_context function name is misleading as it actually retrieves the previous context. Consider renaming it to previous_context for better clarity. Additionally, the query could be simplified.

Consider this simpler alternative:
-def last_context(query \\ __MODULE__, stage_id, context_id) do
-  ctx = from(pch in __MODULE__,
-    where: pch.stage_id == ^stage_id and pch.context_id == ^context_id,
-    group_by: pch.stage_id,
-    select: %{id: pch.stage_id, max: max(pch.inserted_at)}
-  )
-
-  from(pch in query,
-    join: lasts in subquery(ctx),
-      on: pch.stage_id == lasts.id,
-    where: pch.inserted_at < lasts.max and pch.context_id != ^context_id,
-    limit: 1,
-    order_by: [desc: :inserted_at]
-  )
-end
+def previous_context(query \\ __MODULE__, stage_id, context_id) do
+  from(pch in query,
+    where: pch.stage_id == ^stage_id and
+           pch.context_id != ^context_id,
+    order_by: [desc: :inserted_at],
+    limit: 1
+  )
+end
lib/console/deployments/pipelines/discovery.ex (1)
27-27: Add documentation for the revert operation.

While the implementation is correct, consider adding documentation to explain:

The purpose of the revert operation

When it's triggered

The expected behavior and side effects

Add documentation above the function:
+  @doc """
+  Reverts a pipeline stage to its previous context.
+  Part of the auto-rollback feature that handles service failures by reverting to a prior service context.
+
+  ## Parameters
+    - stage: A PipelineStage struct containing the stage to revert
+  """
   def do_revert(%PipelineStage{id: id} = stage), do: StageWorker.revert(worker_shard(id), stage)
lib/console/ai/fixer.ex (2)
12-16: Consider enhancing the prompt structure for better AI responses.

The prompt could be more specific about response format expectations and contextual requirements. Consider structuring it to:

Request a specific format for the response (e.g., JSON or markdown)

Specify whether git commands should be included

Define how multi-file changes should be presented

Include context about the auto-rollback feature when relevant

Here's a suggested enhancement:
  @prompt """
- Please provide the most straightforward code or configuration change available based on the information I've already provided above to fix this issue.
-
- Be sure to explicitly state the Git repository and full file names that are needed to change, alongside the complete content of the files that need to be modified.
+ Please provide the fix in the following format:
+ 1. Repository: <repository-url>
+ 2. Files to modify:
+    - Path: <full-path-to-file>
+    - Changes:
+      ```
+      <complete-file-content-with-changes>
+      ```
+ 3. Implementation notes:
+    - Describe any dependencies or side effects
+    - Mention if changes affect the auto-rollback behavior
+    - List any required follow-up actions
  """
Line range hint 12-47: Consider enhancing AI fix recommendations for rollback scenarios.

Given that this PR implements auto-rollback functionality, consider extending the module to better support rollback-specific scenarios:

Add specific prompt instructions for rollback-related fixes

Include validation for rollback compatibility in the fix recommendations

Consider adding a new fix type specifically for rollback issues

Would you like assistance in implementing these enhancements?
lib/console/deployments/pubsub/protocols/pipelineable.ex (2)
20-29: LGTM! Consider enhancing the logging.

The guard clause and modularization improve the code structure. The function now properly handles both healthy and failed service states, which is essential for the auto-rollback feature.

Consider adding more context to the log messages:
-    Logger.info "Kicking any pipelines associated with #{svc.id}"
+    Logger.info "Kicking pipelines for service #{svc.id} with status #{s}"
-      Logger.info "Service #{svc.id} has no recent updates"
+      Logger.info "Skipping service #{svc.id} (status: #{s}) - no recent updates"
35-47: Consider configuration and query optimization improvements.

The helper functions are well-structured, but there are opportunities for enhancement:

The 2-minute threshold could be configurable

The preload query could be optimized for larger datasets

Consider these improvements:
# Add to your config.exs:
+config :console, :service_update_threshold_minutes, 2

# Then modify the recent? function:
 defp recent?(%Service{updated_at: nil}), do: true
 defp recent?(%Service{updated_at: uat}) do
+  threshold = Application.get_env(:console, :service_update_threshold_minutes, 2)
   Timex.now()
-  |> Timex.shift(minutes: -2)
+  |> Timex.shift(minutes: -threshold)
   |> Timex.before?(uat)
 end

# Optimize the stages query:
 defp stages(%Service{} = svc) do
   case Console.Repo.preload(svc, [
+    stage_services: from(ss in StageService, select: [:id, :stage_id])
+    stage_services: {from(s in Stage, select: [:id, :name]), :stage}
   ]) do
     %{stage_services: [_ | _] = ss} -> Enum.map(ss, & &1.stage)
     _ -> []
   end
 end
lib/console/application.ex (1)

45-47: Consider documenting the metrics migration.

Since this represents a significant change in the monitoring setup, consider:

Adding a migration guide for operators

Documenting any changes needed in monitoring dashboards

Listing any breaking changes in metric names or formats

Would you like me to help create a migration guide or documentation for the metrics changes?
lib/console/deployments/pipelines/stage_worker.ex (2)
43-54: Fix error message in revert operation.

The error message seems to be copied from the context handler and should be updated to reflect the revert operation.

Apply this change:
-        Pipelines.add_stage_error(stage, "revert", "Failed to apply stage context with error: #{format_error(err)}")
+        Pipelines.add_stage_error(stage, "revert", "Failed to revert stage context with error: #{format_error(err)}")
Otherwise, the implementation looks good with proper error handling and logging.

25-26: Consider separate timeout values for revert operations.

The revert operation currently uses the same 60-second timeout as context operations. Depending on the complexity and scope of revert operations in your system, you might want to consider:

Using a different timeout value for reverts if they typically take more/less time than context applications

Making the timeout configurable via application config

Also applies to: 43-54
lib/console/ai/provider/anthropic.ex (2)
Line range hint 1-7: Update module documentation

The @moduledoc should be updated to include information about:

The default model being used

How to override the model via configuration

Any specific limitations or requirements
 @moduledoc """
 Implements our basic llm behaviour against the Anthropic api
+
+ Uses claude-3-5-sonnet-latest by default, which can be overridden
+ through configuration. The model supports up to 8000 tokens per request.
 """
Line range hint 11-13: Consider configuration improvements for better maintainability

Several values are hardcoded that might benefit from being configurable:

API version ("2023-06-01")

Timeout settings (currently :infinity)

Max tokens (8000)

Consider moving these to a config module or environment variables for easier maintenance and environment-specific tuning. The infinite timeout setting in particular could benefit from a reasonable default with the ability to override.
lib/console/deployments/helm/agent.ex (1)
16-18: Consider enhancing error handling for auto-rollback reliability.

Given this module's role in the auto-rollback feature, consider these architectural improvements:

Make timeouts configurable via application config

Add telemetry events for timeout occurrences

Implement circuit breaker pattern for failing repositories

Add specific error types for timeout vs. other failures

Example configuration approach:
# config/config.exs
config :console, Console.Deployments.Helm.Agent,
  fetch_timeout: {:system, "HELM_FETCH_TIMEOUT", "30000"},
  digest_timeout: {:system, "HELM_DIGEST_TIMEOUT", "30000"}
This would allow for environment-specific tuning and easier testing.
lib/console/schema/embeds.ex (2)
141-143: LGTM: Well-structured embedded schema for uniqueness validation

The uniq_by embedded schema is well-designed and properly integrated into the Validation schema. The use of on_replace: :update is appropriate for managing updates to existing records.

Consider documenting the behavior of each scope level in the module documentation, as this will be crucial for users implementing uniqueness constraints in their configurations.

158-163: Consider a more idiomatic approach to the uniq_changeset function

While the implementation is functionally correct, it could be more idiomatic Elixir.

Consider this slightly more idiomatic version:
  defp uniq_changeset(model, attrs) do
-   cast(model, attrs, ~w(scope)a)
-   |> validate_required(~w(scope)a)
+   model
+   |> cast(attrs, ~w(scope)a)
+   |> validate_required(~w(scope)a)
  end
This maintains consistency with other changeset functions in the module and follows the pipe-operator pattern more clearly.
charts/console/values.yaml (1)
24-25: Consider adding documentation for additionalContainers.

Since this feature enables auto-rollback functionality, it would be helpful to document:

The purpose and use cases for additionalContainers

The expected container configuration format

Example configurations for the auto-rollback use case

Add a comment above the configuration:
 console:
+  # -- Additional containers to be added to the console pod.
+  # This can be used to implement features like auto-rollback using sidecar containers.
+  # @default -- null
   additionalContainers: ~
charts/console/templates/deployment.yaml (1)

51-53: Document the new additionalContainers feature.

Consider adding documentation that covers:

Purpose and use cases for additional containers

Expected schema and validation rules

Examples of common configurations

Security best practices and considerations

Would you like me to help create documentation for this feature?
lib/console/deployments/pubsub/recurse.ex (1)
103-110: LGTM! Consider adding error logging.

The implementation is well-structured using with for proper error handling and efficient data loading. The sequence of operations (preload -> discover -> poll) is logical and maintains data consistency.

Consider adding error logging for failed cases:
 def process(%{item: %PullRequest{stack_id: id} = pr}) when is_binary(id) do
   with %PullRequest{stack: %Stack{} = stack} = pr <- Console.Repo.preload(pr, [stack: :repository]),
        _ <- Discovery.kick(stack.repository),
     do: Stacks.poll(pr)
+  else
+    error ->
+      Logger.error("Failed to process pull request: #{inspect(error)}")
+      error
 end
lib/console/deployments/git/agent.ex (2)
64-65: Consider enhancing the telemetry event data.

While the basic telemetry implementation is correct, consider these improvements:

Rename the metric from "count" to "started" or "initialized" for better clarity

Add timestamp to the metadata for better tracking

Add error handling around the telemetry execution

Here's a suggested implementation:
-    :telemetry.execute(metric_scope(:git_agent), %{count: 1}, %{url: repo.url})
+    try do
+      :telemetry.execute(
+        metric_scope(:git_agent),
+        %{started: 1},
+        %{
+          url: repo.url,
+          timestamp: DateTime.utc_now()
+        }
+      )
+    rescue
+      _ -> Logger.warn("Failed to emit git agent init telemetry")
+    end
Add telemetry documentation and standardize event data

The telemetry implementation needs improvement:

The module lacks documentation about its telemetry events despite emitting them in init/1 and terminate/2

The same event name :git_agent is used for different lifecycle events without distinguishing between them

The event data structure is identical in both cases (%{count: 1}) which doesn't reflect the actual semantics of init vs terminate

Suggested changes:

Add telemetry documentation to @moduledoc

Use distinct event names: :git_agent_started and :git_agent_terminated

Include event-specific metadata (e.g., reason for termination)

🔗 Analysis chain

Line range hint 1-1: Document and standardize telemetry events.

To improve the telemetry implementation:

Add module documentation describing the telemetry events and their data structure

Ensure consistent event naming and metadata across the module

Let's verify the metric scope usage across the codebase:

Let me gather more information about the telemetry events in this module:

Let me check the actual module documentation content and look for any other telemetry-related code in the module:
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for consistent metric scope usage

# Find all telemetry executions with git_agent scope
rg -A 2 "metric_scope$:git_agent$" --type elixir

# Find all metric scope definitions
ast-grep --pattern 'def metric_scope($name) when $name == :git_agent do
  $$$
end'
Length of output: 1130

Script:
#!/bin/bash
# Find all telemetry executions in the agent module
rg -B 2 -A 2 ":telemetry.execute" lib/console/deployments/git/agent.ex

# Find any module documentation or telemetry-related docs
rg -B 2 "@moduledoc|@doc" lib/console/deployments/git/agent.ex

# Find all telemetry event handlers for git_agent
rg -B 2 "attach.*git_agent" --type elixir
Length of output: 586

Script:
#!/bin/bash
# Get the full module documentation
cat lib/console/deployments/git/agent.ex

# Find any other telemetry-related code in the module
rg -B 2 -A 2 "telemetry|metric" lib/console/deployments/git/agent.ex

# Check for any telemetry configuration or setup
rg -B 2 -A 2 "telemetry.*config|attach_telemetry" --type elixir
Length of output: 7982
test/console/graphql/queries/deployments/pipeline_queries_test.exs (1)
95-118: LGTM! Consider enhancing test coverage with edge cases.

The test effectively validates the basic functionality of fetching pipeline context history within a pipeline stage. The implementation follows the established patterns and includes proper RBAC testing.

Consider adding the following test cases to improve coverage:

Empty context history

Multiple stages with history

Pagination edge cases (e.g., when there are more than 5 history entries)

Example enhancement:
test "it handles empty context history in pipeline stage" do
  user = insert(:user)
  pipe = insert(:pipeline, read_bindings: [%{user_id: user.id}])
  stage = insert(:pipeline_stage, pipeline: pipe)

  {:ok, %{data: %{"pipeline" => found}}} = run_query("""
    query Pipe($id: ID!) {
      pipeline(id: $id) {
        id
        stages {
          contextHistory(first: 5) {
            edges { node { id } }
          }
        }
      }
    }
  """, %{"id" => pipe.id}, %{current_user: user})

  [%{"contextHistory" => found}] = found["stages"]
  assert Enum.empty?(from_connection(found))
end
lib/console/deployments/cron.ex (2)
69-78: Good addition of error handling, consider enhancing recovery capabilities.

The addition of error handling for cluster cache warming operations is a good improvement that prevents a single cluster's failure from affecting others. This aligns well with the system's resilience needs, especially important for auto-rollback functionality.

Consider these enhancements:

Track failed warmups to potentially retry them later

Add metrics/telemetry for monitoring warm cache failures

Here's a suggested enhancement:
 try do
   Clusters.warm(:cluster_metrics, cluster)
   Clusters.warm(:nodes, cluster)
   Clusters.warm(:node_metrics, cluster)
   Clusters.warm(:api_discovery, cluster)
+  :telemetry.execute([:console, :cache, :warm, :success], %{}, %{cluster: cluster.handle})
 rescue
   e ->
     Logger.error "hit error trying to warm node caches for cluster=#{cluster.handle}"
     Logger.error(Exception.format(:error, e, __STACKTRACE__))
+    :telemetry.execute([:console, :cache, :warm, :error], %{}, %{
+      cluster: cluster.handle,
+      error: Exception.message(e)
+    })
 end
Line range hint 1-1: Consider adding context history cleanup.

Since this PR implements auto-rollback with context history tracking, consider adding a cron job to clean up old service contexts beyond a certain age to prevent unbounded growth of historical data.

Here's a suggested implementation:
def prune_context_history() do
  Logger.info "pruning old service context history"
  # Assuming you have a PipelineContextHistory schema
  PipelineContextHistory.expired()
  |> Repo.delete_all()
end
lib/console/ai/chat.ex (1)

Line range hint 1-391: Consider splitting chat persistence and AI interaction concerns.

The module currently handles both chat persistence (thread/message management) and AI interaction (completions/summaries). Consider splitting these into separate modules for better separation of concerns:

Console.AI.ChatPersistence - For thread and message management

Console.AI.ChatCompletion - For AI interactions and completions

This would:

Improve maintainability and testability

Make the code more modular

Allow for easier swapping of AI providers or persistence layers
charts/controller/crds/deployments.plural.sh_customstackruns.yaml (2)
130-131: Improve property description for clarity.

The description "How to determine uniquenss for this field" contains a typo and could be more descriptive.

Consider updating to:
-          description: How to determine uniquenss for this field
+          description: Specifies the scope level at which this configuration field's value must be unique
133-136: Document the scope enum values.

The scope field's enum values (PROJECT and CLUSTER) would benefit from documentation explaining their implications.

Consider adding descriptions for each enum value:
               enum:
               - PROJECT
               - CLUSTER
+              description: |
+                Defines the scope of uniqueness:
+                * PROJECT - Value must be unique within the project
+                * CLUSTER - Value must be unique across the entire cluster
               type: string
go/controller/config/crd/bases/deployments.plural.sh_customstackruns.yaml (1)

129-139: Consider updating documentation for the new uniqueness feature.

The uniqBy property introduces project and cluster-level uniqueness validation. This should be documented in:

User documentation explaining how to use this feature

API documentation detailing the validation behavior at different scopes

Would you like me to help create documentation for this new feature?

plural/helm/console/crds/deployments.plural.sh_customstackruns.yaml (1)

129-139: Consider documenting the uniqueness enforcement strategy.

To ensure smooth adoption of this feature, consider:

Documenting how uniqueness is enforced at different scopes (PROJECT vs CLUSTER)

Providing migration guidelines for existing resources

Adding examples in the CRD documentation showing typical use cases

go/controller/api/v1alpha1/prautomation_types.go (1)

465-469: Consider future scope requirements

The Scope field's enum values are currently limited to PROJECT and CLUSTER. Consider if other scopes might be needed in the future (e.g., NAMESPACE, ORGANIZATION).

test/support/factory.ex (1)

755-760: Consider grouping with related pipeline factories.

For better code organization, consider moving this factory function near other pipeline-related factories (around line 450 where pipeline_factory, pipeline_stage_factory, and pipeline_context_factory are defined).
lib/console/graphql/deployments/pipeline.ex (2)
340-347: Add more detailed field descriptions.

The object type is well-structured, but could benefit from more detailed field descriptions to improve schema documentation.

Consider adding descriptions to the fields like this:
 @desc "A record of a prior pipeline context attached to a stage"
 object :pipeline_context_history do
-  field :id,      non_null(:id)
-  field :stage,   :pipeline_stage, resolve: dataloader(Deployments)
-  field :context, :pipeline_context, resolve: dataloader(Deployments)
+  field :id,      non_null(:id), description: "Unique identifier for the context history record"
+  field :stage,   :pipeline_stage, 
+    description: "The pipeline stage this context history belongs to",
+    resolve: dataloader(Deployments)
+  field :context, :pipeline_context,
+    description: "The historical pipeline context that was applied to this stage",
+    resolve: dataloader(Deployments)

   timestamps()
 end
Line range hint 185-352: Consider implementing context history retention policy.

The implementation provides a solid foundation for the auto-rollback feature. To ensure optimal performance and resource utilization, consider implementing a retention policy for context history records. This could involve:

Setting a maximum number of historical contexts per stage

Implementing automatic cleanup of old context history records

Adding TTL (Time To Live) for context history records

This would help manage database growth and maintain system performance over time.
test/console/deployments/git_test.exs (2)
355-378: LGTM! Consider strengthening the error message assertion.

The test case effectively validates the name duplication check. However, the error message assertion could be more specific.

Consider using a more precise assertion:
-      assert err =~ "there is already a project with name"
+      assert err == "there is already a project with name 'bogus'"
366-369: Consider adding test cases for different validation scenarios.

The current test covers the duplicate name case, but consider adding tests for:

Successful case with non-duplicate names

Different uniqueness scopes (e.g., CLUSTER scope)

Example test case structure:
test "it allows creation with non-duplicate names" do
  user = insert(:user)
  conn = insert(:scm_connection, token: "some-pat")
  insert(:project, name: "existing")
  pra = insert(:pr_automation,
    # ... setup code ...
    configuration: [
      %{name: "second", type: :string, validation: %{uniq_by: %{scope: :project}}}
    ]
  )

  {:ok, pr} = Git.create_pull_request(%{
    "second" => "new-unique-name"
  }, pra.id, "pr-test", user)

  assert pr.url == "https://github.com/pr/url"
end
test/console/deployments/pipelines_test.exs (1)
269-270: Consider enhancing context history verification

While the existence check is valid, consider verifying additional properties of the context history record, such as:

The correct context is stored

The timestamp is recent

The association with the stage is correct
-      assert PipelineContextHistory.for_stage(stage.id) |> Console.Repo.exists?()
+      history = PipelineContextHistory.for_stage(stage.id) |> Console.Repo.one()
+      assert history
+      assert history.context_id == ctx.id
+      assert history.stage_id == stage.id
+      assert Timex.diff(Timex.now(), history.inserted_at, :minutes) < 1
charts/controller/crds/deployments.plural.sh_prautomations.yaml (1)
199-209: Consider documenting uniqueness constraint behavior.

While the implementation is solid, it would be beneficial to document:

How uniqueness is enforced at different scopes

What happens when uniqueness constraints are violated

Examples of typical use cases for each scope

Consider adding these details to the description field:
-          description: How to determine uniquenss for this field
+          description: |
+            How to determine uniqueness for this field.
+            - PROJECT scope: Ensures uniqueness within a project
+            - CLUSTER scope: Ensures uniqueness across the entire cluster
+            
+            When a uniqueness constraint is violated, the validation will fail with a detailed error message.
+            
+            Example:
+            ```yaml
+            validation:
+              uniqBy:
+                scope: PROJECT  # Ensures unique values within each project
+            ```
plural/helm/console/crds/deployments.plural.sh_prautomations.yaml (1)

199-209: Consider documenting migration steps for existing resources.

The addition of the required uniqBy field in the validation schema might affect existing PrAutomation resources. Consider:

Adding documentation about migrating existing resources

Implementing a migration strategy or validation webhook to handle existing resources
lib/console/graphql/deployments/git.ex (1)
593-595: Minor: Inconsistent spacing in field definitions.

The spacing between fields in the dependency_management_service object is inconsistent.

Apply this diff to maintain consistent spacing:
-    field :id,         non_null(:id)
-    field :connection, :scm_connection, resolve: dataloader(Deployments)
-    field :service,    :service_deployment, resolve: dataloader(Deployments)
+    field :id,         non_null(:id)
+    field :connection, :scm_connection,      resolve: dataloader(Deployments)
+    field :service,    :service_deployment,  resolve: dataloader(Deployments)
lib/console/deployments/pr/validation.ex (1)
60-61: Handle unexpected scope values in scope_query/1 function

Currently, scope_query/1 does not handle unexpected scope values, which could lead to runtime errors if scope is neither :project nor :cluster.

Consider adding a catch-all clause to handle unexpected scopes and provide a helpful error message:
defp scope_query(:project), do: Project
defp scope_query(:cluster), do: Cluster
defp scope_query(other), do: raise ArgumentError, "Invalid scope: #{inspect(other)}"
This ensures that any invalid scope is caught early, improving the robustness of your validation logic.
lib/console/deployments/pipelines.ex (4)
178-182: Avoid variable shadowing of ctx_id to improve code clarity

In the anonymous function for the :hist operation within apply_pipeline_context/1, the variable ctx_id is shadowed inside the case statement. This can lead to confusion and potential bugs.

Consider renaming the inner ctx_id variable to avoid shadowing:
        |> case do
-         %PipelineContextHistory{context_id: ctx_id} ->
+         %PipelineContextHistory{context_id: previous_ctx_id} ->
            start_transaction()
            |> add_operation(:bind, fn _ -> create_context_binding(previous_ctx_id, stage) end)
            |> add_operation(:apply, fn %{bind: stage} -> apply_pipeline_context(stage) end)
            |> execute(extract: :apply, timeout: 60_000)
          _ ->
            {:error, "no prior context"}
        end
191-203: Ensure consistent variable usage to prevent confusion

Similarly, in the revert_pipeline_context/1 function, the ctx_id variable is shadowed within the case statement. Using distinct variable names enhances readability and maintainability.

Update the code to avoid variable shadowing:
        |> case do
-         %PipelineContextHistory{context_id: ctx_id} ->
+         %PipelineContextHistory{context_id: previous_ctx_id} ->
            start_transaction()
            |> add_operation(:bind, fn _ -> create_context_binding(previous_ctx_id, stage) end)
            |> add_operation(:apply, fn %{bind: stage} -> apply_pipeline_context(stage) end)
            |> execute(extract: :apply, timeout: 60_000)
          _ ->
            {:error, "no prior context"}
        end
190-211: Update function specifications to accurately reflect return types

The revert_pipeline_context/1 function can return {:error, String.t()}, but the current @spec only indicates Console.error. To improve code clarity and maintain accurate type annotations, update the @spec to include all possible return types.

Update the function specification as follows:
- @spec revert_pipeline_context(PipelineStage.t) :: {:ok, map} | Console.error
+ @spec revert_pipeline_context(PipelineStage.t | binary | any) :: {:ok, map} | {:error, String.t()}
210-210: Enhance error message for clarity

The error message "you cannot revert this stage" may not provide enough context. Consider providing a more descriptive error message to help users understand the issue.

For example:
- def revert_pipeline_context(_), do: {:error, "you cannot revert this stage"}
+ def revert_pipeline_context(_), do: {:error, "Cannot revert: invalid pipeline stage or no applied context to revert."}
go/client/models_gen.go (2)
1376-1377: Consider renaming 'UniqBy' to 'UniqueBy' for clarity

Replacing 'UniqBy' with 'UniqueBy' would enhance readability and align with standard naming conventions.

Apply this diff:
 type ConfigurationValidationAttributes struct {
   // regex a string value should match
   Regex *string `json:"regex,omitempty"`
   // whether the string is json encoded
   JSON *bool `json:"json,omitempty"`
   // configuration for name uniqueness
-  UniqBy *UniqByAttributes `json:"uniqBy,omitempty"`
+  UniqueBy *UniqueByAttributes `json:"uniqueBy,omitempty"`
 }
Ensure that you also rename the UniqByAttributes type to UniqueByAttributes and update all references accordingly.

5549-5549: Improve comment clarity by rephrasing

Consider rephrasing the comment for better grammar and clarity.

Apply this diff:
-  // the scope this name is uniq w/in
+  // the scope within which this name is unique

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 592fcfb and 6b7e58a.

⛔ Files ignored due to path filters (2)

assets/src/generated/graphql.ts is excluded by !**/generated/**
mix.lock is excluded by !**/*.lock

📒 Files selected for processing (49)

charts/console/templates/deployment.yaml (1 hunks)
charts/console/values.yaml (1 hunks)
charts/controller/crds/deployments.plural.sh_customstackruns.yaml (1 hunks)
charts/controller/crds/deployments.plural.sh_prautomations.yaml (1 hunks)
config/config.exs (2 hunks)
config/prod.exs (1 hunks)
config/test.exs (1 hunks)
go/client/models_gen.go (5 hunks)
go/controller/api/v1alpha1/prautomation_types.go (2 hunks)
go/controller/api/v1alpha1/zz_generated.deepcopy.go (2 hunks)
go/controller/config/crd/bases/deployments.plural.sh_customstackruns.yaml (1 hunks)
go/controller/config/crd/bases/deployments.plural.sh_prautomations.yaml (1 hunks)
lib/console/ai/chat.ex (1 hunks)
lib/console/ai/fixer.ex (2 hunks)
lib/console/ai/provider/anthropic.ex (1 hunks)
lib/console/application.ex (1 hunks)
lib/console/deployments/cron.ex (1 hunks)
lib/console/deployments/git/agent.ex (3 hunks)
lib/console/deployments/git/statistics.ex (2 hunks)
lib/console/deployments/helm/agent.ex (1 hunks)
lib/console/deployments/observer/runner.ex (1 hunks)
lib/console/deployments/pipelines.ex (2 hunks)
lib/console/deployments/pipelines/discovery.ex (1 hunks)
lib/console/deployments/pipelines/stage_worker.ex (2 hunks)
lib/console/deployments/pr/validation.ex (2 hunks)
lib/console/deployments/pubsub/consumers/pipeline.ex (2 hunks)
lib/console/deployments/pubsub/protocols/pipelineable.ex (1 hunks)
lib/console/deployments/pubsub/recurse.ex (1 hunks)
lib/console/deployments/statistics.ex (3 hunks)
lib/console/graphql/deployments/git.ex (4 hunks)
lib/console/graphql/deployments/pipeline.ex (2 hunks)
lib/console/graphql/resolvers/deployments/pipeline.ex (2 hunks)
lib/console/prom/ecto.ex (0 hunks)
lib/console/prom/metrics.ex (0 hunks)
lib/console/prom/plugin.ex (1 hunks)
lib/console/prom/scraper.ex (0 hunks)
lib/console/prom/setup.ex (0 hunks)
lib/console/prom_ex.ex (1 hunks)
lib/console/schema/embeds.ex (3 hunks)
lib/console/schema/pipeline_context_history.ex (1 hunks)
mix.exs (3 hunks)
plural/helm/console/crds/deployments.plural.sh_customstackruns.yaml (1 hunks)
plural/helm/console/crds/deployments.plural.sh_prautomations.yaml (1 hunks)
priv/repo/migrations/20241113160231_pipeline_context_history.exs (1 hunks)
schema/schema.graphql (6 hunks)
test/console/deployments/git_test.exs (1 hunks)
test/console/deployments/pipelines_test.exs (2 hunks)
test/console/graphql/queries/deployments/pipeline_queries_test.exs (1 hunks)
test/support/factory.ex (1 hunks)

💤 Files with no reviewable changes (4)

lib/console/prom/ecto.ex
lib/console/prom/metrics.ex
lib/console/prom/scraper.ex
lib/console/prom/setup.ex

🔇 Additional comments (63)

lib/console/deployments/git/statistics.ex (1)

2-2: LGTM: Clean import statement

The import is properly scoped to only the required function, following Elixir best practices.

priv/repo/migrations/20241113160231_pipeline_context_history.exs (4)

1-3: LGTM! Module setup follows best practices.

The migration module is properly defined and follows Elixir naming conventions.

13-13: LGTM! Index creation is appropriate.

The index on stage_id will help optimize queries for retrieving history records for a specific stage.

4-14: LGTM! Migration is automatically reversible.

The use of change/0 with standard table and index creation commands ensures that Ecto can automatically handle both forward and rollback migrations.

5-11: Consider the implications of cascade deletes.

The on_delete: :delete_all cascading behavior on both foreign keys means that history records will be automatically deleted when either the stage or context is deleted. While this prevents orphaned records, it might lead to loss of valuable historical data.

Consider if a soft delete strategy might be more appropriate for maintaining audit trails.

✅ Verification successful

The cascade delete strategy is consistent with existing patterns

The on_delete: :delete_all cascade behavior is consistent with the existing history table cluster_restore_history which also uses cascade deletes for its foreign key relationships. This suggests a deliberate architectural decision to treat history tables as tightly coupled with their parent records, where historical entries are not meant to outlive their referenced entities.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if there are other tables with similar historical data that might have different deletion strategies
rg -l "create table.*history" priv/repo/migrations/

Length of output: 173

Script:

#!/bin/bash
# Check the deletion strategy in other history tables
rg -A 10 "create table.*history" priv/repo/migrations/20240127024051_backup_restore.exs

Length of output: 621

lib/console/deployments/statistics.ex (2)

2-2: Verify metric scope definitions in Console.Prom.Plugin

The metric scopes :cluster_count, :unhealthy_cluster_count, :service_count, and :failed_service_count should be properly defined in the Plugin module.

✅ Verification successful

Metric scopes are properly defined and used correctly

The verification shows that:

All required metric scopes (:cluster_count, :unhealthy_cluster_count, :service_count, :failed_service_count) are properly defined in Console.Prom.Plugin
Each scope follows a consistent naming convention and hierarchy
The scopes are correctly used in the statistics module for telemetry execution

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify metric scope definitions in Console.Prom.Plugin

# Search for metric scope definitions
ast-grep --pattern 'def metric_scope($name) when $name in [:cluster_count, :unhealthy_cluster_count, :service_count, :failed_service_count] do
  $$$
end'

# Backup: Search for any metric_scope function definition
rg -A 5 'def metric_scope'

Length of output: 383

Script:

#!/bin/bash
# Let's try a different approach to find metric_scope definitions and usages

# Search for the Plugin module definition and any metric_scope function
rg -U "defmodule.*Console\.Prom\.Plugin|def metric_scope" -A 10

# Search for actual usage of these metric scopes in the statistics module
rg "metric_scope\(:(cluster_count|unhealthy_cluster_count|service_count|failed_service_count)\)" lib/console/deployments/statistics.ex

# Look for telemetry event handlers that might be using these scopes
rg -U "attach_telemetry|handle_event.*(:service_count|:failed_service_count|:cluster_count|:unhealthy_cluster_count)"

Length of output: 1962

26-27: Verify metric collection and consider optimizations

The service metrics implementation follows good practices but should be verified in the monitoring system.

Apply the same optimizations as suggested for cluster_stats:

  defp service_stats() do
    Service.stats()
    |> Console.Repo.one()
    |> case do
      %{unhealthy: h, count: c} ->
-       :telemetry.execute(metric_scope(:service_count), %{total: c}, %{})
-       :telemetry.execute(metric_scope(:failed_service_count), %{total: h}, %{})
+       require Logger
+       Logger.debug("Recording service stats: total=#{c} failed=#{h}")
+       :telemetry.execute(metric_scope(:service_count), %{total: c}, @empty_metadata)
+       :telemetry.execute(metric_scope(:failed_service_count), %{total: h}, @empty_metadata)

Verify metric collection:

✅ Verification successful

Telemetry metrics are properly configured and collected

The verification shows that:

The metrics are properly integrated with the PromEx telemetry system (lib/console/prom_ex.ex)
The metric scopes are correctly defined in the Prom Plugin (lib/console/prom/plugin.ex)
The telemetry events are properly emitted with the correct metric scopes and metadata
The suggested optimization to use @empty_metadata is unnecessary as the current implementation using %{} is equivalent

The metrics will be collected and available in the monitoring system as they are properly integrated into the application's telemetry pipeline.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify telemetry handlers are properly attached for these metrics

# Search for telemetry handler attachments
rg -A 5 'attach_telemetry|handle_event.*(:service_count|:failed_service_count|:cluster_count|:unhealthy_cluster_count)'

Length of output: 383

Script:

#!/bin/bash
# Let's try a different approach to find telemetry-related code

# Search for telemetry configurations and handlers
rg -l "telemetry"

# Search for metric_scope function definition
ast-grep --pattern 'def metric_scope($_) do
  $$$
end'

Length of output: 379

Script:

#!/bin/bash
# Search for telemetry.execute calls
rg ":telemetry.execute" -A 2 -B 2

Length of output: 2094

Script:

#!/bin/bash
# Look for telemetry configurations in config files
fd "config.exs|runtime.exs" --exec cat {}

Length of output: 7771

Script:

#!/bin/bash
# Let's check for PromEx configuration and telemetry handlers
rg -A 10 "PromEx|telemetry_metrics|handle_event" lib/console/

# Check mix.exs for telemetry dependencies
cat mix.exs

Length of output: 15317

Script:

#!/bin/bash
# Look for metric_scope function implementation in statistics files
cat lib/console/deployments/statistics.ex
cat lib/console/deployments/git/statistics.ex

Length of output: 1430

lib/console/deployments/pubsub/consumers/pipeline.ex (1)

15-15: LGTM! Verify event handling integration.

The new pattern match for revert events follows the existing pattern matching structure and maintains consistent flow through the act/1 function.

Let's verify the integration with the broader auto-rollback feature:

✅ Verification successful

Revert event handling is properly integrated across the codebase

The pattern match for revert events is well-integrated with the broader auto-rollback feature:

Pipeline consumer correctly forwards revert events to Discovery.revert/1
StageWorker implements revert handling with proper logging and context management
Protocol defines revert event generation for failed stages
Pipelines module handles context history for reverts
Test coverage exists for rollback functionality

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for related revert event handling code
# Expected: Find related event handling and tests for the revert functionality

# Check for related revert handling code
echo "Checking for related revert handling code..."
rg -l "revert.*PipelineStage|handle.*revert" --type elixir

# Check for related tests
echo "Checking for related tests..."
rg -l "test.*revert.*pipeline|test.*rollback" --type elixir test/

Length of output: 611

Script:

#!/bin/bash
# Let's examine the implementation details of revert handling across these files

# Check the revert implementation in the pipeline consumer
echo "=== Pipeline Consumer Implementation ==="
rg "revert.*PipelineStage|handle.*revert" lib/console/deployments/pubsub/consumers/pipeline.ex -A 5

# Check how revert is handled in the Pipelines module
echo -e "\n=== Pipelines Module Implementation ==="
rg "revert.*PipelineStage|handle.*revert" lib/console/deployments/pipelines.ex -A 5

# Check the protocol implementation
echo -e "\n=== Protocol Implementation ==="
rg "revert.*PipelineStage|handle.*revert" lib/console/deployments/pubsub/protocols/pipelineable.ex -A 5

# Check stage worker implementation
echo -e "\n=== Stage Worker Implementation ==="
rg "revert.*PipelineStage|handle.*revert" lib/console/deployments/pipelines/stage_worker.ex -A 5

# Check test coverage
echo -e "\n=== Test Coverage ==="
rg "test.*revert.*pipeline|test.*rollback" test/console/graphql/mutations/deployments/services_mutations_test.exs -A 5

Length of output: 2598

lib/console/prom_ex.ex (3)

1-5: LGTM: Module setup follows best practices

The module definition and PromEx setup are properly configured.

22-27: Verify Grafana datasource configuration

The dashboard is configured to use "plural_console" as the datasource ID. Please ensure this matches your Grafana setup.

Let's check for any datasource configuration in the codebase:

✅ Verification successful

Let me try searching for more specific patterns related to Grafana configuration and datasources:

Based on the search results, I can now provide a definitive response about the datasource configuration:

The datasource ID "plural_console" is correctly configured

The codebase shows this is the expected datasource ID used throughout the application:

It's set in lib/console/prom_ex.ex which is the main PromEx configuration
The application uses this ID consistently with its Prometheus and Grafana integrations
The configuration is properly wired up through the deployment templates and environment variables

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for Grafana datasource configurations
echo "Checking for datasource configurations:"
rg "datasource" --type yaml --type json

Length of output: 1109

Script:

#!/bin/bash
# Search for Grafana and datasource related configurations more broadly
echo "Checking for Grafana configurations:"
rg -i "grafana|prometheus|datasource_id|plural_console" --type yaml --type elixir -A 3

# Also check for any config files specifically
echo -e "\nChecking config files:"
fd "config" --type f --exec rg -i "grafana|prometheus|datasource"

Length of output: 62509

7-19: Verify if additional monitoring plugins are needed

Several monitoring plugins are commented out that might be beneficial:

Phoenix monitoring (line 12) - Given this is a web application, consider enabling this
PhoenixLiveView (line 15) - If LiveView is used in the application
Oban (line 14) - If background job processing is implemented

Let's verify which features are actually used in the application:

✅ Verification successful

Enable Phoenix monitoring plugin and keep other plugins commented

Based on the verification results:

Phoenix is actively used in the application (found in lib/console_web/endpoint.ex and lib/console_web/console_web.ex), so the Phoenix monitoring plugin should be enabled
LiveView, Oban, and Broadway are referenced only in the prom_ex.ex file itself (likely just in the commented configuration), and there's no evidence of their actual usage in the codebase
The dependencies in mix.exs confirm Phoenix usage but don't show LiveView, Oban, or Broadway as dependencies

Recommendation:

Uncomment the Phoenix plugin line: {Plugins.Phoenix, router: ConsoleWeb.Router, endpoint: ConsoleWeb.Endpoint}
Keep other plugins (LiveView, Oban, Broadway) commented as they don't appear to be used in the application

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for Phoenix router usage
echo "Checking for Phoenix router usage:"
rg "use Phoenix.Router" -l

# Check for LiveView usage
echo "Checking for LiveView usage:"
rg "use Phoenix.LiveView" -l

# Check for Oban usage
echo "Checking for Oban job definitions:"
rg "use Oban.Worker" -l

Length of output: 1109

Script:

#!/bin/bash
# Let's try a more comprehensive check for Phoenix, LiveView, and Oban usage

# Check for Phoenix router and endpoint
echo "Checking for Phoenix router and endpoint:"
rg "Phoenix.Router|Phoenix.Endpoint" --type elixir -l

# Check for LiveView related files and imports
echo -e "\nChecking for LiveView usage:"
rg "Phoenix.LiveView|live_view|live_component" --type elixir -l

# Check for Oban configuration and usage
echo -e "\nChecking for Oban configuration and usage:"
rg "Oban|oban" --type elixir -l

# Check for Broadway usage
echo -e "\nChecking for Broadway usage:"
rg "Broadway|broadway" --type elixir -l

# Check mix.exs for these dependencies
echo -e "\nChecking dependencies in mix.exs:"
cat mix.exs

Length of output: 6885

lib/console/deployments/observer/runner.ex (1)

18-18: LGTM: Enhanced error logging

The improved error logging using inspect/1 will provide better visibility into unknown errors.

lib/console/schema/pipeline_context_history.ex (1)

1-4: Add module documentation and verify auto-rollback integration

Please add module documentation explaining:

The purpose of this module in relation to the auto-rollback feature
How it tracks and manages pipeline context history
Example usage in the auto-rollback scenario

Let's verify the auto-rollback integration:

#!/bin/bash
# Description: Check how this module is used in the auto-rollback feature
# Look for references to PipelineContextHistory in the codebase
rg -l "PipelineContextHistory" --type elixir

# Look for auto-rollback related functions
rg "def.*rollback" --type elixir

lib/console/deployments/pipelines/discovery.ex (1)

18-20: LGTM! Verify RPC behavior across nodes.

The implementation follows the module's established patterns for distributed operations and correctly handles both local and remote execution scenarios.

Let's verify the RPC behavior across nodes:

✅ Verification successful

Let me gather more information about the maybe_rpc pattern to ensure complete verification.

RPC implementation verified and consistent

The revert/1 function follows the established pattern used across the codebase for distributed operations:

It uses the same maybe_rpc implementation found in other discovery modules
The function signature and RPC delegation matches other pipeline stage operations (stage/1, context/1)
Error handling is consistent with other RPC calls in the codebase
The implementation correctly handles both local execution (via apply/3) and remote calls (via rpc.call/4)

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for consistent RPC handling patterns across the codebase

# Look for other RPC calls to ensure consistent error handling
rg -A 2 ':rpc.call' --type elixir

# Check for similar stage operation patterns
ast-grep --pattern 'def $func(%PipelineStage{id: id} = stage),
    do: maybe_rpc(id, __MODULE__, $_, [stage])'

Length of output: 1312

Script:

#!/bin/bash
# Check for maybe_rpc implementations and usage patterns
rg -A 5 'maybe_rpc' --type elixir

# Look for similar stage operation functions in the pipeline module
rg -A 2 'def \w+\(%PipelineStage{' lib/console/deployments/pipelines/discovery.ex

Length of output: 4890

lib/console/deployments/pubsub/protocols/pipelineable.ex (1)

31-33: Verify edge cases in the auto-rollback logic.

The implementation looks good but let's ensure it handles all edge cases properly:

Multiple consecutive failures
Partial stage failures
Race conditions between health checks

Let's verify the implementation:

✅ Verification successful

Edge cases are properly handled in the auto-rollback implementation

After reviewing the complete implementation, the edge cases are well-handled:

Multiple consecutive failures are handled through the recent?/1 function that prevents duplicate processing within a 2-minute window
Race conditions are mitigated by:
- The recent?/1 time-window check
- Atomic status transitions (using pattern matching on specific statuses)
Partial failures are handled safely:
- Empty stage lists return :ok
- All stages in a failed service are mapped to revert actions consistently
- The implementation is stateless and idempotent

The implementation is robust and includes appropriate safeguards for the mentioned edge cases.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential edge cases in the auto-rollback logic

# Look for tests covering multiple failures
rg -A 5 "test.*multiple.*failures?" test/
# Look for race condition handling
ast-grep --pattern 'test "handles race conditions"'
# Look for partial failure handling
rg -A 5 "partial.*failure" test/

Length of output: 132

Script:

#!/bin/bash
# Let's try a broader search for test cases related to the handle_status function

# Search for test files related to deployments and pubsub
fd "test.*ex$" test/console/deployments/pubsub/

# Look for test cases containing 'status' or 'health'
rg -B 2 -A 5 "test.*status|test.*health" test/console/deployments/pubsub/

# Look for any test files that might test the pipelineable protocol
fd "pipelineable_test.exs?" test/

# If found, examine its contents
fd "pipelineable_test.exs?" test/ -x cat {}

Length of output: 948

Script:

#!/bin/bash
# Let's look for the implementation and related code to understand the context better

# Find the pipelineable protocol implementation
fd "pipelineable.ex$" -x cat {}

# Look for any pipeline-related error handling or failure scenarios
rg -B 2 -A 5 "failed|failure|error" lib/console/deployments/pubsub/

# Search for any documentation or specs related to the protocol
rg "@spec|@doc" lib/console/deployments/pubsub/protocols/pipelineable.ex

# Look for any modules that implement this protocol
ast-grep --pattern 'defimpl.*Pipelineable'

Length of output: 3651

lib/console/graphql/resolvers/deployments/pipeline.ex (1)

4-9: LGTM! Clean and well-organized alias structure

The addition of PipelineContextHistory to the schema aliases is consistent with the module's purpose and follows Elixir conventions.

lib/console/deployments/pipelines/stage_worker.ex (1)

25-26: LGTM! Function signature and implementation are consistent.

The new revert/2 function follows the established pattern and includes proper typing with the same timeout value as other similar operations.

Let's verify the usage of this new function across the codebase:

✅ Verification successful

Function implementation and usage are properly verified

The revert/2 function in StageWorker is correctly used in the Discovery module, following the same pattern as other similar operations (dispatch, context). The implementation:

Maintains consistent error handling through maybe_rpc
Uses proper type specifications with PipelineStage struct
Follows the established pattern of shard-based worker distribution

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for calls to StageWorker.revert to ensure proper usage
rg "StageWorker\.revert" --type elixir

Length of output: 184

Script:

#!/bin/bash
# Let's check the implementation of do_revert in Discovery module to verify proper usage
rg "do_revert" -A 5 -B 5 --type elixir lib/console/deployments/pipelines/discovery.ex

Length of output: 830

lib/console/ai/provider/anthropic.ex (1)

55-55: Verify Claude model change implications

The change from claude-3-5-haiku-latest to claude-3-5-sonnet-latest as the default model is significant. The Sonnet model offers enhanced capabilities but comes with different pricing and performance characteristics.

Please ensure:

The application's budget and performance requirements align with the Sonnet model
Error handling adequately covers potential timeout scenarios (Sonnet may have longer response times)
Token limits are appropriate (currently set to 8000 tokens)

Consider adding a configuration value for the default model to make it environment-specific.

config/config.exs (2)

63-63: LGTM: Prometheus plugin configuration added.

The addition of Console.Prom.Plugin to the prom_plugins configuration is well-placed and aligns with the telemetry enhancement goals.

133-138: Consider enhancing the PromEx configuration.

While the basic PromEx configuration is in place, it might benefit from additional settings:

The metrics_server is disabled, which means metrics won't be exposed via HTTP
No port configuration for metrics endpoint
No Grafana configuration for metrics visualization

Let's check if there are any other PromEx configurations in the codebase:

Consider adding:

Metrics server configuration with a designated port
Grafana configuration if metrics visualization is needed
Specific metrics groups to collect/drop based on your monitoring needs

lib/console/deployments/helm/agent.ex (1)

16-16: Consider documenting the timeout rationale and handling.

The addition of a 30-second timeout for Helm operations is a good improvement over the default 5-second timeout, especially for large charts or slow network conditions. However, there are a few considerations:

The timeout value should be documented to explain why 30 seconds was chosen
The caller should be aware of potential GenServer.call timeout errors
This timeout should be consistent with the auto-rollback feature's needs

Consider adding module documentation:

@doc """
Fetches a Helm chart with the specified version.

## Timeout
Uses a 30-second timeout to accommodate large charts and network latency.
Raises GenServer.call timeout error if the operation exceeds this limit.

## Examples
    {:ok, file, digest} = Agent.fetch(pid, "mychart", "1.0.0")
"""

Let's verify if this timeout is sufficient for typical chart sizes:

Also applies to: 18-18

✅ Verification successful

Based on the search results, I can now provide a final response:

30-second timeout for Helm operations is appropriate and consistent with the codebase

The 30-second timeout for fetch/3 is well-justified based on the evidence:

The largest Helm charts in the repository are around 363KB (kubernetes-dashboard), which can be safely downloaded within 30 seconds even on slower connections
Other similar operations in the codebase use comparable timeouts:
- Git operations use the same 30-second timeout (@timeout 30_000 in lib/console/deployments/git/agent.ex)
- Helm repository operations use 60-second timeout (found in go/controller/config/samples/service_deployment_helm.yaml)

The timeout value is appropriate for the auto-rollback feature's needs and consistent with similar operations in the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check typical chart sizes to validate the 30s timeout
# Look for .tgz files which are typically helm charts
fd -e tgz -x ls -lh {} \; 2>/dev/null

# Look for potential timeout issues in logs
rg -i "timeout|timed? out" .

Length of output: 47957

config/test.exs (1)

44-45: LGTM! Configuration changes look appropriate for test environment.

The addition of prom_plugins: [] alongside grafana_dns is well-structured and follows the expected pattern for test configurations where certain features should be disabled.

lib/console/schema/embeds.ex (2)

103-104: LGTM: Clear and well-defined enum for uniqueness scoping

The UniqScope enum appropriately defines the two levels (project and cluster) at which uniqueness can be enforced, aligning well with the CRD requirements.

Line range hint 103-163: Verify integration with related components

These schema changes implement uniqueness validation, which is part of the auto-rollback feature. Let's verify the integration points.

✅ Verification successful

Integration is properly implemented and tested

The changes are well integrated:

The UniqScope enum and validation are properly implemented in lib/console/schema/embeds.ex
The validation is tested in test/console/deployments/git_test.exs with test cases covering:
- Project-level uniqueness validation
- Error handling for duplicate names
The validation is enforced through the changeset validation pipeline with validate_required(~w(scope)a)

The implementation is complete and properly tested, with no issues found in the integration.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for related schema usage and test coverage

# Check for usage of UniqScope in tests
echo "Checking for test coverage of UniqScope..."
rg -l "UniqScope" test/

# Check for related configuration schemas
echo "Checking for related configuration schemas..."
fd -e ex -e exs | rg -l "Console\.Schema\.Configuration"

# Check for GraphQL type definitions
echo "Checking for GraphQL type definitions..."
fd -e graphql | xargs rg -l "UniqScope|uniqBy"

Length of output: 440

Script:

#!/bin/bash
# Let's check the GraphQL schema definition and related resolvers

# Check GraphQL schema definition
echo "Checking GraphQL schema definition..."
cat schema/schema.graphql

# Look for resolver implementations
echo -e "\nChecking resolver implementations..."
rg -l "UniqScope|uniq_by" lib/console/

# Look for any test files that might test validation
echo -e "\nChecking validation tests..."
rg -l "validation.*scope" test/

Length of output: 170635

Script:

#!/bin/bash
# Let's check the validation implementation and tests

# Check validation implementation
echo "Checking validation implementation..."
cat test/console/deployments/git_test.exs

# Check for any other validation usage
echo -e "\nChecking for other validation usage..."
rg -A 5 "validate_required.*scope" lib/

# Check for any GraphQL mutations that use this validation
echo -e "\nChecking GraphQL mutations..."
rg -l "Condition.*changeset" lib/

Length of output: 24044

config/prod.exs (1)

94-97: Verify Prometheus and Loki service configuration

The configuration assumes Prometheus and Loki services are available at specific Kubernetes DNS names in the 'monitoring' namespace. Let's verify these assumptions.

charts/console/values.yaml (1)

24-25: LGTM! Verify template usage.

The addition of additionalContainers configuration follows good Helm practices and maintains backward compatibility.

Let's verify how this value is used in the deployment template:

✅ Verification successful

Implementation verified and correctly used in deployment template

The additionalContainers value is properly implemented in the deployment template:

Correctly placed in the containers section of the pod spec
Uses proper conditional check with if statement
Implements correct YAML indentation using nindent 6
Maintains backward compatibility when the value is null

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if the deployment template properly handles additionalContainers
# Expected: Should find conditional blocks in deployment.yaml that use .Values.console.additionalContainers

rg -l 'additionalContainers' charts/console/templates/

Length of output: 95

Script:

#!/bin/bash
# Let's examine how additionalContainers is used in the deployment template
rg -A 5 -B 5 'additionalContainers' charts/console/templates/deployment.yaml

Length of output: 509

lib/console/deployments/pubsub/recurse.ex (1)

108-109: Verify repository discovery timing impact.

The Discovery.kick operation is synchronous and might impact the response time of the event handler. Let's verify if there are any long-running repository discoveries.

✅ Verification successful

The previous search revealed multiple usages of Discovery.kick, but we need to understand its implementation to verify if it's synchronous and potentially impacting response times. Let's analyze the implementation and any async patterns used.

Let's try to locate and analyze the Discovery module implementation with different search patterns to understand its behavior.

Based on the search results, I can now provide a conclusive response about the repository discovery timing impact:

Repository discovery is asynchronous and debounced, no timing impact concerns.

The implementation in lib/console/deployments/git/discovery.ex shows that:

The kick operation is debounced (15-second cooldown between pulls)
It uses an Agent-based asynchronous process model
The actual repository update happens in a separate process via Agent.kick(pid)

The operation is designed to be non-blocking and won't impact the response time of the event handler. The code follows good practices for handling potentially long-running Git operations.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for Discovery.kick usage and timing logs
rg -A 5 "Discovery.*kick" --type elixir

# Search for any existing timing or performance metrics
rg -A 5 "timing|duration.*repository" --type elixir

Length of output: 2327

Script:

#!/bin/bash
# Let's check the Discovery module implementation to understand its behavior
ast-grep --pattern 'def kick($_) {
  $$$
}'

# Also search for any async/task related implementations
rg -A 10 "Task\.|async" --type elixir lib/console/deployments/git/

Length of output: 114

Script:

#!/bin/bash
# Let's try a different pattern to find the Discovery module implementation
fd "discovery.ex" --type f

# And search for the kick function with ripgrep as ast-grep pattern might not match
rg -A 10 "def kick" --type elixir

# Look for any GenServer or async patterns in the Discovery module
rg -A 10 "use Gen|handle_cast" --type elixir

Length of output: 27466

mix.exs (4)

137-137: Verify telemetry upgrade compatibility.

The upgrade of telemetry from 0.4 to 1.0 is a major version bump that might include breaking changes.

#!/bin/bash
# Search for telemetry event handlers and attach calls
rg ":telemetry.attach" --type elixir

# Search for telemetry event executions
rg ":telemetry.execute" --type elixir

91-94: Verify integration of new monitoring dependencies with auto-rollback feature.

The addition of monitoring dependencies (prom_ex, telemetry_poller, cowboy_telemetry, telemetry_registry) suggests improved observability. This aligns well with the auto-rollback feature, as it will help track rollback events and their success rates.

Consider adding the following metrics for the auto-rollback feature:

Rollback attempt count
Rollback success/failure ratio
Time taken for rollbacks
Service recovery time

103-104: Verify HTTP server upgrade impact.

The upgrade of plug_cowboy (2.0 -> 2.7) and addition of cowboy (2.12) are significant version jumps.

#!/bin/bash
# Check for any custom Cowboy-specific code that might need updates
rg "Plug.Cowboy" --type elixir

# Look for any cowboy-specific configurations
rg "cowboy" config/ --type elixir

Line range hint 1-167: Critical: Verify removal of Horde dependency.

The removal of the horde dependency (a distributed supervisor library) while implementing auto-rollback functionality needs careful consideration.

#!/bin/bash
# Search for any remaining Horde usage that needs to be migrated
rg "Horde" --type elixir

# Look for distributed supervisor implementations
rg "use DynamicSupervisor" --type elixir

Please confirm:

How is distributed supervision being handled after removing Horde?
What's the new strategy for maintaining service state across nodes during rollbacks?
Has the distributed aspect of the auto-rollback feature been considered in the new implementation?

lib/console/deployments/git/agent.ex (1)

11-11: LGTM!

The import is correctly scoped to only the needed function, following Elixir best practices.

lib/console/ai/chat.ex (1)

25-26: LGTM! Verify impact on existing chat titles.

The changes to make chat titles more concise (max 6 words, single-phrase) look good. However, we should verify that existing chat titles won't be affected by this change.

Let's check for any existing chat titles that might exceed the new 6-word limit:

✅ Verification successful

Changes to chat title length are safe to implement

The codebase scan shows that existing chat titles are well within the new 6-word limit:

Default chat title: "Your primary chat with Plural AI" (6 words)
Factory test title: "a chat thread" (3 words)
Chatbot button title: "Further questions about an insight from Plural AI" (8 words, but this is a description field, not a chat title)

Only the chatbot button description exceeds 6 words, but it's a different field (description) and not affected by the @summary attribute's constraints.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for existing chat titles that might exceed the new 6-word limit
# Note: This helps identify if we need a migration for existing data

# Search for chat thread summaries in the codebase
rg -g '!*.{ex,exs}' -l 'summary:'

# Search for potential test cases that might need updating
rg -g '*.{ex,exs}' 'summary:.*\b(\w+\s+){7,}\w+'

Length of output: 480

Script:

#!/bin/bash
# Let's check the actual content of these files to see if they contain chat titles
# that might be affected by the 6-word limit

# Check GraphQL schema definition
echo "=== GraphQL Schema ==="
cat schema/schema.graphql | rg -A 2 'summary:'

# Check React components
echo -e "\n=== ChatbotButton.tsx ==="
cat assets/src/components/ai/chatbot/ChatbotButton.tsx | rg -A 2 'summary:'

echo -e "\n=== AITableActions.tsx ==="
cat assets/src/components/ai/AITableActions.tsx | rg -A 2 'summary:'

# Check test files for chat-related tests
echo -e "\n=== Chat-related tests ==="
rg -g '*.{ex,exs}' 'chat.*summary|summary.*chat'

Length of output: 1330

charts/controller/crds/deployments.plural.sh_customstackruns.yaml (1)

129-139: Verify the impact of new uniqueness constraints.

The addition of the uniqBy field introduces uniqueness validation for configuration items. While the schema change looks correct, we should ensure existing CustomStackRun resources won't be affected.

Let's check for any existing CustomStackRun resources that might need updates:

✅ Verification successful

Let me analyze the existing CustomStackRun resources to better understand the impact. I'll search for any usage of configuration items that might be affected by the new uniqueness constraints.

Let me check one more aspect to understand how the configuration items are structured in the CRD and if there are any existing uniqueness constraints.

Based on the analysis of the codebase, I can now provide a final response about the uniqueness constraints addition:

The uniqBy field addition is safe to proceed with

The new uniqueness constraint schema is a forward-compatible addition that won't affect existing CustomStackRun resources because:

The field is being added to the configuration items schema which already exists and is well-defined
No existing CustomStackRun examples in the codebase use any uniqueness-related fields
The scope enum (PROJECT/CLUSTER) aligns with the existing resource model

The schema change provides a clear validation mechanism for configuration items without breaking existing functionality.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for existing CustomStackRun resources that might be affected by the new schema

# Search for CustomStackRun resources in the codebase
rg -l "kind:\s*CustomStackRun" --type yaml

# Look for configuration field usage in tests or examples
rg -A 5 "configuration:" --type yaml

Length of output: 23333

Script:

#!/bin/bash
# Look for configuration items in CustomStackRun examples and test files
rg -A 10 "items:" --type yaml go/controller/config/samples/deployments_v1alpha1_customstackrun.yaml go/controller/config/samples/pr_automation.yaml

# Check if there are any existing configuration items with similar fields
rg -B 5 -A 10 "type: \w+" --type yaml go/controller/config/samples/

Length of output: 11998

Script:

#!/bin/bash
# Check the configuration items schema in the CRD
rg -B 5 -A 15 "configuration:" charts/controller/crds/deployments.plural.sh_customstackruns.yaml

# Look for any existing uniqueness-related fields
rg -l "unique|uniq" --type yaml charts/controller/crds/

Length of output: 2043

go/controller/config/crd/bases/deployments.plural.sh_customstackruns.yaml (1)

129-139: LGTM! Well-structured schema addition for uniqueness validation.

The new uniqBy property is well-defined with proper validation constraints and clear documentation.

Let's verify if any existing CustomStackRun resources might be affected by this change:

✅ Verification successful

The new uniqBy validation property is safe to add and won't affect existing configurations

The analysis of the codebase shows that:

The sample CustomStackRun (go/controller/config/samples/deployments_v1alpha1_customstackrun.yaml) doesn't use any validation configurations, so existing samples won't be affected
The validation property is being added as an optional field within the configuration items schema, as evidenced by the CRD structure
The scope enum values (PROJECT, CLUSTER) are well-defined and properly constrained
No existing CustomStackRun resources are using validation configurations that could be impacted by this change

The schema addition is backward compatible and follows Kubernetes CRD best practices by:

Making the new field optional
Providing clear documentation
Using proper validation constraints
Following the existing schema structure

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for existing CustomStackRun resources with configuration validation
# that might be affected by the new uniqBy property

# Search for CustomStackRun resources with configuration validation
kubectl get customstackruns -A -o json | jq '.items[] | select(.spec.configuration != null) | {name: .metadata.name, namespace: .metadata.namespace, config: .spec.configuration}'

Length of output: 255

Script:

#!/bin/bash
# Let's analyze the codebase for existing usage patterns and tests related to CustomStackRun configurations

# Search for test files and examples containing CustomStackRun configurations
rg -l "CustomStackRun" --type yaml

# Look for any test cases or examples specifically related to configuration validation
rg -A 5 "configuration:" --type yaml

# Check for any documentation files mentioning CustomStackRun
fd -e md -e txt | xargs rg -l "CustomStackRun"

Length of output: 23396

Script:

#!/bin/bash
# Let's examine the sample CustomStackRun and API documentation to understand the existing configuration patterns

# Check the sample CustomStackRun for any configuration validation
cat go/controller/config/samples/deployments_v1alpha1_customstackrun.yaml

# Look at the API documentation
cat go/controller/docs/api.md

# Check for any existing validation patterns in the CRD
rg -A 10 "validation:" go/controller/config/crd/bases/deployments.plural.sh_customstackruns.yaml

Length of output: 93540

plural/helm/console/crds/deployments.plural.sh_customstackruns.yaml (1)

129-139: LGTM! Well-structured uniqueness constraint implementation.

The addition of the uniqBy field with project/cluster-level scoping provides a robust way to enforce configuration uniqueness. The schema is well-defined with:

Clear description
Required scope field with proper enum validation
Appropriate placement within the validation structure

Let's verify if there are any existing CustomStackRun resources that might be affected by this new validation:

✅ Verification successful

The uniqBy validation field is consistently defined across CRD instances

After examining the sample CustomStackRun and all CRD definitions in the codebase:

The sample CustomStackRun doesn't use any configuration items yet, so it won't be affected
The uniqBy validation schema is consistently defined in all CRD instances (console, controller, and charts)
The validation is optional within the configuration items, so existing resources will continue to work

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for existing CustomStackRun resources that might need updates

# Search for CustomStackRun resources in the codebase
rg -l "kind:\s*CustomStackRun" --type yaml

# Look for any configuration validation patterns that might conflict
ast-grep --pattern 'validation:
  $$$
  uniqBy:
    $$$'

Length of output: 381

Script:

#!/bin/bash
# Let's examine the content of the found CustomStackRun resources to check for any existing configuration patterns

# Check the sample CustomStackRun
echo "=== Sample CustomStackRun ==="
cat go/controller/config/samples/deployments_v1alpha1_customstackrun.yaml

# Check other CRD definitions for any differences
echo -e "\n=== Other CRD definitions ==="
cat go/controller/config/crd/bases/deployments.plural.sh_customstackruns.yaml
cat charts/controller/crds/deployments.plural.sh_customstackruns.yaml

Length of output: 22547

go/controller/api/v1alpha1/prautomation_types.go (2)

459-463: LGTM: Well-structured field addition for uniqueness validation

The new UniqBy field is properly defined with appropriate optionality and type safety.

471-477: LGTM: Proper nil handling and consistent implementation

The Attributes() method follows the established pattern and correctly handles nil cases.

test/support/factory.ex (1)

755-760: LGTM! Factory function follows established patterns.

The new pipeline_context_history_factory is well-structured and consistent with the codebase's factory patterns. It properly builds the required associations for pipeline stage and context, which aligns with the PR's objective of maintaining service context history.

lib/console/graphql/deployments/pipeline.ex (2)

352-352: LGTM! Connection type properly declared.

The connection node type declaration follows the established pattern and enables proper pagination for pipeline context history queries.

185-188: LGTM! Verify resolver implementation.

The connection field is well-structured and follows GraphQL best practices for pagination. The placement in pipeline_stage makes sense for tracking stage-specific context history.

Let's verify the resolver implementation:

✅ Verification successful

Resolver implementation verified and properly structured

The resolver implementation in lib/console/graphql/resolvers/deployments/pipeline.ex is correctly implemented:

Uses proper pagination with paginate/2
Fetches records using PipelineContextHistory.for_stage/1 and ordered/1
Follows schema relationships (stage and context associations)
Maintains consistent naming and structure

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if the resolver function is properly implemented
# Expected: Find the resolver function definition

# Search for the resolver function
rg -A 5 "pipeline_context_history.*fn.*args.*_" "lib/console/graphql/resolvers"

Length of output: 79

Script:

#!/bin/bash
# Let's try different search patterns and locations to find the resolver

# Search for pipeline_context_history function in all Elixir files
rg -t elixir "def pipeline_context_history" 

# Search for any mentions of pipeline_context_history
rg -t elixir "pipeline_context_history"

# Look specifically in the deployments resolver module if it exists
fd -e ex -t f "deployments.ex" lib/console/graphql/resolvers/

Length of output: 1701

Script:

#!/bin/bash
# Let's check the implementation of the resolver we found
rg -A 10 "def pipeline_context_history" lib/console/graphql/resolvers/deployments/pipeline.ex

# Also check the schema definition to understand the context
rg -A 10 'schema "pipeline_context_history"' lib/console/schema/pipeline_context_history.ex

Length of output: 714

test/console/deployments/git_test.exs (1)

Line range hint 1-1: Missing test coverage for auto-rollback feature.

The PR objectives mention implementing an auto-rollback feature, but there are no tests covering this functionality. Consider adding test cases for:

Service failure scenarios
Auto-rollback triggers
Service context history management

test/console/deployments/pipelines_test.exs (1)

5-5: LGTM: Alias properly added

The PipelineContextHistory alias is well-placed and follows the module's naming conventions.

charts/controller/crds/deployments.plural.sh_prautomations.yaml (2)

199-209: LGTM! Well-structured uniqueness validation implementation.

The new uniqBy property is well-defined with clear scope options and proper validation schema. The implementation follows Kubernetes CRD best practices with:

Clear description
Proper enum values for scope (PROJECT, CLUSTER)
Required field validation

199-209: Verify uniqueness constraint implementation.

Let's verify the implementation of uniqueness constraints in the codebase.

✅ Verification successful

Uniqueness constraint implementation is properly defined and integrated

The uniqueness constraint implementation is properly defined in the codebase:

The CRD correctly defines the uniqBy field with proper validation in charts/controller/crds/deployments.plural.sh_prautomations.yaml
The implementation is backed by:
- Type definitions in go/controller/api/v1alpha1/prautomation_types.go with proper Kubernetes validation annotations
- Generated client models in go/client/models_gen.go with complete enum definitions for ValidationUniqScope
- Proper type conversion methods between API types and client models via the Attributes() functions
- Deep copy implementations for proper Kubernetes type handling

The scope-based uniqueness validation (PROJECT/CLUSTER) is consistently implemented across the API definition, CRD, and Go types, making it a robust implementation.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for uniqueness validation implementation and tests

# Check for validation logic implementation
echo "Checking for validation implementation..."
rg -l "uniqBy.*scope" --type go

# Check for test coverage
echo "Checking for test coverage..."
rg -l "TestUniqBy|TestUniqueness" --type go

# Check for validation error handling
echo "Checking for error handling..."
rg "ValidationError.*uniq" --type go

Length of output: 351

Script:

#!/bin/bash
# Let's try a broader search to understand the uniqueness validation implementation

# Search for any mentions of uniqueness in Go files
echo "Searching for uniqueness-related code..."
rg -i "unique|uniq" --type go -A 5

# Look for CRD validation code
echo "Searching for CRD validation code..."
rg "webhook.*validate" --type go -A 5

# Check for any validation logic in the controller
echo "Searching for validation in controller..."
fd -e go -x grep -l "Validate" {}

# Look for the specific struct definition
echo "Looking for struct definition..."
ast-grep --pattern 'type $_ struct {
  $$$
  UniqBy $_ `json:"uniqBy"`
  $$$
}'

Length of output: 12265

go/controller/config/crd/bases/deployments.plural.sh_prautomations.yaml (1)

199-209: LGTM! Well-structured uniqueness constraint implementation.

The addition of the uniqBy property with PROJECT/CLUSTER scope is well-implemented with proper OpenAPI v3 schema validation and clear documentation. The implementation follows Kubernetes CRD best practices.

Let's verify the consistency of this change across related CRDs:

✅ Verification successful

The uniqBy property implementation is consistent across CRDs

The verification shows that both PrAutomations and CustomStackRuns CRDs implement the uniqBy property with identical schema validation:

Same property structure and validation rules
Consistent enum values (PROJECT/CLUSTER) for scope
Identical required fields and descriptions

This consistent implementation across related CRDs indicates good schema design and maintainability.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for consistent implementation of uniqBy property across CRDs
# Expected: Similar uniqBy implementation in related CRDs

# Search for other CRDs with similar validation structure
fd -e yaml -e yml . | xargs rg -l "validation.*scope.*PROJECT.*CLUSTER"

# Search for Go types defining this schema
ast-grep --pattern 'type $_ struct {
  $$$
  UniqBy $_ `json:"uniqBy"`
  $$$
}'

Length of output: 155

Script:

#!/bin/bash
# Let's try a different approach to find CRD files and uniqBy implementations

# Find all CRD files in the repository
echo "=== CRD Files ==="
fd ".*_crd.*\.ya?ml$" .

# Search for uniqBy in yaml files
echo -e "\n=== UniqBy Usage in YAML ==="
rg -l "uniqBy:" .

# Search for scope with PROJECT/CLUSTER in yaml files
echo -e "\n=== Scope Usage ==="
rg "scope:.*\n.*- PROJECT\n.*- CLUSTER" .

# Look for Go type definitions that might contain UniqBy
echo -e "\n=== Go Type Definitions ==="
fd -e go . | xargs rg "type.*struct.*UniqBy"

Length of output: 1016

Script:

#!/bin/bash
# Let's examine the implementation of uniqBy across the found CRD files

echo "=== PrAutomations CRD Content ==="
rg -A 10 "uniqBy:" ./go/controller/config/crd/bases/deployments.plural.sh_prautomations.yaml

echo -e "\n=== CustomStackRuns CRD Content ==="
rg -A 10 "uniqBy:" ./go/controller/config/crd/bases/deployments.plural.sh_customstackruns.yaml

# Let's also check the Go types that define these CRDs
echo -e "\n=== Go Types with UniqBy ==="
rg -A 5 "UniqBy.*\`json:\"uniqBy\"" .

Length of output: 1369

plural/helm/console/crds/deployments.plural.sh_prautomations.yaml (1)

199-209: LGTM! Well-structured uniqueness constraint implementation.

The new uniqBy field is properly implemented with:

Clear description
Well-defined scope enum (PROJECT/CLUSTER)
Proper schema validation with required fields

Let's verify the consistency of this uniqueness constraint implementation across other CRDs:

✅ Verification successful

Uniqueness constraint implementation is consistent across CRD files

The uniqBy field implementation is identical in all three CRD files (charts/controller, go/controller, and plural/helm/console), with:

Same schema structure
Same scope enum values (PROJECT/CLUSTER)
Same required fields
Same description

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for consistent implementation of uniqBy field across CRDs
# Expected: Similar uniqBy field structure in related CRDs

# Search for other CRDs with uniqBy field
rg -U "uniqBy.*scope.*PROJECT.*CLUSTER" "plural/helm/console/crds/*.yaml"

# Search for potential inconsistencies in scope enum values
rg -U "scope.*enum" "plural/helm/console/crds/*.yaml"

Length of output: 275

Script:

#!/bin/bash
# Let's first find the correct path for CRDs
fd ".*_prautomations.yaml$"

# Then search for all CRD files
fd ".*.yaml$" | grep -i "crd"

# Once we find the files, we can check their content

Length of output: 6678

Script:

#!/bin/bash
# Now that we found the correct paths, let's check for uniqBy implementations
rg -U "uniqBy.*scope.*PROJECT.*CLUSTER" "charts/controller/crds/*.yaml" "go/controller/config/crd/bases/*.yaml" "plural/helm/console/crds/*.yaml"

# Also check for any scope enums to verify consistency
rg -U "scope:.*enum" "charts/controller/crds/*.yaml" "go/controller/config/crd/bases/*.yaml" "plural/helm/console/crds/*.yaml"

Length of output: 716

Script:

#!/bin/bash
# Let's try a different approach by checking the content of prautomations CRD files directly
cat charts/controller/crds/deployments.plural.sh_prautomations.yaml
cat go/controller/config/crd/bases/deployments.plural.sh_prautomations.yaml
cat plural/helm/console/crds/deployments.plural.sh_prautomations.yaml

Length of output: 92070

lib/console/graphql/deployments/git.ex (3)

15-30: LGTM! Consistent formatting of ecto_enum declarations.

The alignment of ecto_enum declarations improves code readability.

Line range hint 580-589: LGTM! Clean implementation of webhook URL and name resolution.

The implementation properly uses lazy resolution for webhook URLs and names, with clear documentation.

186-193: LGTM! Well-structured validation fields with clear documentation.

The new validation fields provide comprehensive configuration validation capabilities. Let's verify the usage of these validation fields across the codebase.

✅ Verification successful

Validation fields are properly implemented and used across the codebase

The verification shows that:

The validation fields are correctly implemented in lib/console/deployments/pr/validation.ex with proper handling for:
- JSON validation with error messages
- Regex pattern matching with error feedback
- Uniqueness validation with scoping
Test coverage exists in test/console/deployments/git_test.exs demonstrating usage of these validations
The fields are also referenced in GraphQL queries (lib/console/plural/queries.ex)

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check usage of validation fields in configurations
# Test: Search for usage of validation fields in configuration contexts
rg -A 5 'validation.*\b(regex|json|uniq_by)\b' --type elixir

Length of output: 4105

go/controller/api/v1alpha1/zz_generated.deepcopy.go (2)

3597-3601: LGTM: Correct deepcopy implementation for UniqBy field

The deepcopy implementation for the UniqBy field in PrAutomationConfigurationValidation follows the standard pattern for pointer fields and is correctly generated.

3934-3948: LGTM: Correct deepcopy implementation for PrAutomationUniqBy type

The DeepCopyInto and DeepCopy methods for the new PrAutomationUniqBy type are correctly implemented following the standard deepcopy pattern for simple types.

lib/console/deployments/pr/validation.ex (1)

2-3: Addition of necessary aliases

The aliases for Console.Repo, Project, and Cluster are appropriately added for use in the new validation logic.

lib/console/deployments/pipelines.ex (1)

21-21: Alias addition for PipelineContextHistory is appropriate

Adding PipelineContextHistory to the list of aliases ensures that the module can be referenced succinctly throughout the code.

schema/schema.graphql (7)

3216-3217: Addition of contextHistory field to PipelineStage is appropriate

The new field contextHistory allows retrieval of historical pipeline contexts associated with a stage, enhancing tracking capabilities. The use of pagination parameters (after, first, before, last) and returning a connection type aligns with GraphQL best practices.

3431-3439: Definition of PipelineContextHistory type is correct

The PipelineContextHistory type includes necessary fields—id, stage, context, insertedAt, and updatedAt—which are appropriate for capturing historical context data of pipeline stages.

3455-3459: PipelineContextHistoryConnection conforms to Relay specifications

The connection type PipelineContextHistoryConnection includes pageInfo and edges, correctly facilitating pagination as per the GraphQL Relay pattern.

8224-8228: Definition of PipelineContextHistoryEdge is properly structured

The PipelineContextHistoryEdge type includes the necessary node and cursor fields, conforming to pagination requirements and ensuring smooth client navigation through data.

5090-5094: Enum ValidationUniqScope is well-defined

The ValidationUniqScope enum provides clear scope options with PROJECT and CLUSTER, enabling precise uniqueness constraints in validations.

5356-5359: Inclusion of uniqBy enhances validation capabilities

Adding the uniqBy field to ConfigurationValidationAttributes allows specifying uniqueness constraints, improving data integrity by enforcing name uniqueness within a defined scope.

5361-5364: UniqByAttributes input type is appropriately defined

The UniqByAttributes input type correctly specifies the scope field with a non-nullable ValidationUniqScope enum, facilitating scope-specific uniqueness enforcement.

go/client/models_gen.go (2)

3417-3434: Addition of 'PipelineContextHistory' structs is well-structured

The introduction of PipelineContextHistory, PipelineContextHistoryConnection, and PipelineContextHistoryEdge types is correctly implemented and follows Go conventions.

3537-3537: Ensure proper initialization of 'ContextHistory' in 'PipelineStage'

The new ContextHistory field in PipelineStage is added appropriately. Please verify that all instances of PipelineStage properly initialize the ContextHistory field to prevent potential nil pointer dereferences.

Run the following script to check for PipelineStage initializations without ContextHistory:

coderabbitai · 2024-11-19T03:47:35Z

lib/console/deployments/observer/runner.ex

      err ->
-        Logger.error "unknown observer error: #{err}"
+        Logger.error "unknown observer error: #{inspect(err)}"
    end


🛠️ Refactor suggestion

Consider consistent error handling for unknown errors

While the logging improvement is good, unknown errors are handled differently from poll and action errors. Consider updating the observer's error state for unknown errors as well.

Consider this change:

err -> Logger.error "unknown observer error: #{inspect(err)}" + add_error(observer, "unknown", err)

This ensures:

Consistent error handling across all error types

Error history is preserved in the database

Better observability of error states

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

err ->

Logger.error "unknown observer error: #{err}"

Logger.error "unknown observer error: #{inspect(err)}"

end

err ->

Logger.error "unknown observer error: #{inspect(err)}"

add_error(observer, "unknown", err)

end

lib/console/schema/pipeline_context_history.ex

lib/console/ai/fixer.ex

lib/console/application.ex

coderabbitai · 2024-11-19T03:47:37Z

lib/console/deployments/pr/validation.ex

+  defp do_validate(%Configuration{type: :string, validation: %Validation{regex: r}}, val)
+      when is_binary(r) and is_binary(val) do


⚠️ Potential issue

Incorrect interpolation of runtime variable in regex sigil

Interpolating a runtime variable r into the regex sigil ~r/#{r}/ is invalid in Elixir. Sigils are expanded at compile time, and runtime interpolation is not allowed, which will lead to a compilation error.

To fix this issue, use Regex.compile/1 and Regex.match?/2 functions to handle dynamic regex patterns:

case Regex.match?(Regex.compile!(r), val) do true -> :ok false -> {:error, "value #{val} does not match regex #{r}"} end

This approach safely compiles the regex at runtime and performs the match against val.

coderabbitai · 2024-11-19T03:47:37Z

lib/console/prom/plugin.ex

+          description: "The total number of clusters managed by this instance.",
+          measurement: :total
+        ),


⚠️ Potential issue

Correct duplicate metric descriptions for 'unhealthy_cluster_count' and 'failed_service_count'

The descriptions for unhealthy_cluster_count and failed_service_count are identical to their healthy counterparts. They should accurately reflect the metrics they represent.

Apply the following diff to correct the descriptions:

last_value( [:cluster, :count], event_name: metric_scope(:cluster_count), description: "The total number of clusters managed by this instance.", measurement: :total ), last_value( [:unhealthy, :cluster, :count], event_name: metric_scope(:unhealthy_cluster_count), - description: "The total number of clusters managed by this instance.", + description: "The total number of **unhealthy** clusters managed by this instance.", measurement: :total ), last_value( [:service, :count], event_name: metric_scope(:service_count), description: "The total number of services managed by this instance.", measurement: :total ), last_value( [:failed, :service, :count], event_name: metric_scope(:failed_service_count), - description: "The total number of services managed by this instance.", + description: "The total number of **failed** services managed by this instance.", measurement: :total ),

Also applies to: 53-55, 59-61, 65-67

lib/console/prom/plugin.ex

Handle service failures by auto-rollback to a prior service context (also persist service context history per stage)

nif Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

michaeljguarino added the enhancement New feature or request label Nov 14, 2024

michaeljguarino requested a review from a team November 14, 2024 05:44

michaeljguarino force-pushed the mjg/prod-2887-pipeline-auto-rollback-apis branch 10 times, most recently from 8e5650d to 6b7e58a Compare November 19, 2024 03:39

coderabbitai bot reviewed Nov 19, 2024

View reviewed changes

michaeljguarino force-pushed the mjg/prod-2887-pipeline-auto-rollback-apis branch from a9d7f9b to da4c859 Compare November 19, 2024 05:31

floreks approved these changes Nov 19, 2024

View reviewed changes

michaeljguarino force-pushed the mjg/prod-2887-pipeline-auto-rollback-apis branch 3 times, most recently from a20a495 to 3d0670c Compare November 20, 2024 03:56

michaeljguarino and others added 6 commits November 19, 2024 23:27

Implement auto-rollback

357cd69

Handle service failures by auto-rollback to a prior service context (also persist service context history per stage)

Add uniqueness checks to pr automation schema

69bcde9

add ability to setup arbitrary console sidecar containers

25750ac

rework prometheus setup to use prom_ex which seems more mature

6faee1e

Update lib/console/prom/plugin.ex

0eeefb2

nif Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

convert vertexai code to use openai client

1c46e24

michaeljguarino force-pushed the mjg/prod-2887-pipeline-auto-rollback-apis branch from 3d0670c to 02ae471 Compare November 20, 2024 05:07

implement openai streaming

799ae0c

michaeljguarino force-pushed the mjg/prod-2887-pipeline-auto-rollback-apis branch 2 times, most recently from cdeba72 to 8308916 Compare November 20, 2024 17:33

michaeljguarino force-pushed the mjg/prod-2887-pipeline-auto-rollback-apis branch 2 times, most recently from 4591b6b to 2f95b54 Compare November 20, 2024 17:58

refactor out sse parser

e4cd354

michaeljguarino force-pushed the mjg/prod-2887-pipeline-auto-rollback-apis branch from 2f95b54 to e4cd354 Compare November 20, 2024 18:01

add additional containers to migration job

15211b8

michaeljguarino merged commit 8bdf9fc into master Nov 20, 2024
16 checks passed

michaeljguarino deleted the mjg/prod-2887-pipeline-auto-rollback-apis branch November 20, 2024 21:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement auto-rollback #1585

Implement auto-rollback #1585

michaeljguarino commented Nov 14, 2024 •

edited by coderabbitai bot

Loading

linear bot commented Nov 14, 2024

coderabbitai bot commented Nov 19, 2024 •

edited

Loading

Review skipped

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot Nov 19, 2024

coderabbitai bot Nov 19, 2024

coderabbitai bot Nov 19, 2024

		defp do_validate(%Configuration{type: :string, validation: %Validation{regex: r}}, val)
		when is_binary(r) and is_binary(val) do

Implement auto-rollback #1585

Implement auto-rollback #1585

Conversation

michaeljguarino commented Nov 14, 2024 • edited by coderabbitai bot Loading

Test Plan

Checklist

Summary by CodeRabbit

linear bot commented Nov 14, 2024

coderabbitai bot commented Nov 19, 2024 • edited Loading

Review skipped

Walkthrough

Changes

Poem

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 19, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 19, 2024

Choose a reason for hiding this comment

coderabbitai bot Nov 19, 2024

Choose a reason for hiding this comment

michaeljguarino commented Nov 14, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 19, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)