Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(package): Enable replica set for the MongoDB results cache and configure it when starting the package. #632

Merged
merged 11 commits into from
Jan 20, 2025

Conversation

junhaoliao
Copy link
Member

@junhaoliao junhaoliao commented Dec 11, 2024

Description

  1. Enable replica set for the MongoDB results cache.
  2. Configure replica set in the indices building script.
  3. Change the results cache's Docker network mode to host and specify --bind-ip and --port when starting the mongo image.

Validation performed

Search query performance compassion

With & Without the replica set enabled, performed a search query in the WebUI with a screen recording of 30fps, and extract below times:

  1. T0 (relative 0 second): the query was started.
  2. T1: the first batch of results were available and displayed in the browser.
  3. T2: the total number of results are available (either from the results-metadata or summed from the aggregated time buckets) and displayed in the browser.
  4. T3: the job is marked as done and displayed in the browser.
  T0 T1 T2 T3
Trial 1 (with replica-set) 0 0.7 1.266667 2.1
Trial 2 (with replica-set) 0 0.633333 1.433333 2.166667
Trial 3 (with replica-set) 0 0.433333 1.4 2.1
Trial 4 (w/o replica-set) 0 0.466667 1.433333 2.1
Trial 5 (w/o replica-set) 0 0.5 1.433333 2.066667
Trial 6 (w/o replica-set) 0 0.6 1.433333 2.2

image

The speed performance does not seem to significantly differ with and without replica-set enabled.

Result cache host switching

Previous attempts to enable replica set failed because there were host reachability issues observed. Below validations ensure the clp-config.yml/results_cache host and port are configurable without affecting the result cache's avaialbility.

 build/clp-package/lib/python3/site-packages/clp_py_utils/create-results-cache-indices.py

  1. Built the clp-package.

  2. clp-package/sbin/start-clp.sh with the default config and observed below logs:

    2024-12-25T22:45:58.892 INFO [start_clp] Creating results_cache indices...
    2024-12-26 03:45:59,548 [DEBUG] Replica set initialization requested for localhost:27017
    2024-12-26 03:45:59,551 [DEBUG] Replica set has not been previously initialized.
    2024-12-26 03:45:59,551 [DEBUG] Initializing replica set at localhost:27017
    2024-12-26 03:45:59,693 [DEBUG] Replica set initialized successfully.
    2024-12-25T22:46:00.517 INFO [start_clp] Created results_cache indices.
    
  3. Changed the result_cache's host and port config in clp-package/etc/clp-config.yml to be localhost:27018. e.g.,

    results_cache:
      host: "localhost"
      port: 27018
      db_name: "clp-query-results"
      stream_collection_name: "stream-files"

    clp-package/sbin/start-clp.sh and observed below logs:

    2024-12-25T22:50:29.309 INFO [start_clp] Creating results_cache indices...
    2024-12-26 03:50:29,761 [DEBUG] Replica set initialization requested for localhost:27018
    2024-12-26 03:50:30,266 [DEBUG] Replica set is already initialized but reports invalid config.
    2024-12-26 03:50:30,266 [DEBUG] Initializing replica set at localhost:27018
    2024-12-26 03:50:30,269 [DEBUG] Replica set initialized successfully.
    2024-12-25T22:50:31.042 INFO [start_clp] Created results_cache indices.
    
  4. Used ip addr to find another address that belongs to the clp-package host (i got 172.22.78.194). Changed the result_cache's host and port config in clp-package/etc/clp-config.yml to be 172.22.78.194:27018. clp-package/sbin/start-clp.sh and observed below logs:

    2024-12-25T22:54:43.400 INFO [start_clp] Creating results_cache indices...
    2024-12-26 03:54:43,919 [DEBUG] Replica set initialization requested for 172.22.78.194:27018
    2024-12-26 03:54:44,422 [DEBUG] Replica set is already initialized but reports invalid config.
    2024-12-26 03:54:44,422 [DEBUG] Initializing replica set at 172.22.78.194:27018
    2024-12-26 03:54:44,427 [DEBUG] Replica set initialized successfully.
    2024-12-25T22:54:45.239 INFO [start_clp] Created results_cache indices.
    
  5. Changed the result_cache's host and port config in clp-package/etc/clp-config.yml to be localhost:27017. clp-package/sbin/start-clp.sh and observed below logs:

    2024-12-25T22:57:01.151 INFO [start_clp] Creating results_cache indices...
    2024-12-26 03:57:01,619 [DEBUG] Replica set initialization requested for localhost:27017
    2024-12-26 03:57:02,122 [DEBUG] Replica set is already initialized but reports invalid config.
    2024-12-26 03:57:02,122 [DEBUG] Initializing replica set at localhost:27017
    2024-12-26 03:57:02,125 [DEBUG] Replica set initialized successfully.
    2024-12-25T22:57:02.894 INFO [start_clp] Created results_cache indices.
    

Summary by CodeRabbit

  • Configuration Changes

    • Updated MongoDB configuration to support replica sets
    • Removed explicit IP binding settings
    • Added replica set name configuration
  • MongoDB Management

    • Introduced new functions for managing MongoDB replica set initialization and status checking
  • Deployment Updates

    • Modified container startup scripts for results cache and reducer
    • Updated network configuration settings for containers

Copy link
Contributor

coderabbitai bot commented Dec 11, 2024

Warning

Rate limit exceeded

@junhaoliao has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 17 minutes and 17 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 69ea477 and 7dc8c55.

📒 Files selected for processing (2)
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py (3 hunks)
  • components/clp-py-utils/clp_py_utils/initialize-results-cache.py (1 hunks)

Walkthrough

The pull request introduces modifications to support MongoDB replica set initialization in a Python utility script and updates the MongoDB configuration file. The changes enable automatic replica set configuration during the index creation process for a stream collection. The script now checks the replica set status and initializes it if not already configured, ensuring proper MongoDB replication setup.

Changes

File Change Summary
components/clp-py-utils/clp_py_utils/create-results-cache-indices.py Added functions check_replica_set_status(), init_replica_set_for_oplog(), and init_replica_set_for_oplog_if_needed() for MongoDB replica set management. Updated main() to call init_replica_set_for_oplog_if_needed() after establishing a connection. Imported OperationFailure and urlparse.
components/package-template/src/etc/mongo/mongod.conf Removed net.bindIp parameter and added replication.replSetName set to "rs0" to enable replica set support.
components/clp-package-utils/clp_package_utils/scripts/start_clp.py Modified start_results_cache and start_reducer functions to directly append network settings (--bind_ip and --port) in the MongoDB container start command. Added --network option for the reducer container.

Sequence Diagram

sequenceDiagram
    participant Script as Create Results Cache Indices Script
    participant MongoDB as MongoDB Client
    
    Script->>MongoDB: Establish Connection
    Script->>MongoDB: Check Replica Set Status
    alt Replica Set Not Initialized
        Script->>MongoDB: Initialize Replica Set
        MongoDB-->>Script: Replica Set Configured
    else Replica Set Already Initialized
        Script-->>MongoDB: Continue Processing
    end
    Script->>MongoDB: Create Indexes
Loading

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@junhaoliao
Copy link
Member Author

junhaoliao commented Dec 26, 2024

An issue is found when in clp-config.yml I configure a non-27017 port on host localhost for the replica set:

pymongo.errors.OperationFailure: No host described in new configuration with {version: 36921, term: -1} for replica set rs0 maps to this node, full error: {'ok': 0.0, 'errmsg': 'No host described in new configuration with {version: 36921, term: -1} for replica set rs0 maps to this node', 'code': 74, 'codeName': 'NodeNotFound'}

However, such issue is only observed on localhost hosts. If the host is not localhost / 127.0.0.1 (e.g., my WSL's virtual eth0 interface gets assigned 172.22.78.194), with any port the replica set initialization would succeed and there is no issue connecting to the clp-configured mongo address.

By looking into the community Mongo server source code, it was found they have two ways to check whether a node is up before the server configures itself as a replica node: isSelfFastPath and isSelfSlowPath.

  • isSelfFastPath is purely logic based as it tries to check identity of the server's address with a given config's host and port. It first checks if the port matches then it checks for host.
  • isSelfSlowPath handles the case that the fast path cannot find a match of the server's address in any of the configs. It tries to make a connection with the given config host and port and issue a command to check for identity.

Why localhost: any non-27017 port fails

Since our mongo server instance runs within a Docker container with only the docker's localhost's port 27017 mapped to the Docker host, if we request replica set to be configured on localhost:27017, the fast path check should succeed; if we request for localhost: any non-27017 port, the fast path wouldn't find a match because the server is started with 27017, and the slow path wouldn't find a match because in the Docker container's own network, the Docker host's port (configured by clp-config and used in the replica set config) is not reachable.

Why any non-localhost host: any port works (e.g., 172.22.78.194:27018)

The fast path would fail but the slow path would succeed as Docker knows non-localhost is outside of its network and Docker routes the connection to the Docker host.

Proposed solution

  1. Create a docker network and put all services (docker containers) that use Mongo into the same network.
  2. docker run the result cache with --network "host" so that the mongo Docker container sees / exposes everything on the Docker host.

Considering existing CLP utilities, Option 2. seems the quickest way. @haiqi96 may comment on the impacts on CLP cloud.

@junhaoliao
Copy link
Member Author

You may also see from the {version: 36921, term: -1} in the error message that they may have some uninitialized memory issue, lol

@junhaoliao junhaoliao changed the title WIP - Enable MongoDB replica set for the results cache and initialize it when starting the package. feat(package): Enable MongoDB replica set for the results cache and initialize it when starting the package. Dec 26, 2024
@junhaoliao junhaoliao changed the title feat(package): Enable MongoDB replica set for the results cache and initialize it when starting the package. feat(package): Enable MongoDB replica set for the results cache and configure it when starting the package. Dec 26, 2024
@junhaoliao junhaoliao marked this pull request as ready for review December 26, 2024 03:59
@junhaoliao junhaoliao requested a review from haiqi96 December 26, 2024 03:59
@junhaoliao junhaoliao changed the title feat(package): Enable MongoDB replica set for the results cache and configure it when starting the package. feat(package): Enable replica set for the MongoDB results cache and configure it when starting the package. Dec 26, 2024
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
components/clp-py-utils/clp_py_utils/create-results-cache-indices.py (2)

19-33: Consider parametrising the replica set host and port.

Although this initialisation flow works well for the default port (27017) on localhost, it may not be sufficient if the user wants to set up a replica set on a non-standard port or host. Passing in the host and port as parameters will improve flexibility and will align with the PR’s objective of making replica set configuration more adaptable.


47-49: Avoid creating two separate MongoClient connections.

Re-using the same MongoClient after invoking initialize_replica_set can be more efficient and simpler. If you do require a new client connection for index creation, consider adding a short comment explaining why a fresh connection is needed, so it is clear to maintainers.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6dd8fc1 and fb4516e.

📒 Files selected for processing (2)
  • components/clp-py-utils/clp_py_utils/create-results-cache-indices.py (3 hunks)
  • components/package-template/src/etc/mongo/mongod.conf (1 hunks)
🔇 Additional comments (2)
components/clp-py-utils/clp_py_utils/create-results-cache-indices.py (1)

6-6: Good use of the OperationFailure import.

This import from pymongo.errors is essential for handling replica set status checks and ensures that any non-initialised replica set is properly caught and configured.

components/package-template/src/etc/mongo/mongod.conf (1)

3-4: Replica set configuration approved.

Enabling the replica set by specifying replSetName is a straightforward and effective strategy. If you later find a need to run multiple replica sets on the same machine, you may consider making rs0 a configurable parameter in the environment or in a generated config file.

@haiqi96
Copy link
Contributor

haiqi96 commented Jan 9, 2025

Quick question: for Result cache host switching validation, I assumed you also ensured the search in webui is still working with different IPs and ports?

@haiqi96
Copy link
Contributor

haiqi96 commented Jan 9, 2025

Discussed with Junhao offline and clarified the two points:

  1. The current replica set only adds one member, so it doesn't actually provide redundancy.
  2. The purpose of replica set is not support redundancy, but to enable oplog that supports watch(), which replaces polling to check updates in the collection.

Imo, we should make these two points clear in the code. We have agreed on renaming some replica related functions and adding docstrings so developer who only reads code can also have the context above.

@@ -491,6 +491,7 @@ def start_results_cache(instance_id: str, clp_config: CLPConfig, conf_dir: pathl
cmd = [
"docker", "run",
"-d",
"--network", "host",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, do you know exactly why we need --network host? is there any IP other than results_cache.port that will be accessed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using --network host with MongoDB in Docker ensures proper replica set configuration because Docker's network containerization interferes with MongoDB's isSelf checks when using non-localhost-&-27017 ports. By exposing the host network to the MongoDB container, we can avoid this issue.

More details of the investigation can be found at #632 (comment) Let me know if I can elaborate on any part

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we also consider renaming this script as initialize-results-cache.py to better reflect its updated purpose?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
components/clp-py-utils/clp_py_utils/create-results-cache-indices.py (3)

1-1: Consider renaming the script to better reflect its purpose.

As suggested in the previous review, renaming this script to initialize-results-cache.py would better represent its expanded functionality, which now includes both index creation and replica set initialization.


20-48: Well-structured replica set status checking with comprehensive error handling.

The function properly handles different MongoDB error codes and provides appropriate logging for each case:

  • Code 94: Not yet initialized
  • Code 93: Invalid replica set configuration
  • Other codes: Re-raised for proper error propagation

Consider adding docstring documentation:

 def check_replica_set_status(client, netloc):
+    """
+    Check the status of MongoDB replica set.
+    
+    Args:
+        client: MongoDB client instance
+        netloc: Network location string (host:port)
+    
+    Returns:
+        tuple: (is_already_initialized, should_configure_replica_set)
+        
+    Raises:
+        OperationFailure: For unexpected MongoDB operation errors
+    """

51-69: Robust replica set configuration implementation.

The function handles both initial configuration and reconfiguration scenarios appropriately. The use of force=True for reconfiguration is a good choice to avoid version management complexity.

Consider adding a warning log before force reconfiguration:

     if is_already_initialized:
+        logger.warning("Force reconfiguring existing replica set")
         # Use `force=True` so that we do not have to increment the version
         client.admin.command("replSetReconfig", config, force=True)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fb4516e and 66e6b93.

📒 Files selected for processing (3)
  • components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2 hunks)
  • components/clp-py-utils/clp_py_utils/create-results-cache-indices.py (3 hunks)
  • components/package-template/src/etc/mongo/mongod.conf (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: lint-check (macos-latest)
🔇 Additional comments (4)
components/clp-package-utils/clp_package_utils/scripts/start_clp.py (2)

496-496: Network configuration change aligns with MongoDB replica set requirements.

The addition of --network host is necessary for proper replica set initialization in Docker, as explained in the previous discussion. This allows MongoDB to perform its isSelf checks correctly.


508-511: Direct port and IP binding replaces Docker port mapping.

The change from Docker port mapping to direct MongoDB configuration using --bind_ip and --port is more suitable for replica set configuration. This approach:

  1. Avoids network translation issues that could interfere with replica set communication
  2. Provides clearer network configuration that matches the replica set configuration
components/package-template/src/etc/mongo/mongod.conf (1)

1-2: Replica set configuration is properly defined.

The configuration correctly sets up the replica set name as "rs0", which matches the configuration used in the Python initialization code.

Let's verify that this replica set name is consistently used across the codebase:

✅ Verification successful

Replica set configuration is consistent across the codebase

The replica set name "rs0" is correctly configured in mongod.conf and properly referenced in the initialization script. The configuration follows MongoDB best practices with explicit host specifications.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for "rs0" references to ensure consistency
rg "rs0" --type yaml --type python

Length of output: 65


Script:

#!/bin/bash
# Search for "rs0" across all files
rg "rs0"

# Find MongoDB-related Python files
fd -e py -e yaml -e conf -e config | rg -i "mongo"

Length of output: 279


Script:

#!/bin/bash
# Check the context of rs0 usage in the indices script
rg -A 5 -B 5 "rs0" components/clp-py-utils/clp_py_utils/create-results-cache-indices.py

Length of output: 462

components/clp-py-utils/clp_py_utils/create-results-cache-indices.py (1)

97-98: Ensure replica set is configured before creating indices.

The code correctly uses directConnection=True for the initial connection to avoid replica set connection attempts before configuration. However, there's a potential race condition between the two client connections.

Consider adding a retry mechanism for the second connection:

         with MongoClient(results_cache_uri, directConnection=True) as results_cache_client:
             configure_replica_set_if_needed(results_cache_client, results_cache_uri)
+            
+        # Wait for replica set to become available
+        max_retries = 30
+        retry_interval = 1
+        for attempt in range(max_retries):
+            try:
+                with MongoClient(results_cache_uri, serverSelectionTimeoutMS=1000) as client:
+                    client.admin.command('ping')
+                    break
+            except Exception as e:
+                if attempt == max_retries - 1:
+                    raise
+                time.sleep(retry_interval)
✅ Verification successful

Add retry mechanism for MongoDB replica set initialization

The review comment is correct. After configuring a replica set, MongoDB needs time to elect a primary and stabilize. The suggested retry mechanism will help handle the transition period properly and prevent connection failures.

🏁 Scripts executed

The following scripts were executed for the analysis:

Error: No shell scripts found


Script:

#!/bin/bash
# Search for configure_replica_set_if_needed implementation
ast-grep --pattern 'def configure_replica_set_if_needed($$$):'

# Look for MongoDB connection patterns and retry mechanisms
rg -A 10 "MongoClient\(" --type py

# Search for retry-related patterns
rg "retry|attempt|max_retries" --type py

Length of output: 6544


Script:

#!/bin/bash
# Search for configure_replica_set_if_needed implementation with content
rg -A 10 "configure_replica_set_if_needed" --type py

Length of output: 2583

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
components/clp-py-utils/clp_py_utils/create-results-cache-indices.py (3)

20-64: Consider adding more detailed logging for replica set status changes.

The logging could be more consistent. Consider adding INFO level logs for important state changes, while keeping DEBUG level for detailed operations.

-            logger.debug("Replica set is already initialized at %s", existing_netloc)
+            logger.info("Replica set is already initialized at %s", existing_netloc)

94-115: Add validation for MongoDB URI components.

The URI validation could be more thorough to prevent issues with invalid configurations.

     parsed_uri = urlparse(uri)
     netloc = parsed_uri.netloc
-    if 0 == len(netloc):
+    if not netloc:
         raise ValueError("Invalid URI: %s", uri)
+    
+    # Validate URI components
+    if not parsed_uri.hostname:
+        raise ValueError("Missing hostname in URI: %s" % uri)
+    if not parsed_uri.port:
+        raise ValueError("Missing port in URI: %s" % uri)

129-131: Consider consolidating MongoDB connections and improving error handling.

The code uses two separate client connections: one for replica set initialization and another for index creation. This could be confusing and potentially inefficient.

-        with MongoClient(results_cache_uri, directConnection=True) as results_cache_client:
-            init_replica_set_for_oplog_if_needed(results_cache_client, results_cache_uri)
-
-        with MongoClient(results_cache_uri) as results_cache_client:
+        # Initialize with direct connection for replica set setup
+        results_cache_client = MongoClient(results_cache_uri, directConnection=True)
+        try:
+            init_replica_set_for_oplog_if_needed(results_cache_client, results_cache_uri)
+            # Reconnect to use replica set
+            results_cache_client = MongoClient(results_cache_uri)
             stream_collection = results_cache_client.get_default_database()[stream_collection_name]
+        except OperationFailure as e:
+            logger.error("Failed to initialize replica set: %s", str(e))
+            raise
+        finally:
+            results_cache_client.close()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 66e6b93 and 69ea477.

📒 Files selected for processing (1)
  • components/clp-py-utils/clp_py_utils/create-results-cache-indices.py (3 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: lint-check (ubuntu-latest)
  • GitHub Check: lint-check (macos-latest)
🔇 Additional comments (3)
components/clp-py-utils/clp_py_utils/create-results-cache-indices.py (3)

4-7: LGTM! The new imports support the replica set functionality.

The added imports are appropriate for URI parsing and MongoDB error handling.


1-1: Consider renaming the script to better reflect its purpose.

As suggested in previous reviews, the script now handles both index creation and replica set initialization. Consider renaming it to initialize-results-cache.py to better reflect its expanded purpose.


66-92: 🛠️ Refactor suggestion

Add safety checks and clarify single-node nature in function name.

  1. The function name doesn't indicate it's specifically for single-node setup
  2. Using force=True in replSetReconfig could be dangerous without additional checks
-def init_replica_set_for_oplog(client: MongoClient, netloc: str, force: bool):
+def init_single_node_replica_set(client: MongoClient, netloc: str, force: bool):
     """
-    Initializes or reconfigures a single-node MongoDB replica set for enabling the oplog.
+    Initializes or reconfigures a single-node MongoDB replica set.
+    
+    WARNING: This function is specifically designed for single-node replica sets.
+    Using force=True in reconfiguration can be dangerous in multi-node setups.

     :param client:
     :param netloc: The network location of the MongoDB instance to configure
     as a replica set member.
     :param force: Whether this is a reconfiguration operation.
     """
+    if not netloc:
+        raise ValueError("netloc cannot be empty")

Likely invalid or redundant comment.

@junhaoliao junhaoliao requested a review from haiqi96 January 11, 2025 18:29
@junhaoliao
Copy link
Member Author

@haiqi96 I added docstrings for the newly added functions in the init script, and renamed the methods & parameters & the init script name, which hopefully has clarified the purpose of those methods. Please help taking a look again.

Copy link
Contributor

@haiqi96 haiqi96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments on the docstrings but in general the change looks reasonable to me.

def check_replica_set_status(client: MongoClient, netloc: str) -> tuple[bool, bool]:
"""
Checks the current replica set status of the MongoDB server and determines whether it needs to
be configured or reconfigured.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto.

Co-authored-by: haiqi96 <14502009+haiqi96@users.noreply.github.com>
@junhaoliao junhaoliao requested a review from haiqi96 January 19, 2025 21:06
Copy link
Contributor

@haiqi96 haiqi96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, assuming it is tested and verified.

@junhaoliao junhaoliao merged commit 5857bcf into y-scope:main Jan 20, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants