forked from linkedin/venice
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Jdk17 venice ghci #7
Closed
Closed
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…inkedin#512) PushJobZstdConfig calculates number of bytes to be collected per input file based on the configurable sample size(default 200MB). If that ends up being less than size of first value in file, no samples get collected from that file potentially leading to skipping zstd dictionary creation. Adding an extra check to add at least 1 sample per input file(until reaching the max sample size).
…edin#508) In the VPJ, we call the controller 4 times to get the StoreResponse which is wasteful. For configurations, they are unlikely to change during a push job and should be cached to reduce the duplicate call. This PR caches the response so we can reduce the duplicate calls by 2. We will need to do more refactoring to make the VPJ class simple and easy-to-understand
* [server] cleanup removed stores and unused versions during server bootstrap If server rocksDB directories have resources which are bad and cannot be opened by rocksdb during restart, they are not loaded into storage engine and none of the background resource cleaner kicks for those directories. This leads to massive disk space waste. This PR if rocksDB open fails, it cleans up unused resources.
linkedin#514) Currently, when creating a D2 client via D2ClientFactory, the code only waits for 5 seconds. This commit increases the timeout to 60 seconds.
…existing Pair usage in controller (linkedin#515) This PR uses existing POJO class ExecutionStatusWithDetails to replace Pair<ExecutionStatus, Optional> and Pair<ExecutionStatus, String> usage in controller push monitoring logics to improve readability
…etionUnderway (linkedin#518) The commit removes the `isTopicDeletionUnderway` API from the `PubSubAdminAdapter` class. The current default implementation: `ApacheKafkaAdminAdapter::isTopicDeletionUnderway` always returns `false`, making it useless. Removing this unnecessary API to improve the clarity and maintainability of the codebase.
…din#496) * [Controller] Emergency source region config should be scoped per cluster to help with testing and can also be useful with complex deployments
* [vpj] minor logging improvements
…lix manager (linkedin#521) Current behavior: After helix manager initialization, child controllers find system schema cluster leader controller url from helix manager, build controller client using the url, register schemas via the controller client. Problem: During deployment, when leader controller A stops, helix assigns controller B as leader. When A's controller service starts again, helix wants to hand over leadership to A so it transits B from leader to standby. Before transition completes, B is still the leader and all A's requests are sent to B. But since B clears in-memory store map, it throws no store exception to A's requests. A fails to start. Fix: Before helix manager initialization, child controllers register system schemas via controller client built using url or d2 configs. Since the controller instance is still disabled in helix, helix won't hand back leadership to A and won't transit B from leader to standby. B will handle A's requests successfully. Also, in case that leadership transits from B to another controller C, increase the number of retries to cover leader->standby period.
1. Fix a log bug happen during ingestion isolation handover: When the SIT partition is shutdown in time, the old logic still log wrong msg that it does not stop in time and the log is ERROR level. This causes in house deployment tooling to report ERROR logging issues. This PR fixes the logic to only log it when it is actually times out and reduce the log to WARN level. 2. Remove StoreVersionState.toString() logging in handover as this is the underlying SpecificRecord's toString and will print out binary data we store inside SVS. This PR simply remove the logging as we are not interested into the content of the SVS.
linkedin#520) * [admin-tool] Add support for '--cluster' parameter in data recovery tools This changes adds support for taking '--cluster' (cluster name) as optional parameter to several data recovery tools. Because of this change, the '--stores' parameter is now optional and when both are present, '--stores' overwrites '--cluster'.
…check (linkedin#517) * [server][da-vinci] Standardize ingestion stats on ready to serve lag measure The old measure is out of the date with what we actually use for ready to serve. So this will remove false alarms and give more transparency in our metrics.
…e components in the test framework (linkedin#523) Sometimes to construct pubsub clients we need to pass additional details about the pubsub brokers. This PR will let us pass these details to the pubsub clients in integration testing env.
…edin#516) * [controller][test] Fix a race condition between kill job and version swap Today there's a race around kill job and push completion or version swap. This is because we only write the kill message and do nothing and rely on the servers to receive the kill, report ERROR to mark a version killed, and prevent it from becoming current/active. However, this is not always the case especially when the push strategy only requires n-1 replicas to be completed in every partition. This means if a kill is sent when a push is near completion or almost completed, it can still become ONLINE and current version after the kill was sent and recognized by some SNs. This is an issue because we now have a current version that is also having a kill message in the participant store. No re-balance or re-bootstrap can succeed for this store version. Current mitigation is to re-push or rollback to previous version when this race happens. A prerequisite rb (linkedin#269) that introduces the new VersionStatus.KILLED has deployed in production allows us to continue with the following fix. This fix includes: 1. On kill, we take a store lock (since current version swap will also be acquiring this lock) when checking if the version to-be-killed is still in a non-complete state. Update the version status on ZK to KILLED, if it is. 2. On version swap, we check if the version to which we are swapping has been killed already by checking the VersionStatus. 3. An integration test with mocks to verify the correctness of the above behavior.
* [davinci][ii] Remove zk usage from II for davinci stores User libraries should not access zk. Additional work could be done to remove zk completely from the sub project, but it would seem the way things are architected currently that could be quite a lot of work, and we need to make some incremental progress here. * fix test * add localhost util EOM * Tweak test. This isn't great though, the intent is to use two processes. * Revert "Tweak test. This isn't great though, the intent is to use two processes." This reverts commit 6aa73d8. * Final fix for testing * Remove commented code * sanity check * More sanity check * Fix port assignment in integration test framework to make cluster initializer not advertize a non connected port. * some clean up --------- Co-authored-by: Zac Policzer <zpolicze@zpolicze-mn1.linkedin.biz>
…n#525) * Ignore slow node logic during retry when 5XX encountered so retry can still request nodes from the slow node set
The DIV state could grow unboudedly since it tracks some metadata for each producer, and never cleared them out. This is problematic for use cases where a hybrid store is RT-only, meaning it doesn't rev up store-versions, and as the servers are bounced overtime, the leader replicas generate new GUIDs overtime, which linger in the DIV state forever. In extreme cases, this leads to OOM errors. This change mitigates the unbounded growth by introducing a max age, beyond which DIV metadata can be evicted. This is controlled by a new config, which is disabled by default: div.producer.state.max.age.ms (default: -1) The clearing logic is inserted in the updateOffsetRecordForPartition and function of KafkaDataIntegrityValidator, which is itself called prior to each sync (both as part of ingestion when a threshold is reached, and during graceful shutdown). It is also called in the setPartitionState function of the same class, which is called when subscribing to a partition, including when the server starts up. The age of a producer is determined by the most recent timestamp within a partition (computed by OffsetRecord::getMaxMessageTimeInMs) minus the most recent timestamp of that producer (which is the Segments' lastRecordProducerTimestamp). These timestamps can be subject to drift but in practice it probably isn't going to drift enough to be a concern. Importantly, because the math is entirely within a single partition, this means one partition can be lagging (e.g. as would be the case in a rebootstrap scenario) while another may be caught up, and each partition will independently clear their own states without contaminating each other. This property of isolating the partitions' expired state clearing is achieved by inverting the KafkaDataIntegrityValidator's internal data structure. Previously, it contained a Map<GUID, ProducerTracker> with each PT containing one Segment per tracked partition. Now, it instead contains a SparseConcurrentList of PartitionTracker (which is essentially the same as ProducerTracker, but renamed), each of which contains a Map<GUID, Segment>. This is a more natural structure that aligns with other parts of the code as well. It also results in eliminating 2 map lookups from the DIV hot path (called 1/msg in followers, and 2/msg in leaders). Note that currently the state clearing happens only in the "main" KafkaDataIntegrityValidator, used by drainers, and which persist their state to disk. There is another KafkaDataIntegrityValidator used for leader replicas only, which track the RT topic's DIV state, but that one is transient and not persisted to disk. As such, it is not expected to grow as large, and bouncing effectively clears it (which is not the case for the one which gets persisted). Later on, we could trigger state cleaning for this other validator as well. Miscellaneous changes: Removed the RedundantExceptionFilter::getRedundantExceptionFilter with parameters, since that is broken by design... this and the non-parameterized functions both returned a singleton object, so the actual parameters of the singleton were driven by the timing of which code path got called first, which makes no sense. Moved ProducerTracker (renamed as PartitionTracker) and its unit test from venice-common to dvc, since it is only used in the latter. StoreIngestionTaskTest::testDataValidationCheckPointing no longer has invocationCount = 3, as this is a pretty long test, and it doesn't seem useful. Introduced a ReentrantLockWithRef, a new utility class which extends ReentrantLock and also carries a ref to another object. This is used in PartitionTracker, where previously we would need to query two maps, one to get the lock, and the other to get the protected object. Since we were always operating on the same key of both maps, we could instead group the elements into this shared container, hosted in a single map. Also removed several unnecessary objects from Segment. A server heap dump was found containing >3M instances of this class, so making it more efficient should be beneficial (i.e. even when DIV state clearing is not enabled yet). Checksum::getInstance does not return an Optional anymore. Reduced string concatenation in several places.
…t server start time (linkedin#531) When a server starts, if system schema initialization is enabled, it will call controller to check and register new meta system store schemas. Therefore, servers can be deployed before controllers when meta system store schema is upgraded. Controller url (for open source users who do not want to onboard d2) or d2 configs must be provided to enable system schema initialization at start time.
…nice components (linkedin#538) This commit adds support for passing non-mergeable broker details to the Venice components. This means that if there is a key that is present in the broker details for multiple brokers, only the value for the last broker will be kept. Previously, all values for the same key would be concatenated together.
…linkedin#535) This PR aims to clean up deprecated value and RMD chunks to avoid storage engine chunking leaks. When we processed a partial update request, it will also try to look up old value and RMD manifest and will produce DELETE messages to old value and RMD chunks to remove them. Note: As discussed offline, this solution will be a short term solution to clean up disk space, but it will not clean up Kafka version topic garbage chunks as even after log compression there will be one DELETE message left per deprecated chunk key, but we think this is acceptable at this stage as DELETE message is considered small and we can work on long term complete solution later. Based on current implementation, we could not fully delete a RMD of a key. We can only insert an empty RMD into it. (If this is not acceptable, we will need to add new implementations to execute real RocksDB delete on RMD Column Family. We refactor the chunking get method to store the retrieved value's manifest if it is chunked. That means for AASIT, we will always delete old RMD chunks (as it is always fetched for every operation) and we will do best effort to delete old value's RMD (in theory, we will always do it for UPDATE). For LFSIT, we will always delete value chunks for UPDATE operation.
…ory` for integration test and venice components. (linkedin#493) This change supports plugging in PubSubClientFactory from server config and controller config. This will enable configuration passing specific combination of PubSubClients. For example, we may have admin factory and producer factory for one type of PubSub Client, consumer factory is in other type. This change will also benefit Ingestion Isolation and DaVinci client with specific PubSub Client. --------- Co-authored-by: Hao Xu <xhao@xhao-mn2.linkedin.biz>
…n#534) Read-compute functionality was available to select users and required support from service operators to ensure safety and stability for all users. This change makes the read-compute interface available to all users by leveraging batch-get and local computations by default. This allows service operators to enable/disable remote computations without disrupting user applications. The order of operations is as follows: Upon start, the client sends a compute request and sets a new HTTP header VENICE_CLIENT_COMPUTE to indicate that client-side computation is supported. The router checks store config, and when remote computation is off, fails the request if client-computation is not available. Otherwise, the router converts the compute request into a batch-get. The client checks HTTP headers to resolve what kind of response it got, and if necessary performs computations locally. The following computation requests are converted to batch-get on the client side to avoid extra work and sending extra data over the network. Batch-get response always carries a header indicating the latest availability of remote computation so that the client can switch back and forth if needed (e.g., when service operator makes changes to store config). Summary of changes: * Moved common computation logic from server and da-vinci-client to ComputeUtils to reduce code duplication. * Merged com.linkedin.venice.utils.ComputeUtils and com.linkedin.venice.compute.ComputeOperationUtils into com.linkedin.venice.compute.ComputeUtils. * Moved AbstractAvroStoreClient:: StoreClientStreamingCallback out to reduce client complexity and maintain independency between transport and client logic. The new decoder layer is responsible for deserializing records from stream of bytes. * Added storage_engine_read_compute_efficiency server metric to help monitor data compression due to read compute. * Added multiget_fallback metric to both client and router to monitor client-computation activity.
…#533) RocksDB offers a set of multiget API besides single-get API, and it claims that it should improve the performance by parallelizing the IO operations with linux kernal 5.1 or later. We also want to benchmark these APIs for PT format since in theory, it should reduce the cost of JNI because of less JNI calls. Based on the benchmarking, the performance of multiget API is performing worse than single-get API (which is weird), and we would like to keep these benchmarks and code, so that we can evaluate BT format in the future. PT benchmark result (subset): Benchmark (BATCH_SIZE) Mode Cnt Score Error Units RocksDBLookupApiBenchmark.measureMultiGetAPI 1 avgt 5 18195.987 ± 1484.887 ms/op RocksDBLookupApiBenchmark.measureMultiGetAPI 2 avgt 5 16774.263 ± 212.736 ms/op RocksDBLookupApiBenchmark.measureMultiGetAPI 10 avgt 5 18206.964 ± 3331.436 ms/op RocksDBLookupApiBenchmark.measureMultiGetAPIWithByteBuffer 1 avgt 5 27395.865 ± 688.723 ms/op RocksDBLookupApiBenchmark.measureMultiGetAPIWithByteBuffer 2 avgt 5 23750.618 ± 4607.772 ms/op RocksDBLookupApiBenchmark.measureMultiGetAPIWithByteBuffer 10 avgt 5 15060.939 ± 941.485 ms/op RocksDBLookupApiBenchmark.measureSingleGetAPI 1 avgt 5 8542.114 ± 378.653 ms/op
… operations (linkedin#537) * [controller] Update superset schema ID after all value schema related operations Existing implementation updates the superset schema id as soon as we add a new value or superset schema. This is problematic because in parent controller we perform schema addition in the following order: 1. Add new value schema (superset schema id is also updated). 2. Check if A/A is enabled and update RMD schema. 3. Check if WC is enabled and update derived schema. This execution order allows race to happen when superset schema id is updated but the corresponding RMD schema and derived schema are not available yet. Resulting in failure in downstream consumers like the Venice storage node during ingestion. The fix is to perform superset schema id as the last step and separate it into its own update store admin message.
This was flagged when reviewing 6c8b672 and left as a followup. The historical context is that there was a temporary phase of the A/A migration where RT producers would be replicated to all regions and so servers would consume duplicate events concurrently, even some written by the same producer GUID. This led to race conditions in the DIV code, and so this locking was introduced to guard against it. The locking was retained as part of the aforementioned commit, but is not actually necessary anymore, so we're cleaning it up. This is expected to provide a small boost to ingestion performance as well as to the memory efficiency of DIV, especially in cases where there are lots of producer GUIDs involved. Miscellaneous: - Tweaked the VeniceSystemFactoryTest::testSerializationCastParams data provider so that it returns parameters with a deterministic toString output, which is necessary for the test-retry plugin to work well.
…child controller client (linkedin#539) To resolve KME upgrade deployment order, one action is to make child controllers and servers register KME from the code into local system store at startup time. However, KME is only backward compatible. Adding schema that is not fully compatible via controller client is not supported. This commit supports adding value schema with id and compat type via child controller client.
…inkedin#542) * [vpj] Reuse KafkaConsumer when building dict from an existing topic Previously, `KafkaInputDictTrainer` would initiate a new KafkaConsumer per partition, which results in ton of KafkaConsumer related logs printed in VPJ log, which would make VPJ log analyzing less efficient. This code change will reuse the same consumer when extracting data messages from different partitions to reduce the total amount of logs produced by KafkaInputDictTrainer. * Addressed comment
* [controller] Cleanup Helix disabled partition map During ingestion error in leader, we add disabled partition map config in helix. But even if those resources are dropped later, the disabled map not cleared, which later might cause znode to blow up. This PR will cleanup disabled partition map (by calling enable, which will clear the node) during dropping of resources. Co-authored-by: Sourav Maji <tester@linkedin.com>
…kedin#558) Prior to this commit, the ComputeRequestWrapper class was used across client, router and server, but these components all had different needs and the end result was a bit awkward, as it led to the class needing to be constructed in a way that was not fully initialized, only to have the rest of the init happen later by mutating its internal state. This seemed error-prone. Furthermore, the class attempted to serve as a way to evolve the read compute wire protocol, and although we did evolve read compute, the way we evolved it was already compatible anyway, and if we had evolved it in less compatible ways, it's not clear that the original structure of this class would have been good enough to support such evolution. In this commit, we are cleaning all of that up and simplifying many aspects of the code in the process. Notably: - ComputeRequestWrapper is now only used in the client, and it is fully initialized at construction-time, with all internal state final. - The read compute protocol version passed in a header is still checked when parsing an incoming request, but then it's not carried into deeper code paths, where it was essentially ignored and useless. - Created a SchemaAndToString container class to replace some usages of Pair<Schema, String>. - Added a ComputeRequest specific record, which is the same as ComputeRequestV[1-4] except that the items inside the list of operations are strongly-typed, rather than objects. We do still need to keep (at least one of) the other schemas around since those are the ones the clients use on the wire, but we can cheaply convert between those and the new one, and so the server code need not do as much casting as before. - In the router, the VeniceComputePath does not hang on to a ComputeRequestWrapper anymore, which it didn't need to. It merely skips over the bytes represented by the compute request, with as little allocation as possible. It also hangs on to the raw String headers it needs rather than convert them to int and then back to strings moments later. Made all state final.
Introduced pub-sub client exception classes for uniform exception handling within PubSubAdminAdapter implementations. By adopting these exceptions, TopicManager can now base its retry logic on pub-sub client exceptions rather than system-specific exceptions. This will improve error handling and reliability. The commit also includes the following improvements: - Comprehensive Javadoc documentation outlining the behavior of APIs in PubSubAdminAdapter and PubSubProducerAdapter - Addition of extensive unit tests for the ApacheKafkaAdminAdapter implementation - Structural changes and code cleanups
… clients (linkedin#570) Currently, if new clients would like to add support for different POJO formats in their interfaces, they need to override the "initSerializer" function to configure the serializers to serialize the POJO to Avro. The base function also configures the serializers for "compute" and "batchget" requests. So, new clients would have to specify serializers for both these explicitly. This commit reduces the requirements for new clients to only specify the key serializers and they get "batchget" support by default.
…allback when store is not A/A or partial update enabled (linkedin#571) Resolves excessive logging found in internal certification flow. This newly added log line will be logged for store that does not have A/A or partial update enabled. This PR fixes the issue by only log it when TR cache is used.
…t construction (linkedin#575) Currently, we extract Kafka client properties from VeniceProperties based on prefixes. However, this approach sometimes leads to passing irrelevant properties to clients, as these properties may contain settings intended for other client types. When irrelevant properties are passed to Kafka clients they log a message which ends up polluting logs and can cause confusion. Fixed this issue by adding an additional property check to ensure their validity for the specific client being constructed.
…inkedin#573) To start a router on a specific port, passing `LISTENER_PORT` configuration via `extraProperties` could accidentally cause it to be utilized by the VeniceServer which could lead to port conflicts. To resolve this, introduced a separate config for starting a router on a designated port. This fix ensures that the port assignment remains distinct and does not interfere with other components. Additional improvements: - Removed unused classes related to Zk client. - Removed a log statement that was polluting the logs.
For stores without pushstatus system store, controller logs get flooded with full excception stacktrace. Also in server most of the time RMD file sst writer are empty, so do not print any warn log for it. Co-authored-by: Sourav Maji <tester@linkedin.com>
…iter configs (linkedin#576) This commit allows the users of online producer to specify arbitrary configs for the underlying VeniceWriter
* [controller][test] Fix issues in data recovery API This rb targets several issues in the data recovery API: 1. Today, during data recovery, only when the version number from the source equals to the current version in the dest colo, we bump a store's version to LargestUsedVersionNumber+1 in parent and use the increased version for data recovery. So, if the version number from the source is less than the current version in the dest, we simply use it in the dest. As a result, data recovery can not complete successfully, because backup version cleanup thread can kick in, consider it as an expired backup version, and delet it. As a fix to it, we should extends current condition, and increase the store version for such cases as well. On the other hand, if the version number from the source colo is larger than current version in dest colo, we can use it for the dest colo without increasing it. This is fine because it is larger than the current, and resources for this version will be deleted during prepareDataRecovery. So, even though the version number is used (created and then deleted) in dest colo before, it can still be reused. 2. Users sometimes can put the same colo name for src and dest which is considered as an error operation. This rb adds guardrail to protect such cases and fails early. Integration tests are also improved to cover the above two issues and verify the correctness of this fix.
…adata update bug and add schema evolution support (linkedin#560) * [venice-client][venice-admin-tool] Fixed a bug where we only update FC metadata on new store version 1. The problematic behavior was that metadata update would only be applied in FC if the fetched metadata belongs to a new store version. This behavior is unexpected because the routing info can change for the same store version (e.g. replica reblanced). The same behavior is found for HAR info and it is also fixed to be updated everytime. Using atomic reference to an object that bundles the map and list to avoid inconsistent access when updating HAR info. 2. Added new Admin Tool command to get the request based metadata payload from a given server. 3. Added full schema evolution support for MetadataResponseRecord. Backwards compatibility is ensured by adding a new MetadataResponseRecordCompatibilityTest. Forward compatibility is supported using RouterBackedSchemaReader and schema system store. This means we need to have a few routers around in the cluster to perform new schema discovery when we evolve the MetadataResponseRecord. 4. Fixed a bug in server where we used version number instead of the entire resource name (store + version number) to get routing table for request based metadata.
…din#582) This change reverts a bunch of changes that were made in PR linkedin#472. There was a deadlock in production when users used the getKeySchema method on the client when initial client warm up failed and an async warm up was executing
Many classes were located in venice-client-common, yet were only used within DVC. This commit moves these classes to DVC. Ideally, they should eventually end up in server, but since DVC itself is tangled with server code, that is difficult to refactor in one shot. At least for now venice-client-common will be less bloated, and we'll reduce the likelihood of future tangling to continue. Miscellaneous: - Renamed several ID to Id in variables and method names, at the request of Spotbugs. - Removed the --stacktrace argument from the CI, as it makes it harder to scroll to the root cause of a build failure, and typically does not contain any useful information (but rather, only Gradle internal details). - Also enabled MergeGenericRecordTest::testUpdate which was previously disabled. Had to fix it up a bit so it could run. - Introduced a deserializer cache in VeniceRmdTTLFilter to get rid of some code which was instantiating MapOrderPreservingDeserializer instances on the hot path. Map order preservation is only necessary within the server when we will deserialize and then reserialize the payloads, but that was not the case in the TTL filter. Module-level test branch coverage changes: - venice-client-common: 37.76% to 44.33% - da-vinci-client: 48.69% to 51.7%
…er test (linkedin#581) * [da-vinci] Added one more config to simplify the DaVinci memory limiter test For some DaVinci customers, they want to validate DaVinci memory limiter feature without affecting prod in canary instance, and currently, it is not possible since DaVinci memory limiter will apply globally. To simplify the test, this code change introduces a new config: ingestion.memory.limit.store.list : default empty When it is specified, only the stores configured will enforce memory limiter, and if it is not specified (default), all the stores in the same DaVinci instance will enforce memory limiter. With this way, the customer can specify a test store to test DaVinci memory limiter feature, and when the bootstrapping of test store fails because of memory exhaust, only the push job to the test store will fail. * fixed test failures * Fixed test failures * addressed comment
During target colo push for stores with DaVinci client, or during data recovery, Venice uses data recovery client endpoints to enable push to rest of the colos, but for DaVinci client running in follower mode does not have the configs needed for data recovery which leads to DaVinci host ingestion failure. This PR skip data recovery code path in Davinci clients. --------- Co-authored-by: Sourav Maji <tester@linkedin.com>
…inkedin#532) * FC batch get uses instanceHealthMonitor to check for healthy instances before sending our the post request, but doesn't update the health monitor if an instance ends up being healthy/unhealthy: This change reuses the methods which the single get uses and its counted per route requests, not per key. * FC (both single/batch gets) doesn't increment metrics when the instance is not reachable: Currently in such scenarios, a future is completed based on routingLeakedRequestCleanupThresholdMS to update the health data, but it doesn't complete the value future and the stats future. Made changes such that we end up completing these futures and added tests. * When both original and retry requests fail and the stats future is completed, only the unhealthy count and latency metric is incremented, not providing any insights into the reasons (eg: noAvailableReplica) and other insights like whether retries happened in the request. Modified that flow to increment a few more metrics for such failure cases. * cleaned up metric verification for some tests
* [vpj] Send pushjob details during post-target pushjob The push job details are generated and send through controller for various offline tracking of pushjob health. During post canary colo job, it missed to send out pushjob details to the controller, thus it would appear in pushjob detail store that other colo did not finish the pushjob. This PR fixes that issue. --------- Co-authored-by: Sourav Maji <tester@linkedin.com>
Replaced several lookups by field name to use field position instead in write compute code paths. There are probably more such opportunities left. This is just one pass for cases where the safety of the usage is more obvious. Also moved CollectionTimestampBuilder to tests since it was meant from the start to be a test-only class.
…er tool (linkedin#590) The CLI tools to read ("QueryTool") and write ("ProducerTool") data to Venice do not handle complex types currently. This commit adds the support to read and write data where the store has complex data types in the schema.
… transition (linkedin#578) Previously in O/B/O mode, there is an explicit delay from OFFLINE to DROPPED transition; we missed this delay when building L/F mode. This PR added the delay back, and the delay only applies to current version resources. Add a stat to track number of threads being blocked from OFFLINE to DROPPED transition.
We see some host contained in /CONFIG/PARTICIPANT path but not in /INSTANCE of a cluster . If queried a list of instances using ZKHelixAdmin::getInstancesInCluster and called getDisabledPartitionsMap which throws HelixException. This PR tries to skip such exception to continue cleaning the helix resources. Also bumps the Helix dependency to 1.1.0 which would resolve the exception log in zk node deletion code path. --------- Co-authored-by: Sourav Maji <tester@linkedin.com>
Previously admin tool construct ApacheKafkaConsumerAdpater without relying on PubSubConsumerFactory#create(), then bootstrap servers url property cannot be correctly translated into kafka property. In this change, using factory to create this adapter class to leverage the config translation functionality and unifying the factory usage of admin tool and naming. Co-authored-by: Hao Xu <xhao@xhao-mn2.linkedin.biz>
…ock dependency incompatible issues. (linkedin#595) Previous refactoring removed this constructor, it has caused downstream dependency issue. To unblock this we add back the constructor and mark it deprecated, as we want to let downstream to use the other constructor to pass specific PubSubClientFactory. Co-authored-by: Hao Xu <xhao@xhao-mn2.linkedin.biz>
…inkedin#565) - Adapts Netty Channel pipeline and corresponding handler classes to be compatible with gRPC requests - Can collect server side statistics and enforce read quota for gRPC requests - Introduce mutual TLS support on gRPC enabled servers and fast client, when an SSLFactory is present - Update unit tests and integration tests
* [server][admin-tool] Improve Server Ingestion State Dump This code change adds a new command to dump ingestion state, also Venice Server will skip the ACL check for admin request endpoint since there is no PII data. In the meantime, this code change will skip compression dict in the response since it is not readable and string-format the `Fixed` type. * Addressed unit test coverage issue * Fixed a comment * Fixed unit test coverage failure
…icDumper (linkedin#599) Recently, we've added support to emit DELETE messages to clean up leaked chunks (in PR linkedin#535) but we didn't handle this in the topic dumper tool
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary, imperative, start upper case, don't end with a period
Resolves #XXX
How was this PR tested?
Does this PR introduce any user-facing changes?