Skip to content
Huizhi Lu edited this page Aug 20, 2020 · 2 revisions

Introduction

Apache Zookeeper is an open-source, highly reliable distributed coordination service for maintaining configuration information, naming, providing distributed synchronization, and providing group services.

Helix uses Zookeeper for cluster metadata storage, live instance detection, communicate channel, distributed lock, etc. Helix applications highly rely on Zookeeper to function well. As data systems keep growing in high scale, Zookeeper runs into scalability challenges. Eg., increased ZK write QPS causes high mastership handoff latency and ZK session expiry; Helix applications has availability issues because of high propagation latency caused by ZK write QPS.

To address the challenges, Helix designs and implements an idea of distributing application’s metadata in a Zookeeper realm over multiple Zookeeper realms for one Zookeeper namespace. So the load of Zookeeper requests could be spreaded out from one Zookeeper realm to multiple Zookeeper realms, which helps reduce stress on each Zookeeper realm and scale Helix applications.

Helix uses ZkClient to talk to Zookeeper and one ZkClient instance can only connect to one Zookeeper realm. With multi-realm Zookeeper implementation, as some Helix Java APIs can only perform CRUD operations, we would like to have a ZkClient that could access multiple Zookeeper realms in one ZkClient instance.

This wiki outlines design and implementation for such ZkClient: FederatedZkClient. FederatedZkClient is a new version of ZkClient that maintains multiple Zookeeper connections to different Zookeeper realms and routes read/write Zookeeper requests to the appropriate Zookeeper realm based on the Zookeeper path sharding key.

Glossary

  • Zookeeper namespace: a managed Zookeeper service for an application or a tenant group. A Zookeeper namespace has one or more Zookeeper realms.
  • Zookeeper realm: a Zookeeper ensemble consisting of multiple Zookeeper instances to service coordination service.
  • ZK path sharding key: a string key used by metadata directory service to lookup which ZK realm a given ZK path corresponds to. We could put different Helix cluters into different ZK realms and the cluster name is its ZK sharding key. Eg. given a ZK path “/clusterName/instances/currentstates”, its sharding key is “/clusterName”.
  • Metadata store directory service: a Helix REST service to lookup mappings of ZK path -> ZK realm. It populates routing information from ZK and provides REST endpoints for applications to query routing information that translate a ZK path to its corresponding ZK realm.
  • Helix data access JAVA API: public interfaces for Helix applications to access ZK. Eg., HelixDataAccessor, HelixPropertyAccessor, HelixAdmin, etc.

Design

Helix applications create Helix Java API to access Zookeeper. Inside Helix Java API, in multi-realm mode, FederatedZkClient is created to talk to Zookeeper. Below diagram illustrates the flow.

image

Let's zoom in FederatedZkClient and see how it is implemented. Inside one FederatedZkClient instance, there are multiple ZK connections that talk to different ZK realms. Considering the complexity of implementing a ZkClient from scratch, we use existing raw ZkClient to implement FederatedZkClient by wrapping raw ZkClient inside FederatedZkClient. Each raw ZkClient connects to one ZK realm. FederatedZkClient reads routing data from metadata store directory service or Zookeeper and keeps the routing data as cache in FederatedZkClient object.

image

FederatedZkClient aims to support CRUD ZK operations and data change callback notifications over multiple Zookeeper realms. It does not support session management operations such as ephemeral node creation, getting session ID or handling session state change notification.

Which ZkClient to use

In Helix, for some reasons, there are multiple options of ZkClient: raw ZkClient, DedicatedZkClient, SharedZkClient, FederatedZkClient. We would ask, which ZkClient should I use for my purpose? Below table compares these ZkClients based on the use patterns.

Type of Operation Raw ZkClient DedicatedZkClient SharedZkClient FederatedZkClient
CRUD persistent ZNode Yes Yes Yes Yes
Data/children change callback Yes Yes Yes Yes
Session operations: ephemeral node creation, getting session id, session state change notification Yes Yes No No
Access multiple ZK realms with one instance No No No Yes

Integrated in Helix Java API

Helix Java APIs are the public interfaces or classes provided in Helix so Helix applications can use the APIs to access their application clusters. Helix Java APIs are using ZkClient underneath to talk to Zookeeper. An instance of existing APIs can only talk to one ZK realm. With multi-ZK implementation, existing Helix Java APIs will be enhanced using the new ZkClient API to allow Helix applications to run in multi-ZK environment.

FederatedZkClient is integrated into those Helix Java APIs that need to access multiple ZK realms in one instance, eg. HelixAdmin, HelixDataAccessor, BaseDataAccessor, ZKUtil, etc. When creating an API instance, we can set multi-realm mode to enable the instance to access multiple ZK realms. With multi-realm mode on, the API instance could only do CRUD operations and callbacks, but no ephemeral node creation.

Future Work

  • Enhance to support access to multiple ZK realms for different paths of a sharding key. Currently, one ZK path sharding key is mapped to one ZK realm. Considering some use cases that a cluster(sharding key) may have a large amount of metadata stored in ZK, the metadata of the cluster may be split to multiple ZK realms instead of only one. So FederatedZkClient needs to be enhanced to route a sharding key’s different paths to the appropriate ZK realm.
  • FederatedZkClient can dynamically fetch routing data if there is a cache miss or routing data change. Currently routing data in FederatedZkClient is static. Once the FederatedZkClient object is constructed, routing data can not change. We need a more flexible way to enhance this strategy so the FederatedZkClient is more reliable according to routing data change.
Clone this wiki locally