-
Notifications
You must be signed in to change notification settings - Fork 241
DDFS Evolution
Here are the main features that are still needed for DDFS:
-
The unit of DDFS monitoring should be extended to the disk/volume level. Currently, it is at the host-level; which means that if one of the disks of a host goes down, then that DDFS node goes down. This is done for safety (see #250). This is inconvenient when nodes can have many disk volumes. When doing this, the replication strategy should not be changed: replicas should still be put on different nodes, which allows us to function even when upto (K-1) nodes each have possibly multiple malfunctioning disks. However, GC/RR needs to be performed at the disk/volume level, so that all functioning disks can participate.
-
DDFS needs a good rebalancing mechanism. GC/RR can be extended to perform rebalancing using similar logic used for node-removal. Node-removal required the discounting of existing replicas (on ddfs-blacklisted nodes), so that new replicas could be created (on non-ddfs-blacklisted nodes). Similarly, existing replicas on overloaded/full disks could be discounted so that new replicas are created on lightly-loaded/empty disks. The main difference is that rebalancing would need to actually delete the replicas on overloaded disks after replication is completed; node removal does not currently delete replicas on ddfs-blacklisted nodes.
We also need find a fix for the following correctness issue: in rare cases, it is possible for the master to return a creation-failure message for a tag-creation operation, even though a new tag was actually created by one of the cluster nodes. This is rare since such failures could occur if tag-creation fails on all cluster-nodes, either through explicit failure acks, or through transient failures (among which is the crucial case when a node or network link goes down after creating the tag but before the final ack reaches the master).
To allow scaling of DDFS to very large clusters (several hundreds of nodes or more), we would need to introduce and use rack-level locality (to reduce switch traffic), and techniques like consistent hashing of blobs/tags to nodes (to avoid latencies in DDFS operations that require interrogation of all DDFS nodes). This might also involve the use of explicit communications between master and nodes instead of using Distributed Erlang.