Evaluation

Disclamer: This page contains our thoughts about the evaluation part of the pluto project. We are excited to discuss promising directions for pluto. Feel free to leave a comment or an issue to get in touch. Also note, that this reflects initial ideas and challenges. Elements might change or get dropped completely as the semester progresses. It is highly likely that we will not have the time to evaluate all the following aspects.

Ambition & Purpose

The evaluation is supposed to convey the advantages of weakly-consistent replication models for responsive and resilient geo-replicated services of planetary scale.
It shall supplement the (anticipated, see Verification) correctness verification in terms of real world measures.
Therefore, huge IMAP workloads and long-running test scenarios are needed to come closer to realistic use cases.

Setup

We need to evaluate (and possibly, benchmark) pluto against a production-grade IMAP service. For our first evaluation in terms of response times we chose Dovecot. It is the most used IMAP service in the world and thus, I would recommend to stick with Dovecot as the real world standard. As far as I know, Dovecot has no native feature comparable to pluto's weakly-consistent replication among more than two nodes. Therefore, we have to set up Dovecot in a way that allows for global synchronization. My initial proposal is to try to set up the following three systems:

pluto, in its highest optimized version
Dovecot A, using dsync between server pairs, configured to pair-wise sync all master/master replicas
Dovecot B, using some mechanism or underlying system to achieve replication (GlusterFS? Dovecot Object Store? Replicated database?)

All of the setups will be deployed as kubernetes applications and the Google Compute Engine will be our public cloud of choice. Each setup is supposed to be spanning the "whole world", therefore 3 or more data centers shall be used for replication (e.g. Belgium, Taiwan, Oregon).

Aspects

Reliability & Resilience: The main reason why we chose to use CRDTs for data replication in the first place, was, that they allow for consensus-free progress of state while ensuring global state convergence as soon as every message has been applied at every node. Up until now, we have not evaluated this main feature of pluto. That is supposed to change with evaluation in the direction of reliability and resilience. Therefore, we want to obtain "hard numbers" insights into pluto's fault tolerance (its CAP resistance). The global distribution of each setup's components should do as a starting point. Eventually, though, we want to use a fault injection framework for kubernetes deployments to create partitions (crash nodes and take down network links) and introduce varying degrees of latency on randomly chosen links. All setups will be monitored as to how they cope with these challenges. Multiple metrics might be of interest, possibly on quite low levels:
- [SHOULD NOT HAPPEN] Data or operation loss - IMAP commands and synchronization instructions do not reach all intended nodes, messages and content do not get persisted to storage, etc.
- [SHOULD NOT HAPPEN] Manual intervention required - synchronization is supposed to be conflict-free and thus should happen automatically. If a node crashes due to replication efforts or human intervention is needed due to other circumstances, this will effect the setup's rating negatively.
- [SHOULD BE LOW] Time bounds on synchronization - across a large number of synchronization procedures, start and end time are saved. Any conclusions possible based on these time windows, e.g. infering average and worst-case bounds?
- [SHOULD BE LOW] Downtime or unavailability during a failure. How fast can we (re)start the process or virtual machine, create new links etc?
Performance: In the first evaluation, we concentrated on response time performance. Maybe an update on these numbers is due, because code and deployment will have changed significantly. Single-user tests might be performed as part of an IMAP benchmark, otherwise I would recommend to focus on concurrency (multi-user) performance tests.
Costs: Can we estimate the deployment costs for all evaluated setups? How does more powerful or redundant hardware influence the performance and reliability?
Scalability?: If we succeed in finding and implementing a model to scale stateful applications at runtime, we might also show its (dis)advantages.

Challenges

Most important for a successful and meaningful evaluation is to capture the right metrics in a least influencing way at the right level and in a fair manner. This is hard and deserves more thought.
- Which are the most expressive and useful metrics to collect?
- How can we collect them fairly in software we don't know?
- How can we export these metrics fast and efficiently?
- What conclusions are we even able to draw from them?
- How can we represent these conclusions most usefully?
Finding the most fitting and competitively optimized Dovecot setup
- Consider:
  - https://wiki.dovecot.org/Replication
  - https://wiki.dovecot.org/PerformanceTuning

Needed

A realistic IMAP workload dataset, possibly focused on write commands
A fault injection framework usable in kubernetes deployments (needed: node crash, link failure, latency injection)
- Pumba?
- Chaos Monkey?
- ChaosKube
- ...?
A metrics capture and export tool / collector

Provide feedback

Saved searches