GitHub - Huangxy-Minel/System-Design-for-Federated-Learning: Paper list of federated learning: About system design

System Design for Federated Learning

A paper list of federated learning - About system design. Currently, it mainly focuses on distributed computing framework, communication & computation efficiency for Federated Learning (Cross-Silo & Cross-devices).

Chinese blogs: Neth-Lab, includes study notes, tutorials and development documents.

Last update: Apr, 12th, 2022.

Catlog

For some paper, we have provided some Chinese blogs.

2 Research Areas

2.1 Survey
2.2 Optimization in algorithm perspective
2.3 Optimization in framework perspective
2.4 Optimization in communication perspective
2.5 Optimization for Memory
2.6 Optimization for Homomorphic Encryption

3 Opensource Projects

4 Researchers

1 Federated Learning Foundation

1.1 Blogs

Understand the types of federated learning. Sep 2020
- A brief introduction to the terminology and classification of federal learning

1.2 Survey

Federated Machine Learning: Concept and Applications. TITS. Qiang Yang. 2019
- Chinese blog: Overview of Federated Learning. Section 1

1.3 Algorithms

Practical Secure Aggregation for Privacy-Preserving Machine Learning. 2017. CCS
- Horizontal logistic regression algorithm
Private federated learning on vertically partitioned data via entity resolution and additively homomorphic encryption. 2017. arXiv
- Vertical logistic regression algorithm.
- Chinese blog: Machine Learning & Federated Learning. Section 5
SecureBoost: A Lossless Federated Learning Framework. 2021. IEEE Intelligent Systems
- Vertical secure boosting algorithm.
- Chinese blog: Machine Learning & Federated Learning. Section 6

2 Research Areas

2.1 Survey

A Comprehensive Survey of Privacy-preserving Federated Learning: A Taxonomy, Review, and Future Directions. 2021. ACM Cmputing Surveys
A Quantitative Survey of Communication Optimizations in Distributed Deep Learning. 2021. IEEE Network
A Survey on Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection. 2021. TKDE
- About system challenges for Fedrated Learning
- Chinese blog: Survey of System Design for Distributed ML & FL. Section 2.2
System Optimization in Synchronous Federated Training: A Survey. 2021. arXiv
- Focus on time-to-accuracy optimization for FL system
- Chinese blog: Survey of System Design for Distributed ML & FL. Section 2.3
A Survey on Distributed Machine Learning. 2020. ACM Computing Surveys
- About system challenges for distributed machine learning
- Chinese blog: Survey of System Design for Distributed ML & FL. Section 2.1
Communication-Efficient Distributed Deep Learning: A Comprehensive Survey. 2020. arXiv

2.2 Optimization in algorithm perspective

2.2.1 Optimization for FL

SAFELearn: Secure Aggregation for private FEderated Learning. 2021. S&P workshop
Secure bilevel asynchronous vertical federated learning with backward updating. 2021. AAAI
VF2Boost: Very Fast Vertical Federated Gradient Boosting for Cross-Enterprise Learning. 2021. SIGMOD
- System optimization for SBT using overlapping between encryption, cross-Internet communication and histogram constraction. In addition, it gives several trick implementations in practice.
- Chinese blog: Summary of VF2Boost
FedML: A Research Library and Benchmark for Federated Machine Learning. 2020. arXiv
- A library and system architecture for FL.
FDML: A Collaborative Machine Learning Framework for Distributed Features. 2019. KDD

2.2.2 Optimization for ML

2.3 Optimization in framework perspective

This section will collect paper from both Distributed framework for other computation (e.g. DNN and batch-based computation) and Distributed System for FL

2.3.1 Optimization for FL

2.3.1.1 Topology

Sphinx: Enabling Privacy-Preserving Online Learning over the Cloud. 2022. S&P
Cerebro: A Platform for Multi-Party Cryptographic Collaborative Learning. 2021. USENIX Security
Citadel: Protecting Data Privacy and Model Confidentiality for Collaborative Learning. 2021. SoCC
SecureBoost+ : A High Performance Gradient Boosting Tree Framework for Large Scale Vertical Federated Learning. 2021. arXiv
FedAT: a high-performance and communication-efficient federated learning system with asynchronous tiers. 2021 SC
FEDAT: A COMMUNICATION-EFFICIENT FEDERATED LEARNING METHOD WITH ASYNCHRONOUS TIERS UNDER NON-IID DATA. 2020. arXiv
Throughput-Optimal Topology Design for Cross-Silo Federated Learning. 2020. arXiv
Towards Federated Learning at Scale: System Design. 2019. MLSys
- A framework for scaling horizontal FL.
- Chinese blogs: Survey of Distributed Framework in Federated Learning. Section 3

2.3.1.2 Scheduler

Oort: Efficient Federated Learning via Guided Participant Selection. 2021. OSDI
TiFL: A tier-based federated learning system. 2020. HPDC

2.3.2 Optimization for Machine Learning

Since currently there is a few research paper about distributed framework for FL, here we provide related work focus on Machine Learning Framework for reference.

2.3.2.1 Topology

Gradient Compression Supercharged High-Performance Data Parallel DNN Training. 2021. SOSP
- A noval pipeline method (GPU-based) of compression operations and computing operations for compression distributed DNN training.
- Chinese blog: Summary of HiPress
DAPPLE: a pipelined data parallel approach for training large models. 2021. PPoPP
PET: Optimizing Tensor Programs with Partially Equivalent Transformations and Automated Corrections. 2021. OSDI
P3: Distributed Deep Graph Learning at Scale. 2021. OSDI
HetPipe: Enabling Large DNN Training on (Whimpy) Heterogeneous GPU Clusters through Integration of Pipelined Model Parallelism and Data Parallelism. 2020. ATC
PipeDream: generalized pipeline parallelism for DNN training. 2019. SOSP
Ray: A Distributed Framework for Emerging AI Applications. 2018. OSDI
- Actor-based framework & parallelism methods for reinforce learning.
- Chinese blog: Summary of Ray
Tensorflow: A system for large-scale machine learning. 2016. OSDI
- Chinese blog: Summary of TensorFlow
Spark sql: Relational data processing in spark. 2015. SIGMOD
Scaling distributed machine learning with the parameter server. 2014. OSDI
- Chinese blog: Summary of Parameter Server
Large scale distributed deep networks. 2012. NeurIPS
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. 2012. OSDI
- Chinese blog: Summary of Apache Spark
MapReduce: simplified data processing on large clusters. 2004. OSDI
- Chinese blog: Summary of MapReduce

2.3.2.2 Scheduler

CrystalPerf: Learning to Characterize the Performance of Dataflow Computation through Code Analysis. 2021. ATC
Scaling Large Production Clusters with Partitioned Synchronization. 2021. ATC
- A distributed resource scheduler architecture. Use partition synchronization method to reduce the impact of contention on high-quality resources and staleness of local states, which causes high scheduling latency.
- Chinese blog: Survey of Framework-based Optimization for Federated Learning. Section 4
Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications. 2021. SOSP
Advanced synchronization techniques for task-based runtime systems. 2021. PPoPP
Ownership: A Distributed Futures System for Fine-Grained Tasks. 2021. NSDI

2.4 Optimization in communication perspective

2.4.1 Optimization for FL

FLASHE: Additively Symmetric Homomorphic Encryption for Cross-Silo Federated Learning. 2021. arXiv
Efficient Batch Homomorphic Encryption for Vertically Federated XGBoost. 2021. arXiv
Cheetah: Optimizing and Accelerating Homomorphic Encryption for Private Inference. 2021. HPCA
Communication-Efficient Federated Learning with Adaptive Parameter Freezing. 2021. ICDCS
RC-SSFL: Towards Robust and Communication-efficient Semi-supervised Federated Learning System. 2020. arXiv
BatchCrypt: Efficient Homomorphic Encryption for Cross-Silo Federated Learning. 2020. ATC
- Use quantization method to compress encrypted data size, which reduces the costs of communication and computation.
- Chinese blog: Summary of BatchCrypt
FetchSGD: Communication-Efficient Federated Learning with Sketching. 2020. ICML
- Use sketching method to reduce communication costs (compress gradients), which just need once communication between server and clients.
- Chinese blog: Summary of Sketching. section 3
Communication-Efficient Federated Deep Learning With Layerwise Asynchronous Model Update and Temporally Weighted Aggregation. 2020. TNNLS
CMFL: Mitigating Communication Overhead for Federated Learning. 2019. ICDCS
- Reduce communication costs by reducing times of communication between edge devices and center server. Similar as Gaia, it introduces relevance between local updates and global updates to determine whether transfer the local updates to center server.
- Chinese blog: Summary of CMFL

2.4.2 Optimization for Machine Learning

This section will introduce some researches focus on tradition Machine Learning, which is related to Federated Learning.

Octo: INT8 Training with Loss-aware Compensation and Backward Quantization for Tiny On-device Learning. 2021. ATC
- A INT8 quantization model, which is used in tiny on-device learning
- Chinese blog: Survey of Communication-based Optimization for Federated Learning. Section 4
Hoplite: efficient and fault-tolerant collective communication for task-based distributed systems. 2021. SIGCOMM
- Introduce collective communication to task-based runtime distributed frameworks (e.g., Ray, Dask, Hydro)
- Chinese blog: Summary of Hoplite
waveSZ: a hardware-algorithm co-design of efficient lossy compression for scientific data. 2020. PPoPP
Communication-efficient distributed sgd with sketching. 2019. NIPS
- Use sketch method to choose top-k gradient elements so that workers just need transfer top-k updates, which reduces communication cost.
- Chinese blog: Summary of Sketching. Section 2
A generic communication scheduler for distributed DNN training acceleration. 2019. SOSP.
- Chinese blog: Summary of A generic communication scheduler for distributed DNN training acceleration
Sketchml: Accelerating distributed machine learning with data sketches. 2018. SIGMOD
Gradient Sparsification for Communication-Efficient Distributed Optimization. 2018. NIPS
Horovod: fast and easy distributed deep learning in TensorFlow. 2018. arXiv
- Chinese blog: Summary of Hoplite. Section 3
QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding. 2017. NIPS
Gaia: Geo-distributed machine learning approaching lan speeds. 2017. NSDI
- Use significant function to determine the importance of updates. If smaller than threshold, do not transfer so that mitigate the overhead of WAN bandwidth. Introduce a new parallelism method called ASP, which is proved can guarantee convergence requirement.
- Chinese blog: Summary of Gaia

2.5 Optimization for Memory

2.5.1 Optimization for FL

2.5.2 Optimization for Machine Learning

GAIA: A System for Interactive Analysis on Distributed Graphs Using a High-Level Language. 2021. NSDI
- A memory management system for interactive graph computation, at distributed infrastructure layer.
- Chineses blog: Survey of Framework-based Optimization for Federated Learning. Section 2
A novel memory-efficient deep learning training framework via error-bounded lossy compression. 2021. PPoPP
Zico: Efficient GPU Memory Sharing for Concurrent DNN Training. 2021. ATC
Are dynamic memory managers on GPUs slow?: a survey and benchmarks. 2021. PPoPP
Sentinel: Efficient Tensor Migration and Allocation on Heterogeneous Memory Systems for Deep Learning. 2021. HPCA

2.6 Optimization for Homomorphic Encryption

3 Open-source Projects

FATE: Industrial framework for FL. From WeBank. Chinese blog: Architecture of FATE
PySyft
Tensorflow Federated
PyTorch Implementation: An implementation based on PyTorch. From shaoxiongji
Microsoft/SEAL: easy-to-use and powerful homomorphic encryption library.
Microsoft/EVA: Compiler for the SEAL homomorphic encryption library.

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.vscode		.vscode
.DS_Store		.DS_Store
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

System Design for Federated Learning

Catlog

1 Federated Learning Foundation

2 Research Areas

3 Opensource Projects

4 Researchers

1 Federated Learning Foundation

1.1 Blogs

1.2 Survey

1.3 Algorithms

2 Research Areas

2.1 Survey

2.2 Optimization in algorithm perspective

2.2.1 Optimization for FL

2.2.2 Optimization for ML

2.3 Optimization in framework perspective

2.3.1 Optimization for FL

2.3.1.1 Topology

2.3.1.2 Scheduler

2.3.2 Optimization for Machine Learning

2.3.2.1 Topology

2.3.2.2 Scheduler

2.4 Optimization in communication perspective

2.4.1 Optimization for FL

2.4.2 Optimization for Machine Learning

2.5 Optimization for Memory

2.5.1 Optimization for FL

2.5.2 Optimization for Machine Learning

2.6 Optimization for Homomorphic Encryption

3 Open-source Projects

4 Researchers

About

Releases

Packages

License

Huangxy-Minel/System-Design-for-Federated-Learning

Folders and files

Latest commit

History

Repository files navigation

System Design for Federated Learning

Catlog

1 Federated Learning Foundation

2 Research Areas

3 Opensource Projects

4 Researchers

1 Federated Learning Foundation

1.1 Blogs

1.2 Survey

1.3 Algorithms

2 Research Areas

2.1 Survey

2.2 Optimization in algorithm perspective

2.2.1 Optimization for FL

2.2.2 Optimization for ML

2.3 Optimization in framework perspective

2.3.1 Optimization for FL

2.3.1.1 Topology

2.3.1.2 Scheduler

2.3.2 Optimization for Machine Learning

2.3.2.1 Topology

2.3.2.2 Scheduler

2.4 Optimization in communication perspective

2.4.1 Optimization for FL

2.4.2 Optimization for Machine Learning

2.5 Optimization for Memory

2.5.1 Optimization for FL

2.5.2 Optimization for Machine Learning

2.6 Optimization for Homomorphic Encryption

3 Open-source Projects

4 Researchers

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages