-
-
Notifications
You must be signed in to change notification settings - Fork 98
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
doing copy edits in content folder then will move up
this is just so I can see what I am doing
- Loading branch information
1 parent
5336f66
commit fcbb347
Showing
11 changed files
with
987 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
+++ | ||
title="Primers" | ||
date="28 Dec 2022 12:22:11 BST" | ||
+++ | ||
|
||
## Reading, Building, and Working | ||
|
||
Our programme has three strands of work in every sprint. We give our fellows plenty of documentation, books, and papers to read and discuss during office hours. Mentors have contributed these primers to support the study strand. | ||
|
||
There are also links to related projects to explore the ideas presented here. |
137 changes: 137 additions & 0 deletions
137
...tent/primers/distributed-software-systems-architecture/1-reliable-rpcs/index.md
Large diffs are not rendered by default.
Oops, something went wrong.
Binary file added
BIN
+35.3 KB
website/content/primers/distributed-software-systems-architecture/1-reliable-rpcs/rpcs.webp
Binary file not shown.
294 changes: 294 additions & 0 deletions
294
website/content/primers/distributed-software-systems-architecture/2-state/index.md
Large diffs are not rendered by default.
Oops, something went wrong.
Binary file added
BIN
+53.5 KB
...s/distributed-software-systems-architecture/2-state/leader-follower-diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
112 changes: 112 additions & 0 deletions
112
...distributed-software-systems-architecture/3-scaling-stateless-services/index.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,112 @@ | ||
+++ | ||
title="3. Scaling Stateless Services" | ||
+++ | ||
|
||
# 3 | ||
|
||
## Scaling Stateless Services | ||
|
||
### Microservices or monoliths | ||
|
||
A monolith is a single large program that encapsulates all the logic for your application. A microservice architecture, on the other hand, splits application functionality across a number of smaller programs, which are composed together to form your application. Both have advantages and drawbacks. | ||
|
||
Read [Microservices versus Monoliths](https://www.atlassian.com/microservices/microservices-architecture/microservices-vs-monolith) for a discussion of microservices and monoliths. Optionally, watch - [a comedy about the extremes of microservice-based architectures](https://www.youtube.com/watch?v=y8OnoxKotPQ). | ||
|
||
The middle ground between a single monolith and many tiny microservices is to break the application into a more moderate number of services, each of which have high cohesion (or relatedness). Some call this approach [‘macroservices’](https://www.geeksforgeeks.org/software-engineering-coupling-and-cohesion/). | ||
|
||
### Horizontal Scaling and Load Balancing | ||
|
||
We have seen how stateless and stateful services can be scaled horizontally – i.e. run across many separate machines to provide a scalable service that can serve effectively unlimited load – with the correct architecture. | ||
|
||
Load balancers are an essential component of horizontal scaling. For stateless services, the load balancers are general-purpose proxies (like Nginx, Envoy Proxy, or HAProxy). For scaling datastores, proxies are typically specialised, like mcrouter or the Vitess vtgates. | ||
|
||
### How load balancers work | ||
|
||
It is worth understanding a little of how load balancers work. Load balancers today are typically based on software, although hardware load balancer appliances are still available. The biggest split is between Layer 4 load balancers, which focus on balancing load at the level of TCP/IP packets, and Layer 7 load balancers, which understand HTTP. | ||
|
||
Read [Introduction to modern network load balancing and proxying by Matt Klein](https://blog.envoyproxy.io/introduction-to-modern-network-load-balancing-and-proxying-a57f6ff80236). | ||
|
||
- Name 5 functions of load balancers | ||
- Why might you use a Layer 7 loadbalancer instead of a Layer 4? | ||
- When might you use a Layer 4 loadbalancer? | ||
- Give a reason to use a sidecar proxy architecture rather than a middle proxy | ||
- Why would you use Direct Server Return? | ||
- What is a Service Mesh? What are the advantages and disadvantages of a service mesh? | ||
- What is a VIP? What is Anycast? | ||
|
||
#### Round Robin and other loadbalancing algorithms | ||
|
||
Many load balancers use Round Robin to allocate connections or requests to backend servers. This means that they assign the first connection or request to the first backend, the second to the second, and so on until they loop back around to the first backend. This has the virtue of being simple to understand, doesn’t need a lot of complex state or feedback to be managed, and it’s difficult for anything to go very wrong with it. | ||
|
||
Read about other [loadbalancing algorithms.](https://www.cloudflare.com/en-gb/learning/performance/types-of-load-balancing-algorithms/) | ||
|
||
Now consider the weighted response time loadbalancing algorithm, which sends the most requests to the server that responds fastest. What can go wrong here? | ||
|
||
If one server is misconfigured and happens to be very very rapidly serving errors or empty responses, then loadbalancers configured to use weighted response time algorithms would send more traffic to the faulty server. | ||
|
||
#### DNS | ||
|
||
There is one more approach to load balancing that is worth knowing about: [DNS Load Balancing](https://www.cloudflare.com/en-gb/learning/performance/what-is-dns-load-balancing/). DNS-based load balancing is often used to route users to a specific set of servers that is closest to their location, in order to minimize latency (page load time). | ||
|
||
### Performance: Edge Serving and CDNs | ||
|
||
Your users may be anywhere on Earth, but quite often, your serving infrastructure (web applications, datastores, and so on) is all in one region. This means that users in other continents may find your application slow. A network round-trip between Sydney in Australia and most locations in the US or Europe takes around 200 milliseconds. 200ms is not overly long, but the problem is that serving a user request may take several round trips. | ||
|
||
#### Round trips | ||
|
||
First, the user may need to look up the DNS name for your site (this may be cached nearby). Next, they need to open a TCP connection to one of your servers, which requires a network round trip. Finally, the user must perform a [SSL handshake](https://zoompf.com/blog/2014/12/optimizing-tls-handshake/) with your server, which also requires one or two network round trips, depending on configuration of client and server (a recent session may be resumed in one round trip if both client and server support TLS 1.3). | ||
|
||
All of this takes place before any data may be returned to the user, and, unless there is already an open TCP connection between the user and the website, involves an absolute minimum of _three network round trips_ before the first byte of data can be received by the client. | ||
|
||
#### SSL Termination at the Edge | ||
|
||
SSL termination _at the edge_ is the solution to this issue. This involves running some form of proxy much nearer to the user which can perform the SSL handshake with the user. If the user need only make network round trips of 10 milliseconds to a local Point of Presence (PoP), as opposed to 200 milliseconds to serving infrastructure in a different continent, then a standard TCP connection initiation and SSL handshake will take only around 60 milliseconds, as opposed to 1.2 seconds. | ||
|
||
Of course, the edge proxies must still relay requests to your serving infrastructure, which remains 200 milliseconds of network latency away. However, the edge proxies will have persistent encrypted network connections to your servers: there is no per-request overhead for connection setup or handshakes. | ||
|
||
##### Content Delivery Networks | ||
|
||
Termination at the edge is a service often performed by [Content Delivery Networks](https://en.wikipedia.org/wiki/Content_delivery_network) (CDNs). CDNs can also be used to cache static assets such as CSS or images close to your users, in order to reduce site load time as well as reducing load on your origin servers. You can also run your own compute infrastructure close to your users. However, this is an area of computing where being big is an advantage: it is hard to beat the number of edge locations that large providers like Cloudflare, Fastly, and AWS operate. | ||
|
||
[Edge Regions in AWS](https://www.lastweekinaws.com/blog/what-is-an-edge-location-in-aws-a-simple-explanation/) is worth reading to get an idea of the scale of Amazon’s edge presence. | ||
|
||
### QUIC | ||
|
||
It is worth being aware of [QUI](https://en.wikipedia.org/wiki/QUIC)C, an emerging network protocol that is designed to be faster for modern Internet applications than TCP/IP. While it is by no means ubiquitous yet, it is certainly an area that the largest Internet companies are investing in. HTTP/3, the next major version of the HTTP protocol, uses QUIC. | ||
|
||
Read about [HTTP over QUIC](https://blog.cloudflare.com/http3-the-past-present-and-future/) in the context of the development of the HTTP protocol. | ||
|
||
### Autoscaling | ||
|
||
Aside from load balancing, the other major component of successful horizontal scaling is autoscaling. Autoscaling means to scale the number of instances of your service up and down according to the load that the service is experiencing. This can be more cost-effective than sizing your service for expected peak loads. | ||
|
||
On AWS, for example, you can create an Autoscaling Group (ASG) which acts as a container for your running EC2 instances. ASGs can be [configured](https://docs.aws.amazon.com/autoscaling/ec2/userguide/scale-your-group.html) to scale up or scale down the number of instances based on a schedule, based on[ predicted load](https://docs.aws.amazon.com/autoscaling/ec2/userguide/ec2-auto-scaling-predictive-scaling.html) (based on the past two weeks of history), or based on [current metrics](https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-scale-based-on-demand.html). Kubernetes [Horizontal Pod Autoscaler](https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/) (HPA) is a tool similar to ASGs scaling policies for Kubernetes. | ||
|
||
#### CPU utilisation | ||
|
||
CPU utilisation is very commonly used as a scaling signal. It is a good general proxy for how 'busy' a particular instance is. It’s a generic metric that your platform understands, as opposed to an application-specific metric. | ||
|
||
The CPU utilisation target should not be set too high. Your service will take some time to scale up when load increases. You might experience failure of a subset of your infrastructure (think of a bad code push to a subset of your instances, or a failure of an AWS Availability Zone). | ||
|
||
Your service probably cannot serve reliably at close to 100% CPU utilisation. 40% utilisation is a common target. | ||
|
||
### Autoscaling and Long-Running Connections | ||
|
||
One case where autoscaling does not help you to manage load is when your application is based on long-running connections, such as websockets or gRPC streaming. Managing these at scale can be challenging. Read the article [Load balancing and scaling long-lived connections in Kubernetes](https://learnk8s.io/kubernetes-long-lived-connections). | ||
|
||
- Why doesn’t autoscaling work to redistribute load in systems with long-lived connections? | ||
- How can we make these kinds of systems robust? | ||
|
||
## Project work for this section | ||
|
||
- **[Multiple Servers](projects/multiple-servers)** | ||
|
||
### Stretch | ||
|
||
- [Consul and Chaos Engineering](https://learn.hashicorp.com/tutorials/consul/introduction-chaos-engineering?in=consul/resiliency) : This HashiCorp tutorial will give you hands-on experience with seeing health checks can be used to manage failure. | ||
- [Implement Circuit Breaking in Consul Service Mesh with Envoy](https://learn.hashicorp.com/tutorials/consul/service-mesh-circuit-breaking?in=consul/resiliency) : Demonstrate circuit breaking. | ||
- [Load Balancing Envoy](https://learn.hashicorp.com/tutorials/consul/load-balancing-envoy?in=consul/resiliency) : See different kinds of load-balancing algorithms in use. | ||
- Do the LearnDevOps tutorial which demonstrates autoscaling with minikube. | ||
You will need to install minikube on your computer if you don’t have it. It will give you hands-on experience in configuring autoscaling, plus some exposure to Kubernetes configuration. | ||
- You may need to run this first: `minikube addons enable metrics-server` | ||
- [Kubernetes - Deploy App into Minikube Cluster using Deployment controller, Service, and Horizontal Pod Autoscaler](https://learndevops.novalagung.com/kubernetes-minikube-deployment-service-horizontal-autoscale.html) |
45 changes: 45 additions & 0 deletions
45
...ibuted-software-systems-architecture/4-asynchronous-work-and-pipelines/index.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,45 @@ | ||
+++ | ||
title="4. Asynchronous Work and Pipelines" | ||
+++ | ||
|
||
# 4 | ||
|
||
## Asynchronous Work and Pipelines {#asynchronous-work-and-pipelines} | ||
|
||
Not all work that we want to do with computers involves serving a request in near-real-time and responding to a user. Sometimes we need to do _asynchronous_ tasks like: | ||
|
||
- Periodic work, such as a nightly data export, or computing monthly reports | ||
- Work scheduled for later, such as scheduling reminders to users | ||
- Long-running work, such as scheduling a build or a set of tests | ||
- Running a continuous statistics computation based on incoming data | ||
|
||
Batch processes may not be that large and may just run as a scheduled [cron](https://en.wikipedia.org/wiki/Cron) job. | ||
|
||
### MapReduce {#mapreduce} | ||
|
||
However, not all computations fit on one machine. One way of running large batch computations across a fleet of computers is the MapReduce paradigm. Read this article which [describes how MapReduce works](https://medium.com/edureka/mapreduce-tutorial-3d9535ddbe7c). | ||
|
||
- How does MapReduce help us to scale big computations? | ||
|
||
Read this book chapter about [Data Processing Pipelines](https://sre.google/sre-book/data-processing-pipelines/). | ||
|
||
- Give two reasons why data processing pipelines can be fragile | ||
|
||
Optionally, you can follow this short tutorial to implement a distributed word count application, and run it locally on a Glow cluster. You will get hands-on experience with MapReduce. | ||
|
||
- [https://blog.gopheracademy.com/advent-2015/glow-map-reduce-for-golang/](https://blog.gopheracademy.com/advent-2015/glow-map-reduce-for-golang/) | ||
|
||
### Queues {#queues} | ||
|
||
Queues are a frequently-seen component of large software systems that involve potentially heavyweight or long-running requests. A queue can act as a form of buffer, smoothing out spikes of load so that the system can deal with work when it has the resources to do so. Read about the [Queue-Based Load-Leveling Pattern](https://learn.microsoft.com/en-us/azure/architecture/patterns/queue-based-load-leveling). | ||
|
||
- How can results of tasks be communicated back to users in a queue-based system? | ||
|
||
Kafka is a commonly-used open-source distributed queue. Read [Apache Kafka in a Nutshell](https://medium.com/swlh/apache-kafka-in-a-nutshell-5782b01d9ffb). | ||
|
||
- What are the components of the Kafka architecture? | ||
- How are topics different from partitions? | ||
|
||
### Project work for this section {#project-work-for-this-section} | ||
|
||
- [https://github.com/CodeYourFuture/immersive-go-course/tree/main/kafka-cron](https://github.com/CodeYourFuture/immersive-go-course/tree/main/kafka-cron) |
63 changes: 63 additions & 0 deletions
63
...e-systems-architecture/5-distributed-locking-and-distributed-consensus/index.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
+++ | ||
title="5. Distributed Locking and Distributed Consensus" | ||
+++ | ||
|
||
# 5 | ||
|
||
## Distributed Locking and Distributed Consensus | ||
|
||
> In a program, sometimes we need to lock a resource. Think about a physical device, like a printer: we only want one program to print at a time. Locking applies to lots of other kinds of resources too, often when we need to update multiple pieces of data in consistent ways in a multi-threaded context. | ||
### How to do distributed locking | ||
|
||
We need to be able to do locking in distributed systems as well. Read Martin Kleppmann’s article [How to do distributed locking](https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html). | ||
|
||
### What are the two main reasons to do distributed locking? | ||
|
||
It turns out that in terms of computer science, distributed locking is theoretically equivalent to reliably electing a single leader (like in database replication, for example). It is also logically the same as determining the exact order that a sequence of events on different machines in a distributed system occur. All of these are useful. | ||
|
||
### RAFT | ||
|
||
An algorithm that you should know about is the RAFT distributed consensus protocol: read [https://raft.github.io/raft.pdf](https://raft.github.io/raft.pdf). | ||
|
||
- Under what circumstances is a RAFT cluster available? | ||
- What does the leader do? | ||
- Is there always a leader? | ||
- How is a leader elected? | ||
- What happens if one replica falls behind the leader, i.e. the leader has committed transactions that the replica doesn’t have? | ||
- What happens if a server is replaced? | ||
|
||
Read about the operational characteristics of distributed consensus algorithms in [Managing Critical State.](https://sre.google/sre-book/managing-critical-state/) | ||
|
||
- What are the scaling limitations of distributed consensus algorithms? | ||
- How can we scale read-heavy workloads? | ||
|
||
### Project work for this section {#project-work-for-this-section} | ||
|
||
See [raft-otel](https://github.com/CodeYourFuture/immersive-go-course/tree/main/raft-otel) project. | ||
This project is an opportunity to explore the RAFT distributed consensus protocol through the medium of distributed tracing. | ||
|
||
## Further Optional Reading {#further-optional-reading} | ||
|
||
Discuss any of these pieces at office hours with Laura. | ||
|
||
[Dan Luu’s list of postmortems](https://github.com/danluu/post-mortems) | ||
: includes a lot of interesting stories of real-world distributed systems failure | ||
|
||
[Jeff Hodges' Notes on Distributed Systems for Young Bloods](https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/) | ||
: is a very practical take on distributed systems topics. | ||
|
||
[Alvaro Videla's blog post and talks about learning distributed systems](https://alvaro-videla.com/2015/12/learning-about-distributed-systems.html) | ||
: are accessible and well organised. | ||
|
||
[Marc Brooker’s blog](https://brooker.co.za/blog/) | ||
: is full of interesting pieces, which are very approachable. | ||
|
||
[The SRE Book](https://sre.google/sre-book/table-of-contents/) | ||
: is available in full online and is worth reading - it addresses many aspects of operating distributed software systems. | ||
|
||
Aphyr (Kyle Kingsbury) | ||
: provides detailed [notes for a distributed systems class he teaches.](https://github.com/aphyr/distsys-class) | ||
|
||
[A Distributed Systems Reading List](https://dancres.github.io/Pages/) | ||
: will keep you reading excellent distributed systems papers for many many months. |
File renamed without changes.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
+++ | ||
title="Troubleshooting Primer" | ||
author="Laura Nolan and Radha Kumari" | ||
date="28 Dec 2022 12:22:11 BST" | ||
+++ | ||
|
||
# Troubleshooting Primer | ||
|
||
## About this document {#about-this-document} | ||
|
||
This document is a crash course in troubleshooting. It is aimed at people with some knowledge of computing and programming, but without significant professional experience in operating distributed software systems. This document does not aim to prepare readers to be oncall or be an incident responder. It aims primarily to describe the skills needed to make progress in day-to-day software operations work, which often involves a lot of troubleshooting in development and test environments. | ||
|
||
## Learning objectives: | ||
|
||
- Explain troubleshooting and how it differs from debugging | ||
- Name common troubleshooting methods | ||
- Experience working through some example scenarios. | ||
- Use commonly used tools to troubleshoot |
Oops, something went wrong.