Data presented here is gathered from different courses: Stanford University, Carnegie Mellon University, Different Books, Udemy-DesignGuru-Alex Xu-Oreilly courses, etc.
- Azure Solution Architecture
- TOGAF Enterprise Architecture
- Solution and System Design examples
- Advanced topics regarding different aspects of solution architecture
- Azure: Solution Architecture in Practice
a. Azure: Logging & Monitoring
b. Azure: Service Principal, Managed Identity, and Service Connection
c. Azure: Data Materials
d. Azure: Design Relational Data Storage
e. Azure: Active Geo-replication
f. Azure: Networking
g. Azure: High Availability and Traffic Manager
h. Azure: Architecture examples by Microsoft
i. Azure Migration Strategy by Microsoft
g. Azure Additional Links to Different Materials
k. Azure: Articles which covers everything in Azure. Third-Party repo I cloned
l. API Management and VNet + APIM with Gateway for On-Premise - Azure B2B, B2C, Entitlement, Guest Users. App roles & delegated permissions
a. Azure: B2B & B2C, Azure AD DS Domain Services
b. Azure Entitlement
c. Azure Delegate Permissions (User to Azure App)
d. Azure Assign App roles to another App. Api Permissions - Azure: High Availability. Disaster Recovery for Azure Functions. Strategies
a. Azure Service Bus: Performance Improvements Best Practices
b. Azure App Service Disaster Recovery. Active-Active, Active-Passive, Active-Cold - Azure: Skylines Academies materials
- Data & Data Storage Materials
a. Design for Data Storage
b. Azure SQL. Data Retention
c. Azure SQL DB Data Retention for more than 35 days. Tip - Traffic Manager
- Azure Data Factory
- Azure Proximity Placement Groups. Configuring Low-latency VMs in one DataCenter
- Azure App Configuration. Feature Flags and Configurations
- ToGAF. ADM. Basics
- ToGAF. Preliminary Phase
- ToGAF. A. Architecture Vision
- ToGAF. B. Business Architecture
- ToGAF. C. Information Systems Architecture
- ToGAF. D. Technology Architecture
- ToGAF. E. Opportunities and Solutions
- ToGAF. F. Migration Planning
- ToGAF. G. Governance. Table of Content
- Good Principles of Governance
- Compliant. Non-Compliant. Consistent. Conformant. Fully-Conformant. Irrelevant.
- ToGAF. H. Architecture Change Management
- ADM. Guidelines and Techniques. Table of Content
- Architecture Governance Techniques. Table of Content
- Architecture: How Much of Architecture is needed
a. Architecture lifecycle
b. Architecture patterns
c. Architecture style vs pattern
d. Architecture Influence Cycle - Architecture: Jeffrey Richter's Course materials
- Architecture: Attribute-Driven Design
- Architecture: Quality Attributes and Tactics to achieve them
a. Functional vs Non-Functional. Test Process
b. Tactics to Achieve Quality Attributes - Architecture: Architecture Style
a. Module Styles
b. Component & Connectors Styles
c. Allocation Styles - Architecture Views. Documenting Software Architecture. Properties to document in your Architecture Document
- Documenting Architecture. General Structure. General Principles. Mapping Requirements
a. Documenting: Combining Views. Hybrid View
b. Documenting: Interfaces, Behavior, Context
c. View Packets. Alternatives
d. Example: Architecture Decision Records (ADR\MADR). How to Document your architecture decisions and their consequences - Views and Beyond. Alternatives: DoDAF, ISO42010 \ IEEE1471-2000
- Reviewing Architecture Documentation
- Architecture Evaluation
- Data & Data Storage Materials
a. Data Replication. Leader-Follower & Quorum patterns in SQL
b. Blob Storage vs File Share vs Managed Disks - Cache. Read-Through, Cache-Aside
a. Cache Consistency Models
b. Cache Challenges
c. Cache Replacement Policies
d. Cache Performance Metrics - Governance and Compliance materials
- Kafka & Messaging patterns
a. Kafka. Basics. Consumer Group. Compression & Batching. Load Balancing
b. Messaging Patterns suitable for Kafka and for other services. Q&A - Darp. The Distributed Application Runtime
- CAP Theorem. PACELC
- Consistent Hashing for Data Replication and Data Partitioning
- Architecture in Practice. System Design Interview. Q&A. Main Problems
a. System Design Interview In practice
b. URL Shortener
c. Pastebin
d. Web Crawler - CAP Theorem. PACELC
- Consistent Hashing for Data Replication and Data Partitioning
Table of Contents
- System Design Complete Guide by Karan Pratap Singh
- Designing Data-Intensive Applications, M.Kleppmann
- System Design Interview, Alex Xu
QAW (workshop) => ADD (Attribute-Driven Design) => V&B (Views and Beyond) => ATAM (Architecture Tradeoffs Analysis)
- Pattern consists of 3 parts - Problem, Context, and Solution.
- Patterns tells more about context in which Solution appeared.
On the other hand style is about Solution, it well describes what exactly was selected, without detailed explanation "why". (Leonard Bass) - Style is higher level of abstraction, style can demonstrate elements and their relations
- Pattern shows exact way of "how to achieve that using exact way". Details: https://www.geeksforgeeks.org/difference-between-architectural-style-architectural-patterns-and-design-patterns/
- DDD, Hexagonal, Onion, Clean, CQRS: https://herbertograca.com/2017/11/16/explicit-architecture-01-ddd-hexagonal-onion-clean-cqrs-how-i-put-it-all-together/
- Anemic model (Fowler): https://martinfowler.com/bliki/AnemicDomainModel.html
- Anemic model vs Full Domain Model (Fowler): https://martinfowler.com/bliki/AnemicDomainModel.html
?aaS Cloud course from Jeffrey Richter
Jeffry Richter Presentation, Why cloud apps, Embracing Failures, Orchestrators, Virtualization.pptx
Jeffrey Richter Presentation, Regions and Microservices.pptx
Jeffrey Richter, Scaling, 12-Factor, Containers.pptx
Jeffrey Richter, Docker, Hyper-V containers, Containerd runtime, CI and CD.pptx
Jeffrey Richter, API Versioning, Troubleshooting, Steps towards Microservices(what need to take into account).pptx
Jeffrey Richter, Idempotency, Retry Policy, Exactly Once Strategy.pptx
twelve factor application (12 factor explained)
Forward & Reverse Proxies
When people talk about proxy they usually mean forward proxy.
Forward proxy - kind of proxy when you have a several services which reach one resource in the internet. And you may create a service which adds some information (headers) to such request and\or transform them somehow before forwarding (change the address of destination).
Good forward proxy example: Fiddler;
Good reverse proxy examples: Nginx, IIS, API Gateway, Load Balancer, etc.
You may need them for:
- Manage calls coming to your services (Being a facade of your services)
- Check whether coming calls which have basic authentication - are able to interact with the service
- Load Balancing (on OSI level4 for TCP and UDP traffic and for OSI level7 for HTTP\HTTPS traffic)
- SSL termination - request coming to reverse proxy (Nginx for example) may be HTTPS, but forwarded request will be http
- Caching mechanisms
- Throttling - to control the input, e.g. amount of requests per seconds
- Billing - to control the amount of requests and to help making a bill for them
- DDos mitigation
- Retry Policy. It may automatically make a retry to another service instance behind it if the certain one is unreachable
Reverse proxy example. RP-I is Reverse proxy and it plays a load balancer role here.
NOTE: It's impossible to keep endpoints in sync as service instances come\go. Client code must be robust against this
Another example of reverse proxy with load balancer + healthcheck role
Quality attributes. How better bescribe them. Concrete examples.
Instead of just saying that performance is important for you - you need to create scenarios.
Each scenario should have 6 points starting from what's happening, from where this stimulus comes, in which circumstances, and until how to react
Tactics: Availability. 3 types: fault detection, fault recovery, fault prevention
- as example: for Availability you need to care about
- Passive redundancy - copy your data\state to extra instances and make hot replacement when needed
- Condition Monitoring - control when failure occured in your instance and react on this event - restart instance, isolate, etc.
- Voting is used in airplanes where several processes make same calculations in parallel and majority wins.
- Usually system element as Monitor is looking at the output of each process. If one of processes shows wrong results - it's deemed to be faulty.
Tactics: Performance
One way to increase performance is to carefully manage the demand for resources. This can be done by reducing the number of events processed or by limiting the rate at which the system responds to events. In addition, a number of techniques can be applied to ensure that the resources that you do have are applied judiciously
- Manage work requests.
One way to reduce work is to reduce the number of requests coming into the system to do work. Ways to do that include the following: - Manage event arrival.
A common way to manage event arrivals from an external system is to put in place a service level agreement (SLA) that specifies the maximum event arrival rate that you are willing to support - Manage sampling rate.
In cases where the system cannot maintain adequate response levels, you can reduce the sampling frequency of the stimuli—for example, the rate at which data is received from a sensor or the number of video frames per second that you process. Of course, the price paid here is the fidelity of the video stream or the information you gather from the sensor data. Nevertheless, this is a viable strategy if the result is “good enough.” - Limit event response.
When discrete events arrive at the system (or component) too rapidly to be processed, then the events must be queued until they can be processed, or they are simply discarded. You may choose to process events only up to a set maximum rate, thereby ensuring predictable processing for the events that are actually processed.
- Reduce indirection.
The use of intermediaries (so important for modifiability, as we saw in Chapter 8) increases the computational overhead in processing an event stream, so removing them improves latency. This is a classic modifiability/performance tradeoff. Separation of concerns—another linchpin of modifiability—can also increase the processing overhead necessary to service an event if it leads to an event being serviced by a chain of components rather than a single component - Co-locate communicating resources.
Context switching and intercomponent communication costs add up, especially when the components are on different nodes on a network. One strategy for reducing computational overhead is to co-locate resources.z - Periodic cleaning.
A special case when reducing computational overhead is to perform a periodic cleanup of resources that have become inefficient. For example, hash tables and virtual memory maps may require recalculation and reinitialization. - Increase efficiency of resource usage.
Improving the efficiency of algorithms used in critical areas can decrease latency and improve throughput and resource consumption.
Even if the demand for resources is not controllable, the management of these resources can be. Sometimes one resource can be traded for another.
- Increase resources. Faster processors, additional processors, additional memory
- Introduce concurrency. If requests can be processed in parallel, the blocked time can be reduced
- Maintain multiple copies of computations. This tactic reduces the contention that would occur if all requests for service were allocated to a single instance.
- Maintain multiple copies of data. Two common examples of maintaining multiple copies of data are data replication and caching.
- Bound queue sizes. This tactic controls the maximum number of queued arrivals and consequently the resources used to process the arrivals.
- Schedule resources. Whenever contention for a resource occurs, the resource must be scheduled. Processors are scheduled, buffers are scheduled, and networks are scheduled
-
Service Mesh
The service mesh pattern is used in microservice architectures. The main feature of the mesh is a sidecar—a kind of proxy that accompanies each microservice, and which provides broadly useful capabilities to address application-independent concerns such as interservice communications, monitoring, and security. A sidecar executes alongside each microservice and handles all interservice communication and coordination. -
Load Balancer
A load balancer is a kind of intermediary that handles messages originating from some set of clients and determines which instance of a service should respond to those messages. The key to this pattern is that the load balancer serves as a single point of contact for incoming messages—for example, a single IP address -
Throttling
The throttling pattern is a packaging of the manage work requests tactic. It is used to limit access to some important resource or service. In this pattern, there is typically an intermediary—a throttler—that monitors (requests to) the service and determines whether an incoming request can be serviced. -
Map-Reduce The map-reduce pattern efficiently performs a distributed and parallel sort of a large data set and provides a simple means for the programmer to specify the analysis to be done. Unlike our other patterns for performance, which are independent of any application, the map-reduce pattern is specifically designed to bring high performance to a specific kind of recurring problem: sort and analyze a large data set. This problem is experienced by any organization dealing with massive data
Tactics: Deployability. Patterns
Rather than deploying to the entire user base, scaled rollouts deploy a new version of a service gradually, to controlled subsets of the user population, often with no explicit notification to those users. (The remainder of the user base continues to use the previous version of the service.) By gradually releasing, the effects of new deployments can be monitored and measured and, if necessary, rolled back. This tactic minimizes the potential negative impact of deploying a flawed service. It requires an architectural mechanism (not part of the service being deployed) to route a request from a user to either the new or old service, depending on that user’s identity.
If it is discovered that a deployment has defects or does not meet user expectations, then it can be “rolled back” to its prior state. Since deployments may involve multiple coordinated updates of multiple services and their data, the rollback mechanism must be able to keep track of all of these, or must be able to reverse the consequences of any update made by a deployment, ideally in a fully automated fashion.
Deployments are often complex and require many steps to be carried out and orchestrated precisely. For this reason, deployment is often scripted. These deployment scripts should be treated like code—documented, reviewed, tested, and version controlled. A scripting engine executes the deployment script automatically, saving time and minimizing opportunities for human error.
This tactic accommodates simultaneous deployment and execution of multiple versions of system services. Multiple requests from a client could be directed to either version in any sequence. Having multiple versions of the same service in operation, however, may introduce version incompatibilities. In such cases, the interactions between services need to be mediated so that version incompatibilities are proactively avoided. This tactic is a resource management strategy, obviating the need to completely replicate the resources so as to separately deploy the old and new versions.
This tactic packages an element together with its dependencies so that they get deployed together and so that the versions of the dependencies are consistent as the element moves from development into production. The dependencies may include libraries, OS versions, and utility containers (e.g., sidecar, service mesh), which we will discuss in Chapter 9. Three means of packaging dependencies are using containers, pods, or virtual machines; these are discussed in more detail in Chapter 16.
Even when your code is fully tested, you might encounter issues after deploying new features. For that reason, it is convenient to be able to integrate a “kill switch” (or feature toggle) for new features. The kill switch automatically disables a feature in your system at runtime, without forcing you to initiate a new deployment. This provides the ability to control deployed features without the cost and risk of actually redeploying services.
### Code organization point of view
Module Styles: Decomposition, Uses, Generalization, Layered, Data Model, Aspects styles
Module styles are closer to code. Used for:
- code construction,
- analysis (impact of changes, planning, or for budgeting concerns).
- education (onboarding new team-members).
- Build vs Buy. You can use Decomposition in order to understand what is better for you - build on your own or buy this product from 3rd party vendor.
Module Styles:
- Decomposition Style
- Uses Style
- Generalization Style
- Layered Style
- Data Model Style
- Aspects Style
Module Styles: Decomposition Style. Decomposition Refinement
Module Styles: Uses Style
- Useful for planning an incremental development when you have several dependencies and you need to know where all of them will be available for you;
- Useful in debugging and testing because you can stub and mock your dependencies, or you want to isolate from where the issue comes;
- Useful to validate dependencies and avoid circular dependencies which not letting you make incremental deployment and delivery
- Tracing changes when you want to guarantee that other dependencies will not suffer
Module Styles: Generalization Style
Module Styles: Layered Style
-
Concentric diagram could not be equivalent to stack diagram because there is ambiguity and it's not clear if B1-B2-B3 can use each other (especially B1-B3 because they touch each other).
Module Styles: Aspects Style (Also known as Multi-dimentional separation of concerns)
- Style to depict relations between classes or set of classes (their aspect modules)
- Could be used to understand the strategy for error handling. When Set of modules want to use one protocol (handling policy) for handling error if it occurs.
- Could be useful for application decomposition
- Could be useful to understand the scalability of solution
- Colored bar represents the aspect in module
- Could be used only when code is already exist
- To understand the scalability the best strategy is to create UML Class diagram
- For very large apps you need to use non-graphical diagram because in very large apps with dozens of classes and their aspects your UML diagram will be a hell
Module Styles: Data Model Style. ERD & UML Notation
Component and Connector Styles: Pipe and Filter, Client-Server, Service-Oriented (SOA), Publish-Subscribe, Shared-Data styles
Used for:
- Performance, Security, Availability (Runtime attributes) analysis
- Education
- Construction. C&C Style could describe behavior that elements must demonstrate when they work together
- Could help to describe how very specific part of the system works
Components and Connectors style:
- Pipe-and-filter Style
- Client-Server Style
- Service-Oriented Style
- Publish-Subscribe Style
- Shared-Data Style
- Repository Style
- Others
- DataFlow, Call-Return, Event-Based, Repository are Sub-families of C&C Styles
** Rules of C&C Style:**
- In C&C style Component could be a runtime of interaction or Data Store. Component has ports to the outside world;
- Connector connects components. Connector has roles which could be called;
- Ports and Roles are just special interfaces of Components and Connectors respectively;
- The Only one relation between elements - Attachment; It describes Attachment between Components and Connectors
- In Architecture Connector is not just a procedure call, but could be very sophisticated computation;
- C&C Diagram could have Quality attributes which help with analysis;
- For different Quality Attributes could be useful different C&C Styles
C&C Style: Pipe and Filter. Pipes and Filters in UML. Yahoo! Pipes
- Good for cases when data is transformed serially
- Series of Filters or Series of Pipes one after another is prohibited by Style
- Good for Functional composition Data Analysis (what the output could be knowing the function and input)
C&C Style: Client-Server
C&C Style: Service-Oriented (SOA). Web Services
- Useful for property analysis which could be associated with service or with service client;
- Services could be not discoverable and not dynamically bound
- ESB - Enterprise Service Bus, special component which takes routing of messages
Use cases: - services made by different languages, for different platforms, or by different teams\organizations - services which have different styles - helps with integration of external components - for repackaging legacy systems. you can rehabilitate pieces of legacy system one by one
C&C Style: Publish-Subscribe
- Pure event-based style
- Useful if you dont know all of your subscribers or you dont know their amount.
- Useful for sending information for unknown recipients
- Event distributor can be depicted as a bus or a component
- Need to understand which components can listen to which events. Some events could be not public
- Components can listen their own events (sometimes answer yes, sometimes no)
C&C Style: Shared-Data
C&C Style: Combination of styles
C&C Style: Crosscutting issues in different C&C Styles. Grouping in Tiers. Tiers vs Layers. Multi-tier Notation
-
Layers are used more in Model View. Tiers in C&C view.
-
They can be mapped one-to-one, but it's also possible to map several layers to one tier
-
Tiers are used to check cohesion of components in C&C view
-
Tiers also could be transformed into Packages in UML Notation
-
Cross-communication processes issues (when different processes share resources)
- Useful to declare connections between tiers (only allowed to communicate with these exact tiers and not with others)
- Tiers could be pass-through, but should be exclicitly declared
- Grouping components in tiers, for example Tier execution
- Client Tier also could be described what clients it has: thin or fat: Thin clients are generally embedded within a web browser. Fat must be installed on the client's machine.
- Tiers could be depicted as packages in UML notation
- We can declare how tiers are communicating
Allocation Styles: Deployment, Install, Work Assignment styles
Allocation Style:
- Deployment Style
- Install Style
- Work Assignment Style
Allocation Styles: Deployment
Allocation Styles: Install
Allocation Styles: Work Assignment. "Who will do the job". Specializations: Platform-style, Competence-center, Open-Source
Architecture Views. Documenting Software Architecture. Properties to document in your Architecture Document
General. Select Properties to document. Examples in terms of Quality Attribute
Example:
- Performance Attribute: you need to document best and worst response time properties. Or maximum number of event that element can service per time unit (per secord or per minute).
- Security: perhaps you need to document the level of encryption and authorization rules for different elements and relations.
General. Apply Views. Information on Views. Structural vs Behavioral. Traced-Oriented vs Comprehensive Language
View Type: (C&C) Component And Connector View
Shortly: how your code maps on resources. Where it executes. What major parameters it has:
- Processor cores vs processes
- Sockets
- REST APIs (dependency relationships between information carriers)
- Where your code is executed: Client Machine vs Server, PC or Mobile.
** Attachments:**
- Output from one port to another Input port
** Quality of service information**
- Amount of requests per hour
- Latency
View Type: Allocation View: Deployment, Implementation views. Their Refinements
Shortly: About requirements where and how application runs
- Memory requirements
- Processor requirements
- Execution in processes and threads
Allocation View consist of:
- Deployment view (how and where you deploy your app)
- Implementation View
- On the right we reveal that our connector is event dispatcher
- It might help to understand how teams need to change their interfaces to interact with dispatcher appropriately
- New questions to architect may appear and potentially you need to put your attention to how exactly event bus will work and what limitations now you have
- diagram on the left could be interested to people who are not so interested in tech details. The right - people who are more concerned about design
Notations for Architecture View. Model View. Uses View. Generalization View
What View You need to Document. General pieces of advice
- Check who is your stakeholders
- What diagrams each of them need to understand and sell your product
- Consolidate views (if their number is too high)
- Rationale of your design decisions
- Functional, Non-functional attributes and constraints of your system
- Legend on all diagrams
Mapping requirements to design Decisions. Microsoft template
What Views to choose. 3 step principle: Build Stakeholders table, Combine Views (rule of thumb), Prioritize documentation. Part 1
- PS: It's also important to include at least one Model, one C&C diagrams to your documentation.
- Build Stakeholders table
- Dont accept all their wishes, you need to document architectures which help them in their work, not "just nice to have"
- Instead of 12 diagrams we selected 8 and decide to make combine view for 2. It means we need 7 diagrams
- It still could make our documentation redundant, so we need to prioritize them and combine-simplify them if they are not super important
Combine Views. Hybrid Views. Overlay Part 2
- we had 2 small diagrams and decided to merge them together
- such overlaying might recognize the component which is not well defined and understand how we gonna built it
Documenting: Behavior. Dynamic properties vs static properties. time to response (TTR), Throughput
- Behavior documentation is needed to declare dynamic properties of built system: time to response, throughput, etc.
- It supports system analysis as it executes.
- Answers on "in what order components interact".
- Depicts transition from State to State to State
- What the system status under certain circumstances
- How the system startup
- Can guarantee that the system works as indended under variety of conditions
Documenting: Behavior. When and why. Trace-oriented vs Comprehensive language
Documenting: Behavior. Trace-oriented Language & Comprehensive Language. BPMN Notation. Diagrams: Collaboration, Sequence, Activity
- Trace-oriented language answers on the question "what happens when particular stimulus arrives or in specific state"
- It does not help us to capture all possible behaviors unless you are collecting them
- Shows complete behavior of that system
- usually Statemachine or StateChart diagrams
Documenting: Context Diagrams. Their Notations. System, environment, Relations. in C&C and Layered view.
Documenting: Decisions. Capturing Complex Architecture Decisions. 12 steps-parts document
- If decision took 5 minutes - probably it's not worth to document.
- If 5 days - so, yes.
- Document the future steps especially if you have concerns regarding decisions you have made. Good start point in the next communication with your manager.
Documenting: View Packets
-
View packets is the way to split complex diagrams on parts.
-
Each view has parent, child, and\or siblings.
-
In case you have view packets the overall document will lot a bit another
-
View packets could also be very useful in ADD because inside View packet you can solve specific attribute questions.
-
Create view and assign it particular responsibilities
- Make huge unwieldy diagram but use the tool to present and use it with zoon-in, zoom-out, and fly-through abilities
- Series of diagrams
Views and Beyond. Template for Beyong view and rationale
ISO\IEC 42010 (also known as IEEE 1471-2000). ISO42010 vs "Views and Beyond".Alternative to "Views and Beyond"
![image](https://user-images.githubusercontent.com/4239376/224509958-7a6c3bf4-936d-496d-9ad0-ca90d831ae67.png)
ISO 42010 vs Views and Beyond. Differences and Similarities
DoDAF. Alternatives to "Views and Beyond"
Documentation in Agile. Alternatives to "Views and Beyond"
- ISO42010 works well in Agile environment, so better to go straight with this format
- Views and Beyond works well in Agile as well. Views and Beyond could be made compliant to ISO42010 in Agile environment as well
6 steps of Documentation Review Process
Architecture Evaluation. Approaches and Techniques. Evaluation Output
ATAM. Risk identification method
The point of ATAM is only to find risks, not to mitigate them You can do that through elisit the right questions to architects, senior designers and key developers So, its risk identification method, not risk resolution method We do not provide precise analysis
- After creating architecture, but not so much code is in place;
- Check existing system architecture, evaluate it;
- Decide whether we will build this system or buy it from 3rd party vendor;
Phase 0: gather a small group of architects and evaluators and discuss what you are going to evaluate, what you have, etc.
Utility Tree; L,M,H - how important in terms of business (1st value, Highly important, Low importance) and how risky (2nd, High risks, Low risks).
H,H scenarios - our main business scenarios and drivers we must focus on
Non-risks may become risks if situation changed
https://www.linkedin.com/pulse/architecture-10-rules-thumb-matthew-golzari/
https://medium.com/@i.gorton/six-rules-of-thumb-for-scaling-software-architectures-a831960414f9
Solution Architecture Principles in Practice.pdf
Solution Architecture Principles in Practice,_Student_Workbook_2020.pdf
SEI Software Architecture Professional Exam: https://www.sei.cmu.edu/education-outreach/courses/course.cfm?courseCode=V19 Service-based Architecture Professional Cert: https://www.sei.cmu.edu/education-outreach/credentials/credential.cfm?customel_datapageid_14047=15189
powershell materials coming from Skylines Academy
SLA, SLO, SLI
Difference: https://www.atlassian.com/incident-management/kpis/sla-vs-slo-vs-sli * SLI Explained in details: http://cs.brown.edu/courses/csci2952-f/slides/Class9.pdf SLI Of The Platform - Critical Replica Threshold = CRT - Available replicas = min(total available pods, CRT) - Replica availability = (available replicas / CRT) * 100 - Critical Replica Availability = mean(replica availability of each service)AZ+301 SKYLINES ACADEMY Slides_Student_Version.pdf
SECTION-1 Workload requirements
SECTION-2 Identity and Security
SECTION-3 Data Platform Solutions
SECTION-4 Business Continuity
SECTION-5 Deployment, Migration, Integration
SECTION-6 Infrastructure Strategy
Insights and everything related to that.
- All information related to logging and monitoring are grouped here: Logging and Monitoring Information
- Azure AD, Azure AD Connect, Azure AD Connect sync, Azure AD B2B, Azure AD B2C, Azure AD Conditional Policies
- Managed Identities, User-defined, System-defined
- Azure Key Vault, when to use
- SAS Token, when to use
All information related to Authorization and Authentication is here: Authentication, Authorization, Azure AD and features
Consider how Azure policy is different from role-based access control (RBAC).
It’s important not to confuse Azure Policy and Azure RBAC. Azure RBAC and Azure Policy should be used together to achieve full scope control.
-
You use Azure Policy to ensure that the resource state is compliant to your organization’s business rules. Compliance doesn’t depend on who made the change or who has permission to make changes. Azure Policy will evaluate the state of a resource, and act to ensure the resource stays compliant.
-
You use Azure RBAC to focus on user actions at different scopes. Azure RBAC manages who has access to Azure resources, what they can do with those resources, and what areas they can access. If actions need to be controlled, then use Azure RBAC. If an individual has access to complete an action, but the result is a non-compliant resource, Azure Policy still blocks the action.
Area | Azure Policy | Role-based Access Control |
---|---|---|
Description | Ensure resources are compliant with a set of rules. | Authorization system to provide fine-grained access controls. |
Focus | Focused on the properties of resources. | Focused on what resources the users can access. |
Implementation | Specify a set of rules. | Assign roles and scopes. |
Default access | By default, rules are set to allow. | By default, all access is denied. |
Trust Center, Compliance Manager, Data Protection, Azure Security and Compliance, Blueprints
- In-depth information access tp FedRAMP, ISO, SOC audit reports, data protection white papers and different assessment reports
- Centralized resources around security, compliance and privacy
- Manage compliance from a central location
- Proactive risk assessment
- Insights and recommended actions
- Prepare compliance reports for audit
- Trust documents, GDPR, Compliance guides, Pen test and Security Assessment tests
Azure Security and Compliance, Blueprints
- Industry-specific overview and guidance
- Customer responsibilities matrix
- References architectures with threat models
Cache. Cache Replacement Policies. Performance Metrics
Lab. Microsoft Docs. Azure Blueprints. Additional Materials
Lab:
https://docs.microsoft.com/en-us/learn/paths/design-identity-governance-monitor-solutions/
Microsoft docs:
https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/govern/guides/
https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/govern/guides/standard/
With Azure Blueprints, the relationship between the blueprint definition (what should be deployed) and the blueprint assignment (what was deployed) is preserved.
In other words, Azure creates a record that associates a resource with the blueprint that defines it. This connection helps you track and audit your deployments. Azure Blueprints orchestrates the deployment of various resource templates and other artifacts.
How are Azure Blueprints different from Azure Policy
-
A policy is a default allow and explicit deny system focused on resource properties during deployment and for already existing resources. It supports cloud governance by validating those resources within a subscription adhere to requirements and standards.
-
A policy can be included as one of many artifacts in a blueprint definition. Including a policy in a blueprint enables the creation of the right pattern or design during assignment of the blueprint. The policy inclusion makes sure that only approved or expected changes can be made to the environment to protect ongoing compliance to the intent of the blueprint.
Build a cloud governance strategy on Azure
Describe core Azure architectural components
Microsoft Cloud Adoption Framework for Azure
Intro to Azure blueprints
Azure SQL Server vs Azure SQL managed instance. Difference
* Azure SQL Server vs Azure SQL Managed Instance, difference: [CHECK THIS LINK](https://medium.com/awesome-azure/azure-difference-between-azure-sql-database-and-azure-sql-managed-instance-sql-mi-2e61e4485a65)
AKS. Persistent Volumes - Types of replication
In Case of AKS:
Infrastructure-based asynchronous replication
- Your apps might require persistent storage. In Kubernetes, you can use persistent volumes to persist data storage. These persistent volumes are mounted to a node VM and then exposed to the pods. Typically, you provide a common storage point where apps write their data. This data is then replicated across regions and accessed locally, as displayed in the following graphic.
Application-based asynchronous replication
- Kubernetes currently provides no native implementation for application-based asynchronous replication. However, because containers and Kubernetes are loosely coupled, you should be able to use any traditional app or language approach to replicate storage.
Consider Azure Backup or Velero
As with any app, it's important you back up the data related to your AKS clusters and their apps. When your apps consume and store data which is persisted on disks or in files, you should schedule frequent backups or take regular snapshots of that data. You can use several tools for these backup operations, including:
- Azure Disks: Azure Disks can use built-in snapshot technologies. However, your apps might need to flush writes-to-disk before the snapshot operation.
- Velero: Velero can back up persistent volumes along with additional cluster resources and configurations.
Business Critical Tier. General Purpose
-
Business Critical Tier:
The next service tier to consider is Business Critical, which can generally achieve the highest performance and availability of all Azure SQL service tiers (General Purpose, Hyperscale, Business Critical). Business Critical is meant for mission-critical applications that need low latency and minimal downtime.
Patterns used in Databases
- Hyperscale The Hyperscale service tier is currently available for Azure SQL Database, and not Azure SQL Managed Instance. This service tier has a unique architecture because it uses a tiered layer of caches and page servers to expand the ability to quickly access database pages without having to access the data file directly.
Active geo-replication is available for:
- Azure SQL Database: You can configure active geo-replication for any database in any elastic database pool.
You can use active geo-replication to: - Create a readable secondary replica in a different region.
- Fail over to a secondary database if your primary database fails or needs to be taken offline.
Materials are taken from this site: https://rajanieshkaushikk.com/2023/04/08/azure-blob-storage-vs-file-storage-vs-disk-storage-which-is-right-for-you/#:~:text=Azure%20File%20storage%20is%20not,low%20latency%20and%20high%20IOPS.
Traffic manager. General Info
Failover scenarios:
-
Manually, by using Azure DNS, this failover solution uses the standard DNS mechanism to fail over to your backup site. This option works best when used in conjunction with the cold standby or the pilot light approaches.
-
Automatically, by using Traffic Manager, with more complex architectures and multiple sets of resources capable of performing the same function, you can configure Azure Traffic Manager (based on DNS). Traffic Manager checks the health of your resources and routes the traffic from the non-healthy resource to the healthy resource automatically.
Approach | Description |
---|---|
Active/Passive with cold standby | Your VMs (and other appliances) that are running in the standby region aren't active until needed. However, your production environment is replicated to a different region. This approach is cost-effective but takes longer to undertake a complete failover. |
Active/Passive with pilot light | You establish the standby environment with a minimal configuration; it has only the necessary services running to support a minimal and critical set of apps. In its default form, this approach can only execute minimal functionality. However, it can scale up and spawn more services, as needed, to take more of the production load during a failover. |
Active/Passive with warm standby | Your standby region is pre-warmed and is ready to take the base load. Auto scaling is on, and all the instances are up and running. This approach isn't scaled to take the full production load but is functional, and all services are up and running. |
Kafka Basics. Record. Topics. Consumers. Consumer Groups. Load Balancing. Compression and Batching
Consumers are the applications that subscribe to (read and process) data from Kafka topics. Consumers subscribe **to one or more topics** and consume published messages by pulling data from the brokers.
A record is a message or an event that gets stored in Kafka. Essentially, it is the data that travels from producer to consumer through Kafka. A record contains a key, a value, a timestamp, and optional metadata headers.
Consumer Groups. Consumer group can have multiple consumers that subscribe to the same topic, allowing the system to process messages in parallel.
Consumer groups are a way to manage multiple consumers of a messaging system that work together to process messages from one or more topics.
Each consumer group ensures that all messages in the topic are processed, and each message is processed by only one consumer within the group.
This approach allows for parallel processing and load balancing among consumers.
For example, in Apache Kafka, a consumer group can have multiple consumers that subscribe to the same topic, allowing the system to process messages in parallel and distribute the workload evenly.
Kafka uses a referred to as "sticky partitioning" to provide load balancing functionality.
This algorithm locks each message to a specific partition, and then distributes new messages to the next available partition in a round-robin fashion. This ensures that load is spread evenly across partitions and that remain balanced.
Message batching is the process of combining multiple messages into a single batch before processing or transmitting them.
This approach can improve throughput and reduce the overhead of processing individual messages.
Compression, on the other hand, reduces the size of the messages, leading to less network bandwidth usage and faster transmission.
For example, Apache Kafka supports both batching and compression:
Producers can batch messages together, and the system can compress these batches using various compression algorithms like Snappy or Gzip, reducing the amount of data transmitted and improving overall performance.
Kafka Cluster. Zoo Keeper
Kafka is deployed as a cluster of one or more servers, where each server is responsible for running one Kafka broker.
ZooKeeper is a distributed key-value store and is used for coordination and storing configurations. It is highly optimized for reads.
Kafka uses ZooKeeper to coordinate between Kafka brokers; ZooKeeper maintains metadata information about the Kafka cluster.
Kafka vs RabbitMQ vs ActiveMQ vs Azure ServiceBus vs AWS SQS
Kafka is ideal for streaming data in an at-least-once manner and provides powerful features such as transmission of data over partitions, replication, and high-availability across multiple data centers.
It is optimized for large-scale data streaming and has built-in support for high throughput, low latency and scalability.
Azure Service Bus is used for messaging and not for streaming, and generally provides higher throughput and lower latency for scenarios where at-most-once messaging is required.
It supports mobile devices, and provides integration with other Azure services such as Azure Storage and Service Fabric.
Kafka is ideal for streaming data in an at-least-once manner and provides powerful features such as transmission of data over partitions, replication, and high-availability across multiple data centers.
It can be used to ingest data from multiple sources to multiple destinations and is optimized for large-scale streaming.
AWS SQS is not suitable for streaming and is used for message queuing scenarios where at-most-once delivery is required.
It supports mobile devices, and provides integration with other AWS services.
Messaging Patterns with Kafka. Point to Point. Pub-Sub. Request-Response. Fan-Out/Fan-In (Scatter-Gather). Dead Letter Queue
- Competing consumers
- Guaranteed delivery
- Content-based routing
- Routing slip
- Correlation identifier
- Routing by header
- Receiver-initiated workflow
- Routing using selectors
- Sagas
Which messaging pattern fits better for data stream processing?
The best messaging pattern for data stream processing is Publish/Subscribe. This pattern is typically used for passing data between applications, decoupling producers and consumers, and ensuring that messages are distributed to all interested parties in the system.
CAP Theorem
- Consistency ( C ): All nodes see the same data at the same time. This means users can read or write from/to any node in the system and will receive the same data. It is equivalent to having a single up-to-date copy of the data.
- Availability ( A ): Availability means every request received by a non-failing node in the system must result in a response. Even when severe network failures occur, every request must terminate. In simple terms, availability refers to a system's ability to remain accessible even if one or more nodes in the system go down.
- Partition tolerance ( P ): A partition is a communication break (or a network failure) between any two nodes in the system, i.e., both nodes are up but cannot communicate with each other.
A partition-tolerant system continues to operate even if there are partitions in the system. Such a system can sustain any network failure that does not result in the failure of the entire network.
Data is sufficiently replicated across combinations of nodes and networks to keep the system up through intermittent outages.
PACELC Theorem. General Info: ACID vs BASE. PACELC Theorem Examples
- We cannot avoid partition in a distributed system, therefore, according to the CAP theorem, a distributed system should choose between consistency or availability.
- ACID (Atomicity, Consistency, Isolation, Durability) databases, such as RDBMSs like MySQL, Oracle, and Microsoft SQL Server, chose consistency (refuse response if it cannot check with peers), while BASE (Basically Available, Soft-state, Eventually consistent) databases, such as NoSQL databases like MongoDB, Cassandra, and Redis, chose availability (respond with local data without ensuring it is the latest with its peers).
- Dynamo and Cassandra are PA/EL systems: They choose availability over consistency when a partition occurs; otherwise, they choose lower latency.
- BigTable and HBase are PC/EC systems: They will always choose consistency, giving up availability and lower latency.
- MongoDB can be considered PA/EC (default configuration): MongoDB works in a primary/secondaries configuration. In the default configuration, all writes and reads are performed on the primary.
As all replication is done asynchronously (from primary to secondaries), when there is a network partition in which primary is lost or becomes isolated on the minority side, there is a chance of losing data that is unreplicated to secondaries, hence there is a loss of consistency during partitions.
Therefore it can be concluded that in the case of a network partition, MongoDB chooses availability, but otherwise guarantees consistency. Alternately, when MongoDB is configured to write on majority replicas and read from the primary, it could be categorized as PC/EC.
Q: Where Consistent Hashing is used for Data Partitioning?
A: Amazon's Dynamo and Apache Cassandra use Consistent Hashing to distribute and replicate data across nodes
Q: In what other scenarios we may use Consistent Hashing for Data Servers?
A: In the following scenarios:
Any system working with a set of storage (or database) servers and needs to scale up or down based on the usage, e.g., the system could need more storage during Christmas because of high traffic.
Any distributed system that needs dynamic adjustment of its cache usage by adding or removing cache servers based on the traffic load.
Any system that wants to replicate its data shards to achieve high availability.
Data Partitioning. Data Replication. Naive approach
Data partitioning: It is the process of distributing data across a set of servers. It improves the scalability and performance of the system.
Data replication: It is the process of making multiple copies of data and storing them on different servers. It improves the availability and durability of the data across the system.
Data partition and replication strategies lie at the core of any distributed system. A carefully designed scheme for partitioning and replicating the data enhances the performance, availability, and reliability of the system and also defines how efficiently the system will be scaled and managed.
- How do we know on which node a particular piece of data will be stored?
- When we add or remove nodes, how do we know what data will be moved from existing nodes to the new nodes? Additionally, how can we minimize data movement when nodes join or leave?
- Easy to create and understand
CONS:
- Hard to add or delete node
Consistent Hashing for Data Partitioning. Algorithm: MD5
Distributed systems can use Consistent Hashing to distribute data across nodes. Consistent Hashing maps data to physical nodes and ensures that only a small set of keys move when servers are added or removed.
Consistent Hashing stores the data managed by a distributed system in a ring. Each node in the ring is assigned a range of data.
Whenever the system needs to read or write data, the first step it performs is to apply the MD5 hashing algorithm to the key. The output of this hashing algorithm determines within which range the data lies and hence, on which node the data will be stored.
hus, the hash generated from the key tells us the node where the data will be stored.
PROS:
- When node added or deleted only limited amount of data is affected
- When node deleted the next node starts being responsible for all operations of removed node
CONS:
- Each node in Consistent Hashing represents a real server. Therefore, it shows not great load distribution
- Works well only in homogenious systems. If you have different servers you cant balance them well
- High chance of hotspot issue (when one server uses more often than others)
Consistent Hashing. Virtual Nodes
- The load spreads more evenly across the physical nodes on the cluster by dividing the hash ranges into smaller subranges, this speeds up the rebalancing process after adding or removing nodes
- When a new node is added, it receives many Vnodes from the existing nodes to maintain a balanced cluster
- Many nodes participate in the rebuild process when a node needs to be rebuilt
- It's easier to maintain the data cluster if it consists of different machines (heterogenious servers). More powerful machines may have more Vnodes than others
Step by Step guide by Design Gurus. 3 Steps: Clarify Requirements, Expectations & Estimations, System interface definition
- Ask questions about problems you are trying to solve; Design questions on interview are open-ended
a) It has not ONE correct answer. You must clarify ambiguities
b) Need to know on which aspects you must focus
Question examples:
- Will users of our service be able to post tweets and follow other people?
- Should we also design to create and display the user's timeline?
- Will tweets contain photos and videos?
- Are we focusing on the backend only, or are we developing the front-end too?
- Will users be able to search tweets?
- Do we need to display hot trending topics?
- Will there be any push notification for new (or important) tweets?
- Define what APIs are expected from the system. This will establish the exact contract expected from the system and ensure if we haven't gotten any requirements wrong. Some examples of APIs for our Twitter-like service will be:
postTweet(user_id, tweet_data, tweet_location, user_location, timestamp, …)
generateTimeline(user_id, current_time, user_location, …)
markTweetFavorite(user_id, tweet_id, timestamp, …)
Step by Step guide by Design Gurus. Next 4 Steps: Define data model, Degine Database Type, High-level design
-
Defining the data model in the early part of the interview will clarify how data will flow between different system components. Later, it will guide for data partitioning and management.
-
The candidate should identify various system entities, how they will interact with each other, and different aspects of data management like storage, transportation, encryption, etc. Here are some entities for our Twitter-like service:
User
: UserID, Name, Email, DoB, CreationDate, LastLogin, etc.
Tweet
: TweetID, Content, TweetLocation, NumberOfLikes, TimeStamp, etc.
UserFollow
: UserID1, UserID2
FavoriteTweets
: UserID, TweetID, TimeStamp
Which database system should we use? Will NoSQL like Cassandra best fit our needs, or should we use a SQL-like solution? What kind of block storage should we use to store photos and videos?
Questions here:
1) Do we need to be ACID-compliant?
2) Do we need to support Strong Data Consistency and Transactions?
3) What data structure do we have?
4) Amount of Read\Write operations?
In case of Tweeter the answer is possible NoSQL, we can sacrifice Strong Consistency over Availability (Low Latency).
In terms of type we need to estimate the balance between read and write operations. Potentially, we read more often. So, it's a NO for Cassandra and probably it's better to go with MongoDB.
PS: If we are choosing among native NoSQL solutions, for DynamoDB there is only one good DB pattern named "One Big Table". Lots of patterns eventually say you to go with One Big Table: https://www.alexdebrie.com/posts/dynamodb-single-table/
- Draw a block diagram with 5-6 boxes representing the core components of our system. We should identify enough components that are needed to solve the actual problem from end to end.
- For Twitter, at a high level, we will need multiple application servers to serve all the read/write requests with load balancers in front of them for traffic distributions.
If we're assuming that we will have a lot more read traffic (compared to write), we can decide to have separate servers to handle these scenarios. On the back-end, we need an efficient database that can store all the tweets and support a large number of reads. We will also need a distributed file storage system for storing photos and videos.