Skip to content

Commit

Permalink
Site updated: 2024-01-03 10:19:37
Browse files Browse the repository at this point in the history
  • Loading branch information
yulewei committed Jan 3, 2024
1 parent a8f7219 commit bd11869
Show file tree
Hide file tree
Showing 4 changed files with 43 additions and 38 deletions.
28 changes: 15 additions & 13 deletions 2023/12/website-scalability-reliability-resilience/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@
<meta property="og:image" content="https://static.nullwy.me/stability-google-service-failures-distribution.png">
<meta property="og:image" content="https://static.nullwy.me/stability-response-alibaba-1-5-10.png">
<meta property="article:published_time" content="2023-12-25T02:36:00.000Z">
<meta property="article:modified_time" content="2024-01-03T01:35:08.560Z">
<meta property="article:modified_time" content="2024-01-03T02:18:23.150Z">
<meta property="article:author" content="nullwy">
<meta property="article:tag" content="架构">
<meta property="article:tag" content="可靠性">
Expand Down Expand Up @@ -445,7 +445,7 @@ <h1 id="可靠性和韧性设计">可靠性和韧性设计</h1>
<li><strong>时间冗余</strong>(time redundancy):多次执行相同的操作(重试)实现冗余,例如多次执行程序或传输数据的多个副本。</li>
</ul>
<p>硬件冗余和软件冗余被合称为结构冗余(structural redundancy)。相对与时间冗余,硬件冗余、软件冗余、信息冗余被合称为空间冗余(space redundancy)。硬件冗余比较常见,而软件冗余相对少见。</p>
<p><strong>应对各种故障的具体典型的可靠性和韧性策略</strong></p>
<p><strong>应对各种故障的具体典型的可靠性和韧性策略</strong><sup class="footnote-ref"><a href="#fn20" id="fnref20">[20]</a></sup><sup class="footnote-ref"><a href="#fn5" id="fnref5:1">[5:1]</a></sup><sup class="footnote-ref"><a href="#fn7" id="fnref7:1">[7:1]</a></sup></p>
<ul>
<li><strong>硬件和网络故障</strong>
<ul>
Expand Down Expand Up @@ -479,7 +479,7 @@ <h1 id="可靠性和韧性设计">可靠性和韧性设计</h1>
</ul>
</li>
</ul>
<p>云环境的硬件基础设施,比如 AWS、Azure、阿里云等,按物理隔离程度区分<strong>可用区</strong><a target="_blank" rel="noopener" href="https://en.wikipedia.org/wiki/Availability_zone">Availability Zone</a>, AZ)和<strong>地域</strong>(Region,也叫区域)。<strong>地域</strong>指数据中心所在的地理区域,通常按照数据中心所在的城市划分。例如阿里云<sup class="footnote-ref"><a href="#fn20" id="fnref20">[20]</a></sup>,华北 1(青岛)地域表示数据中心所在的城市是青岛。<strong>可用区</strong>是指在同一地域内独立的物理分区,每个可用区包含一个或多个数据中心,这些数据中心配置独立电源、冷却和网络。例如阿里云,华北 1(青岛)地域支持 2 个可用区,包括青岛可用区 B 和青岛可用区 C。在同一地域内,可用区与可用区之间内网互通。各可用区之间可以实现故障隔离,即如果一个可用区出现故障,则不会影响其他可用区的正常运行。</p>
<p>云环境的硬件基础设施,比如 AWS、Azure、阿里云等,按物理隔离程度区分<strong>可用区</strong><a target="_blank" rel="noopener" href="https://en.wikipedia.org/wiki/Availability_zone">Availability Zone</a>, AZ)和<strong>地域</strong>(Region,也叫区域)。<strong>地域</strong>指数据中心所在的地理区域,通常按照数据中心所在的城市划分。例如阿里云<sup class="footnote-ref"><a href="#fn21" id="fnref21">[21]</a></sup>,华北 1(青岛)地域表示数据中心所在的城市是青岛。<strong>可用区</strong>是指在同一地域内独立的物理分区,每个可用区包含一个或多个数据中心,这些数据中心配置独立电源、冷却和网络。例如阿里云,华北 1(青岛)地域支持 2 个可用区,包括青岛可用区 B 和青岛可用区 C。在同一地域内,可用区与可用区之间内网互通。各可用区之间可以实现故障隔离,即如果一个可用区出现故障,则不会影响其他可用区的正常运行。</p>
<p>按故障的影响范围,可以区分组件级、可用区级和地域级共三个级别的故障,这三个级别故障的具体的容错措施是:</p>
<ul>
<li>组件级故障:实现组件冗余,避免单点故障</li>
Expand All @@ -492,10 +492,10 @@ <h1 id="可靠性和韧性设计">可靠性和韧性设计</h1>
<li>迅速而准确地检测到问题的发生</li>
<li>当出现问题时,安全迅速地回退改动</li>
</ul>
<p>阿里将这三点变更管理最佳实践总结概括为简单易记的“<strong>变更三板斧</strong>”,可灰度、可监控、可回滚<sup class="footnote-ref"><a href="#fn21" id="fnref21">[21]</a></sup><sup class="footnote-ref"><a href="#fn22" id="fnref22">[22]</a></sup>。另外一个提高系统稳定性的变更管理最佳实践是,在重保活动等重要事件的时候开启<strong>封版</strong><a target="_blank" rel="noopener" href="https://en.wikipedia.org/wiki/Freeze_%28software_engineering%29">change freeze</a>)策略,在封版期间除了特殊的紧急发布外禁止生产环境的全部变更。</p>
<p><strong>故障响应</strong>(incident response)方面,提高系统稳定性的最核心的目标就是<strong>缩短故障恢复时间(MTTR)</strong>。阿里的稳定性实践是把这个目标量化,提出“1-5-10 故障快恢”目标,1 分钟发现及启动响应,5 分钟定位,10 分钟恢复<sup class="footnote-ref"><a href="#fn22" id="fnref22:1">[22:1]</a></sup><sup class="footnote-ref"><a href="#fn23" id="fnref23">[23]</a></sup>。阿里的 1-5-10 能力图谱,如下图所示<sup class="footnote-ref"><a href="#fn23" id="fnref23:1">[23:1]</a></sup></p>
<p>阿里将这三点变更管理最佳实践总结概括为简单易记的“<strong>变更三板斧</strong>”,可灰度、可监控、可回滚<sup class="footnote-ref"><a href="#fn22" id="fnref22">[22]</a></sup><sup class="footnote-ref"><a href="#fn23" id="fnref23">[23]</a></sup>。另外一个提高系统稳定性的变更管理最佳实践是,在重保活动等重要事件的时候开启<strong>封版</strong><a target="_blank" rel="noopener" href="https://en.wikipedia.org/wiki/Freeze_%28software_engineering%29">change freeze</a>)策略,在封版期间除了特殊的紧急发布外禁止生产环境的全部变更。</p>
<p><strong>故障响应</strong>(incident response)方面,提高系统稳定性的最核心的目标就是<strong>缩短故障恢复时间(MTTR)</strong>。阿里的稳定性实践是把这个目标量化,提出“1-5-10 故障快恢”目标,1 分钟发现及启动响应,5 分钟定位,10 分钟恢复<sup class="footnote-ref"><a href="#fn22" id="fnref22:1">[22:1]</a></sup><sup class="footnote-ref"><a href="#fn23" id="fnref23:1">[23:1]</a></sup>。阿里的 1-5-10 能力图谱,如下图所示<sup class="footnote-ref"><a href="#fn24" id="fnref24">[24]</a></sup></p>
<img width="750" alt="阿里“1-5-10 故障快恢”能力图谱" title="阿里“1-5-10 故障快恢”能力图谱" src="https://static.nullwy.me/stability-response-alibaba-1-5-10.png">
<p>类似的,哈啰的故障响应目标是 5-5-10:5 分钟响应、5 分钟定位、10 分钟恢复<sup class="footnote-ref"><a href="#fn24" id="fnref24">[24]</a></sup></p>
<p>类似的,哈啰的故障响应目标是 5-5-10:5 分钟响应、5 分钟定位、10 分钟恢复<sup class="footnote-ref"><a href="#fn25" id="fnref25">[25]</a></sup></p>
<h1 id="参考资料">参考资料</h1>
<hr class="footnotes-sep">
<section class="footnotes">
Expand All @@ -508,11 +508,11 @@ <h1 id="参考资料">参考资料</h1>
</li>
<li id="fn4" class="footnote-item"><p>阿里云卓越架构:卓越运营支柱:故障管理:故障等级定义的制定和录入 <a target="_blank" rel="noopener" href="https://help.aliyun.com/document_detail/2536143.html">https://help.aliyun.com/document_detail/2536143.html</a> <a href="#fnref4" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn5" class="footnote-item"><p>2022-06 中国信通院:分布式系统稳定性建设指南(2022年) <a target="_blank" rel="noopener" href="http://www.caict.ac.cn/kxyj/qwfb/ztbg/202206/t20220620_404604.htm">http://www.caict.ac.cn/kxyj/qwfb/ztbg/202206/t20220620_404604.htm</a> <a target="_blank" rel="noopener" href="https://mp.weixin.qq.com/s/OkG3_pjtaQcB-cOupCNe-w">https://mp.weixin.qq.com/s/OkG3_pjtaQcB-cOupCNe-w</a> <a href="#fnref5" class="footnote-backref">↩︎</a></p>
<li id="fn5" class="footnote-item"><p>2022-06 中国信通院:分布式系统稳定性建设指南(2022年) <a target="_blank" rel="noopener" href="http://www.caict.ac.cn/kxyj/qwfb/ztbg/202206/t20220620_404604.htm">http://www.caict.ac.cn/kxyj/qwfb/ztbg/202206/t20220620_404604.htm</a> <a target="_blank" rel="noopener" href="https://mp.weixin.qq.com/s/OkG3_pjtaQcB-cOupCNe-w">https://mp.weixin.qq.com/s/OkG3_pjtaQcB-cOupCNe-w</a> <a href="#fnref5" class="footnote-backref">↩︎</a> <a href="#fnref5:1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn6" class="footnote-item"><p>AWS Well-Architected Framework: Concepts: Resiliency <a target="_blank" rel="noopener" href="https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-23/wat.concept.resiliency.en.html">https://wa.aws.amazon.com/wellarchitected/2020-07-02T19-33-23/wat.concept.resiliency.en.html</a> <a href="#fnref6" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn7" class="footnote-item"><p>2022-01 Microsoft Azure 韧性白皮书(Resilience in Azure whitepaper) <a target="_blank" rel="noopener" href="https://www.modb.pro/doc/109965">https://www.modb.pro/doc/109965</a> <a target="_blank" rel="noopener" href="https://web.archive.org/web/0/https://azure.microsoft.com/en-us/resources/resilience-in-azure-whitepaper/">https://web.archive.org/web/0/https://azure.microsoft.com/en-us/resources/resilience-in-azure-whitepaper/</a> <a href="#fnref7" class="footnote-backref">↩︎</a></p>
<li id="fn7" class="footnote-item"><p>2022-01 Microsoft Azure 韧性白皮书(Resilience in Azure whitepaper) <a target="_blank" rel="noopener" href="https://www.modb.pro/doc/109965">https://www.modb.pro/doc/109965</a> <a target="_blank" rel="noopener" href="https://web.archive.org/web/0/https://azure.microsoft.com/en-us/resources/resilience-in-azure-whitepaper/">https://web.archive.org/web/0/https://azure.microsoft.com/en-us/resources/resilience-in-azure-whitepaper/</a> <a href="#fnref7" class="footnote-backref">↩︎</a> <a href="#fnref7:1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn8" class="footnote-item"><p>Google 构建安全可靠的系统,2021,<a target="_blank" rel="noopener" href="https://book.douban.com/subject/35585206/">豆瓣</a>:第8章 弹性设计 <a href="#fnref8" class="footnote-backref">↩︎</a></p>
</li>
Expand All @@ -538,15 +538,17 @@ <h1 id="参考资料">参考资料</h1>
</li>
<li id="fn19" class="footnote-item"><p>数据中心一体化最佳实践,Barroso, Hölzle, Ranganathan,第3版2018,<a target="_blank" rel="noopener" href="https://book.douban.com/subject/34950732/">豆瓣</a>:第7章 故障处理与维修 <a href="#fnref19" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn20" class="footnote-item"><p>阿里云:地域和可用区 <a target="_blank" rel="noopener" href="https://help.aliyun.com/document_detail/40654.html">https://help.aliyun.com/document_detail/40654.html</a> <a href="#fnref20" class="footnote-backref">↩︎</a></p>
<li id="fn20" class="footnote-item"><p>云系统管理:大规模分布式系统设计与运营,<a target="_blank" rel="noopener" href="https://en.wikipedia.org/wiki/Tom_Limoncelli">Tom Limoncelli</a>,2014,<a target="_blank" rel="noopener" href="https://book.douban.com/subject/26865122/">豆瓣</a>:第6章 弹性设计模式 <a href="#fnref20" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn21" class="footnote-item"><p>2020-03 阿里陈鑫:阿里巴巴DevOps文化浅谈 <a target="_blank" rel="noopener" href="https://mp.weixin.qq.com/s/h-F8dopr23pgvSoXjWfE8A">https://mp.weixin.qq.com/s/h-F8dopr23pgvSoXjWfE8A</a> <a href="#fnref21" class="footnote-backref">↩︎</a></p>
<li id="fn21" class="footnote-item"><p>阿里云:地域和可用区 <a target="_blank" rel="noopener" href="https://help.aliyun.com/document_detail/40654.html">https://help.aliyun.com/document_detail/40654.html</a> <a href="#fnref21" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn22" class="footnote-item"><p>阿里云卓越架构:稳定性支柱:稳定性设计方案 <a target="_blank" rel="noopener" href="https://help.aliyun.com/document_detail/2573820.html">https://help.aliyun.com/document_detail/2573820.html</a> <a href="#fnref22" class="footnote-backref">↩︎</a> <a href="#fnref22:1" class="footnote-backref">↩︎</a></p>
<li id="fn22" class="footnote-item"><p>2020-03 阿里陈鑫:阿里巴巴DevOps文化浅谈 <a target="_blank" rel="noopener" href="https://mp.weixin.qq.com/s/h-F8dopr23pgvSoXjWfE8A">https://mp.weixin.qq.com/s/h-F8dopr23pgvSoXjWfE8A</a> <a href="#fnref22" class="footnote-backref">↩︎</a> <a href="#fnref22:1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn23" class="footnote-item"><p>2021-05 阿里暴晓亚若厉:阿里巴巴GOC稳定性保障介绍(slides, 26p) <a target="_blank" rel="noopener" href="https://www.modb.pro/doc/31443">https://www.modb.pro/doc/31443</a> <a href="#fnref23" class="footnote-backref">↩︎</a> <a href="#fnref23:1" class="footnote-backref">↩︎</a></p>
<li id="fn23" class="footnote-item"><p>阿里云卓越架构:稳定性支柱:稳定性设计方案 <a target="_blank" rel="noopener" href="https://help.aliyun.com/document_detail/2573820.html">https://help.aliyun.com/document_detail/2573820.html</a> <a href="#fnref23" class="footnote-backref">↩︎</a> <a href="#fnref23:1" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn24" class="footnote-item"><p>2022-04 哈啰技术:稳定性建设系列文章1_大纲&amp;方法论 <a target="_blank" rel="noopener" href="https://segmentfault.com/a/1190000041671012">https://segmentfault.com/a/1190000041671012</a> <a href="#fnref24" class="footnote-backref">↩︎</a></p>
<li id="fn24" class="footnote-item"><p>2021-05 阿里暴晓亚若厉:阿里巴巴GOC稳定性保障介绍(slides, 26p) <a target="_blank" rel="noopener" href="https://www.modb.pro/doc/31443">https://www.modb.pro/doc/31443</a> <a href="#fnref24" class="footnote-backref">↩︎</a></p>
</li>
<li id="fn25" class="footnote-item"><p>2022-04 哈啰技术:稳定性建设系列文章1_大纲&amp;方法论 <a target="_blank" rel="noopener" href="https://segmentfault.com/a/1190000041671012">https://segmentfault.com/a/1190000041671012</a> <a href="#fnref25" class="footnote-backref">↩︎</a></p>
</li>
</ol>
</section>
Expand Down
6 changes: 3 additions & 3 deletions atom.xml

Large diffs are not rendered by default.

Loading

0 comments on commit bd11869

Please sign in to comment.