当前位置: 首页 > news >正文

Distributed systems theory for the distributed systems engineer

Gwen Shapira, SA superstar and now full-time engineer at Cloudera, asked a question on Twitter that got me thinking.

My response of old might have been “well, here’s the FLP paper, and here’s the Paxos paper, and here’s the Byzantine generals paper…”, and I’d have prescribed a laundry list of primary source material which would have taken at least six months to get through if you rushed. But I’ve come to thinking that recommending a ton of theoretical papers is often precisely the wrong way to go about learning distributed systems theory (unless you are in a PhD program). Papers are usually deep, usually complex, and require both serious study, and usually significant experience to glean their important contributions and to place them in context. What good is requiring that level of expertise of engineers?

And yet, unfortunately, there’s a paucity of good ‘bridge’ material that summarises, distills and contextualises the important results and ideas in distributed systems theory; particularly material that does so without condescending. Considering that gap lead me to another interesting question:

What distributed systems theory should a distributed systems engineer know?

A little theory is, in this case, not such a dangerous thing. So I tried to come up with a list of what I consider the basic concepts that are applicable to my every-day job as a distributed systems engineer; what I consider ‘table stakes’ for distributed systems engineers competent enough to design a new system. Let me know what you think I missed!

 

First steps

These four readings do a pretty good job of explaining what about building distributed systems is challenging. Collectively they outline a set of abstract but technical difficulties that the distributed systems engineer has to overcome, and set the stage for the more detailed investigation in later sections

Distributed Systems for Fun and Profit is a short book which tries to cover some of the basic issues in distributed systems including the role of time and different strategies for replication.

Notes on distributed systems for young bloods – not theory, but a good practical counterbalance to keep the rest of your reading grounded.

A Note on Distributed Systems – a classic paper on why you can’t just pretend all remote interactions are like local objects.

The fallacies of distributed computing – 8 fallacies of distributed computing that set the stage for the kinds of things system designers forget.

Failure and Time

Many difficulties that the distributed systems engineer faces can be blamed on two underlying causes:

  1. processes may fail
  2. there is no good way to tell that they have done so

There is a very deep relationship between what, if anything, processes share about their knowledge oftime, what failure scenarios are possible to detect, and what algorithms and primitives may be correctly implemented. Most of the time, we assume that two different nodes have absolutely no shared knowledge of what time it is, or how quickly time passes.

You should know:

* The (partial) hierarchy of failure modes: crash stop -> omission -> Byzantine. You should understand that what is possible at the top of the hierarchy must be possible at lower levels, and what is impossible at lower levels must be impossible at higher levels.

* How you decide whether an event happened before another event in the absence of any shared clock. This means Lamport clocks and their generalisation to Vector clocks, but also see the Dynamo paper.

* How big an impact the possibility of even a single failure can actually have on our ability to implement correct distributed systems (see my notes on the FLP result below).

* Different models of time: synchronous, partially synchronous and asynchronous (links coming, when I find a good reference).

The basic tension of fault tolerance

A system that tolerates some faults without degrading must be able to act as though those faults had not occurred. This means usually that parts of the system must do work redundantly, but doing more work than is absolutely necessary typically carries a cost both in performance and resource consumption. This is the basic tension of adding fault tolerance to a system.

You should know:

* The quorum technique for ensuring single-copy serialisability. See Skeen’s original paper, but perhaps better is Wikipedia’s entry.

* About 2-phase-commit, 3-phase-commit and Paxos, and why they have different fault-tolerance properties.

* How eventual consistency, and other techniques, seek to avoid this tension at the cost of weaker guarantees about system behaviour. The Dynamo paper is a great place to start, but also Pat Helland’s classic Life Beyond Transactions is a must-read.

Basic primitives

There are few agreed-upon basic building blocks in distributed systems, but more are beginning to emerge. You should know what the following problems are, and where to find a solution for them:

* Leader election (e.g. the Bully algorithm)

* Consistent snapshotting (e.g. this classic paper from Chandy and Lamport)

* Consensus (see the blog posts on 2PC and Paxos above)

* Distributed state machine replication (Wikipedia is ok, Lampson’s paper is canonical but dry).

Fundamental Results

Some facts just need to be internalised. There are more than this, naturally, but here’s a flavour:

  • You can’t implement consistent storage and respond to all requests if you might drop messages between processes. This is the CAP theorem.
  • Consensus is impossible to implement in such a way that it both a) is always correct and b) always terminates if even one machine might fail in an asynchronous system with crash-* stop failures (the FLP result). The first slides – before the proof gets going – of my Papers We Love SF talk do a reasonable job of explaining the result, I hope. Suggestion: there’s no real need to understand the proof.
  • Consensus is impossible to solve in fewer than 2 rounds of messages in general

Real systems

The most important exercise to repeat is to read descriptions of new, real systems, and to critique their design decisions. Do this over and over again. Some suggestions:

Google:

GFS, Spanner, F1, Chubby, BigTable, MillWheel, Omega, Dapper. Paxos Made Live, The Tail At Scale.

Not Google:

Dryad, Cassandra, Ceph, RAMCloud, HyperDex, PNUTS

 

from: http://the-paper-trail.org/blog/distributed-systems-theory-for-the-distributed-systems-engineer/

转载于:https://www.cnblogs.com/cobbliu/p/5642499.html

相关文章:

  • python 学习 异常处理
  • c# 定时器
  • 图论(二分图最大权独立点集):COGS 2051. 王者之剑
  • flow.ci + Github + Slack 一步步搭建 Python 自动化持续集成
  • pct xcode7
  • 高并发性能调试经验分享
  • java中String类、StringBuilder类和StringBuffer类详解
  • 如何确认软件测试结束的呢?
  • SpringMVC(一)
  • 【转】web移动端一些常用知识
  • CSS Hack是什么意思
  • c++中while(cinstr)和ctrl z的相关问题探讨
  • 机器学习学习笔记1
  • $.ajax()方法详解
  • 【C语言入门教程】7.4 共用体
  • 《Javascript数据结构和算法》笔记-「字典和散列表」
  • android 一些 utils
  • Android组件 - 收藏集 - 掘金
  • Essential Studio for ASP.NET Web Forms 2017 v2,新增自定义树形网格工具栏
  • gitlab-ci配置详解(一)
  • Java教程_软件开发基础
  • leetcode378. Kth Smallest Element in a Sorted Matrix
  • maven工程打包jar以及java jar命令的classpath使用
  • MySQL-事务管理(基础)
  • node-sass 安装卡在 node scripts/install.js 解决办法
  • PAT A1120
  • Sass Day-01
  • 从零开始学习部署
  • 关于 Linux 进程的 UID、EUID、GID 和 EGID
  • 关于字符编码你应该知道的事情
  • 基于web的全景—— Pannellum小试
  • 区块链分支循环
  • 吐槽Javascript系列二:数组中的splice和slice方法
  • 硬币翻转问题,区间操作
  • puppet连载22:define用法
  • 策略 : 一文教你成为人工智能(AI)领域专家
  • 我们雇佣了一只大猴子...
  • ​ArcGIS Pro 如何批量删除字段
  • ​渐进式Web应用PWA的未来
  • ​力扣解法汇总946-验证栈序列
  • #stm32驱动外设模块总结w5500模块
  • (145)光线追踪距离场柔和阴影
  • (2)(2.4) TerraRanger Tower/Tower EVO(360度)
  • (done) NLP “bag-of-words“ 方法 (带有二元分类和多元分类两个例子)词袋模型、BoW
  • (html5)在移动端input输入搜索项后 输入法下面为什么不想百度那样出现前往? 而我的出现的是换行...
  • (vue)页面文件上传获取:action地址
  • (阿里云万网)-域名注册购买实名流程
  • (二)基于wpr_simulation 的Ros机器人运动控制,gazebo仿真
  • (二十四)Flask之flask-session组件
  • (附源码)node.js知识分享网站 毕业设计 202038
  • (免费领源码)Java#ssm#MySQL 创意商城03663-计算机毕业设计项目选题推荐
  • (区间dp) (经典例题) 石子合并
  • (三分钟了解debug)SLAM研究方向-Debug总结
  • (转)重识new
  • *p++,*(p++),*++p,(*p)++区别?