(原創(chuàng)翻譯喲)
Redundancy and Backup Model -Engineering
冗余備份模型--工程學(xué)
In engineering,redundancy is the duplication of critical components or functions of a system with the intention of increasing reliability of the system, usually in the case of a backup or fail-safe.
工程學(xué)中,冗余是指復(fù)制關(guān)鍵的部件或者系統(tǒng)的主要功能,意圖提高系統(tǒng)的可靠性,通常使用備份或者自動(dòng)防故障裝置。
In many safety-critical systems, such as fly-by-wire and hydraulic systems in aircraft, some parts of the control system may be triplicated, which is formally termed triple modular redundancy (TMR). An error in one component may then be out-voted by the other two. In a triply redundant system, the system has three sub components, all three of which must fail before the system fails. Since each one rarely fails, and the sub components are expected to fail independently, the probability of all three failing is calculated to be extremely small; often outweighed by other risk factors, e.g., human error. Redundancy may also be known by the terms "majority voting systems",or "voting logic".
? ?在許多安全導(dǎo)向的系統(tǒng)上,比如飛機(jī)上的全數(shù)字電傳操作和液壓系統(tǒng),控制系統(tǒng)上的某些部件也許會(huì)被一式三份,專業(yè)術(shù)語:三重模塊冗余(TMR)。一個(gè)部件出錯(cuò)將會(huì)被另外兩個(gè)備份部件所取代。在三重冗余系統(tǒng)中,系統(tǒng)擁有三個(gè)替補(bǔ)部件,在三個(gè)替補(bǔ)部件全部發(fā)生故障前,系統(tǒng)能夠一直保持正常運(yùn)轉(zhuǎn)。因?yàn)槊恳粋€(gè)部件發(fā)生故障的概率都很小,而且部件之間互不影響,所以三個(gè)部件全部出現(xiàn)故障的幾率是非常小的。常常低于其他風(fēng)險(xiǎn)因素,例如...人為錯(cuò)誤。冗余也常被稱為“多數(shù)表決系統(tǒng)”或“表決邏輯系統(tǒng)”。
Forms of redundancy
冗余的構(gòu)成
There are four major forms of redundancy, these are:
冗余有四種主要形式,分別是:
Hardware redundancy, such as DMR and TMR
·硬件冗余,如:DMR和TMR
Information redundancy, such as Error detection and correction methods
·信息冗余,如:錯(cuò)誤檢查和矯正法
Time redundancy, including transient fault detection methods such asAlternate Logic
·時(shí)間冗余,包括臨時(shí)故障檢查法 ,如候補(bǔ)邏輯
Software redundancy such as N-version programming
·軟件冗余,如:N版本編程
A modified form of software redundancy, applied to hardware may be:
一個(gè)改良后的軟件冗余,可能應(yīng)用于硬件:
Distinct functional redundancy, such as both mechanical and hydraulic braking in a car. Applied in the case of software, code written independently and distinctly different but producing the same results for the same inputs.
? ? 不同功能的冗余,如:機(jī)械和液壓都用于汽車制動(dòng)。就軟件方面的應(yīng)用來說,代碼獨(dú)立編寫并明顯不同,但是卻能產(chǎn)生的相同的結(jié)果和輸入。
DMR:A machine which is Dual Modular Redundant has duplicated elements which work in parallel to provide one form of redundancy. A typical example is a complex computer system which has duplicated nodes, so that should one node fail, another is ready to carry on its work. For instance: the Submarine Command System SMCS used on submarines of the Royal Navy employs duplicated central computing nodes, interconnected by a duplicated LAN.
DMR:雙重模塊冗余(Dual Modular Redundant )機(jī)器,通過復(fù)制元素、并行運(yùn)作,來提供一種冗余。一個(gè)典型的例子是 復(fù)雜的電腦系統(tǒng),它會(huì)復(fù)制很多節(jié)點(diǎn),當(dāng)一個(gè)節(jié)點(diǎn)發(fā)生故障,另一個(gè)節(jié)點(diǎn)就準(zhǔn)備好接替它的工作。再舉個(gè)例子:潛艇指揮系統(tǒng) (SMCS :the Submarine Command System ),被用在皇家海軍的潛艇上,采用復(fù)制中央計(jì)算節(jié)點(diǎn),通過復(fù)制的局域網(wǎng)來互相連接。
A lockstep fault-tolerant machine uses replicated elements operating in parallel. At any time, all the replications of each element should be in the same state. The same inputs are provided to each replication, and the same outputs are expected. The outputs of the replications are compared using a voting circuit. A machine with two replications of each element is termed dual modular redundant (DMR). The voting circuit can then only detect a mismatch and recovery relies on other methods. Examples include 1ESS switch. A machine with three replications of each element is termed triple modular redundant (TMR). The voting circuit can determine which replication is in error when a two-to-one vote is observed. In this case, the voting circuit can output the correct result, and discard the erroneous version. After this, the internal state of the erroneous replication is assumed to be different from that of the other two, and the voting circuit can switch to a DMR mode. This model can be applied to any larger number of replications.
容錯(cuò)同步機(jī)器能復(fù)制元素,并行運(yùn)作。任何時(shí)候,每個(gè)復(fù)制元素的狀態(tài)都是一樣的。對(duì)每個(gè)復(fù)制的輸入都是一樣的,并且輸出也跟預(yù)期的一樣。使用表決電路來對(duì)復(fù)制元素的輸出進(jìn)行比較。每個(gè)元素有兩個(gè)復(fù)制品的機(jī)器被稱為 雙重模塊冗余(DMR)。 表決電路只能偵測(cè)不匹配的狀況,而依靠其他方法來恢復(fù)。例子包括 1ESS(TheNumber One Electronic Switching System 第一電子交換系統(tǒng))。每個(gè)元素有三個(gè)復(fù)制品的機(jī)器被稱為三重模塊冗余(TMR)。當(dāng)表決電路觀察到表決數(shù)為二比一時(shí),就會(huì)決定那些復(fù)制品是故障的。在這種情況下,表決電路會(huì)輸出正確的結(jié)果,并且拋棄錯(cuò)誤的版本。在此之后,錯(cuò)誤復(fù)制品的內(nèi)部狀態(tài)被假設(shè)為跟其他兩個(gè)復(fù)制品不一樣,同時(shí)表決電路會(huì)切換至DMR模式。該模型可用于任何存在大量復(fù)制品的情況。
TMR:In computing,triple modular redundancy, sometimes called
triple-mode redundancy
TMR:在電腦運(yùn)算中,雙重模塊冗余有時(shí)被稱為雙重模式冗余
[1]
(TMR) is a fault-tolerant form of N-modular redundancy, in which three systems perform a process and that result is processed by a voting system to produce a single output. If any one of the three systems fails, the other two systems can correct and mask the fault. If the voter fails then the complete system will fail. However, in a good TMR system the voter is much more reliable than the other TMR components. Alternatively, if there is another stage of TMR logic following the current one (for example, in systems such as the Saturn Launch Vehicle Digital Computer), then three voters are used – one for each copy of the next stage of logic.
TMR是一種 多模塊容錯(cuò)形式的冗余,三個(gè)系統(tǒng)的執(zhí)行過程和執(zhí)行結(jié)果是是通過表決系統(tǒng)處理來產(chǎn)出的一個(gè)單一的輸入輸出。三個(gè)系統(tǒng)中的任何一個(gè)發(fā)生故障,其他兩個(gè)系統(tǒng)都能糾正錯(cuò)誤并且修復(fù)這個(gè)錯(cuò)誤。如果表決電路發(fā)生故障 ,那么整個(gè)系統(tǒng)都將會(huì)癱瘓。然而,在一個(gè)優(yōu)秀的TMR系統(tǒng)中,表決電路是通常是系統(tǒng)中最可靠的部件?;蛘撸粼诋?dāng)前TMR邏輯系統(tǒng)中存在另外一個(gè)階段(例如,土星運(yùn)載火箭上的數(shù)字計(jì)算機(jī)系統(tǒng)),那么把三個(gè)表決電路中的每一個(gè)都會(huì)被備份,為邏輯系統(tǒng)的下一階段做準(zhǔn)備。
The TMR concept can be applied to many forms of redundancy, such as software redundancy in the form of N-version programming.
TMR概念可以應(yīng)用在許多冗余形式中,例如N版本編程的軟件冗余形式。
Some ECC memory uses triple modular redundancy hardware (rather than the more common Hamming code), because triple modular redundancy hardware is faster than Hamming error correction hardware.
一些ECC內(nèi)存(Error-correcting code memory:ECC memory寄存式內(nèi)存,能夠?qū)崿F(xiàn)錯(cuò)誤檢查和糾正技術(shù)的內(nèi)存條)使用三重模塊冗余硬件(比常見的漢明碼(Hamming code是一個(gè)錯(cuò)誤校驗(yàn)碼碼集)要好),因?yàn)槿啬K冗余硬件要比漢明碼的錯(cuò)誤糾正技術(shù)硬件更加迅速。
[2]
Space satellite systems often use TMR,
航天衛(wèi)星系統(tǒng)經(jīng)常使用TMR
[3][4][5]
although satellite RAM usually uses Hamming error correction.
盡管衛(wèi)星的RAM經(jīng)常使用漢明碼錯(cuò)誤糾正技術(shù)
[6]
To utilize triple modular redundancy, a ship must have at least three chronometers. At one time, the cost of three sufficiently accurate chronometers was more than the cost of a smaller merchant vessel.
使用三重模塊冗余,船上必須最少有三個(gè)精密計(jì)時(shí)器。在以前,三個(gè)足備的、準(zhǔn)確的精密計(jì)時(shí)器的成本要高于一艘小型商業(yè)輪船。
[7]
Some vessels carried more than three chronometers – for example, the HMS Beagle carried 22 chronometers.
一些輪船 攜帶超過三個(gè)精密計(jì)時(shí)器---例如,皇家海軍獵兔犬號(hào)攜帶了22個(gè)精密計(jì)時(shí)器.
[8]
Some communication systems use N-modular redundancy as a simple form offorward error correction. For example, 5-modular redundancy communication systems (such as FlexRay) use the majority of 5 samples – if any 2 of the 5 results are erroneous, the other 3 results can correct and mask the fault.
一些通訊系統(tǒng)使用多模塊冗余 如簡(jiǎn)單形式的向前糾錯(cuò)技術(shù)。例如,五重模塊冗余通訊系統(tǒng)(如:FlexRay車載網(wǎng)絡(luò)標(biāo)準(zhǔn))使用大多數(shù)的5個(gè)樣本---如果5個(gè)中的任意2個(gè)結(jié)果是錯(cuò)誤的,那么其他3個(gè)結(jié)果就會(huì)糾正并且修復(fù)這個(gè)錯(cuò)誤。
N-version programming(NVP), also known as multiversion programming, is a method or process in software engineering where multiple functionally equivalent programs are independently generated from the same initial specifications.
N版本編程(NVP),也被稱為多版本編程,是軟件工程中的一種方法或過程:相同初始規(guī)格獨(dú)立生成的多功能等價(jià)程序。
[1]
The concept of N-version programming was introduced in 1977 by Liming Chen and Algirdas Avizienis with the central conjecture that the "independence of programming efforts will greatly reduce the probability of identical software faults occurring in two or more versions of the program".
N版本編程的概念是由陳立明與Algirdas Avizienis 在1977年的中心推測(cè)中提出的,獨(dú)立編程的成果可以巨大的降低發(fā)生在兩個(gè)或更多版本的相同軟件中的故障幾率。
[1][2]
The aim of NVP is to improve the reliability of software operation by building in fault tolerance or redundancy.
NVP的目的在于通過建立容錯(cuò)或冗余機(jī)制來提高軟件使用的可靠性
[1]
[edit]Function of redundancy
功能冗余
The two functions of redundancy are passive redundancy and active redundancy. Both functions prevent performance decline from exceeding specification limits without human intervention using extra capacity.
雙功能冗余屬于被動(dòng)冗余和主動(dòng)冗余。兩個(gè)功能可以阻止因超出規(guī)格限制而導(dǎo)致的性能下降 同時(shí)不需人為介入使用額外的能力。
Passive redundancy uses excess capacity to reduce the impact of component failures. One common form of passive redundancy is the extra strength of cabling and struts used in bridges. This extra strength allows some structural components to fail without bridge collapse. The extra strength used in the design is called the margin of safety.
被動(dòng)冗余使用額外的能力來減少組件故障所造成的影響。一種常見形式的被動(dòng)冗余 是在橋梁上使用超高強(qiáng)度的鋼桁和支柱。這種高強(qiáng)度能夠允許一些部件的老化但不至于使橋垮塌。
Eyes and ears provide working examples of passive redundancy. Vision loss in one eye does not cause blindness but depth perception is impaired. Hearing loss in one ear does not cause deafness but directionality is impaired. Performance decline is commonly associated with passive redundancy when a limited number of failures occur.
一個(gè)被動(dòng)冗余的實(shí)例是眼睛和鼻子。視覺系統(tǒng)失去一只眼睛不至完全失明,但卻會(huì)深深損害知覺。聽力系統(tǒng)失去一只耳朵不至耳聾,但肯定會(huì)受到損害。性能下降是常常是跟發(fā)生少數(shù)失效的被動(dòng)冗余關(guān)聯(lián)在一起的。
Active redundancy eliminates performance decline by monitoring performance of individual device, and this monitoring is used in voting logic. The voting logic is linked to switching that automatically reconfigures components. Error detection and correction and the Global Positioning System (GPS) are two examples of active redundancy.
主動(dòng)冗余通過監(jiān)控個(gè)別設(shè)備的性能來消除性能下降,并且這種監(jiān)控題使用的是表決邏輯系統(tǒng)。表決邏輯系統(tǒng)是跟開關(guān)連接在一起的,能夠?qū)崿F(xiàn)自動(dòng)裝配部件。錯(cuò)誤檢測(cè)與糾正技術(shù)和全球定位系統(tǒng)是兩個(gè)主動(dòng)冗余的例子。
Electrical power distribution provides an example of active redundancy. Several power lines connect each generation facility with customers. Each power line includes monitors that detect overload. Each power line also includes circuit breakers. The combination of power lines provides excess capacity. Circuit breakers disconnect a power line when monitors detect an overload. Power is redistributed across the remaining lines.
一個(gè)主動(dòng)冗余的例子是電力分配系統(tǒng)。不同電線連接每代消費(fèi)者的設(shè)備。每條電線都包含檢測(cè)儀以偵察電量是否超荷。每條電線還包含著電路斷接器。組合電線可以提供一個(gè)額外的能力。斷電器會(huì)在檢測(cè)儀偵察到電量超荷時(shí)切斷電源
[edit]Voting logic
表決邏輯系統(tǒng)
Voting logic uses performance monitoring to determine how to reconfigure individual components so that operation continues without violating specification limitations of the overall system. Voting logic often involves computers, but systems composed of items other than computers may be reconfigured using voting logic. Circuit breakers are an example of a form of non-computer voting logic.
表決邏輯系統(tǒng)使用性能檢測(cè)器來決定怎樣在沒有表決規(guī)格限制綜合系統(tǒng)的情況下重新裝配個(gè)別部件,來讓系統(tǒng)持續(xù)運(yùn)行。表決邏輯系統(tǒng)常常涉及電腦,但是系統(tǒng)由其他項(xiàng)目組成,除了計(jì)算機(jī)也許會(huì)用表決邏輯系統(tǒng)來重新裝配。一個(gè)無電腦表決邏輯系統(tǒng)就是斷電器
Electrical power systems use power scheduling to reconfigure active redundancy. Computing systems adjust the production output of each generating facility when other generating facilities are suddenly lost. This prevents blackout conditions during major events like earthquake.
電力系統(tǒng)使用電力調(diào)度來重新配置的主動(dòng)冗余。運(yùn)算系統(tǒng)會(huì)在其他發(fā)電設(shè)備突然癱瘓的時(shí)候調(diào)節(jié)每個(gè)發(fā)電設(shè)備的發(fā)電量。以防止在重要時(shí)期出現(xiàn)斷電的情況,如地震。
The simplest voting logic in computing systems involves two components: primary and alternate. They both run similar software, but the output from the alternate remains inactive during normal operation. The primary monitors itself and periodically sends an activity message to the alternate as long as everything is OK. All outputs from the primary stop, including the activity message, when the primary detects a fault.
運(yùn)算系統(tǒng)中最簡(jiǎn)單的表決邏輯系統(tǒng)涉及兩個(gè)部件:主要部件和替補(bǔ)部件。它們運(yùn)行的程序是一樣的,但是在正常操作期間替補(bǔ)部件的輸出是保持無影響的狀態(tài)。主要部件能自我監(jiān)控,只要一切正常它就會(huì)定期給替補(bǔ)部件發(fā)送一個(gè)活動(dòng)消息。一旦主要部件偵察到故障,那么一切來自它的輸出包括活動(dòng)消息都會(huì)被停止。
The alternate activates its output and takes over from the primary after a brief delay when the activity message ceases. Errors in voting logic can cause both to have all outputs active at the same time, can cause both to have all outputs inactive at the same time, or outputs can flutter on and off.
當(dāng)主要部件停止發(fā)送活動(dòng)信息時(shí),替補(bǔ)部件將激活它的輸出,并在短暫的延遲后接管主要部件的工作。表決系統(tǒng)的錯(cuò)誤可以導(dǎo)致兩者在同一時(shí)間讓所有的輸出活動(dòng)生效與失效,或者輸出不穩(wěn)定。
A more reliable form of voting logic involves an odd number of 3 devices or more. All perform identical functions and the outputs are compared by the voting logic. The voting logic establishes a majority when there is a disagreement, and the majority will act to deactivate the output from other device(s) that disagree. A single fault will not interrupt normal operation. This technique is used with avionics systems, such as those responsible for operation of the space shuttle.
一個(gè)更加可靠的表決邏輯系統(tǒng)包含 3個(gè)奇數(shù)或者更多的設(shè)備。所有運(yùn)作相同的功能以及輸出都會(huì)通過表決邏輯系統(tǒng)來進(jìn)行比較。當(dāng)它們發(fā)生分歧的時(shí)候,表決系統(tǒng)會(huì)確定一個(gè)優(yōu)勝者,優(yōu)勝者會(huì)撤銷與其不一致的設(shè)備的輸出。單個(gè)故障不會(huì)導(dǎo)致正常操作被中斷。這種技術(shù)被航空電子設(shè)備系統(tǒng)所采用,例如那些航天飛機(jī)的操作就由它來負(fù)責(zé)。
[edit]Calculating the probability of system failure
計(jì)算系統(tǒng)故障的幾率
Each duplicate component added to the system decreases the probability of system failure according to the formula:
公式得出系統(tǒng)每增加一個(gè)復(fù)制部件都能降低系統(tǒng)故障的幾率:
where:
哪里:
- number of components
部件的數(shù)量
- probability of component i failing
部件故障的幾率
- the probability of all components failing (system failure)
所有部件故障的幾率(系統(tǒng)故障)
This formula assumes independence of failure events. That means that the probability of a component B failing given that a component A has already failed is the same as that of B failing when A has not failed. There are situations where this is unreasonable, such as using two power supplies connected to the same socket, whereby if one socket failed, the other would too.
這個(gè)公式假設(shè):故障事件之間是各自獨(dú)立的。這意味著 部件B故障的概率,跟在部件A故障的幾率、部件A已經(jīng)故障的幾率和部件B已經(jīng)故障的幾率都是一樣的。那種情況是不合理的,例如使用兩個(gè)電源供應(yīng)器連接同一的插座,由此,如果其中一個(gè)供應(yīng)器壞了,那么另外一個(gè)也會(huì)壞。
It also assumes that at only one component is needed to keep the system running. If components are needed for the system to survive, out of , the probability of failure is。
它還假設(shè)只需要一個(gè)部件就能保持系統(tǒng)運(yùn)轉(zhuǎn)。如果系統(tǒng)只有通過這個(gè)部件才能活下來,離開了它,系統(tǒng)就會(huì)出故障。
[citation needed
]引用需要
, Assuming all components have equal probability of failure
This model is probably unrealistic in that it assumes that components are not replaced in time when they fail.
假設(shè)所有部件的故障率相同,在這個(gè)假設(shè)中,部件不會(huì)在它們發(fā)生故障的時(shí)候被取代,這種模型很可能是不切實(shí)際的。