VALIDATION OF A SURVIVABLE PUBLISH-SUBSCRIBE SYSTEM

: We describe, with respect to high-level survivability requirements, the validation of a survivable publish-subscribe system that is under development. We use a top-down approach that methodically breaks the task of validation into manageable tasks, and for each task, applies techniques best suited to its accomplishment. These efforts can be largely independent and use a variety of validation techniques, and the re-sults, which complement and supplement each other, are seamlessly integrated to provide a convincing assurance argument. We also demonstrate the use of model-based validation techniques, as a part of the over-all validation procedure, to guide the system’s design by exploring different configurations and evaluating trade-offs.


INTRODUCTION
The emergence of large distributed information systems to support nation-critical needs (e.g., electrical power distribution, telecommunications, military command and control, and health care) has spurred research into new system protection strategies that can produce a system whose critical function will survive in spite of hostile attacks, complex failures, or accidents [1]. These new strategies have drawn upon traditional approaches such as prevention technologies (e.g., cryptography, authentication, and firewalls), detection technologies (e.g., network sensors and host-based sensors), fault-tolerance technologies (e.g., agreement protocols), and the like, and combined them with newer technologies that enable dynamic defensive responses such as changing security policies or isolating suspected components. Successful strategies interleave these building blocks into a solution that remains cost-effective and manageable even when scaled to very large systems.
The validation of survivable systems, like their construction, must use an integrated approach. In addition to well-known security requirements such as confidentiality and integrity, a survivable system may also need to satisfy probabilistic constraints on its behavior. Such constraints impose different modeling and reasoning strategies. This is particularly true of survivable systems that make use of intrusion tolerance technologies [2,3,4], since by definition, intrusion tolerance is a probabilistically quantified property of the system. Moreover, since impairments may manifest themselves in implementation details, we must rely on focused testing to convince ourselves that the implementation is faithful to the modeled design. A successful validation approach must collect this disparate evidence into a single, seamlessly integrated assurance argument that is convincing to the accreditor.
Most traditional approaches to validation of security have been process-based, usually in the form of guidelines [5]. Goal-based evaluations, when attempted, have usually either been based on formal methods [6] and aimed to prove that certain security properties hold given a specified set of assumptions, or been informal, using teams of experts (often called "red teams," e.g., [7]) to try to compromise a system. Quantitative methods, in particular probabilistic modeling, have been receiving increasing attention as a mechanism to validate security. Any probabilistic model intended to validate a secure system would have to represent, among other things, the attacker's behavior. Since computing@tanet.edu.te.ua www.tanet.edu.te.ua/computing ISSN 1727-6209 International Scientific Journal of Computing some of the vulnerabilities in an information system will be unknown at the time the system is designed (and modeled), the prediction of when and how an attacker may successfully intrude the system is a difficult, but critically important problem. Early work on probabilistic validation of secure systems was done by Littlewood et al. [8]. Their work was exploratory in nature and identified "effort" made by an attacker as an appropriate measure of the security of the system. Jonsson and Olovsson [9] attempted to build a quantitative model of attacker behavior using data from several experiments they conducted over a two-year period. They postulated that the process representing an attacker may be broken into multiple phases, each of which has an exponential time distribution. Attempts have been made to build models that take into account behavior of the system as well as the attacker, and the uncertainties therein. Madan et al. [10] have used a semi-Markov model to evaluate the security properties of the SITAR architecture, an intrusiontolerant system. Their model does not explicitly represent the attacker or the vulnerabilities that may lead to intrusions, but represents the state of the system in terms of high-level events that may lead to failures. Sheyner et al. [11] have tried to build attack trees automatically using formal methods, and then analyze those trees using Bayesian networks. Ortalo et al. [12] have proposed modeling of known system vulnerabilities using "privilege graphs," followed by a combination of the privilege graphs with simple assumptions about attacker behavior to obtain "attack-state graphs." The latter can be analyzed using Markov techniques to obtain probabilistic measures of security. Singh et al. [13] have used probabilistic modeling to validate an intrusion-tolerant system, emphasizing the effects of intrusions on the system behavior and the ability of the intrusion-tolerant mechanisms to handle those effects, while using very simple assumptions about the discovery and exploitation of vulnerabilities by the attackers to achieve those intrusions. Gupta et al. [14] have used a similar approach to evaluate the security and performance of several intrusion-tolerant server architectures. In most of the above efforts, the designers chose modeling assumptions and model parameterswithout incorporating the justifications for those choices into a larger validation framework. Moreover, the representation of the attacker's behavior is highly simplified. The chief value of existing research in this area is that it demonstrated the applicability of these approaches as evaluation techniques that can be used as components of an integrated assurance argument.
The safety-critical systems community has been using "safety cases" [15] as a means to express arguments about the guaranteed safety of systems such as nuclear power plants. While safety cases allow disparate kinds of evidence to be incorporated into an overall argument, they primarily serve as a visual aid, and do not give formal guidelines as to how an argument/requirement is decomposed, especially if it is quantitative in nature. The models used for generating actual evidence in argument trees are usually simplistic; there is a lack of traceability between the actual design and the models; and quantitative evidence (typically generated by probabilistic models) is generally in the form of a leaf, with no argument subtree enumerating and justifying the assumptions made in the models. Furthermore, an approach based on an integrated argument has not been applied to validation of security-related properties.
We have recently developed and applied a validation method for survivable systems that includes the use of logical arguments, stochastic simulations, and experiment-based testing with actual system components (e.g., red teaming [7]). The need for such diversity is due, in part, to the basic nature of a survivability requirement, which can generally be decomposed into sub-requirements that differ with regard to the types of techniques that can be applied. We applied the method to a proposed intrusion-tolerant design for a publishsubscribe system (hereafter referred to as IT Pub-Sub), both to demonstrate that the proposed design satisfies imposed survivability requirements and to help us consider different protection trade-offs as we developed the final design. The resulting assurance argument incorporates each piece of assurance evidence as a supporting argument of some higher-level claim.
A major component of our validation approach was probabilistic modeling. The probabilistic models used an innovative attacker model. The attacker model has a sophisticated and detailed representation of various kinds of effects of intrusions on the behavior of system components (such as a variety of failure modes). It includes a representation of the process of discovery of vulnerabilities (both in the operating system(s) and in the specific applications being used by the system) and their subsequent exploitation, and considers an aggressive spread of attacks through the system by taking into account the connectivity of the components of the system, at both the infrastructure and the logical levels. We believe that this attacker model is applicable to a wide range of secure and intrusion-tolerant systems. Moreover, we demonstrate the use of probabilistic modeling, as embedded in the IVP, to compare different design configurations, allowing the designers of the system to make choices that maximize the survivability of the system before it is actually implemented.
The remainder of the paper is organized as follows. Section 2 provides an overview of the publish-subscribe system studied. Section 3 provides the outline of our validation procedure. Section 4 provides the details and results of the validation procedure. Finally, we conclude in Section 5 with a summary of lessons learned and prospects for further evolution of our work.

OVERVIEW OF THE IT PUB-SUB DESIGN
The IT Pub-Sub system consists of multiple clients communicating with each other through a central core, as shown in Fig. 1. The core consists of three zones (or layers) of components: the crumple zone (outermost layer), the operations zone, and the executive zone (innermost layer). The network connectivity is constrained, through the use of network-interface-level hardware firewalls and configurable network switches, to limit direct network connectivity from the outside network to the inner zones. Communication between machines in each zone is accomplished via proprietary communication protocols.
The primary component in the crumple zone is the access proxy (AP). It functions as a bridge between the outer (public) network containing the clients and the inner core network for the remaining zones. It translates between multiple publicly supported communication interfaces (such as RMI and CORBA) and the single internal core communications protocol. Clients connect to the access proxy to publish, subscribe, and query IO. Attacks on clients generate alerts that are forwarded to the core via the AP, and commands to the client security components from the core pass back to the client via the AP.
Inside the crumple zone is the operations zone. It contains the components that perform the two main functions of the core: processing IO objects, and monitoring and maintaining the security of the core and the clients connected to it. IO processing is performed by three operations zone components: the PSQ server (PSQ), the downstream controller (DC), and the guardian (Gu). The PSQ server receives IO objects sent to the core via the AP in the crumple zone. The DC verifies the signatures on messages sent from clients to ensure data integrity. The guardian uses scenario-specific and domainspecific knowledge to identify corruption within the contents of the IO.
Security monitoring and maintenance are performed by local components distributed across all of the hosts in the client, the crumple zone, the operations zone, and a centralized correlator (Co) in the operations zone. The components local to individual hosts are sensors, actuators, and local controllers (LC). Sensors are dedicated to intrusion detection, actuators are mechanisms that carry out actions when commanded, and an LC is the control agent responsible for local survivability management.
The correlator receives alerts from multiple intrusion detection sensors within the client, crumple, and operations zones, filters out redundant and false alerts, and forwards serious messages to the system manager. The correlator interprets the alerts it receives in the context of the global state of the system, allowing it to better identify redundant alerts and false alarms.
The executive zone contains the system manager (SM). It serves as the master controller of the IT Pub-Sub. It monitors intrusion alerts received from the correlator and generates commands for the appropriate response to counter the intrusion. Human operators monitor the IT Pub-Sub via displays generated by the SM, and can manually initiate specific responses from the SM.

Fig. 1 -The IT Pub-Sub Architecture
The core is redundant, consisting of four quadrants (or quads) that run different operating systems. Each quadrant contains copies of all crumple, operation, and executive zone components discussed previously. Agreement protocols run among the SM and PSQ servers to ensure a common, collective view of the system state by human operators viewing it through the SM displays, and by clients interacting with the core via the access proxies in the crumple zone. The baseline configuration of the system used in our case study assigned the same operating system to all the hosts within a single quadrant of the core.

Fig. 2 -Publish Data Flow
Among the various data flows in the system, two are central to the validation effort described in the sections that follow. The first, which is depicted in Fig. 2, is the data flow among the system components during the publish operation. The publish operation begins when a client creates an IO to be published. The IO is signed using the client's private session key and sent to the access proxy in one of the quadrants through the "publish" protocol. The access proxy receives the IO through the isolation switch (IS) and sends it to the DC to verify whether the client is in a valid session. After successful verification, the AP sends the IO to the PSQ server in its quadrant. The PSQ server forwards the IO to the other PSQ servers in the other quadrants. Each PSQ server then stores the IO in its repository (Rep), sends to the client an acknowledgment of the receipt of the publication, and sends the IO to the guardian. The guardian performs domain-specific tests on the IO. If it finds an error in the IO, it sends an alert to the correlator. The correlator determines, based on the threat level of the alert and the alert state of the system, whether the alert is likely to represent an attack. If it is, the alert is forwarded to the SM. The SMs collectively decide whether they should tell the PSQ servers to recall the IO. If there are no recalls from the SM, the IO is accepted, and subscribing clients are notified that a new IO is available. The subscribing clients then query the system for the available IO.

Fig. 3 -Alert/Response Data Flow of a Client
The second data flow of interest is the alert/response data flow associated with the intrusion detection and tolerance capabilities of the system. Alerts are generated on the clients and on components in the core, and the data flow is similar for both cases. The data flow for client alerts is presented in Fig. 3 and described here. When a sensor in a client host detects an anomalous condition/attack, it generates an alert and sends it to the LC on the same host. The LC can either make a local response, such as restarting a process or replacing a corrupted file, or forward the alert to the core for a coordinated adaptive response by the correlator. The correlator then decides (using correlation with other alerts) if this alert is critical, determines the compromised client(s), and informs the SM in its quadrant. The SMs propagate the reports within the SM group and reach a consensus to take action. Each SM then generates a command for the desired response on the client generating the alert and sends it to the client's LC via the DC and AP. The LC commands the appropriate local actuator on the client to take action. If the SM group determines that a more drastic response is appropriate, the client can be quarantined. Each SM notifies the policy server (PS) for the hardwarebased firewall system and isolates the compromised client.

VALIDATION APPROACH
Before we delve into the the details of the validation, we clarify the nature of high-level survivability requirements and formulate a theoretical basis for their decomposition into subrequirements. We then provide an outline of the procedure we employed to validate the IT Pub-Sub with respect to particular high-level requirements.

Survivability Requirements
As defined in [1], survivability is the capability of a system to fulfill its mission, in a timely manner, in the presence of attacks, failures, or accidents. A system is survivable if it has the above-stated capability according to a specified set of survivability requirements. The latter are statements that collectively imply what is meant by the system's "capability to fulfill its mission in an adverse operational environment." (In what follows, it is assumed that "in a timely manner" is accounted for as part of "mission fulfillment.") A survivable system's operational environment includes both mission-specific interactions on the part of intended users (its use environment) and adverse interactions due to attacks and faults (the attack/fault environment).
The term "capability" is particularly important in defining survivability requirements. In some instances, it suffices to identify capability with "certainty," i.e., fulfillment occurs with probability 1. However, capability often needs to be otherwise quantified, and that often involves probabilities. To illustrate this, let us suppose X is some service that is essential to fulfilling the system's mission. Then a corresponding survivability requirement for the system, call it ExR for "Example Requirement," could be the following. ExR: Whenever requested, X is successfully delivered with a probability of at least p X (0 < p X < 1.0), where "successfully delivered" needs to be further elaborated in terms relating to both the system and the mission objectives. For example, suppose that X involves the transfer of a data object from one system user to another. Then, successful delivery (provision) of this service, among other things, can depend on the end-to-end data flow required to realize X, timeliness of the data flow, and integrity of the received data object, along with other security properties such as authentication (the sender is an authorized user) and confidentiality (the data object is not disclosed to an unauthorized recipient). Generally, a system requirement can be viewed as a predicate R(s), where the variable s refers to a system along with relevant aspects of its operational environment. When applied to a specific system S, the proposition R = R(S) is then a requirement for S, where R is satisfied by S (alternatively, S is valid with respect to R) if the truth value of R is true. When a requirement R for a particular system is being stated, the system is usually understood from context, and hence not referred to explicitly in the statement of R. (See the definition of ExR above, in which we omitted saying "by the system" after "successfully delivered.") This practice will be followed throughout the paper.
With a slight abuse of terminology, we regard a requirement as being quantitative if deciding whether R is satisfied by the system entails the evaluation of at least one quantitative measure. In the case of high-level survivability requirements, such quantification is likely probabilistic, as illustrated in the example given above. This is due mainly to uncertainties in the system's operational environment (i.e., use environment and attack/fault environment) together with defense mechanisms that may behave probabilistically. Evaluation of the corresponding measure(s) can be based on a model of the system and its operational environment, experimentation with an actual system (operating in an actual or simulated environment), or some combination of the two. In addition to measure evaluation, other techniques need to be brought into play to determine whether a quantitative requirement is satisfied. If a requirement is non-quantitative, then its satisfaction can generally be decided without direct invocation of evaluation results for any quantitative measure. (Indirectly, such results can provide evidence that the requirement is indeed non-quantitative.) Generally, a requirement is specified as a constraint (typically a bound or equality) on the probabilities of one or more events. This includes non-quantitative requirements, which can be viewed as events with probability 1. Moreover, such probabilities may be conditional, meaning that they are conditioned on the satisfaction of one or more preconditions. For example, the validation effort might initially focus on the survivability of the system, assuming it has been successfully bootstrapped. That would make it possible to deal with the bootstrapping process separately, if needed, perhaps later on in the validation procedure. In the case of ExR, we may define: E ExR = X is delivered successfully upon request. C ExR = The system has been successfully bootstrapped, which includes starting the application that provides the service X.
Each event (including preconditions) may be stated as a conjunction of simpler events, both for clarity and ease of subsequent requirement decomposition. Since events are technically sets in the context of probability theory, throughout the paper, "conjunction between events" describes the corresponding set intersection.

Requirement Decomposition
If feasible, it is helpful to logically decompose a requirement into (more specific) sub-requirements, such that if all of the sub-requirements are satisfied by the system, then the original requirement is satisfied. The primary purpose of the decomposition is to obtain sub-requirements that can be proved using the tools available to the validators for logical argumentation and probabilistic modeling. Our formalism is somewhat similar to the formal composition and refinement framework developed by Abadi-Lamport [16,17] and Shankar [18]. However, their logical formulations do not have probabilistic underpinnings capable of dealing with quantitative survivability requirements, such as our formalism aims to provide. More precisely, a logical decomposition (LD) of a requirement R is a set of two or more requirements (the sub-requirements of R) such that their (logical) conjunction implies R. Formally, let {R 1 , R 2 , … , R m } denote the set of subrequirements of requirement R (m > 2) and let ∧ denote conjunction (the logic operator AND). Then, i.e., the conjunction of the sub-requirements (logically) implies R. Accordingly, in order to validate the system with respect to R, it suffices to validate it with respect to each of the subrequirements. We rule out trivial conjunctions such as R ∧ R and R ∧ Taut, where Taut is a tautology (is always true). We also rule out degenerate LDs for which the conjunction (R 1 ∧ R 2 ∧ … ∧ R m ) is a contradiction (is never true), e.g., the subrequirements are inconsistent, since in that case (by the definition of logical implication), requirement R would be trivially satisfied.
A requirement is decomposable if it admits to an LD. An LD of R is typically determined by finding some initial LD of R and then iteratively determining LDs of decomposable subrequirements. Note that conditions of the LD definition are preserved by such iterations by the transitivity of ⇒.
A requirement R is atomic if it is not logically decomposed. This applies to the original requirement (if it is not decomposed) or to any subrequirement that is not further decomposed. It is important to note that the LD process is guided by the need to obtain manageable sub-requirements, so while a requirement may be atomic because it is not decomposable (does not admit to an LD), in practice, a requirement is often atomic because a beneficial LD is not evident or because the requirement is basic enough to be dealt with effectively without further decomposition. If R is decomposed in the iterative manner described above, the decomposition may be visualized as a tree, with R as the root node and the atomic subrequirements as the leaf nodes. It follows (due to the preservation of the LD condition) that R is satisfied if all of its atomic sub-requirements (leaf nodes for the tree rooted at R) are satisfied.
Generally, the sub-requirements of an LD of R can be either quantitative or non-quantitative (see Section 3.1). If R is quantitative, then at least one of its sub-requirements must also be quantitative. The reason is that logical decomposition cannot eliminate the quantitative aspect(s) of R. Moreover, it is possible for an LD of a quantitative requirement to contain more than one quantitative sub-requirement.
Consider the following example. Suppose we have a requirement R, defined as

Fig. 4 -A Flowchart Depicting the IVP
where E 1 and E 2 are two events, such that neither of them is expected to occur with probability 1. Suppose further that these events are not statistically independent. Nevertheless, R can still be logically decomposed into two sub-requirements where it follows from elementary probability theory that R 1 ∧ R 2 ⇒ R. The value of p 1 can be specified by using a probabilistic model of the relevant portion of the system to evaluate P[E 1 ] and letting p 1 = P[E 1 ], thereby satisfying R 1 . Another model can then be used to evaluate P[E 2 |E 1 ] so as to determine whether R 2 is satisfied. Note that (due to the use of implication), it is sufficient to do this for one value of p 1 , and that the independence of the events involved is not a prerequisite for decomposition.

Outline of the Validation Procedure
Although it is sometimes possible to obtain a validation result by applying a single validation technique (e.g, logical argumentation, if the requirement is not quantitative), we find that validation with respect to high-level quantitative survivability requirements calls for an integrated application of several techniques. Indeed, we believe that the means of accomplishing this is a distinguishing feature of the effort reported herein. We used the following integrated validation procedure (IVP) to validate the IT Pub-Sub with respect to a quantitative survivability requirement, say R. The quantified aspects of R are assumed to be probabilistic (e.g., probabilities, moments of random variables, and so forth). The IVP is summarized in Fig. 4, and the steps are: 1. Formulate a precise statement of R, including any assumed preconditions regarding the system and/or its operational environment. The purpose of this step is to make the goals of the validation exercise absolutely clear right at the onset, and also to specify exactly the scope of the validation by explicitly enumerating the preconditions. We chose propositional logic in combination with a simple probabilistic formulation as the specification formalism, since it easily supports subsequent decomposition as described in Section 3.2. 2. If R admits to logical decomposition, decompose it iteratively using the method described in Section 3.2, thereby determining its corresponding atomic sub-requirements. Otherwise, R is the only atomic requirement. This step is intended to break R into manageable sub-requirements that can be addressed by the tools and techniques used for probabilistic modeling or logical argumentation. Each decomposition of a requirement into subrequirements is accompanied by a proof of the validity of the decomposition. The proofs would typically use propositional logic, and might use some probabilistic reasoning if any of the involved sub-requirements are quantitative. The following steps are applied to each atomic If R a is quantitative, proceed as follows; otherwise, jump to Step 8. In a natural language (or some more formal language suited to the task), describe properties of the system and its operational environment that can guide modelbased evaluation of the probabilistic measure(s) associated with R a . First, the system components (and communications among them) relevant to R a are identified. This is followed by a description of the following. Note that all the descriptions are in terms of the components and communications identified, i.e., they are specified at the same level of abstraction. a) Information flows: i) Service-related data flows: this is a block diagram, with the components identified above as the blocks, representing the precise sequence, and the nature, of communications between the components during normal operation. ii) Attack-caused intrusion detection and recovery flows: this is a block diagram representing the precise sequence, and nature, of communications involved in passing alerts (upon detection of intrusions) and subsequent recovery commands. iii) Fault-caused error detection and recovery flows: similar to (ii) above, except that the alerts are raised upon detection of errors. b) Use scenario(s): this is closely related to information flows. Here, the operational setting of the system is described, and might include details such as frequencies of various events. A description of the quantitative measure(s) required to evaluate R a is also provided. c) Attack and fault effects: This is a description of the possible effects attacks and faults have on the behavior of the system components.
The emphasis is on enumerating the effects that can be detrimental to the system's ability to satisfy R a . It usually includes a representation of the attacker behavior, in particular the dynamics of the spread of intrusions deeper into the system using already intruded components as launching pads. These descriptions can be based on information from threat and vulnerability assessment and from whiteboarding. Step 4 also includes identification of the input parameters required by the model, and choice of reasonable estimates for their values.
Steps 3 and 4 help the validators make sure they clearly understand the system design or implementation being validated, and document that understanding. Since the probabilistic model constructed in Step 6 may not be easily understood by persons lacking background in probability theory and stochastic modeling, the descriptions prepared in the above steps can be reviewed by the designers/implementors of the system to ensure compliance with their views. Similarly, the descriptions help convince the accreditors that the models actually represent the system being validated. The above exercise also greatly eases the job of building the probabilistic model. 5. Verify the modeling assumptions of Step 4 and, where possible, justify values of the model parameter values chosen. Since we were validating a system design (as opposed to an implementation), several of the parameter values chosen were based on informal justification rather than formal proofs or experimentation. The focus in such a case is more on exploration of the design/parameter space and identification of the subspace that ensures compliance with the survivability requirements, thus leading to a more survivable system when the chosen design is actually implemented. The assumptions are also checked for consistency with any preconditions in R a . Furthermore, if a node in the requirement decomposition tree used independence of the underlying events to justify the decomposition, the sets of assumptions in the subtrees rooted at the children of that node are checked against each other for the violation of the independence assumption. 6. Based on the descriptions obtained in Step 4, construct a probabilistic model of the system and its operational environment that can support evaluation of the probability measure(s) associated with R a . Several modeling formalisms may be used. We have used Stochastic Activity Networks (SANs) [19], a generalization of stochastic Petri nets, as the modeling formalism. The models were built using the Möbius tool, which can either solve them analytically by converting them into equivalent continuous time Markov chains, or simulate them by executing multiple behavioral trajectories until the measures being evaluated are determined within desired bounds of accuracy. 7. Based on the model made in Step 6, evaluate the probability measure(s) associated with R a . If the values obtained are within bounds prescribed by R a , then R a is satisfied by the system (the system is valid with respect to R a ). 8. If R a is not quantitative, prove that it is satisfied using logical argumentation. Note that Steps 4-5 will usually be iterated. For example, an inability to verify some assumption in Step 5 may lead to alternative assumption details in Step 4 (even though realities of the design and its operational environment remain unchanged).

VALIDATION DETAILS
As mentioned in Section 2, the essential services provided by the core to clients are publish, subscribe, and query. Accordingly, IT Pub-Sub survivability with respect to these services is a dominant concern of the validation process. We now describe the application of the validation procedure outlined above as it was used to validate the system against the survivability requirement for the publish service.
Step 1: Requirement Specification The "capability to process a publish request successfully" was chosen as the survivability requirement for the publish service. The first step in the validation procedure is to formulate a precise statement of the requirement. To this end, the terms "capability" and "successfully process a publish request" need to be defined, together with any preconditions regarding the system and its operational environment.
Let C PUB be the conjunction of the events representing the preconditions. In our case, C PUB is the conjunction of the following two events; the first refers to the system and the second to the use environment.
1 PUB C = the publishing client is successfully registered with the IT Pub-Sub core (authentication).
2 PUB C = the publishing client's mission application always passes adequate and accurate information to the client. These reflect the assumptions about the system's initial state and invariants during operation, which are taken as axioms in the subsequent analysis.
Let E PUB be the desired event, i.e., the successful processing of a request to publish. It is the conjunction of the following events.
1 PUB E = the data flow of the publish operation is correct.
2 PUB E = the time required for the publish operation does not exceed a specified duration tmax (timeliness).
3 PUB E = the published IO that becomes available to subscribers has the same essential content as that assembled by the publishing client (integrity).
4 PUB E = the published IO is available only to the other clients via subscribe or query requests (confidentiality). Let P[E PUB |C PUB ] be the probability of event E PUB , given that the preconditions hold, i.e., the quantification of "capability." Let the required capability be denoted by p PUB (0 < p PUB < 1), the lower bound on P[E PUB |C PUB ], where its specified value is typically a high probability that depends on the nature of the system use scenario.
The survivability requirement for the IT Pub-Sub publish service, denoted by PUB, can then be stated as PUB is therefore a quantitative requirement (in the sense described in Section 3.1), since deciding whether it is satisfied by the IT Pub-Sub, entails evaluation of the quantitative measure P[E PUB |C PUB ]. Due to uncertainties in attacks, attack effects (intrusions), intrusion effects, and the operation of various IT Pub-Sub intrusion tolerance mechanisms, validation is trivial in the case p PUB = 1, since we know that PUB cannot be satisfied. At the other extreme, if the value p PUB = 0 were allowed and so specified, then PUB would likewise be non-quantitative. Again, validation would be trivial, since, in this case, PUB would always be satisfied (the requirement itself is trivial).

Step 2: Logical Decomposition
PUB can be initially decomposed into the following two sub-requirements. To establish that {PUB 1 , PUB 2 } is an LD of PUB, we need to show that the defining conditions of an LD are satisfied, i.e., (PUB 1 ∧ PUB 2 ) ⇒ PUB.
Suppose the hypothesis holds, i.e., both PUB 1 and PUB 2 are true. Generally, if A and B are two events such that P[A] = 1 then

P[A ∧ B] = P[A] + P[B] -P[A ∨ B] = P[B] = P[A]P[B]
where P[A ∨ B] = 1, since the probability of an "or" event is at least the probability of either disjunct, and can be no greater than 1. From the above identity, it follows that any event having probability 1 is (statistically) independent of an arbitrary event.
In particular, by PUB 2 , by PUB 1,1 , we conclude from the above identity that PUB 1 is true.
Each sub-requirement available at this stage is specific enough to be handled by a validation technique such as probabilistic modeling or logical argumentation. Hence, we stop the decomposition, and refer to PUB 1,1 , PUB 1,2 , PUB 1,3 , and PUB 2 as the atomic sub-requirements of PUB, where the first is quantitative and the other three are nonquantitative.

Step 3: High-Level System Description
For the remaining steps of the validation, we will focus on the quantitative atomic requirement PUB 1,1 . With respect to PUB 1,1 , this step seeks high-level descriptions of the IT Pub-Sub data flow for the publish operation, the associated alert and response data flows, and the attack/intrusion environment.

Data Flows
Both the publish data flow and the attack/alert data flow are represented as block diagrams depicting both the relevant components and the sequence of operations and nature of the communication between the components. The diagrams ( Fig. 2 and Fig. 3) and the associated descriptions have already been provided as a part of Section 2 and are not repeated here.

Attack Model
The attack model makes several important distinctions concerning where attacks occur (location of both the source and target of an attack) and how resulting intrusions affect both system and attacker behavior. In particular, the model accounts for the fact that once a vulnerability has been discovered in a target, the attack can quickly propagate to other instances of that target, provided that they are accessible (via network connectivity) from the attack source. If an intruded target (e.g., a host) is compromised, then it is possible for the host to serve as a source of further attacks.
Terminology In order to describe the attack model more precisely, we make the following distinctions. An entity of the system is one of the following: • A host: A computing resource with an operating system and network interface cards. • A component: A process that realizes an IT Pub-Sub function, e.g., an access proxy. Note that several components typically reside in a single host (for example, the survivability delegate, the sensors, the actuator, and the local controller reside in the client). It occurs if the attack finds a vulnerability in its target and thereby alters the target's behavior (the effect or symptom of the intrusion); otherwise, it is prevented. In other words, an intrusion occurs if an attack has some effect on the target. The effect can range from something very benign (e.g., when the intrusion is "masked" or "blocked") to the compromise of the target such that it can be used as a platform for launching further attacks. An intrusion is prevented if the attack can find no vulnerabilities in its target, thereby obviating any effect. In particular, if an attack is unable to access its target, then the attack cannot find a vulnerability, even if one exists.
An intrusion is tolerated (possibly in the presence of other tolerated intrusions) if its effect does not lead to unsuccessful processing of a publish request; otherwise, it causes a failure. A new vulnerability, found at time t, is a vulnerability that is present in at least one component of the architecture, and that was not known by the attacker until time t during the mission. When discovered, such a vulnerability can be used any number of times (until the end of the mission) against the vulnerable components that the attacker can reach.
A successful attack is repeated if that attack (same source, same type) is made on a similar target having the same vulnerability, in which case it succeeds very quickly.
A successful attack is propagated if the intruded target is compromised such that the target becomes a source of further attacks. If this source can access a similar target with the same vulnerability, the original attack can be repeated (see above). The source can also launch a new attack that attempts to find a new vulnerability in another target.
The simulation model considers attacks that are "successful" in the sense that an intrusion occurs and, moreover, is neither masked nor blocked. However, if such an intrusion is tolerated (the third case noted above), the attack does not succeed in the more usual sense of causing a failure.
Attack Propagation Two basic assumptions underlie the construction of the attack model. First, we assumed that the attacker would discover new vulnerabilities slowly. Define MTTD to be the mean time to discovery of a new vulnerability. Second, it was assumed that the attacker would exploit newly discovered vulnerabilities quickly. Once an entity is intruded following the discovery of a vulnerability, the attack can be repeated (see the preceding terminology). Define MTTE to be the mean time between successive exploitations of a known vulnerability. The typical value of MTTE in our analysis was 5 minutes. It is important to note, however, that repeated attacks require targets with the same vulnerability, typically entities that are instances of the original target. Accordingly, design diversity can be used to reduce the possibility of repeated attacks. For example, a successful OSlevel attack from the outside, compromising a client running under OS1, can propagate to hosts connected to the client (namely, the access proxy in the core). If the access proxy's OS is also OS1, the attack can be repeated, with success (intrusion) coming quickly. On the other hand, if the access proxy is running under a different operating system, then the OS diversity will likely preclude a repeated attack. It is possible that a given vulnerability exists in more than one OS. We account for that possibility using a probability of a common-mode vulnerability. If a vulnerability is determined to be common-mode, it will exist in all OSes used by the IT Pub-Sub.
Types of Attacks Three different types of attacks were represented in the attack model.
• Infrastructure-level attacks exploit a vulnerability found either in the operating system running on a given host (for instance, a flaw in the TCP/IP stack), or in a service running on that host, not related to the IT Pub-Sub (for example, a flaw in sshd). The attack's source and target must be directly connected and communicate with each other, typically with standard network protocols. If a vulnerability of that type is discovered, it can be limited to one of the four operating systems, or affect all operating systems used in the system. • Data-level attacks exploit a vulnerability in an IT Pub-Sub application on the targeted component. The attack's source and target are applications on different hosts. The attack uses the content of the application data to intrude into the target. For example, the client sends a corrupted IO to a PSQ server, resulting in a crash or a corruption of that host.
• Attacks across process domains allow the attacker to intrude into a different process domain of the same host by exploiting low-level vulnerabilities in the operating system. The source and targets are process domains running on the same machine.
Successful infrastructure-level attacks are attacks in depth; the intruder quickly progresses deeply into one quadrant, because only one vulnerability needs to be discovered to compromise the common operating system in the entire quadrant. However, unless the vulnerability found corresponds to a common mode failure, the attacker cannot intrude into any other quadrant, because the other quadrants are based on hosts with different operating systems. Data-level attacks provide attacks in breadth; if a vulnerability is found in the access proxy application, then the publishing client can exploit it directly on the four access proxies.
Those attacks are consequently much more dangerous. Great effort should be put into reducing the number of data vulnerabilities, and preventing attackers from exploiting them.
The first attack on the IT Pub-Sub must be an infra-structure-level attack, since both of the other types require control of a machine within the system. For example, attacks across process domains assume a compromised process domain for the source of the attack. Data-level attacks require control of a component that is capable of generating IT Pub-Sub protocol packets, signed with signatures from host-specific private keys. Fig. 5 provides an example of attack propagation. At time t 0 = 85 minutes, an infrastructure-level vulnerability (ILV) is found on the main process domain of OS1. After a short time, the attacker exploits that vulnerability on the publishing client (time t 1 ). From the client, he launches the same attack on the AP of quadrant 1 at time t 2 . He continues in the same way until he has compromised all the components of quadrant 1 (since they're all running the same OS). At time t 8 = 230 minutes, a data-level vulnerability (DLV) on the PSQ server is found. The attacker uses that vulnerability to compromise the three remaining PSQ servers; he can attack either from the publishing client (times t 9 and t 11 ), or from the PSQ server in the first quadrant (time t 10 ), since he has control of all those entities.

Fig. 5 -Example of Attack Propagation
Intrusion Effects When any of the three types of attacks described above is successfully exploited, the resulting intrusion can have one of the following effects.
• Crash of the intruded entity: For the operating systems that handle process domains (secure OSes), a crash of a given process domain will not lead to the crash of other process domains. On a traditional OS (without process domains), a crash of a given process will crash the whole machine (and thus all the processes running on it).
• Compromise of the intruded entity: The intruded entity can corrupt or drop data packets, or can be a source of further attacks. When a process domain becomes compromised on a secure operating system, the other process domains are unaffected. However, for traditional operating systems, compromise of one process also compromises the other processes on the machine.
• Denial of service: The intruded entity can reduce processing speeds and increase latencies.

Attack Responses
The model includes three mechanisms for responding to a detected intrusion that is not blocked: • Type 1: Rapid response. The first and fastest mechanism is a local and rapid response, based on an automatic decision made by the LC. Examples of rapid response include file restoration, termination of illegal processes, and restarting of critical processes. On average, this operation takes one minute.
• Type 2: Secure reboot. This operation is carried out in two phases: automatically shutting down the host and then manually restarting it. The shutdown is performed by the SM through DC and LC. Restarting has to be done manually by the operator sitting at the console. The reason is that the client application will also need to be restarted, which will require the user to perform some action, such as providing a password or swiping a smart card. Moreover, the shutdown will give the client operator the opportunity to introduce diversity, perhaps by removing some services or changing the firewall policies.
• Type 3: Permanent isolation. The decision to isolate is taken by the SM through the PS. The decision is made by the SM in the core; therefore, this operation requires more time, on average 7 minutes.

Step 4: Detailed Descriptions
This step elaborates the Step 3 descriptions to an extent that permits relatively straightforward construction of corresponding components of the probabilistic model. It also explicitly enumerates all the assumptions the probabilistic model makes that are verified, and, if found to be correct, justified, using logical arguments. For the purpose of illustration, we will limit our attention to the AP component, along with certain details concerning the use and attack models. If the AP is not compromised, it forwards designated traffic from the Quadrant Isolation Switch to core quadrant components and vice versa. The AP, like several other components in IT Pub-Sub, has process domains, which are similar to sandboxed virtual machines running on the same host. The AP handles different types of traffic, and each type, including IOs sent from publishing clients to the core, is handled by a different process domain. If an IO is sent by a publishing client to the AP (the client sends the IO to a randomly chosen quadrant), the token accompanying the IO is sent to the quadrant's DC to assist in the determination of whether the publishing client is in a session. If the DC's response is positive, the IO is forwarded to the PSQ server.
In addition, the AP forwards sensor alerts from client-side sensors to the correlator, heartbeats from client components to the DC, acknowledgments from PSQ servers to clients, notifications from PSQ servers to subscribing clients, commands from the DC to clients' LCs, and commands from ADF policy servers to clients' ADF NICs. The following assumptions are identified in the model for the AP component: AP1: Only well-formed traffic is forwarded by a correct AP. AP2: An AP can change the traffic through it, but cannot re-sign the content. AP3: An AP cannot access the contents of an IO if application-level end-to-end encryption is being used. AP4: If the AP is compromised, it can launch ILA to the following components: client, PSQ, DC, correlator, and PS. AP5: If the AP is compromised, it can launch DLA to the following components: PSQ, guardian, DC, and the publish-subscribe middleware components, as well as the IDS components on the clients, correlator, and SM. The time-to-discovery (TTD) and time-to-exploit (TTE) are assumed to be exponentially distributed. Accordingly, the values assigned to MTTD and MTTE (see the description of the attack model above for definition) parameters of the attack model will be such that MTTE « MTTD. This reflects the fact that the TTD of a new vulnerability is typically much longer than its subsequent TTE.
Once an intrusion has occurred, it has one of the three possible effects described in Step 3 (crash, compromise, or DoS), which we assume will occur with probabilities (parameters of the attack model) p cr , p co , and pdos, respectively. When a process domain on an AP is compromised, data that flows through it can be altered as follows. If any alert and command traffic passes through, it is dropped (blocked). If the process domain handles IOs, it will corrupt an IO if it has access to the client's key (since the process domains can change the signature of the IO according to the corrupted content); otherwise, it will drop the IO. It can also act as a source for further attacks (both ILA and DLA) on other entities. Compromised sensors are not able to detect intrusions. When an alert is generated by the IDS, the recovery options are as follows. On the first alert, the affected process domain is restarted. On the next three, the entire AP is rebooted. On the fifth alert, the AP is quarantined (using the ADF NICs).

Step 5: Justification of the Assumptions
The goal in this step is to justify the model assumptions made in Step 4. We used logical argumentation in this step. Other techniques, such as formal methods and experimentation, were also utilized for some of the assumptions. In the complete description of the model in Step 4, a total of 31 assumptions were justified using logical argument and experimental results. These arguments use very low-level system details, and, in the interest of space, we do not list them here. Interested readers can refer to [20] for details.

Step 6: Construction of the Probabilistic Model
Based on detailed descriptions of the type illustrated in Step 4, a probabilistic model was constructed using Möbius [21] that supports evaluation of the probability measure P[ 1 We used stochastic activity networks (SANs) [19] as the formalism. Behaviorally, a model constructed with SANs represents (measure-relevant) behavior of the IT Pub-Sub platform in the presence of publish demands and random attack-caused intrusions. Structurally, the atomic SAN submodels represent various components of the design and its use/attack environment; descriptions of them were documented during Step 4. The overall model is constructed by using replicate and join operations to compose the atomic submodels.

Fig. 6 -Composed Model of the IT Pub-Sub
Graphically, a composed model can be viewed as a tree, in which the atomic SAN submodels correspond to leaf vertices of the tree, and joins or replicates correspond to internal vertices. A Join vertex combines two or more different SAN submodels, each of which can itself be a composed model (represented by a subtree). A Rep (or Replicate) node generates multiple copies of its submodel (again, each submodel can itself be a composed model). Different submodels in a composed SAN interact through shared state variables.
As depicted in Fig. 6, the system is viewed (for the purpose of validating the requirement PUB 1,1 ) as two clients (Join PubClient and SubClient) communicating with the core (Rep Core) through the network (submodel Path). The clients have four process domains, represented by four submodels under the Join (IT Pub-Sub publish functionality, the sensors, the actuator, and the local controller). The fifth submodel under each of these Joins is the attack model. The two remaining submodels (measures and attack_discovery) assist modelbased formulation of the measures evaluated.
The core is a replication of four quadrants (submodel Quad). As shown in Fig. 7, each quadrant is a Join of several submodels (Access Proxy, PSQ, DC, Guardian, Correlator, PS, and SM), representing all of the core components. As presented in Fig. 1, the access proxy, PSQ server, downstream controller, and guardian have IDS components; therefore, their respective Joins have IDS submodels (for instance, AP_Se, AP_Ac, and AP_LC).
For each node represented in the composed models of Fig. 6 and Fig. 7, an atomic SAN model is built. Atomic SANs encode the state variables representing the components they are modeling, and provide transitions with specified delay distributions that can change the state. They provide the ability to include complex enabling functions for the transitions, and complex completion functions to manipulate the state upon completion of transitions. Each atomic SAN basically implements the description for the corresponding component in Step 4. A detailed description of all the atomic SANs constructed is provided in [22].

Fig. 7 -Composed Model of the IT Pub-Sub Core Quadrant
The point to note is that the entire model construction is driven by the measure we intend to evaluate.

Step 7: Evaluation for Validation Against Requirement
Based on the model of step 6, this step evaluates the probability measure associated with the atomic quantitative requirement PUB 1,1 . If p 1,1 > p PUB (the required lower bound), then PUB 1,1 is satisfied by the system (the system is valid with respect to PUB 1,1 ).
The IT Pub-Sub system is envisaged to be used for short (of the order of a day or two) durations. Letting d M denote the duration (in hours) of a use scenario, p 1,1 is formulated as the fraction of publication requests during d M that are processed successfully, i.e., the fraction for which the event In particular, if (1) new vulnerabilities are discovered at an average rate of no more than once per day (MTTD > 1440 minutes), which is a fairly aggressive assumption, (2) the value of p PUB used is 0.95, and (3) the mission duration is 12 hours, then the IT Pub-Sub is valid with respect to PUB 1,1 .
In addition to measures associated directly with quantitative survivability requirements, the model constructed in Step 5 is sufficiently detailed to support evaluation of a variety of other survivabilityrelated measures. For example, to understand how the number of successful attacks (those that cause intrusions) relates to a value of p 1,1 , the measure n = the total number of intrusions during d M can likewise be evaluated as a function of MTTD. Comparing Fig. 8(a) with Fig. 8(b), we see how the probability p 1,1 of successfully processing a publish request during d M varies inversely with the total number n of successful intrusions experienced during a mission of the same duration.

Step 8: Satisfaction of Non-quantitative Atomic Requirements
Validation with respect to the non-quantitative atomic requirements PUB 1,2 , PUB 1,3 , and PUB 2 is accomplished using logical argumentation. Given that validity has been established with regard to the quantitative atomic requirement PUB 1,1 in Step 7 above, the system is then valid with respect to the original requirement PUB. The argumentation used is similar to that used for validation of assumptions, and in the interest of space we do not provide the details here. Details are presented in [20].

Other Evaluation Results: Exploring Design Choices
In addition to validating the system against the high-level survivability requirements, we also used the probabilistic model developed as a part of the IVP application to study design trade-offs and explore the behavior of the system in different operating configurations and attack environments.
Such studies are possible because each component of the system model contains a set of input parameters that define aspects of the component's functionality, such as the rate of data publishes or the type of operating system used on a host. These are in addition to parameters that define the characteristics of the attackers on the system, such as the time between vulnerability discoveries (MTTD) and the time to exploit known vulnerabilities (MTTE).
We present results for two of the studies performed: effect of the degree of operating system diversity among the four quadrants in the core and effect of the autonomous distributed firewall (ADF NIC) policies.

Effect of Diversity
In this experiment, we studied two different OS configurations in the following way: • 4 OS: the access proxy, downstream controller, PSQ servers, and system manager in a given quadrant run the same operating system. Each quadrant uses a different OS. The policy server is available in only one operating system. Therefore, there is no possible variation for this component. This case corresponds to the default configuration of the system. • 1 OS: all the components in the core (apart from the policy servers) run the same operating system. Also, to make a fair comparison between the two options, all OSes are secure. The graph presented in Fig. 9 is based on an assumption of a less aggressive attacker than we considered in the baseline case; here, the attacker can only execute infrastructure attacks and attacks across process domains. The figure illustrates the difference between the two options mentioned above. It shows p 1,1 versus the vulnerability discovery rate. Diversity significantly increases the performance of the design: for MTTD = 200 minutes, p 1,1 is about 0.97 for the 4-OS case, versus 0.90 for 1 OS, i.e., a 70% improvement of the unavailability (0.10 versus 0.03). The gap between the two curves is noticeable at all rates.
A similar experiment was done for all types of attack, including the data-level attacks. In that case, the two curves were closer. Data-level attacks are the most dangerous type of attack, as they can take out the same component in every quad. For example, a compromised client could launch a data-level attack against the four PSQ servers, which could result in the crash (or compromise) of all four. If that happened, no further PSQ requests would be handled, and the core would be considered down. For the results presented here, we assumed that the data-level vulnerabilities could be considerably reduced not only by the effort put into the implementation of, for example, the PSQ, but also by the semantic checks done on the access proxy for any incoming traffic to the core.

ADF NIC Policies
ADF NICs are local firewalls on each component, administered by the policy servers in the core. The third experiment compares three ADF NIC policies, assuming that only infrastructure-level attacks and attacks across process domains are allowed. The first policy is to allow all communications between any two processes of any two components. The second is a per-component policy, allowing only certain components to communicate with each other (for instance, the AP can talk to the PSQ server in its quad, but the client cannot communicate directly with any PSQ server). Finally, the third one is a per-process-domain policy, restricting communications between specific processes. Fig. 10 presents the results reflecting the three configurations. The graphs reveal that the perprocess-domain policy is by far the best of all three: when MTTD = 100 minutes, p 1,1 = 84.4% for the norestriction policy, versus 90.0% for the percomponent, and 98.5% for the per-process-domain one, which corresponds to a 90% improvement of the unavailability (from 15.6% down to 1.5%). The per-process-domain restriction can also be interpreted as having a ruleset that describes which ports from which machines can communicate with which ports of the other machines. It is a very successful way to increase the survivability of the system, and therefore should be implemented. However, it comes with a price, as it limits the developers by forcing them to allocate fixed port numbers, and also might limit the usability of the machines for purposes other than the PSQ functionalities.

CONCLUSION
The complexity of emerging survivable systems, especially those that make use of multiple security approaches and technologies, calls for an integrated approach to survivability validation. We have presented the validation of a survivable publishsubscribe system using a top-down approach that begins with a precise formulation of a specific survivability requirement, and then systematically decomposes the problem into manageable tasks. As a part of the procedure, stochastic models of the system and of the attacker were presented. We conducted model-based experiments that evaluated the survivability of the system when stressed by various types of attacks by measuring the probability of success for the transactions between the clients and the core. The results show that if the average time between discoveries of new vulnerabilities is longer than one day, more than 95% of the publishes are processed correctly. The system model was used to study design trade-offs, one of which was that OS diversity in the design significantly improved the performance. Another design trade-off became apparent when we compared three ADF NIC policies: a per-process-domain policy leads to the highest availability, but constrains developers. We described a sophisticated attacker behavior model that has wide applicability when using probabilistic modeling techniques for evaluating large networked information systems.
The integrated validation procedure (IVP) used in this work provided a collaborative environment in which a team of individuals with varied areas of expertise was able to work efficiently and optimally to produce an assurance argument that was convincing to the accreditors of the above effort. The IVP gave the designers of the system several insights into the relative merits of different protection trade-offs by comparing various algorithms, features, or infrastructures. The IVP also brought out hidden assumptions and residual requirements that forced the designers to consider issues that they might otherwise have overlooked. We hope the outlined procedure can be used by other security researchers and practitioners to validate similar survivable systems with respect to high-level quantitative survivability requirements. The work described in this paper is an instance of an evolving validation methodology. In the future, we plan to use the IVP to validate other intrusiontolerant systems to ascertain and further refine its generic applicability, using an even wider array of evaluation techniques as the building blocks of the assurance argument. We also intend to explore avenues for further automation, such as in the decomposition of requirements, identification of appropriate assumptions, and presentation of the completed argument.