CS 540 Lecture Notes: Reasoning under Uncertainty

CS 540

Lecture Notes

Fall 1996

Reasoning under Uncertainty (Chapters 14, 15.1 - 15.2)

Why Reason Probabilistically?

In many problem domains it isn't possible to create complete, consistent models of the world. Therefore agents (and people) must act in uncertain worlds (which the real world is).
Some of the reasons for reasoning under uncertainty:
- True uncertainty. E.g., flipping a coin.
- Theoretical ignorance. There is no complete theory which is known about the problem domain. E.g., medical diagnosis.
- Laziness. The space of relevant factors is very large, and would require too much work to list the complete set of antecedents and consequents. Furthermore, it would be too hard to use the enormous rules that resulted.
- Practical ignorance. Uncertain about a particular individual in the domain because all of the information necessary for that individual has not been collected.

Representing Belief about Propositions

Rather than reasoning about the truth or falsity of a proposition, reason about the belief that a proposition or event is true or false
For each primitive proposition or event, attach a degree of belief to the sentence
Use probability theory as a formal means of manipulating degrees of belief
Given a proposition, A, assign a probability, P(A), such that 0 <= P(A) <= 1, where if A is true, P(A)=1, and if A is false, P(A)=0. Proposition A must be either true or false, but P(A) summarizes our degree of belief in A being true/false.
Example: If A = "It will be sunny tomorrow" and P(A) = 0.8, then this corresponds to saying that the agent believes there is an 80% chance that this sentence is true, and therefore that it will be sunny tomorrow.
Example: P(A=a ^ B=b) = P(A=a, B=b) = 0.2, where A=My_Mood, a=happy, B=Weather, and b=rainy, means that there is a 20% chance that when it's raining my mood is happy.
Obtaining and Interpreting Probabilities
There are several senses in which probabilities can be obtained and interpreted, among them the following:
- Frequentist Interpretation
  The probability is a property of a population of similar events. E.g., if set S = P union N, and P intersection N is the empty set, then the probability of an object being in set P is |P|/|S|. Hence, in this interpretation probabilities come from experiments and determining the population associated with a given proposition.
- Subjectivist Interpretation
  A subjective degree of belief in a proposition or the occurrence of an event. E.g., the probability that you'll pass Exam 3 based on your own subjective evaluation of the amount of studying you've done and your understanding of the material. Hence, in this interpretation probabilities characterize the agent's beliefs.
We will assume that in a given problem domain, the programmer and expert identify all of the relevant propositional variables that are needed to reason about the domain. Each of these will be represented as a random variable, i.e., a variable that can take on values from a set of mutually exclusive and exhaustive values called the sample space or partition of the random variable. Usually this will mean a sample space {True, False}. For example, the proposition Cavity has possible values True and False indicating whether a given patient has a cavity or not. A random variable that has True and False as its possible values is called a Boolean random variable.
More generally, propositions can include the equality predicate with random variables and the possible values they can have. For example, we might have a random variable Color with possible values red, green, blue, and other. Then P(Color=red) indicates the likelihood that the color of a given object is red. Similarly, for Boolean random variables we can ask P(A=True), which is abbreviated to P(A), and P(A=False), which is abbreviated to P(~A).

Rules of Probability Theory

Probability Theory provides us with the formal mechanisms and rules for manipulating propositions represented probabilistically. For example,

P(True)=1, P(False)=0
P(A v B) = P(A) + P(B) - P(A ^ B)
P(~A) = 1 - P(A)
0 <= P(A=a) <= 1 for all a in sample space of A
Sum{P(A=a)} = 1, where the sum is over all possible values a in the sample space of A

In particular, conditional probabilities are important for reasoning because they formalize the process of accumulating evidence and updating probabilities based on new evidence. Some of the most important rules related to conditional probability are:

P(A|B) = P(A ^ B) / P(B), or, rewritten as the Product Rule: P(A,B) = P(A|B)P(B)
Bayes's Rule: P(A|B) = (P(A)P(B|A))/P(B), which can be written as follows to more clearly emphasize the "updating" aspect of the rule: P(A|B) = P(A) * [P(B|A)/P(B)]
Addition Rule: P(A) = Sum{P(A|B=b)P(B=b)} where the sum is over all possible values b in the sample space of B.

Using Bayes's Rule

Bayes's Rule is the basis for probabilistic reasoning because given a prior model of the world in the form of P(A) and a new piece of evidence B, Bayes's Rule says how the new piece of evidence decreases my ignorance about the world by defining P(A|B).

Why use Bayes's Rule?
Often want to know P(A|B) but only have access to P(B|A). For example, let S represent the proposition that a given patient has a stiff neck, and let M represent the proposition that the patient has meningitis. The doctor and patient may like to know P(M|S), but obtaining this information from the general population is difficult. Besides it could change significantly over time given epidemics or other seasonal factors. On the other hand, doctors may be able to accumulate statistics that define P(S|M). So, for example, if P(M) = 1/50,000, P(S) = 1/20, and P(S|M) = 1/2, then using Bayes's Rule says that P(M|S) = 1/5000 = .0002

Bayes's Rule with Multiple Evidence
Generalizing Bayes's Rule for two pieces of evidence, B and C, we get:

P(A|B ^ C) = ((P(A)P(B ^ C | A))/P(B ^ C) = P(A) * [P(B|A)/P(B)] * [P(C | A ^ B)/P(C|B)]
Again, this shows how the conditional probability of A is updated given B and C. The problem is that it may be hard in general to obtain or compute P(C | A ^ B). But this difficulty is circumvented if we know evidence B and C are independent.

Independence Assumptions

A is independent of B if P(A|B) = P(A). In this case, P(A ^ B) = P(A)P(B).
A is conditionally independent of B given C if P(A | B ^ C) = P(A|C) and, symmetrically, P(B|A ^ C) = P(B|C). What this means is that if we know P(A|C), we also know P(A | B ^ C), so we don't need to store this case. Furthermore, it also means that P(B ^ C | A) = P(B|A)P(C|A).

Bayes's Rule with Multiple, Independent Evidence
Assuming conditional independence of B and C given A, we can simplify Bayes's Rule for two pieces of evidence B and C:

P(A | B ^ C) = (P(A)P(B ^ C | A))/P(B ^ C) = (P(A)P(B|A)P(C|A))/(P(B)P(C|B)) = P(A) * [P(B|A)/P(B)] * [P(C|A)/P(C|B)]

Furthermore, if B and C are independent, then:

P(A | B ^ C) = P(A) * [P(B|A)/P(B)] * [P(C|A)/P(C)]

Example
Consider the medical domain consisting of three Boolean variables: PickledLiver, Jaundice, Bloodshot, where the first indicates if a given patient has the "disease" PickledLiver, and the second and third describe symptoms of the patient. We'll assume that Jaundice and Bloodshot are conditionally independent given PickledLiver. This is a reasonable assumption since PickledLiver is a direct cause of each symptom.
The doctor wants to determine the likelihood that the patient has a PickledLiver. Based on no other information, she knows that the prior probability P(PickledLiver) = 10^-17. So, this represents the doctor's initial belief in this diagnosis. However, after examination, she determines that the patient has jaundice. She knows that P(Jaundice) = 2^-10 and P(Jaundice | PickledLiver) = 2^-3, so she computes the new updated probability in the patient having PickledLiver as:

P(PickledLiver | Jaundice) = P(P)P(J|P)/P(J) = (2^-17 * 2^-3)/2^-10 = 2^-10

So, based on this new evidence, the doctor increases her belief in this diagnosis from 2^-17 to 2^-10. Next, she determines that the patient's eyes are bloodshot, so now we need to add this new piece of evidence and update the probability of PickledLiver given Jaundice and Bloodshot. Say, P(Bloodshot) = 2^-6 and P(Bloodshot | PickledLiver) = 2^-1. Then, she computes the new conditional probability:

P(PickledLiver | Jaundice ^ Bloodshot) = (P(P)P(J|P)P(B|P))/(P(J)P(B)) = 2^-10 * [2^-1 / 2^-6] = 2^-5
So, after taking both symptoms into account, the doctor's belief that the patient has a PickledLiver is 2^-5.

Belief Networks

Belief Networks, also known as Bayesian Nets and Probability Nets, are a space efficient data structure for encoding all of the information in the joint probability distribution for the set of random variables defining a domain. That is, from the Belief Net one can compute any value in the joint probability distribution of the set of random variables.
Space efficient because it exploits the fact that in many real-world problem domains the dependencies between variables are generally local, so there are a lot of conditionally independent variables
Captures both qualitative and quantitative relationships between variables
Can be used to reason

Forward from causes to effects -- predictive reasoning
Backward from effects to causes -- diagnostic reasoning

Formally, a Belief Net is a directed, acyclic graph (DAG), where there is a node for each random variable, and a directed arc from A to B whenever A causally influences (directly) B. Thus the arcs represent causal connections and the nodes represent states of affairs. The occurrence of A provides support for, and therefore increases the likelihood of, B, and vice versa. The backward influence is call "diagnostic" or "evidential" support for A due to the occurrence of B.

Net Topology Reflects Conditional Independence Assumptions

Conditional independence defines local net structure. For example, if A and C are conditionally independent given B, then we know P(A|B,C) = P(A|B). In a Belief Net this will be represented by the local structure:

For example, for the burglar alarm example in the textbook, Alarm causes John to call and Mary to call, but these two events are conditionally independent given Alarm. So the net will contain:

In general, we will construct the net so that given its parents, a node is conditionally independent of the rest of the net variables. That is,
P(X1=x1, ..., Xn=xn) = P(xi | Parents(Xi)) * ... * P(xn | Parents(Xn))
Hence, we don't need the full joint probability distribution, only conditionals relative to the parent variables.

Example
Consider a domain consisting of 5 Boolean random variables:
T: The lecture started at 11:05 L: The lecturer arrived late V: The lecture is on computer vision C: The lecturer is Chuck S: It is sunny

In this domain it makes sense to make the following additional assumptions:

T is only directly influenced by L. Hence T is conditionally independent of V, C, S given L
L is only directly influenced by C and S. Hence L is conditionally independent of R given C and S
V is only directly influenced by C. Hence V is conditionally independent of L and S given C
C and S are independent

Based on the above, the following is a Belief Net that represents all of these relationships:

Now add the following quantitative information to the net:

For each root node (i.e., node without any parents), compute the prior probability of the random variable associated with the node and store it there
For each non-root node, compute the conditional probabilities of the node's variable given all possible combinations of its immediate parent nodes

Doing this for the above example, we get the following Belief Net:

Notice that in this example, a total of 10 probabilities are computed and stored in the net, whereas the full joint probability distribution would require a table containing 2^5 = 32 probabilities. The reduction is due to the conditional independence of many variables.
Two variables that are not directly connected by an arc can still affect each other. For example, S and T are not independent, but T does not directly depend on S.
Given a belief net, we can easily read off the conditional independence relations that are represented. Specifically, each node is conditionally independent of all of its nonsuccessors given its parents. E.g., in the above example T is conditionally independent of S, C, and V given L. So, P(T | S,L,C,V) = P(T | L).

Building a Belief Net
The following algorithm constructs a belief net:

Identify a set of random variables that describe the given problem domain
Choose an ordering for them: X1, ..., Xn
for i=1 to n do

Add a new node for Xi to the net
Set Parents(Xi) to be the minimal set of already added nodes such that we have conditional independence of Xi and all other members of {X1, ..., Xi-1} given Parents(Xi)
Add a directed arc from each node in Parents(Xi) to Xi
If Xi has at least one parent, then define a conditional probability table at Xi: P(Xi=x | possible assignments to Parents(Xi)). Otherwise, define a prior probability at Xi: P(Xi)

Notes about this algorithm:

There is not, in general, a unique Belief Net for a given set of random variables. But all represent the same information in that from any net constructed every entry in the joint probability distribution can be computed.
The "best" net is constructed if in Step 2 the variables are topologically sorted first. That is, each variable comes before all of its children. So, the first nodes should be the roots, then the nodes they directly influence, and so on.
The algorithm will not construct a net that is illegal in the sense of violating the rules of probability.

Computing Joint Probabilities from a Belief Net
To illustrate how a belief net is used to compute an arbitrary value in the joint probability distribution, consider the Belief Net shown above for the "lecture domain."
Goal: Compute P(S ^ ~C ^ L ^ ~V ^ T)

P(T ^ ~V ^ L ^ ~C ^ S) = P(T | ~V ^ L ^ ~C ^ S) * P(~V ^ L ^ ~C ^ S) by Product Rule = P(T|L) * P(~V ^ L ^ ~C ^ S) by cond. indep. = P(T|L) P(~V | L ^ ~C ^ S) P(L ^ ~C ^ S) by Product Rule = P(T|L) P(~V|~C) P(L ^ ~C ^ S) by cond. indep. = P(T|L) P(~V|~C) P(L | ~C ^ S) P(~C ^ S) by Product Rule = P(T|L) P(~V|~C) P(L|~C ^ S) P(~C | S) P(S) by Product Rule = P(T|L) P(~V|~C) P(L|~C ^ S) P(~C) P(S) by cond. indep. = (.3)(1 - .6)(.1)(1 - .6)(.3) = .00144

where all of the numeric values are available directly in the Belief Net (since P(~A|B) = 1 - P(A|B)).

Summary

We have a methodology for building a Belief Net
The Belief Net is compact in that it doesn't usually require exponential storage to hold all of the information in the joint probability distribution table
We can compute the probability of any given assignment of truth values to the variables (i.e., compute the probability for an entry in the joint probability distribution table). And this computation is fast -- linear in the number of nodes in the net.
But, many queries of interest are conditional, of the form:

P(Q | E1 ^ E2 ^ ... ^ Ek)

That is, given a set of values for selected random variables, E1, ..., Ek, representing a set of evidence gathered, compute the posterior probability of the query variable Q. In general, this requires enumerating all of the "matching" cases in the joint, which takes time exponential in the number of variables. So, general querying using a Belief Net is NP-complete. But, certain special cases (e.g., tree-structured nets) take polynomial time.

Last modified December 11, 1996
Copyright © 1996 by Charles R. Dyer. All rights reserved.