Modeling User
Preferences via Theory Refinement
*Decision Systems and Artificial Intelligence Lab
Dept. of EE&CS
University of Wisconsin-Milwaukee
{bgeisler, vu}@cs.uwm.edu
Peter Haddawy=*
=Computer Science & Information Management
School of Advanced Technologies
Asian Institute of Technology
Bangkok, Thailand
haddawy@cs.ait.ac.th
The task of preference
elicitation is characterized by the conflicting goals of eliciting a model as
accurately as possible and asking as few questions of the user as
possible. To reduce the complexity of
elicitation, traditional approaches involve making independence assumptions and
then performing elicitation within the constraints of those assumptions. Inaccurate assumptions can result in
inaccurate elicitation. Ideally, we
would like to make assumptions that approximately apply to a large segment of
the population and correct for inaccuracies in those assumptions when we
encounter an individual to whom they do not apply. We present an approach to encoding of soft assumptions using the
KBANN technique. We show how to encode
assumptions concerning preferential independence and monotonicity as Horn-clauses and how to encode these in a KBANN
network. We empirically compare the
KBANN network with an unbiased ANN in terms of learning rate and accuracy for
preference structures consistent and inconsistent with the domain theory.
Finally, we demonstrate how we can learn a fine-grained preference structure
from simple binary classification data.
User
modeling, Product brokering, Neural networks, Decision theory
1. Introduction
Intelligent user interfaces
increasingly require accurate representation of user interests and
preferences. Applications range from
adaptable software and operating systems to Internet-based product brokering. But accurately eliciting a preference model
can be highly time consuming. Thus we
are faced with the conflicting goals of eliciting a preference model as
accurately as possible and asking as few questions of the user as possible.
Representation and
elicitation of preferences have long been studied in Decision Theory. In
Multi-Attribute Utility Theory (Keeyney
& Raiffa 1976) a set of items is typically described in terms of a
set of attributes with an item being a complete assignment of values to all the
attributes. In domains containing no
uncertainty, user preferences over a set of items can be represented by a value
function that assigns a real number to each item, resulting in a total ordering
over the items. Given a set of items
described in terms of a set of attributes {X1, …, Xn} a
function v, which associates a real
number v(x) with each point xÎ X, is said to be a value function representing the decision maker’s preferences
provided that:
(x1’,x2’,..,xn’) ~
(x1’’,x2’’,…,xn’’) Û v(x1’,x2’,..,xn’) =
v(x1’’,x2’’,…,xn’’)
and (x1’,x2’,..,xn’) p (x1’’,x2’’,…,xn’’) Û v(x1’,x2’,..,xn’)
< v(x1’’,x2’’,…,xn’’),
where x' ~ x'' means that the
decision maker is indifferent between two outcomes x’ and x'' and x' p x'' means that the decision maker prefers outcome x'' to outcome x' . When the number of
items is large, eliciting a high-dimensional value function can be impractical,
so practitioners attempt to identify assumptions that permit a value function
to be represented in terms of a combination of lower-dimensional sub-value
functions. The most common assumption is preferential independence. Given a set of attributes {X1, …, Xn} a
subset of attributes Y is preferentially
independent of its complement set Z if and only if the conditional
preferences in the y space given a fixed value z¢ for the attribute set Z does not depend on z¢. The
attributes X1, …, Xn are mutually
preferentially independent if every subset Y is preferentially independent
of its complementary set of attributes. Given a set of attributes {X1, …, Xn}, n³3, an additive
value function v(x1, …, xn) = l1v1(x1) + … + lnvn(xn) exists if and only if the attributes
are mutually preferentially independent.
Elicitation of an additive value function typically proceeds by
determining the sub-value functions vi
and then the scaling constants li. The scaling constants can be interpreted as
representing the tradeoffs between levels of the different attributes, e.g.
cost vs time. In cases of continuous
attributes, a user will often have monotonic preferences over the values of the
attributes, making the sub-value function easy to define. But determining the tradeoff coefficients
can require significant introspection.
As a running example, we will
use the domain of airline flight selection.
Suppose we have a user who wishes to find a flight from Milwaukee to the
San Francisco. We might describe
flights by three attributes: time (T), cost (C), and whether or not there is a
layover involved (L). Time and cost are
continuous, while layover is binary. We
can assume that the user prefers shorter flights, cheaper flights, and flights
without layovers. Thus we need only
assess the tradeoff coefficients that indicate how much the decision making is
willing to trade off the level of one attribute for a more desirable level of
another. A resulting value function
might be v(t,c,l) = 100t + c + 50l.
As we have
outlined the traditional approach to preference elicitation, the decision
analyst (or agent) makes a set of assumptions concerning a decision maker’s
preferences and then performs the elicitation within the constraints of those
assumptions. If the assumptions are
incorrect, the elicited model will be inaccurate. While such assumptions are clearly useful, it would be helpful to
be able to detect and correct errors in our assumptions. In particular, when designing product
brokering systems, we may wish to make assumptions that at least approximately
apply to a large segment of the user population and then correct for
inaccuracies in those assumptions when we encounter an individual to whom they
do not apply. The problem of starting
with an approximately correct domain theory and then correcting and refining it
in the light of data has been studied in the area of theory refinement. In this paper we examine use of one
technique for theory refinement called Knowledge-Based Artificial Neural
Networks (KBANN) (Shavlik & Towell 1989).
We show how assumptions concerning independence and monotonicity of
preferences can be encoded as a KBANN domain theory and how examples of user
preferences can then be used to refine the theory. We empirically compare
elicitation speed and accuracy of a KBANN network containing this domain theory
with a neural network containing no domain theory. We use Ha and Haddawy’s similarity measure on preference
structures (Ha & Haddawy 1998) to define a measure of the degree to which a
preference structure violates the domain theory. We use this to evaluate how well the KBANN network performs as
the preference structure from which the training examples are drawn
increasingly violates the domain theory.
Our results show rather robust and somewhat surprising behavior. Finally, we demonstrate how we can learn a
fine-grained preference structure from simple binary classification data.
2. KBANN
The KBANN technique (Shavlik
& Towell 1989) permits a domain theory expressed as an acyclic set of
propositional Horn-clauses to be encoded in a back-propagation neural
network. The propositions become nodes
and the logical relations between them become links. By training the network on data an incomplete domain theory can
be supplemented and an inaccurate domain theory can be corrected. Given an approximately correct domain
theory, KBANN has been shown to be require fewer examples to correctly classify
a test set than an unbiased network (Towel 90). When using an unbiased network, the weights are initialized
randomly, so that the network starts out in a randomly chosen point in function
space. In contrast, initializing the
network with an approximately correct domain theory starts it at a point that
is close to the target function.
In deciding how to represent
user preferenes, several alternatives present themselves: discrete numeric
ratings, a simple binary classification into interesting/uninteresting, a value
function, or pairwise preferences among items. We will evaluate these from the standpoint of representational
adequacy, ease of gathering training examples, and ease of expressing and encoding
assumptions concerning preference. Numeric
ratings are the predominant representation used in building recommender
systems. Note that the simple binary
classification is a degenerate case of a numeric rating scheme. Use of numeric ratings requires training
instances to consist of individual items with an associated rating. But notice that numeric ratings have the
disadvantage of fixed granularity.
Either we choose a coarse granularity and cannot make fine distinctions
in preference or we choose a fine granularity and are faced with deciding
exactly what rating to assign to an item.
For example, a rating scheme with only five values can easily have a the
top rating filled with items among which we have strong distinctions of
preference. On the other hand, if we
use, say, 100 rating values then we may very well end assigning ratings that
have little meaning to us, e.g. a 47 vs 48.
The direct use of a value function to represent preference has this
latter problem of deciding exactly what value to assign to an item as
well.
Training examples consisting
of items with numeric scores would require a neural network architecture in
which we have the attributes for one instance as the input nodes and a single
output node that provides the score.
This would the dictate that the assumptions encoded in the form of a
domain theory say something about the relation of the score to the item
attributes. Notice how difficult it
would be to come up with principled statements of that form.
The alternative is to represent a user’s
preferences in terms of pairwise preferences among items. Training instances consist of pairs of items
and an indication of whether the first is preferred to the second. So using pairwise preferences would require
the network to have input nodes for the attributes of two items and a binary
output node that indicated preference or no preference. If two items are input to the network in one
order and then the other and the output indicates no preference for both cases,
then the user is considered to be indifferent between the two items. Notice that with this representation the
domain theory can simply be written to state under what conditions the user
would prefer one item to another.
We can conveniently encode
assumptions concerning preferential independence and monotonicity of preference
through statements of dominance.
Consider our flight domain. It
is plausible that the user prefer cheaper flights to more expensive flights,
shorter flights to longer flights, and flights without layover to those
with. These assumptions can be encoded
with the following set of Horn-clauses:
C1<C2 Ù T1£T2 Ù L1£L2 ® F1fF2
C1£C2 Ù T1<T2 Ù L1£L2 ® F1fF2
C1£C2 Ù T1£T2 Ù L1<L2 ® F1fF2,
where Ci and Ti are the
cost and time of flight i, respectively.
Li is 0 if no layover and 1 otherwise. F1fF2 indicates that flight 1 is preferred to
flight 2. Notice that each of these
sentences indicates that the attribute over which it is expressing
directionality of preference is also preferentially independent of the other
attributes.
To accommodate this domain theory we have
include an extra hidden layer that contains the logic for our comparison
operators. This layer is coded for the
following operators: <,>,<=,>=,=,¹. The layer is
located between the input layer and the initial hidden layer. The tails of the Horn-clauses in the domain
theory are represented in the hidden layer and the heads are represented by the
output node. Figure 1 shows the general
network architecture. While every node at
each layer is connected to every node at the next with low randomly chosen
weights, the Horn-clauses are encoded by setting some weights to high
values. The figure shows the links that
encode the first Horn-clause in the domain theory above. While it would be possible to compute the
comparisons outside the network and include them as separate inputs,
representing them as nodes in the network allows for the possibility that they
can be modified during training. For
example, it might be the case that the user only has preference among two
different costs when they differ by at least some minimal amount. Training could modify the weights to capture
this.
We ran a series of
experiments to evaluate how well the network biased by the assumptions
performed against one with no bias, i.e. random initial weights. We created two networks: one with no domain
theory (ANN), and one with the domain theory above (dominance). We represented the preferences of a user
with an additive value function over the three attributes T, C, L. We tested the three networks by randomly
generating examples of pairwise preferences from three different value
functions: v(t,c,l) = 100t + 1c +
50l, v(t,c,l) = 50t + 1c + 100l, and v(t,c,l) = 1t + 100c + 50l. Each value function experiment was run 10
times, with early stopping used on each run.
The test size was kept constant at 50 items. After each of the 10 runs for each of the TrainSize
possibilities, we computed a mean for each value function. Similar results were observed with all three
value functions. Figure 3 shows the
mean of the experiments with the three different value functions. The network trained with the dominance
relation theory consistently outperforms the unbiased network. It achieves approximately 84% accuracy with
only 30 examples. The unbiased network
requires between 50 and 75 examples to reach 84% accuracy. This is a significant difference since the
examples are generated by directly querying a user.
One of the
main advantages of the KBANN technique is its robustness, specifically, its
ability to correct for inaccuracies in the domain theory. To evaluate how well this works in our
domain, we examine the performance of the networks for a number of value
functions that violate the independence assumptions to various degrees. It is
expected that the more a value function violates the domain theory, the worse
the performance. The best we can hope for is that this decrease does not occur in
a precipitous manner.
The first
issue we have to address in this experiment is to define a measure of domain
theory violation. Since we are dealing with domain theory composed of
independence assumptions, we will call this measure degree of violating the independences or DOVI. Given a domain theory D
and a value function v, what
would be a good measure of v
violating D? Note that we can view D as a set of value functions that
satisfy D. If we have a measure of
distance d between two value
functions, then we can define a violation measure as the distance between v and the member of D that is closest to v.
(Imagine this approach as an analogy to the definition of a distance from a
point to a body of points in Euclidean geometry.) To define the distance between
two value functions (which represent two preference orders), we use the probabilistic distance, proposed by Ha
and Haddawy (1998). According to this measure, the distance between two
preference orders is determined by the probability that they disagree in their
relative rankings of two randomly chosen outcomes. The reason we picked this
measure is because it is both analytically and computationally tractable for a
wide range of representations of preference orders.
The second
issue we have to address is to generate preference orders of varying degrees of
domain theory violation. To this end, we start with some value functions
consistent with the domain theory, and subsequently randomly permute the orders
of k randomly picked outcomes, with
different values of k. It is clear
that when k is small, the violation
is close to zero, and as we increase k,
the perturbed value function is likely to increasingly violate the domain
theory.
The last
issue, which is mainly technical, is the computation of the minimum distance.
Since there are infinitely
many value functions that satisfy the domain theory D, we can only approximate the violation measure by sampling from D. We achieve this by sampling additive
value functions with randomly generated coefficients. Our experiment shows that
this method of sampling settles to a measure point after about 2000 iterations.
Figure 3
shows the results of our robustness analysis. Using 2000 iterations, we
identify several preference orders with
DOVI varying from 0.04 to 0.5. Notice that the performance of the biased
network remains essentially the same (85% and 92% accuracy for KBANN-30 and
KBANN-50, respectively) even for a preference order whose DOVI is 0.19. Only
when the DOVI increases to 0.36 does the accuracy drop to a (still-respectable)
76% and 78% respectively. When the DOVI
goes up to 0.5 (which is about as high as it can go, due to the fact that value
functions consistent with the domain theory intuitively “cover” the space of
preference orders very well), the accuracy of the biased networks becomes
roughly the same as that of the unbiased one. The performance of the unbiased
network is dominated in all cases, and not surprisingly stays almost constant
over different DOVI’s.
The best domain theory
requires about 30 examples to reach 84% accuracy, which requires the user to
provide pairwise comparisons among 60 flights. This is a large number of flights for a user
to evaluate. One way to easily obtain a
large number of examples of pairwise preferences is to have the user indicate
preference between classes of items rather than just individual items. If we have n items in each class, we then have n2 pairwise preferences implicitly represented. But it is not clear whether examples generated
in this way will have as much information content as those generated randomly
since each item will appear in n
examples. So we empirically evaluated
this approach on one value function.
We simulated
a user’s binary categorization of flights into “like” and “dislike”
groups. We scored flights according the first value function
(1t+100c+50l) and selected the top one third of our flights as the
"like" set and the bottom one third as the "dislike"
set. We then used these sets to
generate examples of pairwise preferences.
The results
are shown in Figure 3. Both the biased
and unbiased networks are able to achieve significantly higher accuracy using
fewer flights than when using randomly generated examples. Using only 14 flights, the KBANN network achieves
an accuracy of 89%, while the unbiased ANN achieves an accuracy of 72%. 14 flights yields 49 training examples. With that number of randomly generated
examples, the KBANN network achieved an accuracy of 94%. So generating the training examples from the
binary categorization yields an accuracy about 5% worse, but the number of
flights required is cut from 96 to 14 – a reduction of 85%.
We next evaluated how the technique
of generating examples from a binary categorization would perform as a function
of the degree to which the preference structures from which the examples are
drawn violate the domain theory. We
generated training examples using the preference structures from the
experiments shown in Figure 4 and plotted the number of flights required to
reach 90% accuracy. The results are
shown in Figure 5. With a preference
structure that conforms to the domain theory, 17 flights are required. With a preference structure that violates the
domain theory to degree 0.5, 28 flights are required. The number of flights required increases smoothly as a function
of the degree of violation of the domain theory.
The Wisconsin Adaptive Web
Assistant (WAWA) (Shavlik, et al 1999) is a system that finds a recommends web
pages to a user. It uses KBANN to
determine if a user would like to visit a given web page. The attribute set is a localized
bag-of-words in which words are grouped according to which part of the page
they occur in: title, the url, window, etc.
WAWA can learn from bias and advice given to the system as well as from
example classifications. The ScorePage
neural network outputs a rating measure.
This rating is then used to determined the top n scores, and then
present these n pages to the user. Note
however, that pages are not compared within the neural net.
In WAWA instances are
classified as ‘+’ or ‘-‘ representing
like and dislike, respectively. The domain theory is specified in the form of
rules providing advice. One such rule
is
WHEN
consercutiveWordsInHypertext
(intelligent user interface)
STRONGLY
SUGGEST FOLLOWING HYPERLINK
This advice leads to a highly
a highly weighted edge between a hidden node and the input node for the value
“intelligent user interface” within the WordsInHyperText attribute.
Other
uses of KBANN have involved categorizing examples. For these application, the
domain theory is typically a set of rules that indicate under what conditions
an example falls into a particular class.
Examples of such applications include splice junction determination in
DNS sequences (Noordwinder 1990), determining whether a DNA sequence is a
promoter (Towell 1998), and action selection for PID water tank controllers
(Scott 1992).
The Decision-Theoretic
Interactive Video Advisor (DIVA) (Nguyen & Haddawy 1999) uses pairwise
preferences to represent user preferences in the domain of movie
recommendation. Pairwise preferences
are obtained by having the user classify movies into like, ok, and dislike
categories. Every movie in a higher
category is considered preferred to every movie in a lower category. DIVA is a collaborative filtering system
that maintains a database of user preference structures. The system makes recommendations by computing
similarity between preference structures.
The approach in the current paper combines the representational ideas
from DIVA with the theory KBANN learning technique.
In this paper we have
presented a technique for encoding assumptions concerning preference in a
neural network. We experimented with
one encoding and showed that in our experiments it clearly outperformed an
unbiased ANN. Much work remains to be
done. The examples of pairwise preferences were generated from a very simple
form of preference structure. We need
to determine how well this technique works with more complex preference
structures. One of the strong points of
this approach is the ability to correct for inaccuracies in the domain
theory. Our preliminary experiments on
this are encouraging but we need to perform more exhaustive
experimentation. We intend to explore
this in the domain of movies, since preferences over attributes in that domain
are typically not independent.
The approach taken in this
paper assumed that we have one domain theory that applies to a large segment of
the population. But this may not be
true in all domains. Furthermore, we
may have some individuals whose preferences highly conflict with the
preferences of the majority. In such
domains, we may with to work with multiple KBANN networks that encode a variety
of domain theories. A weight could be
assigned to each network, depending on the frequency of occurrence of that
preference pattern in the population.
We thank Jude Shavlik and
David Page for helpful discussions.
Ha, V. & Haddawy, P.
(1998) Toward case-based preference elicitation: Similarity measures on
preference structures, Proc. UAI98, pp 193-201.
Keeney, R. L., and Raiffa, H.
(1976). Decisions with Multiple
Objectives: Preferences and Value Tradeoffs. Wiley, New York.
Mitchell, T.M. (1997). Machine Learning. McGraw Hill, New York. (pp 340-344).
Nguyen, H. & Haddawy, P.
(1999). The decision-theoretic interactive video advisor, Proc. UAI99, pp
494-501.
G. M. Scott, J. W.
Shavlik & W. H. Ray (1992). Refining
PID Controllers using Neural Networks. Advances in Neural Information
Processing Systems, pp. 555-562, Denver, CO. Morgan Kaufmann.
Shavlik, J. &
Towell, G. (1989). An approach to combining explanation-based and neural
learning algorithms. Connection Science,
1(3): 233-255.
J. Shavlik, S. Calcari, T.
Eliassi-Rad, & J. Solock (1999). An Instructable, Adaptive Interface for
Discovering and Monitoring Information on the World-Wide Web. Proceedings
of the 1999 International Conference on Intelligent User Interfaces, Redondo
Beach, CA.
G. G. Towell, J. W. Shavlik
& M. O. Noordewier (1990). Refinement of Approximate Domain Theories by
Knowledge-Based Neural Networks. Proceedings of the Eighth National
Conference on Artificial Intelligence, pp. 861-866, Boston, MA.