=========================================================================== HotOS XIII Review #8A Updated Monday 28 Feb 2011 9:47:47pm PST --------------------------------------------------------------------------- Paper #8: Maya : Zero Effort Personalized Web Experience for the Entire Internet User-base --------------------------------------------------------------------------- Overall merit: 1. Reject - This paper has no place in the workshop Reviewer expertise: 3. Knowledgeable ===== Paper summary ===== The paper argues that web personalization is an unsolved problem. It proposes a peer-to-peer mechanism to (supposedly) anonymously aggregating users' traces, suggests using machine learning algorithms to cluster preferences into "taste models", then using these taste models to make recommendations (while maintaining privacy). The paper bit off more than it could chew, simultaneously suggesting it could solve web personalization, realize it in a fully decentralized fashion, and provide such functionality while maintaining full anonymity and privacy. Improvements in *any* of three areas would be beneficial, yet in trying to argue for all three, the paper only gives a very cursory consideration to each. In doing so, it doesn't seem to understand all of the problems in each area, and thus doesn't offer any new insight or compelling technical directions that would overcome prior limitations. Regarding the claim that web personalization is unsolved and limited to sites requiring explicit user input (such as StumbleUpon or Reddit): Search engines and targetting online advertisers would disagree. Google and other search engines offer individually-personalized search results (see http://press.princeton.edu/titles/9378.html for a book by one of the researchers behind Google's personalization algorithms), while personalized behavioral algorithms have certainly been commercial deployed by online advertisers. Perhaps a better direction would have been to start with understanding some of the concrete approaches taken by these algorithms, then figuring out how they could be implemented in a more privacy-friendly fashion. k-Anonymity is a fairly weak metric; there's been a lot of recent work both deanonymizating search and social network datasets (see, e.g., Arvind Narayanan or Jon Kleinberg), while also new work on better privacy metrics for datasets (e.g., differential privacy). =========================================================================== HotOS XIII Review #8B Updated Tuesday 1 Mar 2011 9:31:29am PST --------------------------------------------------------------------------- Paper #8: Maya : Zero Effort Personalized Web Experience for the Entire Internet User-base --------------------------------------------------------------------------- Overall merit: 1. Reject - This paper has no place in the workshop Reviewer expertise: 2. Some familiarity ===== Paper summary ===== This paper tackles the problem of web personalisation: allowing users to easily receive relevant pages (and advertisements) based on their preferences and browsing history. It claims that the major obstacle to more effective personalisation is user privacy, and that if a privacy-preserving means of collecting information on how every user browses the web were available, the problem would be solved. To this end, it proposes Maya: a peer-to-peer system which would collect the full browsing history of every user on the internet, aggregate it, and then use machine learning to provide each user with a highly personalised view of the web. This paper is clearly tackling a big problem, and is likely to provoke discussion. However, the premises of the paper are dubious and poorly-justified, and the design of the system is presented at a high level without any analysis of important factors such as scalability or cost. ===== Comments for author ===== I'm skeptical that attempting to collect and process the web usage traces of every user on the internet is either feasible or a good idea, and dubious that the system proposed here would work at any scale or preserve users' privacy. A good starting point would be to find a definition for privacy in the context of this system, and then identify which parts of the system must be trusted to preserve it. Without this it's difficult to directly argue that the system won't work, but statements such as the following are unreassuring: * "Client Programs ... can do so under an assumed pseudonym" -- as you admit this gives no privacy. * "The Transmitter is trusted not to cause intentional risk to the privacy of the user because it is part of the common software package" -- but this is largely irrelevant: it transmits every website I visit, so that sounds like a pretty big risk by design. * "The browsing trace is broken down into RSets ... in order to reduce the risk to the users privacy". What is the level of risk that a user can tolerate, and how do you quantify it? The system relies on machine learning to achieve many of its goals, but doesn't give me any confidence that this will work. For example, the Filter nodes "detect lies by malicious Transmitters and discard them" using keywords and "sophisticated machine learning techniques". The system has a lot of components which will need to store and process large amounts of data, and run computationally-intensive machine learning algorithms: routers, aggregators, filters, honeypots. I think you need to do some quantitative analysis involving expected data rates, and numbers of nodes, to see if the system will scale. Minor comments / nits: "The common internet user in the US ... visits about 80 non-unique web pages a day" (S1) -- Should this be unique pages? Who are the "entire Internet Serb's"? (S1) What does it mean to "punish maliciousness" during Routing? =========================================================================== HotOS XIII Review #8C Updated Friday 4 Mar 2011 4:45:43pm PST --------------------------------------------------------------------------- Paper #8: Maya : Zero Effort Personalized Web Experience for the Entire Internet User-base --------------------------------------------------------------------------- Overall merit: 2. Weak reject - Probably a reject, but could be convinced otherwise Reviewer expertise: 2. Some familiarity ===== Paper summary ===== The paper proposes a P2P system that collects anonymized web-usage information from Internet users and, based on that, outputs personalization suggestions for each user, e.g., which web sites the user may be interested in. Pros: Interesting idea, can generate discussion. Cons: - The paper does not provide any interesting discussion/analysis beyond stating this high-level idea. - The description of the proposed system is too high level; even though the term "malicious" is frequently used, the adversary model is unclear. ===== Comments for author ===== The paper would benefit from a clearer description of the proposed system, e.g., what kind of adversarial behavior it can handle and how, what kind of anonymization and aggregation techniques it could rely on, how trust scores could be maintained... I am not talking about a complete, detailed description; but it is impossible to assess the viability of the proposed system without any sense of how it would work. A rough estimate of the load expected from the various components would also help. Things that were unclear: Is there a need for transmitters to coordinate their communication to the routers? I.e., could it be that if one transmitter alone sends information to a malicious router, then the router can look at the information, combine it with information from websites that cooperate with the malicious router, and guess the identity of the transmitter? In Section 3.1.1, paragraph starting with "Malicious Transmitters": What happens if a malicious transmitter tries to aritificially associate a particular soccer-related website to a group of popular soccer-related websites? For instance, the malicious transmitter may launch a sybil attack, and make it look as if multiple users who visit the popular soccer sites also visit the particular website. In this case, keyword-based filtering will not work, because the site championed by the malicious transmitter *is* related to the other websites theme-wise.