NSF EAGER Project: Discovering Emerging Events in Social Media
This material is based upon work supported by the National Science Foundation
under Grant No. IIS-1143807. Any opinions, findings, and conclusions or recommendations
expressed in this material are those of the author and do not necessarily reflect
the views of the National Science Foundation.
Award number: IIS-1143807
Award amount: $149,998
Duration: Sept 1, 2011 - Aug 31, 2013 (extended to Aug 31, 2014)
Principal investigator: AnHai Doan
Project Abstract, Goals, Research Challenges, and Broader Impacts
Social media (e.g., Twitter, Facebook, YouTube, blogs) has now become
ubiquitous. It plays an increasingly critical role in many domains,
including commerce, disaster management, science, and national
security. In these domains, applications often have to integrate
social media data to detect emerging events. Example of such
events include a planned protest in a city square, a discovered defect
of a newly released product, an earthquake that just happened in a
remote area, and an emerging algae bloom in a lake. Despite the
obvious importance of detecting such events, today few solutions for
event detection have been proposed, and these solutions often do not
work well because they do not take into account the unique
characteristics of social media.
This project addresses these limitations and
develops a solution that effectively integrates social media to
detect emerging events. The solution will focus on the Twittersphere,
and will address the following three key challenges:
- How to exploit characteristics unique to social media to improve
the accuracy of detecting events,
- How to design the solutions such that they scale to high-speed
streams of social media (such as 1500 tweets / second), and
- How to leverage crowdsourcing to find truly interesting events
and extract attributes of these events.
The project will be among the first to explore in depth how to
integrate social media to detect emerging events, taking into account
social media characteristics. As such, it is a high-risk/high-payoff
project that can open the door to novel research directions, and help
accelerate research into social media integration, an increasingly
critical problem that impacts many areas of the society. If
successful, the project can also help build practical event discovery
tools that can make immediate impacts. Finally, the project will help
train one Ph.D. student for two years, and help build and release a
set of infrastructure tools and testbeds that can help accelerate
subsequent research into social media integration, for both the PI's
group and other research groups in social media.
People
Major Activities and Resulting Research Results
- For the first challenge (accurately detecting emerging events in
the Twittersphere), we have developed a solution. It has been released
as a technical report, and will be submitted to a conference in August
2015.
- For the second challenge (scaling up event detection), we have
developed two solutions.
- The first solution focuses on scaling up event detection over
a single machine. It has been described in the above mentioned
technical report, and will
be submitted to a conference in August 2015.
- The second solution focuses on scaling up event detection over a cluster
of machines. It has been published in VLDB-12.
- We have also developed a solution to extract named entities from
tweets accurately and fast, and a solution to build a global knowledge
base that can be used in extracting named entities from tweets. These
solutions can be used to enhance the accuracy of event detection from
Twitter. They have been published in VLDB-13, and SIGMOD-13.
- We have written a paper describing the overall solution
architecture to social media analytics at Kosmix, a startup in Silicon
Valley. Event detection in Twitter is a major component of this
solution architecture. The paper was invited to a journal and
published in 2013.
- The project has partly funded the co-authoring of
Principles of Data
Integration, the leading textbook in the data integration topic,
published by Morgan Kaufmann in 2012.
- We have collaborated extensively with researchers at WalmartLabs in
processing social media for e-commerce. Some of the results of this collaboration
have been described in the publications and in a patent obtained in 2012.
Dissemination of the Project Information
Publications in Conferences and Journals
- We have published 3 papers in top-ranked data management
conferences about the project, and 1 invited journal paper.
- We have 1 more paper released as a technical report,
and will be submitted to a conference in August 2015.
Textbook, Workshops, and Classes
- Materials from the project have been included in a textbook on
data integration, published in 2012.
- Materials from the project were discussed at two workshops. The
first one was the
NSF Workshop on Social Networks and Mobility in the Cloud, Feb,
2012, where the PI gave an invited talk (see also the PI's
position
paper at the workshop). The second workshop is on social media
analysis organized by DHS and FEMA, Washington DC, Jan 2013, where the
PI gave an invited talk.
- Materials from the project have been taught extensively in
CS 784,
the most advanced data management class at UW-Madison.
Invited Talks at Universities and Organizations
From 2012 to 2014, the PI also gave many talks on the project at
various universities and organizations. These include University
of California, Irvine; University of California, San Diego; Stanford University;
University of Texas, Austin; and New England Database Society, among others.
Data, System Artifacts, and Patents
- Twitter data cannot be publicly released due to a restriction in
usage agreement.
- The source code for our solution to detect emerging events
accurately and fast can be found here.
- A patent was obtained for the solution described in the VLDB-12
paper, listed under
Systems and Methods for Event Stream Processing.
Publications
- Event Extraction in the Twittersphere, A. Ardalan, Q. Wan, N. Garera, A. Doan, J. Patel,
UW-Madison Technical Report, 2014.
- Building, Maintaining, and Using Knowledge Bases: A Report from
the Trenches, O. Deshpande, D. Lamba, M. Tourn, S. Das,
S. Subramaniam, A. Rajaraman, V. Harinarayan, A. Doan. SIGMOD-13.
- Entity Extraction, Linking, Classification, and Tagging for
Social Media: A Wikipedia-Based Approach, A. Gattani, D. Lamba,
N. Garera, M. Tiwari, X. Chai, S. Das, S. Subramaniam, A. Rajaraman,
V. Harinarayan, and A. Doan. VLDB-13.
- Social Media Analytics: the Kosmix Story, with many
authors. IEEE Data Engineering Bulletin, Sept 2013. Invited paper
- Muppet: MapReduce-Style Processing of Fast Data, W. Lam,
L. Liu, S. Prasad, A. Rajaraman, Z. Vacheri, A. Doan, VLDB-12
- Principles of Data Integration, A. Doan, A. Halevy,
and Z. Ives, first edition, Morgan Kaufmann, 2012.
- Analyzing and Integration Social Media, A. Doan,
NSF Workshop on Social Networks and Mobility in the Cloud, 2012,
position paper.
Patents
Software and Data
- Source code for Muppet is available on GitHub under the name "Mupd8".
This is the system that can process a tweet stream in a highly scalable fashion on a cluster of machines, as described
in the above VLDB-12 paper.
- Source code for our solution to detect events is available here. This
solution is described in the technical report mentioned above.
- Tweets used in our experiments were obtained from WalmartLabs and cannot be shared per the usage agreement. Interested researchers however can contact Gnip to explore obtaining a part of the Twitter firehose.
Collaboration and Outreach
We had extensively collaborated with @WalmartLabs on the topic of
event detection and monitoring. Some of these activities were
described in the above published papers and in a patent obtained in
2012. These activities ended in mid 2014 (at the end of this
grant).
We had also collaborated with
Dhavan V. Shah, his
students, and several other researchers in the School of Journalism
and Mass Communication to explore detecting political events in the
Twittersphere. Our main outreach activities were to advise them on how
to set up their infrastructure to detect such events, and to share
lessons learned from our project on how to detect events accurately.
These activities ended in December 2013.
Last updated July 13, 2015