Scientific Mission: Most of the world's information is unstructured: text, images and sensory data. Accessing it in a meaningful way and extracting relevant content is inherently difficult - relevant concepts may be expressed in a wide variety of ways. MIAS mission is to develop the theories, algorithms, and tools for analysts & researchers to access a variety of data formats, integrate unstructured data with existing resources and transform raw data into useful & understandable information.
Educational Mission: Develop diverse human resources to enhance scientific research in MIAS areas. Done via the Data Science Summer Institute, an intense 8-week summer school at the University of Illinois at Urbana-Champaign, which covers foundations, advanced topics tutorials and hand-on research projects.
Focused Data Retrieval and Data Integration
This research aims at enabling effective access to structured information sources on the Internet. Over the past few years, the Web has deepened dramatically. A significant and increasing amount of information is hidden away on the "deep" Web, behind the query interfaces of searchable databases. There are numerous such autonomous and heterogeneous sources, each with a different schema and native query constraints. Because current crawlers cannot effectively query databases, such data is invisible to traditional search engines, and thus remains largely hidden from users
Participants: Kevin Chang
The explosive growth of information demands powerful text mining tools to help us digest information and discover hidden knowledge in text. Text analysis is often associated with various kinds of context, such as time, location, and sources. Given any text data with context information, we often would like to extract the subtopics or themes from text and analyze their variations over context, e.g., to reveal spatiotemporal variations of a subtopic like "government response" in blog articles about hurricane Katrina. In this project, we are developing general probabilistic models and new algorithms for discovering and analyzing various contextual patterns from text, which we refer to as contextual text mining. The proposed models have broad applications in multiple domains to help understand topic evolutions, spatiotemporal impact of events, public opinions, and detect topic related social communities in arbitrary text collections. The extracted topical patterns can reveal hidden associations and latent knowledge in text, and provide evidence for decision-makers to use in making policy decisions.
Participants: ChengXiang Zhai
Semantic Data Enrichment
Handling the overwhelming array of different data formats, understanding data layout, inferring metadata for a variety of text sources and images, inferring semantic markup and construct and augment knowledge bases. Specific topics include: adaptation of entity extraction; text and images; building topic structure based on images and their captions, linking stories that use similar images, and building visual knowledge
Words that appear next to images are correlated. We use this information to train correlated word predictors, and discover scene types.
Participants: David Forsyth
Entity and Relationship Discovery
Entity Uncertainty and Semantic Integration: Developing a robust and efficient semantic integration solution to the problem of entity uncertainty (mention matching). Entity and Relationship Generation: Group matched mentions from entities; develop methods to extract relevant entity attributes. Temporal Aspects and Entity and Relationship Maintenance: Maintain the extracted entity-relationship graph over time, as the underlying raw data changes (e.g., part of the data is deleted, modified, or new data is added).
The Web is teeming with communities such as, for example, those of law enforcement agencies, database researchers, and bioinformatists. Community members often want to aggregate community data, and then query, monitor, and discover interesting information and events. For example, database researchers might be interested in questions such as, "is there any interesting connection between researchers X and Y?" As Web communities proliferate, developing effective ways to support their information needs at the community level is becoming increasingly important. The Cimple Project aims to develop a software platform that can be rapidly deployed and customized to manage data-rich online communities. To drive and validate Cimple, we are building DBLife, a prototype system that manages information for the database research community. This system is viewable online at http://dblife.cs.wisc.edu
Participants: AnHai Doan
Knowledge Discovery and Hypotheses Generation
Exploiting rich semantic structure generated by identifying entities and relationships among them to promote knowledge discovery and to generate hypotheses that emerge from "surprising" correlations or structural events. Mining hidden links in massive information networks: Mining multiple, heterogeneous semantic links to discover crucial hidden relationships that may have strong impact on strategically important tasks. Streaming Data: Developing effective methods for clustering and classifying high-dimensional and evolving data streams.
Entities like people, topics, and events are connected via multiple, heterogeneous, hidden links. We are developing appropriate data structures that encodes this information by (a) annotating multiple data sources and group mentions that represent real world entities, and (b) identifying relations between entities, thus providing an analytical framework for knowledge discovery and the generation of hypotheses. This effort reveals hidden correlations or structural events, and mining such hidden networks will have a strong impact on identifying hidden themes as well as analysis of (anti)social networks and other sub-communities.
Participants: Jiawei Han
The fundamental operation used to access unstructured information is that of Search. Traditional keyword-based Search has been popularized by commercial search engines like Google and Yahoo!, where retrieval is heavily dependent on keywords and their frequency in target documents. However, such techniques often fail to capture the informational needs of the users, resulting in failure to retrieve essential information even when it is available. The goal of this project is to develop natural language processing capabilities and associated search protocols that improve search capabilities. Specifically, we would like to support the search for relations and actions, and to support search via entailment.
Participants: Quang Xuan Do, V.G.Vinod Vydiswaran, Dan Roth
Mathematical and Computational Foundations
Improving fundamental understanding in Machine Learning, Natural Language Processing, Database Theory, Combinatorial Optimization, Algorithmic Data Mining, and Probabilistic Modeling
Most of the world's information comes in the form of unstructured natural language text. Accessing that text in a meaningful way (beyond the keyword level) and extracting relevant content is inherently difficult, in large part because relevant concepts may be expressed in a wide variety of ways. The goal of our work is to make relevant concepts in natural-language text easily accessible to analysts or higher-order automated systems by developing a suite of efficient, precise, and domain-adaptable tools that implement a range of information extraction, synthesis, and display applications. You can try many of our tools at our web site, http://l2r.cs.uiuc.edu/~cogcomp/demos.php.
Participants: Dan Roth