VLDB 2012 Ice Breaker v0.1

It may surprise you but when meeting other people, it helps to act interested in their recent publications/research. I created this enhanced version of the VLDB program listing (authors have links to DBLP) for tracking recent resesarch for my blog. You may find it useful at the VLDB conference.

I will be revising this document to resolve names (where DBLP suggests more than one match) and to correct errors. Comments and suggestions are welcome! patrick@durusau.net

I have generally followed the format of the VLDB program, which means there are duplicate entries for demonstrations and papers, for instance. I left the duplicates in, reasoning being able to quickly find content in the expected location outweighed other considerations.


Table Of Contents

Keynotes

Keynote Talk 1: Data Management on the Spatial Web
Christian S. Jensen (Aarhus University, Denmark)

Abstract. Due in part to the increasing mobile use of the web and the proliferation of geo-positioning, the web is fast acquiring a significant spatial aspect. Content and users are being augmented with locations that are used increasingly by location-based services. Studies suggest that each week, several billion web queries are issued that have local intent and target spatial web objects. These are points of interest with a web presence, and they thus have locations as well as textual descriptions. This development has given prominence to spatial web data management, an area ripe with new and exciting opportunities and challenges. The research community has embarked on inventing and supporting new query functionality for the spatial web. Different kinds of spatial web queries return objects that are near a location argument and are relevant to a text argument. To support such queries, it is important to be able to rank objects according to their relevance to a query. And it is important to be able to process the queries with low latency. The talk offers an overview of key aspects of the spatial web. Based on recent results obtained by the speaker and his colleagues, the talk explores new query functionality enabled by the setting. Further, the talk offers insight into the data management techniques capable of supporting such functionality.

Keynote Talk 2: Data Analytics Opportunities in a Smarter Planet
Brenda Dietrich (IBM T J Watson Research Center, USA)

Abstract. New applications of computing are being enabled by instrumentation of physical entities, aggregation of data, and the analysis of the data. The resulting integration of information and control permits efficient and effective management of complex man-made systems. Examples include transportation systems, buildings, electrical grids, health care systems, governments, and supply chains. Achieving this vision requires extensive data integration and analysis, over diverse, rapidly changing, and often uncertain data. There are many challenges, requiring both new data management techniques as well as new mathematics, forcing new collaborations as the basis of the new "Data Science". Needs and opportunities will be discussed in the context of specific pilots and projects.

Keynote Talk 3: Challenges in Economic Massive Content Storage and Management (MCSAM) in the Era of Self-Organizing, Self-Expanding and Self-Linking Data Clusters
Kenan Şahin (TIAX, USA)

Abstract. Rapid spread of social networks, global on-line shopping, post 9/11 security oriented linking of data bases and foremost the global adoption of smart phones/devices, among other phenomena, are transforming data clusters into dynamic and almost uncontrollable entities that have their own local intelligence, clients and objectives. The scale and rapidity of change is such that large scale innovations in content storage and management are urgently needed if the diseconomies of scale and complexity are to be mitigated. The field needs to reinvent itself. Istanbul, a city that has reinvented itself many times is an excellent venue to engage in such a discussion and for me to offer suggestions and proposals that derive from personal experiences that span academia, start ups, R&D firms and Bell Labs as well my early years spent in Istanbul.


10 Year Best Paper Award


VLDB Awards: Approximate Frequency Counts over Data Streams
Gurmeet Singh Manku (Google Inc., USA)
Rajeev Motwani

Abstract. Research in data stream algorithms has blossomed since late 90s. The talk will trace the history of the Approximate Frequency Counts paper, how it was conceptualized and how it influenced data stream research. The talk will also touch upon a recent development: analysis of personal data streams for improving our quality of lives.

Tutorials



Tutorial Session 1: Efficient Big Data Processing in Hadoop MapReduce
Jens Dittrich (Universität Saarland, Germany)
Jorge-Arnulfo Quiané-Ruiz (Universität Saarland, Germany)


Abstract. This tutorial is motivated by the clear need of many organizations, companies, and researchers to deal with big data volumes efficiently. Examples include web analytics applications, scientific applications, and social networks. A popular data processing engine for big data is Hadoop MapReduce. Early versions of Hadoop MapReduce suffered from severe performance problems. Today, this is becoming history. There are many techniques that can be used with Hadoop MapReduce jobs to boost performance by orders of magnitude. In this tutorial we teach such techniques. First, we will briefly familiarize the audience with Hadoop MapReduce and motivate its use for big data processing. Then, we will focus on different data management techniques, going from job optimization to physical data organization like data layouts and indexes. Throughout this tutorial, we will highlight the similarities and differences between Hadoop MapReduce and Parallel DBMS. Furthermore, we will point out unresolved research problems and open issues.



Tutorial Session 2: MapReduce Algorithms for Big Data Analysis
Kyuseok Shim (Seoul National University, Korea)

Abstract. There is a growing trend of applications that should handle big data. However, analyzing big data is a very challenging problem today. For such applications, the MapReduce framework has recently attracted a lot of attention. Google's MapReduce or its open-source equivalent Hadoop is a powerful tool for building such applications. In this tutorial, we will introduce the MapReduce framework based on Hadoop, discuss how to design efficient MapReduce algorithms and present the state-of-the-art in MapReduce algorithms for data mining, machine learning and similarity joins. The intended audience of this tutorial is professionals who plan to design and develop MapReduce algorithms and researchers who should be aware of the state-of-the-art in MapReduce algorithms available today for big data analysis.



Tutorial Session 3: Entity Resolution: Theory, Practice & Open Challenges
Lise Getoor (University of Maryland, USA)
Ashwin Machanavajjhala (Duke University, USA)


Abstract. This tutorial brings together perspectives on ER from a variety of fields, including databases, machine learning natural language processing and information retrieval, to provide, in one setting, a survey of a large body of work. We discuss both the practical aspects and theoretical underpinnings of ER. We describe existing solutions, current challenges, and open research problems.



Tutorial Session 4: I/O Characteristics of NoSQL Databases
Jiri Schindler (NetApp Inc., USA)

Abstract. The advent of the so-called NoSQL databases has brought about a new model of using storage systems. While traditional relational database systems took advantage of features offered by centrally-managed, enterprise-class storage arrays, the new generation of database systems with weaker data consistency models is content with using and managing locally attached individual storage devices and providing data reliability and availability through high-level software features and protocols. This work aims to review the architecture of several existing NoSQL DBs with an emphasis on how they organize and access data in the shared-nothing locally-attached storage model. It shows how these systems operate under typical workloads (new inserts and point and range queries), what access characteristics they exhibit to storage systems. Finally, it examines how several recently developed key/value stores, schema-free document storage systems, and extensible column stores organize data on local filesystems on top of directly-attached disks and what system features they must (re)implement in order to provide the expected data reliability.



Tutorial Session 5: Mining Knowledge from Interconnected Data: A Heterogeneous Information Network Analysis Approach
Yizhou Sun (University of Illinois at Urbana-Champaign, USA)
Jiawei Han (University of Illinois at Urbana-Champaign, USA)
Xifeng Yan (University of California, Santa Barbara, USA)
Philip S. Yu (University of Illinois at Chicago, USA)

Abstract. Most objects and data in the real world are interconnected, forming complex, heterogeneous but often semi-structured information networks. However, most people consider a database merely as a data repository that supports data storage and retrieval rather than one or a set of heterogeneous information networks that contain rich, inter-related, multi-typed data and information. Most network science researchers only study homogeneous networks, without distinguishing the different types of objects and links in the networks. In this tutorial, we view database and other interconnected data as heterogeneous information networks, and study how to leverage the rich semantic meaning of types of objects and links in the networks. We systematically introduce the technologies that can effectively and efficiently mine useful knowledge from such information networks.



Tutorial Session 6: Understanding and Managing Cascades on Large Graphs
B. Aditya Prakash (Carnegie Mellon University, USA)
Christos Faloutsos (Carnegie Mellon University, USA)


Abstract. How do contagions spread in population networks? Which group should we market to, for maximizing product penetration? Will a given YouTube video go viral? Who are the best people to vaccinate? What happens when two products compete? The objective of this tutorial is to provide an intuitive and concise overview of most important theoretical results and algorithms to help us understand and manipulate such propagation-style processes on large networks. The tutorial contains three parts: (a) Theoretical results on the behavior of fundamental models; (b) Scalable Algorithms for changing the behavior of these processes e.g., for immunization, marketing etc.; and (c) Empirical Studies of diffusion on blogs and on-line websites like Twitter. The problems we focus on are central in surprisingly diverse areas: from computer science and engineering, epidemiology and public health, product marketing to information dissemination. Our emphasis is on intuition behind each topic, and guidelines for the practitioner.



Tutorial Session 7: Interoperability in eHealth Systems (Invited Tutorial)
Asuman Dogac (SRDC Ltd., Turkey)

Abstract. Interoperability in eHealth systems is important for delivering quality healthcare and reducing healthcare costs. Some of the important use cases include coordinating the care of chronic patients by enabling the co-operation of many different eHealth systems such as Electronic Health Record Systems (EHRs), Personal Health Record Systems (PHRs) and wireless medical sensor devices; enabling secondary use of EHRs for clinical research; being able to share life long EHRs among different healthcare providers. Although achieving eHealth interoperability is quite a challenge both because there are competing standards and clinical information itself is very complex, there have been a number of successful industry initiatives such as Integrating the Healthcare Enterprise (IHE) Profiles, as well as large scale deployments such as the National Health Information System of Turkey and the epSOS initiative for sharing Electronic Health Records and ePrescriptions in Europe. This article briefly describes the subjects discussed in the VLDB 2012 tutorial to provide an overview of the issues in eHealth interoperability describing the key technologies and standards, identifying important use cases and the associated research challenges and also describing some of the large scale deployments. The aim is to foster further interest in this area.


Tutorial Session 8: Secure and Privacy-Preserving Data Services in the Cloud: A Data Centric View
Divyakant Agrawal (University of California at Santa Barbara, USA)
Amr El Abbadi (University of California at Santa Barbara, USA)
Shiyuan Wang (University of California at Santa Barbara, USA)


Abstract. Cloud computing becomes a successful paradigm for data computing and storage. Increasing concerns about data security and privacy in the cloud, however, have emerged. Ensuring security and privacy for data management and query processing in the cloud is critical for better and broader uses of the cloud. This tutorial covers some common cloud security and privacy threats and the relevant research, while focusing on the works that protect data confidentiality and query access privacy for sensitive data being stored and queried in the cloud. We provide a comprehensive study of state-of-the-art schemes and techniques for protecting data confidentiality and access privacy, which make different tradeoffs in the multidimensional space of security, privacy, functionality and performance.



Tutorial Session 9: Graph Synopses, Sketches, and Streams: A Survey
Sudipto Guha (University of Pennsylvania, USA)
Andrew McGregor (University of Massachusetts Amherst, USA)


Abstract. Massive graphs arise in any application where there is data about both basic entities and the relationships between these entities, e.g., web-pages and hyperlinks; neurons and synapses; papers and citations; IP addresses and network flows; people and their friendships. Graphs have also become the de facto standard for representing many types of highly structured data. However, the sheer size of many of these graphs renders classical algorithms inapplicable when it comes to analyzing such graphs. In addition, these existing algorithms are typically ill-suited to processing distributed or stream data. Various platforms have been developed for processing large data sets. At the same time, there is the need to develop new algorithmic ideas and paradigms. In the case of graph processing, a lot of recent work has focused on understanding the important algorithmic issues. An central aspect of this is the question of how to construct and leverage small-space synopses in graph processing. The goal of this tutorial is to survey recent work on this question and highlight interesting directions for future research.


Panels



Panel Session 1: Challenges and Opportunities with Big Data
Moderators: Alexandros Labrinidis (University of Pittsburgh, USA)
H. V. Jagadish (University of Michigan, USA)

Panelists: Susan Davidson (University of Pennsylvania)
Johannes Gehrke (Cornell University)
Nick Koudas (University of Toronto)
Raghu Ramakrishnan (Microsoft)


Abstract. The promise of data-driven decision-making is now being recognized broadly, and there is growing enthusiasm for the notion of "Big Data, including the recent announcement from the White House about new funding initiatives across different agencies, that target research for Big Data. While the promise of Big Data is real -- for example, it is estimated that Google alone contributed 54 billion dollars to the US economy in 2009 -- there is no clear consensus on what is Big Data. In fact, there have been many controversial statements about Big Data, such as "Size is the only thing that matters." In this panel we will try to explore the controversies and debunk the myths surrounding Big Data.



Panel Session 2: Social Networks and Mobility in the Cloud
Moderators: Amr El Abbadi (University of California, Santa Barbara, USA) Mohamed F. Mokbel (University of Minnesota, USA)
Panelists: Gustavo Alonso (ETH Zurich)
Mike Carey (University of California, Irvine)
Mohamed Mokbel (University of Minnesota)
Srinivas Narayanan (Facebook)
Gerhard Weikum (Max-Planck-Institut für Informatik)


Abstract. Social networks, mobility and the cloud represent special and unique opportunities for synergy among several existing and emerging communities that are now often evolving in isolated silos. All three areas hold much promise for the future of computing, and represent significant challenges for large scale data management. As these three areas evolve, their direct influence on significant decisions on each other becomes evident and critical. In particular, the cloud has evolved as a new infrastructure paradigm that is a favorite for most computing applications, especially with its attractive pay-as-you go model. This makes it especially attractive for large scale, elastic novel applications. Social networks have exploded in the last few years with diverse and novel large scale needs, connecting diverse communities and demanding ever increasing resources, thus, exploiting the cloud to the maximum degree. Mobility has been an important and significant aspect of many computing applications for a while now. However, the advent of the cloud and social network applications raises many important challenges when confronted with mobility due to it highly dynamic nature. The potential for cross fertilization among these three areas of important research will drive much of the research in academia, and the products in industry. However, each of these areas also has its own particular demands and needs, which are often at odds with each other. Hence, there is typically much tension between the needs of the applications, ie, social network and mobility needs, and those of the underlying cloud infrastructure. This often causes controversy when discussing these seemingly diverse topics together. This panel will bring together a set of renowned researchers who will explore and discuss the synergy and tensions among critical and often intertwined research and application issues that arise in the context of social networks and mobility in a cloud infrastructure setting.


Demonstration Program Details


Demonstration Session 1: MapReduce, Big Data Systems, and Crowdsourcing »

Demonstration Session 2: Query Pricing, Processing, and Optimization »

Demonstration Session 3: Information Retrieval, Web, and Mobility »

PhD Workshop Program Details


PhD Workshop Session 1: Data Semantics and Data Mining »
Session Chair: Ioana Manolescu, INRIA

PhD Workshop Session 2: Database Systems »
Session Chair: Murat Kantarcioglu Univ. of Texas at Dallas

PhD Workshop Session 3: Query Processing »
Session Chair: Jeffrey Xu Yu Chinese University of Hong Kong

Abstracts



Research Session 1: Spatial Queries
Session Chair: Chen Li

Research Session 2: Map Reduce I

Research Session 3: Caching and Prefetching

Research Session 4: Automation

Research Session 5: Web and IR I

Research Session 6: Dense Graphs Discovery

Research Session 7: Query Processing I

Research Session 8: Crowd Sourcing

Research Session 9: Cloud Databases

Research Session 10: Graphs Statistics and Summaries

Research Session 11: Concurrency

Research Session 12: Spatio-Temporal Queries

Research Session 13: Mapreduce II

Research Session 14: Storage

Research Session 15: Privacy I

Research Session 16: Analytics

Research Session 17: Information Networks

Research Session 18: Distributed Databases

Research Session 19: Privacy II

Research Session 20: Modern Hardware

Research Session 21: Shortest Paths and Reachability

Research Session 22: Query Processing II

Research Session 23: Similarity Search and Ranking I

Research Session 24: String Processing

Research Session 25: Data Integration

Research Session 26: Fundamentals and Theory

Research Session 27: Streams

Research Session 28: Indexing

Research Session 29: Probabilistic Databases

Research Session 30: Social Networks

Research Session 31: Trees, Hierarchies and Taxonomies

Research Session 32: Similarity Search and Ranking II

Research Session 33: Web and IR II

Research Session 34: Graphs Similarity Search

Research Session 35: Web Databases

Research Session 36: Data Flow Processing

Research Session 37: Sequence Processing


Experiments and Analysis Session 1: Mining Cleaning and Matching

Experiments and Analysis Session 2: Large Data Management


Industrial Session 1: Database Engine

Industrial Session 2: Potpourri

Industrial Session 3: Big Data I

Industrial 4 : Big Data II


Demonstration Session 1: MapReduce, Big Data Systems, and Crowdsourcing

Demonstration Session 2: Query Pricing, Processing, and Optimization

Demonstration Session 3: Information Retrieval, Web, and Mobility


PhD Workshop Session 1: Data Semantics and Data Mining

PhD Workshop Session 2: Database Systems

PhD Workshop Session 3: Query Processing



SAP Lunch Talk: An in-memory columnar processing based database platform for Enterprise Applications
Vishal Sikka (Member of Executive Board, Head of Technology & Innovation, SAP, Germany)
http://www.sap.com/corporate-en/our-company/sap-boards/executive-board/Vishal-Sikka.epx


Abstract. Specialized data management systems have recently emerged because increasingly commercially available systems have been making compromises to comprehensively address the diverse data needs of a business. Compounded with the evolution of how data is created and used, the specialization of database management systems created a critical need for a comprehensive, real-time data processing platform for the modern enterprise. SAP HANA, the core database engine of SAP's in-memory platform, seeks to enable both transactional and analytical workloads on the same data representation. It also supports both structured and unstructured data analysis, and optimizes execution of application-specific function logic, all within a highly scalable in-memory execution environment. It succeeds in this mission because it leverages the advantages of a fully in-memory columnar store with highly optimized internal data structures and parallel processing algorithms. In this talk, Dr. Sikka will outline the key principles of this next-generation data platform, and some ways in which enterprise application architecture can be rethought in light of this work.