One of the crucial requirements before consuming datasets for any application is
to understand the dataset at hand and its metadata. The process of metadata
discovery is known as data profiling. Profiling activities range from ad-hoc
approaches, such as eye-balling random subsets of the data or formulating
aggregation queries, to systematic inference of structural information and
statistics of a dataset using dedicated profiling tools. In this tutorial, we
highlight the importance of data profiling as part of any data-related use-case,
and discuss the area of data profiling by classifying data profiling tasks and reviewing the state-of-the-art data profiling systems and techniques. In particular, we discuss hard problems in data profiling, such as algorithms for dependency discovery and profiling algorithms for dynamic data and streams.
We conclude with directions for future research in the area of data profiling.
Ziawasch Abedjan, Lukasz Golab and Felix Naumann.
Microblogs data, e.g., tweets, reviews, news comments, and social media comments,
has gained considerable attention in recent years due to its popularity and
rich contents. Nowadays, microblogs applications span a wide spectrum of
interests, including detecting and analyzing events, user analysis for
geo-targeted ads and political elections, and critical applications like
discovering health issues and rescue services. Consequently, major research
efforts are spent to analyze and manage microblogs data to support different applications. This 1.5 hours tutorial gives an overview about microblogs data analysis, management, and systems. The tutorial gives a comprehensive review for research efforts that are trying to analyze microblogs contents to build on them new functionality and use cases. In addition, the tutorial reviews existing research that proposes core data management components to support microblogs queries at scale. Finally, the tutorial reviews system-level issues and on-going work on supporting microblogs data through the rising big data systems. Through its different parts, the tutorial highlights the challenges and
opportunities in microblogs data research.
Microblogs Data Management and Analysis
Amr Magdy and Mohamed Mokbel.
In this tutorial, we present the recent work in the database community for handling Big Spatial Data. This topic became very hot due to the recent explosion in the amount of spatial data generated by smartphones, satellites and medical devices, among others. This tutorial goes beyond the use of existing systems as-is (e.g., Hadoop, Spark or Impala), and digs deep into the core components of big systems (e.g., indexing and query processing) to describe how they are designed to handle big spatial data. During this 90-minute tutorial, we review the state-of-the-art work in the area of Big Spatial Data while classifying the existing research efforts according to the implementation approach, underlying architecture, and system components. In addition, we provide case studies of full-fledged systems and applications that handle Big Spatial Data which allows the audience to better
comprehend the whole tutorial.
The Era of Big Spatial Data
Ahmed Eldawy and Mohamed Mokbel.
Entity Resolution constitutes one of the cornerstone tasks for the integration of overlapping information sources. Due to its quadratic complexity, a large amount of research has focused on improving its efficiency so that it scales to Web Data collections, which are inherently voluminous and highly heterogeneous.
The most common approach for this purpose is blocking, which clusters similar
entities into blocks so that the pair-wise comparisons are restricted to the
entities contained within each block. In this tutorial, we take a close look
on blocking-based Entity Resolution, starting from the early blocking methods
that were crafted for database integration. We highlight the challenges posed by contemporary heterogeneous, noisy, voluminous Web Data and explain why they render inapplicable these schema-based techniques. We continue with the presentation of blocking methods that have been developed for large-scale and heterogeneous information and are suitable for Web Data collections. We also explain how their efficiency can be further improved by meta-blocking and parallelization techniques. We conclude with a hands-on session that demonstrates the relative performance of several, state-of-the-art techniques. The participants of the tutorial will put in practice all the topics discussed in the theory part, and will get familiar with a reference toolbox, which implements the most prominent techniques in the area and can be readily used to tackle
Entity Resolution problems.
Blocking for Large-Scale Entity Resolution: Challenges, Algorithms, and Practical Examples
George Papadakis and Themis Palpanas.
The key objective of this tutorial is to provide a broad, yet an in-depth survey of
the emerging field of co-designing software, hardware, and systems components for
accelerating enterprise data management workloads. The overall goal of this
tutorial is two-fold. First, we provide a concise system-level characterization of
different types of data management technologies, namely, the relational and NoSQL databases and data stream management systems from the perspective of analytical workloads. Using the characterization, we discuss opportunities for accelerating key data management workloads using software and hardware approaches. Second, we dive deeper into the hardware acceleration opportunities using Graphics Processing Units (GPUs) and Field-Programmable Gate Arrays (FPGAs) for the query execution pipeline. Furthermore, we explore other hardware acceleration mechanisms such as single-instruction multiple-data (SIMD) that enables short-vector data
Accelerating Database Workloads by Software-Hardware-System Co-design
Rajesh R. Bordawekar and Mohammad Sadoghi.
The unprecedented scale at which data is consumed and generated today has shown a
large demand for scalable data management and given rise to non-relational,
distributed "NoSQL" database systems. Two central problems
triggered this process: 1) vast amounts of user-generated content in modern
applications and the resulting requests loads and data volumes 2) the desire of
the developer community to employ problem-specific data models for storage and querying. To address these needs, various data stores have been developed by both industry and research, arguing that the era of one-size-fits-all database systems is over. The heterogeneity and sheer amount of these systems - now commonly referred to as NoSQL data stores - make it increasingly difficult to select the most appropriate system for a given application. Therefore, these systems are frequently combined in polyglot persistence architectures to leverage each system in its respective sweet spot. This tutorial gives an in-depth survey of the most relevant NoSQL databases to provide comparative classification and highlight open challenges. To this end, we analyze the approach of each system to derive its scalability, availability, consistency, data modeling and querying characteristics. We present how each system's design is governed by a central set of trade-offs over irreconcilable system properties. We then cover recent research results in distributed data management to illustrate that some shortcomings of NoSQL systems could already be solved in practice, whereas other NoSQL data management problems pose interesting and unsolved research
Scalable Data Management: NoSQL Data Stores in Research and Practice
Felix Gessert and Norbert Ritter.
A large part of modern life is lived indoors such as in homes, offices, shopping
malls, universities, libraries and airports. However, almost all of the existing
location-based services (LBS) have been designed only for outdoor space. This is
mainly because the global positioning system (GPS) and other positioning
technologies cannot accurately identify the locations in indoor venues. Some
recent initiatives have started to cross this technical barrier, promising huge
future opportunities for research organizations, government agencies, technology
giants, and enterprising start-ups---to exploit the potential of indoor LBS.
Consequently, indoor data management has gained significant research attention in
the past few years and the research interest is expected to surge in the upcoming
years. This will result in a broad range of indoor applications including emergency services, public services, in-store advertising, shopping, tracking, guided tours, and much more. In this tutorial, we first highlight the importance of indoor data management and the unique challenges that need to be addressed. Subsequently, we provide an overview of the existing research in indoor data management, covering modeling, cleansing, indexing, querying, and other relevant topics. Finally, we discuss the future directions in this important and growing research area, covering spatial-textual search, integrating outdoor and indoor spaces, handling uncertain indoor data, as well as mining and analytics for indoor
Indoor Data Management
Hua Lu and Muhammad Aamir Cheema.
The evolution of the Web from a technology platform to a social ecosystem has
resulted in unprecedented data volumes being continuously generated, exchanged,
and consumed. User-generated content on the Web is massive, highly dynamic, and
characterized by a combination of factual data and opinion data. False information,
rumors, and fake contents can be easily spread across multiple sources, making it hard to distinguish between what is true and what is not. Truth discovery (also known as fact-checking) has recently gained lot of interest from Data Science communities. This tutorial will attempt to cover all the facets of the complex topic of truth-finding in Big Data. It will provide a broad overview with new insights, highlighting the progress that has been made in information extraction, data and knowledge fusion, as well as modeling of misinformation dynamics in complex (social) networks. We will review in details current models, algorithms, and techniques proposed by various research communities whose contributions converge towards the same goal of estimating the veracity of data at the Web scale.
Scaling Up Truth Discovery: From Probabilistic Inference to Misinformation Dynamics
Laure Berti-Équille and Javier Borge-Holthoefer.