P331 - BTW2023- Datenbanksysteme für Business, Technologie und Web
Permanent URI for this collectionhttps://dl.gi.de/handle/20.500.12116/40312
Authors with most Documents
Browse
Conference Paper Accelerating Large Table Scan using Processing-In-Memory Technology(Gesellschaft für Informatik e.V., 2023) Baumstark, Alexander; Jibril, Muhammad Attahir; Sattler, Kai-Uwe; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedToday’s systems are capable of storing large amounts of data in main memory. In-memoryDBMSs can benefit particularly from this development. However, the processing of the data fromthe main memory necessarily has to run via the CPU. This creates a bottleneck, which affects thepossible performance of the DBMS. The Processing-In-Memory (PIM) technology is a paradigm toovercome this problem, which was not available in commercial systems for a long time. However, withthe availability of UPMEM, a commercial system is finally available that provides PIM technologyin hardware. In this work, the main focus was on the optimization of the table scan, a fundamental,and memory-bound operation. Here a possible approach is shown, which can be used to optimizethis operation by using PIM. This method was then tested for parallelism and execution time inbenchmarks with different table sizes and compared to the usual table scan. The result is a table scanthat outperforms the scan on the usual CPU significantly.Conference Paper Adaptive Architectures for Robust Data Management Systems(Gesellschaft für Informatik e.V., 2023) Bang, Tiemo; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedForm follows function is a well-known expression by the architect Sullivan asserting that the architecture of a building should follow its function. 'Adaptive Architectures for Robust Data Management Systems' is a dissertation asserting that DBMS architectures should follow changing workload and hardware to robustly achieve high DBMS performance. The dissertation first evaluates how workload and hardware affect the performance of DBMSs with static architectures. This evaluation concludes that static DBMS architectures degrade DBMS performance under changing workload and hardware, and hence the DBMS architecture has to become adaptive. Subsequently, adaptation concepts for the architecture of single-server and multi-server DBMSs are proposed. These concepts focus fine-grained adaptation of DBMS architectures and are realized through asynchronous programming models. These programming models decouple the implementation of DBMS components from fine-grained architectural optimization. Thereby, optimizers can derive novel architectures better fitting individual DBMS components, leading to high and robust DBMS performance under changing conditions.Conference Paper Approach to Synthetic Data Generation for Imbalanced Multi-class Problems with Heterogeneous Groups(Gesellschaft für Informatik e.V., 2023) Treder-Tschechlov, Dennis; Reimann, Peter; Schwarz, Holger; Mitschang, Bernhard; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedTo benchmark novel classification algorithms, these algorithms should be evaluated on data with characteristics that also appear in real-world use cases. Important data characteristics that often lead to challenges for classification approaches are multi-class imbalance and heterogeneous groups. Real-world data that comprise these characteristics are usually not publicly available, e. g., because they constitute sensible patient information or due to privacy concerns. Further, the manifestations of the characteristics cannot be controlled specifically on real-world data. A more rigorous approach is to synthetically generate data such that different manifestations of the characteristics can be controlled. However, existing data generators are not able to generate data that feature both data characteristics, i. e., multi-class imbalance and heterogeneous groups. In this paper, we propose an approach that fills this gap as it allows to synthetically generate data that exhibit both characteristics. In particular, we make use of a taxonomy model that organizes real-world entities in domain-specific heterogeneous groups to generate data reflecting the characteristics of these groups. In addition, we incorporate probability distributions to reflect the imbalances of multiple classes and groups from real-world use cases. Our approach is applicable in different domains, as taxonomies are the simplest form of knowledge models and thus are available in many domains. The evaluation shows that our approach can generate data that feature the data characteristics multi-class imbalance and heterogeneous groups and that it allows to control different manifestations of these characteristics.Conference Paper Automated Statement Extraction from Press Briefings(Gesellschaft für Informatik e.V., 2023) Keller, Jüri; Bittkowski, Meik; Schaer, Philipp; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedScientific press briefings are a valuable information source. They consist of alternating expert speeches, questions from the audience and their answers. Therefore, they can contribute to scientific and fact-based media coverage. Even though press briefings are highly informative, extracting statements relevant to individual journalistic tasks is challenging and time-consuming.To support this task, an automated statement extraction system is proposed. Claims are used as the main feature to identify statements in press briefing transcripts. The statement extraction task is formulated as a four-step procedure. First, the press briefings are split into sentences and passages, then claim sentences are identified with a single-label multi-class sequence classification. Subsequently, topics are detected, and the sentences are filtered to improve the coherence and assess the length of the statements.The results indicate that claim detection can be used to identify statements in press briefings. While many statements can be extracted automatically with this system, they are not always as coherent as needed to be understood without context and may need further review by knowledgeable persons.Conference Paper Benchmarking the Second Generation of Intel SGX for Machine Learning Workloads(Gesellschaft für Informatik e.V., 2023) Lutsch, Adrian; Singh, Gagandeep; Mundt, Martin; Mogk, Ragnar; Binnig, Carsten; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedFor domains with high data privacy and protection demands, such as health care and finance, outsourcing machine learning tasks often requires additional security measures. Trusted Execution Environments like Intel SGX are a powerful tool to achieve this additional security. Until recently, Intel SGX incurred high performance costs, mainly because it was severely limited in terms of available memory and CPUs. With the second generation of SGX, Intel alleviates these problems. Therefore, we revisit previous use cases for ML secured by SGX and show initial results of a performance study for ML workloads on SGXv2.Conference Paper Better Safe than Sorry: Visualizing, Predicting, and Successfully Guiding Courses of Study(Gesellschaft für Informatik e.V., 2023) Kerth, Alexander; Schuhknecht, Felix; Pensel, Lukas; Henneberg, Justus; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedSuccessfully going through a course of study is a lengthy and challenging task. To obtain a degree, many obstacles must be overcome and the right decisions must be made at the right point in time, often overwhelming students. To reduce the amount of dropouts, the goal of study advisors is to reach out to endangered students in time and to provide them help and guidance. To support the work of study advisors, who typically have to monitor a large amount of students simultaneously, we present in this demonstration an easy-to-use graphical tool that (a) allows the advisor to visualize all relevant information of study data in a responsive graph in order to overview the current study situation. Additional to visualization, our tool provides (b) a forecasting functionality based on pre-trained models and (c) a warning feature to identify endangered students early on. In the on-site demonstration, the audience will be able to step into the role of a study advisor and use our tool and all of its features to identify and guide struggling students within anonymized real-world study data.Conference Paper BTW 2023 - Complete proceedings(Gesellschaft für Informatik e.V., 2023) Köhnen, Christoph; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedConference Paper CLOCQ: A Toolkit for Fast and Easy Access to Knowledge Bases(Gesellschaft für Informatik e.V., 2023) Christmann, Philipp; Roy, Rishiraj Saha; Weikum, Gerhard; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedCurated knowledge bases (KBs) store vast amounts of factual world knowledge, and are therefore ubiquitous in many information retrieval (IR) and natural language processing (NLP) applications like question answering, named entity disambiguation, or knowledge exploration. Despite that, accessing information from complete knowledge bases is often a daunting task. Researchers and practitioners typically have crisp use cases in mind, for which standard querying interfaces can be overly complex and inefficient. We aim to bridge this gap, and release a public toolkit that provides functionalities for common KB access use cases, and make it available via a public API. Experiments show efficiency improvements over existing KB interfaces for various important functionalities.Conference Paper Communication-Optimal Parallel Reservoir Sampling(Gesellschaft für Informatik e.V., 2023) Winter, Christian; Sichert, Moritz; Birler, Altan; Neumann, Thomas; Kemper, Alfons; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedWhen evaluating complex analytical queries on high-velocity data streams, many systems cannot run those queries on all elements of a stream. Sampling is a widely used method to reduce the system load by replacing the input with a representative yet manageable subset. For unbounded data, reservoir sampling generates a fixed-size uniform sample independent of the input cardinality. However, the collection of reservoir samples itself can already be a bottleneck for high-velocity data.In this paper, we introduce a technique that allows fully parallelizing reservoir sampling for many-core architectures. Our approach relies on the efficient combination of thread-local samples taken over chunks of the input without necessitating communication during the sampling phase and with minimal communication when merging. We show how our efficient merge guarantees uniform random samples while allowing data to be distributed over worker threads arbitrarily. Our analysis of this approach within the Umbra database system demonstrates linear scaling along the available threads and the ability to sustain high-velocity workloads.Conference Paper A Core Ontology to Support Agricultural Data Interoperability(Gesellschaft für Informatik e.V., 2023) Abdelmageed, Aly; Hatem, Shahenda; ael, Tasneem; Medhat, Walaa; König-Ries, Birgitta; Ellakwa, Susan; Elkafrawy, Passent; Algergawy, Alsayed; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedThe amount and variety of raw data generated in the agriculture sector from numeroussources, including soil sensors and local weather stations, are proliferating. However, these raw data in themselves are meaningless and isolated and, therefore, may offer little value to the farmer. Data usefulness is determined by its context and meaning and by how it is interoperable with data from other sources. Semantic web technology can provide context and meaning to data and its aggregation by providing standard data interchange formats and description languages. In this paper, we introduce the design and overall description of a core ontology that facilitates the process of data interoperability in the agricultural domain.Conference Paper Data Extraction for Associative Classification using Mined Rules in Pediatric Intensive Care Data(Gesellschaft für Informatik e.V., 2023) Das, Pronaya Prosun; Mast, Marcel; Wiese, Lena; Jack, Thomas; Wulf, Antje; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedBased on the characteristics of health and medical informatics, data mining techniques that were designed to tackle healthcare problems are faced with new challenges. One such challenge is to prepare medical data for pattern mining or machine learning. In this paper, we present a feature engineering technique for the Associative Classification of the Systemic Inflammatory Response Syndrome (SIRS) in severely ailing children by mining Associative Rules. SIRS is characterized as the body's excessive defense response due to malevolent stressors such as trauma, acute inflammation, infection, malignancy, and surgery. It can have an impact on the clinical outcome and elevate vulnerability for organ dysfunctions. We aim to extract the features from given datasets using a specific extraction process and after the transformation, those features are used to mine rules using Association Rule Mining. Those rules are used to perform Associative Classification and evaluated with the result generated by SIRS criteria defined by the experienced clinicians. The mined rules provide better control over sensitivity and specificity than the SIRS criteria.Conference Paper Detection of Generated Text Reviews by Leveraging Methods from Authorship Attribution: Predictive Performance vs. Resourcefulness(Gesellschaft für Informatik e.V., 2023) Moosleitner, Manfred; Specht, Günther; Zangerle, Eva; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedTextual reviews are an integral part of online shopping and a source of information for potential customers. However, a prerequisite is that the reviews are authentic. To this end, pre-trained large language models have been shown to generate convincing text reviews at scale. Therefore, a critical task is the automatic detection of reviews not composed by a human, in a generated review classification task. State-of-the-art approaches to detect generated texts use pre-trained large language models, which exhibit hefty hardware requirements to run and fine-tune the model. Related work has shown that texts generated by language models often show differences in writing style and choice of words compared to texts written by humans. This two properties, which are unique per author, should be able to be utilized to identify if a text is generated by these algorithms. In this paper, we investigate the performance of features prominently used in authorship attribution tasks, using robust classifiers with substantially lower computational resources required. We show that features and methods from authorship attribution can be successfully applied for the task of detecting generated text reviews, leveraging the consistent writing style exhibited by large language models like GPT2. We argue that our approach achieves similar performance as state-of-the-art approaches while providing shorter training times and lower hardware requirements, necessary for, e.g, detection on the fly.Conference Paper Developing OERs for Teaching Database Systems(Gesellschaft für Informatik e.V., 2023) Rakow, Thomas C.; Kless, André; Hasler, Charlotte; Knolle, Harm; Faeskorn-Woyke, Heide; Saatz, Inga Marina; Lambert, Jens; Focken, Mareike; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedIn the project EILD.nrw, Open Educational Resources (OER) have been developed for teaching databases. Lecturers can use the tools and courses in a variety of learning scenarios. Students of computer science and application subjects can learn the complete life cycle of databases. For this purpose, quizzes, interactive tools, instructional videos, and courses for learning management systems are developed and published under a Creative Commons license. We give an overview of the developed OERs according to subject, description, teaching form, and format. Following, we describe how licencing, sustainability, accessibility, contextualization, content description, and technical adaptability are implemented. The feedback of students in ongoing classes are evaluated.Conference Paper Discovering Multi-Dimensional Subsequence Queries from Traces -- From Theory to Practice(Gesellschaft für Informatik e.V., 2023) Kleest-Meißner, Sarah; Sattler, Rebecca; Schmid, Markus L.; Schweikardt, Nicole; Weidlich, Matthias; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedSubsequence-queries with wildcards and gap-size constraints (swg-queries, for short) are an expressive model for sequence data, in which queries are described by patterns over an alphabet of variables and types, along with a global window size and a number of gap-size constraints. They are evaluated over a trace, i.e., a sequence of types, by replacing variables by single types, while satisfying the window and the gap-size constraints. Kleest-Meißner et al. (Proc. ICDT 2022) formalised the task of discovering an swg-query that describes best a given sample consisting of a finite number of traces, and developed a discovery algorithm solving this task. However, in practical application scenarios, traces are often multi-dimensional, i.e., a trace corresponds to a sequence of tuples of types, which renders the existing technique inapplicable.In this paper, we lift the notion of swg-queries to such a multi-dimensional setting, thereby enlarging the applicability of the query model and the techniques for query discovery. We introduce a mapping between one-dimensional and multi-dimensional sequence data, such that a multi-dimensional trace matches a multi-dimensional query if and only if the corresponding one-dimensional trace matches the corresponding one-dimensional query. We complement our formal results with a description of our prototypical implementation of query discovery for multi-dimensional sequence data. Results from evaluation experiments with real-world data indicate feasibility of our approach.Conference Paper DNAContainer: An object-based storage architecture on DNA(Gesellschaft für Informatik e.V., 2023) El-Shaikh, Alex; Seeger, Bernhard; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedThe digital data volumes produced worldwide per year are ever-increasing. Estimates show that by 2025, we will have reached 175 zettabytes of globally created digital data. Despite today's advancements in storage devices, current database management systems cannot cope with these amounts of data. More than recent improvements in storage technologies are needed to meet the ever-accelerating growth of generated data. This problem is further exaggerated when considering that current storage technologies such as HDD and tape require replacement every few years. To combat this deficiency, deoxyribonucleic acid (DNA) offers a novel durable (millennia scale), extremely dense, and energy-efficient storage medium. However, current DNA systems lack support for random access and more expressive query support beyond key-value lookups. In this paper, we present DNAContainer, a novel storage architecture on DNA that spans an ample virtual address space on objects, enabling random access to DNA at a large scale while adhering to required biochemical constraints. The interface of DNAContainer also facilitates the implementation of common external data structures such as arrays, lists, and trees that store data in blocks of fixed size.Conference Paper DPQL: The Data Profiling Query Language(Gesellschaft für Informatik e.V., 2023) Seeger, Marcian; Schmidl, Sebastian; Vielhauer, Alexander; Papenbrock, Thorsten; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedAbstract: Data profiling describes the activity of extracting implicit metadata, such as schema descriptions, data types, and various kinds of data dependencies, from a given data set. The considerable amount of research papers about novel metadata types and ever-faster data profiling algorithms emphasize the importance of data profiling in practice. Unfortunately, though, the current state of data profiling research fails to address practical application needs: Typical data profiling algorithms (i. e., challenging to operate structures) discover all (i. e., too many) minimal (i. e., the wrong) data dependencies within minutes to hours (i. e., too long). Consequently, if we look at the practical success of our research, we find that data profiling targets data cleaning, but most cleaning systems still use only hand-picked dependencies; data profiling targets query optimization, but hardly any query optimizer uses modern discovery algorithms for dependency extraction; data profiling targets data integration, but the application of automatically discovered dependencies for matching purposes is yet to be shown -and the list goes on. We aim to solve the profiling-and-application-disconnect with a novel data profiling engine that integrates modern profiling techniques for various types of data dependencies and provides the applications with a versatile, intuitive, and declarative Data Profiling Query Language (DPQL). The DPQL enables applications to specify precisely what dependencies are needed, which not only refines the results and makes the data profiling process more accessible but also enables much faster and (in terms of dependency types and selections) holistic profiling runs. We expect that integrating modern data profiling techniques and the post-processing of their results under a single application endpoint will result in a series of significant algorithmic advances, new pruning concepts, and a profiling engine with innovative components for workload auto-configuration, query optimization, and parallelization. With this paper, we present the first version of the DPQL syntax and introduce a fundamentally new line of research in data profiling.Conference Paper Duplicate Table Discovery with Xash(Gesellschaft für Informatik e.V., 2023) Koch, Maximilian; Esmailoghli, Mahdi; Auer, Sören; Abedjan, Ziawasch; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedData lakes are typically lightly curated and as such prone to data quality problems and inconsistencies. In particular, duplicate tables are common in most repositories. The goal of duplicate table detection is to identify those tables that display the same data.Comparing tables is generally quite expensive as the order of rows and columns might differ for otherwise identical tables. In this paper, we explore the application of Xash, a hash function previously proposed for the discovery of multi-column join candidates, for the use case of duplicate table detection. With Xash, it is possible to generate a so-called super key, which serves like a bloom filter and instantly identifies the existence of particular cell values. We show that using Xash it is possible to speed up the duplicate table detection process significantly. In comparison to other hash functions, such as SimHash and other competitors, Xash results in fewer false positive candidates.Conference Paper The Easiest Way of Turning your Relational Database into a Blockchain --- and the Cost of Doing So(Gesellschaft für Informatik e.V., 2023) Schuhknecht, Felix; Jörz, Simon; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedBlockchain systems essentially consist of two levels: The network level has the responsibility of distributing an ordered stream of transactions to all nodes of the network in exactly the same way, even in the presence of a certain amount of malicious parties (byzantine fault tolerance). On the node level, each node then receives this ordered stream of transactions and executes it within some sort of transaction processing system, typically to alter some kind of state.This clear separation into two levels as well as drastically different application requirements have led to the materialization of the network level in form of so-called blockchain frameworks. While providing all the blockchain features" Blockchain, Relational Databases, Distributed Query Processing, Tendermint"Conference Paper Efficient handling of recursive relationships in ORM frameworks using Entity Framework Core as an example(Gesellschaft für Informatik e.V., 2023) Killisch, Benjamin Uwe; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedORM frameworks are a popular method to bridge the differences between object-oriented programming and relational data management. At the same time, recursive relationships are present in many schemas to represent tree-like or net-like structures. This paper discusses how to efficiently build and execute queries for data with recursive relationships in ORM frameworks. Five possible solutions are conceived and then implemented in Entity Framework Core (EF Core), while making sure that they can be used like regular LINQ queries. Next, the solutions are tested with different SQL dialects. The results of these tests are then analyzed by a variety of test parameters. This analysis shows that queries with recursive common table expressions and queries using key loading are the most efficient. Queries with auxiliary property, vertical unrolling or horizontal unrolling are either too slow or only usable under particular circumstances. The analysis also shows that the performance of the solutions is always dependent on the circumstances, especially the SQL dialect.Conference Paper Enabling Integrated Data Analysis Pipelines on Heterogeneous Hardware through Holistic Extensibility(Gesellschaft für Informatik e.V., 2023) Damme, Patrick; Boehm, Matthias; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedThis submission is an extended abstract.
Load citations