P331 - BTW2023- Datenbanksysteme für Business, Technologie und Web
Permanent URI for this collectionhttps://dl.gi.de/handle/20.500.12116/40312
Authors with most Documents
Browse
Conference Paper Using SQL/MED to Query Heterogeneous Data Sources with Alexa Voice Commands(Gesellschaft für Informatik e.V., 2023) Schildgen, Johannes; Heinz, Florian; Olijnyk, Andreas; Lindenau, Arvid; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedTypical Alexa skills and other add-ons for voice assistants need to be custom developed for their one specific use case. This paper presents an approach to map arbitrary data sources (databases, APIs, services) to the relational model by using SQL/MED and to transform voice-based queries into SQL. The key challenges for such a universal skill are to correctly map the natural-language question into a SQL query on the correct source table in the federated database and to convert the result set back to a compact and well-understandable answer.Conference Paper MLProvCodeGen: A Tool for Provenance Data Input and Capture of Customizable Machine Learning Scripts(Gesellschaft für Informatik e.V., 2023) Mustafa, Tarek Al; König-Ries, Birgitta; Samuel, Sheeba; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedOver the last decade Machine learning (ML) has dramatically changed the application ofand research in computer science. It becomes increasingly complicated to assure the transparency and reproducibility of advanced ML systems from raw data to deployment. In this paper, we describe an approach to supply users with an interface to specify a variety of parameters that together provide complete provenance information and automatically generate executable ML code from this information. We introduce MLProvCodeGen (Machine Learning Provenance Code Generator), a JupyterLab extension to generate custom code for ML experiments from user-defined metadata. ML workflows can be generated with different data settings, model parameters, methods, and trainingparameters and reproduce results in Jupyter Notebooks. We evaluated our approach with two ML applications, image and multiclass classification, and conducted a user evaluation.Conference Paper Reliable Rules for Relation Extraction in a Multimodal Setting(Gesellschaft für Informatik e.V., 2023) Engelmann, Björn; Schaer, Philipp; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedWe present an approach to extract relations from multimodal documents using a few training data. Furthermore, we derive explanations in the form of extraction rules from the underlying model to ensure the reliability of the extraction. Finally, we will evaluate how reliable (high model fidelity) extracted rules are and which type of classifier is suitable in terms of F1 Score and explainability. Our code and data are available at https://osf.io/dn9hm/?view_only=7e65fd1d4aae44e1802bb5ddd3465e08.Conference Paper JPTest - Grading Data Science Exercises in Jupyter Made Short, Fast and Scalable(Gesellschaft für Informatik e.V., 2023) Tröbs, Eric; Hagedorn, Stefan; Sattler, Kai-Uwe; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedJupyter Notebook is not only a popular tool for publishing data science results, but canalso be used for the interactive explanation of teaching content as well as the supervised work onexercises. In order to give students feedback on their solutions, it is necessary to check and evaluatethe submitted work. To exploit the possibilities of remote learning as well as to reduce the workneeded to evaluate submissions, we present a flexible and efficient framework. It enables automatedchecking of notebooks for completeness and syntactic correctness as well as fine-grained evaluationof submitted tasks. The framework comes with a high level of parallelization, isolation and a shortand efficient API.Conference Paper Tuning Cassandra through Machine Learning(Gesellschaft für Informatik e.V., 2023) Eppinger, Florian; Störl, Uta; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedNoSQL databases have become an important component of many big data and real-time web applications. Their distributed nature and scalability make them an ideal data storage repository for a variety of use cases. While NoSQL databases are delivered with a default ”off-the-shelf” configuration, they offer configuration settings to adjust a database’s behavior and performance to a specific use case and environment. The abundance and oftentimes imperceptible inter-dependencies of configuration settings make it difficult to optimize and performance-tune a NoSQL system. There is no one-size-fits-all configuration and therefore the workload, the physical design, and available resources need to be taken into account when optimizing the configuration of a NoSQL database. This work explores Machine Learning as a means to automatically tune a NoSQL database for optimal performance. Using Random Forest and Gradient Boosting Decision Tree Machine Learning algorithms, multiple Machine Learning models were fitted with a training dataset that incorporates properties of the NoSQL physical configuration (replication and sharding). The best models were then employed as surrogate models to optimize the Database Management System’s configuration settings for throughput and latency using various Black-box Optimization algorithms. Using an Apache Cassandra database, multiple experiments were carried out to demonstrate the feasibility of this approach, even across varying physical configurations. The tuned Database Management System (DBMS) configurations yielded throughput improvements of up to 4%, read latency reductions of up to 43%, and write latency reductions of up to 39% when compared to the default configuration settings.Conference Paper Improving GPU Matrix Multiplication by Leveraging Bit Level Granularity and Compression(Gesellschaft für Informatik e.V., 2023) Fett, Johannes; Schwarz, Christian; Kober, Urs; Habich, Dirk; Lehner, Wolfgang; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedIn this paper we introduce BEAM as a novel approach to perform GPU based matrix multiplication on compressed elements. BEAM allows flexible handling of bit sizes for both input and output elements. First evaluations show promising speedups compared to an uncompressed state-of-the-art matrix multiplication algorithm provided by nvidia.Conference Paper SportsTables: A new Corpus for Semantic Type Detection(Gesellschaft für Informatik e.V., 2023) Langenecker, Sven; Sturm, Christoph; Schalles, Christian; Binnig, Carsten; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedTable corpora such as VizNet or TURL which contain annotated semantic types per column are important to build machine learning models for the task of automatic semantic type detection. However, there is a huge discrepancy between corpora that are used for training and testing since real-world data lakes contain a huge fraction of numerical data which are not present in existing corpora. Hence, in this paper, we introduce a new corpus that contains a much higher proportion of numerical columns than existing corpora. To reflect the distribution in real-world data lakes, our corpus SportsTables has on average approx. 86% numerical columns, posing new challenges to existing semantic type detection models which have mainly targeted non-numerical columns so far. To demonstrate this effect, we show the results of a first study using a state-of-the-art approach for semantic type detection on our new corpus and demonstrate significant performance differences in predicting semantic types for textual and numerical data.Conference Paper Workload Prediction for IoT Data Management Systems(Gesellschaft für Informatik e.V., 2023) Burrell, David; Chatziliadis, Xenofon; Zacharatou, Eleni Tzirita; Zeuch, Steffen; Markl, Volker; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedThe Internet of Things (IoT) is an emerging technology that allows numerous devices, potentially spread over a large geographical area, to collect and collectively process data from high-speed data streams.To that end, specialized IoT data management systems (IoTDMSs) have emerged.One challenge in those systems is the collection of different metrics from devices in a central location for analysis. This analysis allows IoTDMSs to maintain an overview of the workload on different devices and to optimize their processing. However, as an IoT network comprises of many heterogeneous devices with low computation resources and limited bandwidth, collecting and sending workload metrics can cause increased latency in data processing tasks across the network.In this ongoing work, we present an approach to avoid unnecessary transmission of workload metrics by predicting CPU, memory, and network usage using machine learning (ML).Specifically, we demonstrate the performance of two ML models, linear regression and Long Short-Term Memory (LSTM) neural network, and show the features that we explored to train these models.This work is part of an ongoing research to develop a monitoring tool for our new IoTDMS named NebulaStream.Conference Paper Seamless Integration of Parquet Files into Data Processing(Gesellschaft für Informatik e.V., 2023) Rey, Alice; Freitag, Michael; Neumann, Thomas; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedRelational database systems are still the most powerful tool for data analysis. However, the steps necessary to bring existing data into the database make them unattractive for data exploration, especially when the data is stored in data lakes where users often use Parquet files, a binary column-oriented file format.This paper presents a fast Parquet framework that tackles these problems without costly ETL steps. We incrementally collect information during query execution.We create statistics that enhance future queries. In addition, we split the file into chunks for which we store the data ranges. We call these synopses. They allow us to skip entire sections in future queries.We show that these techniques only add minor overhead to the first query and are of benefit for future requests.Our evaluation demonstrates that our implementation can achieve comparable results to database relations and that we can outperform existing systems by up to an order of magnitude.Conference Paper RAPP: A Responsible Academic Performance Prediction Tool for Decision-Making in Educational Institutes(Gesellschaft für Informatik e.V., 2023) Duong, Manh Khoi; Dunkelau, Jannik; Cordova, José Andrés; Conrad, Stefan; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedDue to the increasing importance of educational data mining for the early intervention of at-risk students and the growth of performance data collected in educational institutes, it becomes natural to employ machine learning models to predict student's performances based off prior data. Although machine learning pipelines are often similar, developing one for a specific target prediction of academic success can become a daunting task. In this work, we present a graphical user interface which implements a customisable machine learning pipeline which allows the training and evaluation of machine learning models for different definitions of academic success, \eg, collected credits, average grade, number of passed exams, etc. The evaluation is exported in PDF format after finishing training. As this tool serves as a decision support system for socially responsible AI systems, fairness notions were included in the evaluation to detect potential discrimination in the data and prediction space.Conference Paper Explainable Data Matching: Selecting Representative Pairs with Active Learning Pair-Selection Strategies(Gesellschaft für Informatik e.V., 2023) Laskowski, Lukas; Sold, Florian; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedIn both research and enterprise, dirty data poses numerous challenges. Many data cleaning pipelines include a data deduplication step that detects and removes entries within a given dataset which refer to the same real-world entity. Throughout the development of such deduplication techniques, data scientists have to make sense of the large result sets that their matching solutions generate to quickly identify changes in behavior or to discover opportunities for improvements. We propose an approach that aims to select a small subset of pairs from the result set of a data matching solution which is representative of the matching solution’s overall behavior. To evaluate our approach, we show that the performance of a matching solution trained on pairs selected according to our strategy outperforms a randomly selected subset of pairs.Conference Paper A Provenance Management Framework for Knowledge Graph Generation in a Web Portal(Gesellschaft für Informatik e.V., 2023) Kleinsteuber, Erik; Babalou, Samira; König-Ries, Birgitta; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedKnowledge Graphs (KGs) are the semantic backbone for a wide variety of applications in different domains. In recent years, different web portals providing relevant functionalities for managing KGs have been proposed. An important functionality such portals is provenance data management of KG generation process. Capturing, storing, and accessing provenance data efficiently are complex problems. Solutions to these problems vary widely depending on many factors like the computational environment, computational methods, desired provenance granularity, and much more. In this paper, we present one possible solution: a new framework to capture coarse-grained workflow provenance of KGs during creation in a web portal. We capture the necessary information of the KG generation process; store and retrieve the provenance data using standard functionalities of relational databases. Our captured workflow can be rerun over the same or different input source data. With this, the framework can support four different applications of provenance data: (i) reproduce the KG, (ii) create a new KG with an existing workflow, (iii) undo the executed tools and adapt the provenance data accordingly, and (iv) retrieve the provenance data of a KG.Conference Paper GTPC: Towards a Hybrid OLTP-OLAP Graph Benchmark(Gesellschaft für Informatik e.V., 2023) Jibril, Muhammad Attahir; Baumstark, Alexander; Sattler, Kai-Uwe; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedGraph databases are gaining increasing relevance not only for pure analytics but alsofor full transactional support. Business requirements are evolving to demand analytical insights onfresh transactional data, thereby triggering the emergence of graph systems for hybrid transactional-analytical graph processing (HTAP). In this paper, we present our ongoing work on GTPC, a hybridgraph benchmark targeting such systems, based on the TPC-C and TPC-H benchmarks.Conference Paper The Evolution of LeanStore(Gesellschaft für Informatik e.V., 2023) Alhomssi, Adnan; Haubenschild, Michael; Leis, Viktor; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedLeanStore is a high-performance storage engine optimized for many-core processors and NVMe SSDs. This paper provides the first full system overview of all LeanStore components, several of which have not yet been described. We also discuss crucial implementation details, and the evolution of the overall system towards a design that is both simple and efficient.Conference Paper Predictive Maintenance for the Optical Synchronization System of the European XFEL: A Systematic Literature Survey(Gesellschaft für Informatik e.V., 2023) Grünhagen, Arne; Tropmann-Frick, Marina; Eichler, Annika; Fey, Görschwin; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedThe optical synchronization system of the European X-ray Free Electron Laser is a networked cyber-physical system producing a large amount of data. To maximize the availability of the optical synchronization system, we are developing a predictive maintenance module that can evaluate and predict the condition of the system. In this paper, we report on state-of-the-art predictive maintenance methods by systematically reviewing publications in this field. Guided by three research questions addressing the type of cyber-physical systems, feature extraction methods, and data analytical approaches to evaluate the current health status or to predict future system behavior, we identified 144 publications of high quality contributing to research in this area. Our result is that especially neural networks are used for many predictive maintenance tasks. This review serves as a starting point for a detailed and systematic evaluation of the different methods applied to the optical synchronization system.Conference Paper Meduse: Interactive and Visual Exploration of Ionospheric Data(Gesellschaft für Informatik e.V., 2023) Reibert, Joshua; Osterthun, Arne; Paradies, Marcus; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedSpatio-temporal models of ionospheric data are important for atmospheric research and the evaluation of their impact on satellite communications. However, researchers lack tools to visually and interactively analyze these rapidly growing multi-dimensional datasets that cannot be entirely loaded into main memory. Existing tools for large-scale multi-dimensional scientific data visualization and exploration rely on slow, file-based data management support and simplistic client-server interaction that fetches all data to the client side for rendering.In this paper we present our data management and interactive data exploration and visualization system MEDUSE. We demonstrate the initial implementation of the interactive data exploration and visualization component that enables domain scientists to visualize and interactively explore multi-dimensional ionospheric data. Use-case-specific visualizations additionally allow the analysis of such data along satellite trajectories to accommodate domain-specific analyses of the impact on data collected by satellites such as for global navigation satellite systems and earth observation.Conference Paper Evolution of Degree Metrics in Large Temporal Graphs(Gesellschaft für Informatik e.V., 2023) Rost, Christopher; Gomez, Kevin; Christen, Peter; Rahm, Erhard; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedGraph metrics, such as the simple but popular vertex degree and others based on it, are well defined for static graphs. However, adapting static metrics for temporal graphs is still part of current research. In this paper, we propose a set of temporal extensions of four degree-dependent metrics, as well as aggregations like minimum, maximum, and average degree of (i) a vertex over a time interval and (ii) a graph at a specific point in time. We show why using the static degree can lead to wrong assumptions about the relevance of a vertex in a temporal graph and highlight the need to include time as a dimension in the metric. We propose a baseline algorithm to calculate the degree evolution of all vertices in a temporal graph and show its implementation in a distributed in-memory dataflow system. Using real-world and synthetic datasets containing up to 462 million vertices and 1.7 billion edges, we show the scalability of our algorithm on a distributed cluster achieving a speedup of around 12 on 16 machines.Conference Paper Approach to Synthetic Data Generation for Imbalanced Multi-class Problems with Heterogeneous Groups(Gesellschaft für Informatik e.V., 2023) Treder-Tschechlov, Dennis; Reimann, Peter; Schwarz, Holger; Mitschang, Bernhard; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedTo benchmark novel classification algorithms, these algorithms should be evaluated on data with characteristics that also appear in real-world use cases. Important data characteristics that often lead to challenges for classification approaches are multi-class imbalance and heterogeneous groups. Real-world data that comprise these characteristics are usually not publicly available, e. g., because they constitute sensible patient information or due to privacy concerns. Further, the manifestations of the characteristics cannot be controlled specifically on real-world data. A more rigorous approach is to synthetically generate data such that different manifestations of the characteristics can be controlled. However, existing data generators are not able to generate data that feature both data characteristics, i. e., multi-class imbalance and heterogeneous groups. In this paper, we propose an approach that fills this gap as it allows to synthetically generate data that exhibit both characteristics. In particular, we make use of a taxonomy model that organizes real-world entities in domain-specific heterogeneous groups to generate data reflecting the characteristics of these groups. In addition, we incorporate probability distributions to reflect the imbalances of multiple classes and groups from real-world use cases. Our approach is applicable in different domains, as taxonomies are the simplest form of knowledge models and thus are available in many domains. The evaluation shows that our approach can generate data that feature the data characteristics multi-class imbalance and heterogeneous groups and that it allows to control different manifestations of these characteristics.Conference Paper Semantic Watermarks for Detecting Cheating in Online Database Exams(Gesellschaft für Informatik e.V., 2023) Brass, Stefan; Hinneburg, Alexander; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedDue to the COVID-19 pandemic,we were forced to conduct the exam for a database course as an online exam.An essential part of the exam was to write non-trivial SQL queries for given tasks.In order to ensure that cheating has a certain risk,we used several techniques to detect cases of plagiarism.One technique was to use a kind of ``watermarks'' invariants of the exercises that are randomly assigned to the students.Each variant is marked by small variationsthat need to be included in submitted solutions.Those markers might go through undetectedwhen a student decides to copy a solution from someone else.In this case,the student would reveal to know a ``secret''that he cannot know without the forbidden communication with another student.This can be used as a proof for plagiarisminstead of just a subjective feeling about the likelihoodof similar solutions without communication.We also used a log of SQL queries that were tried during the exam.Finally,we evaluated similarity-based techniques for SQL plagiarism detection.Conference Paper Inter-Query Parallelism on Heterogeneous Multi-Core CPUs(Gesellschaft für Informatik e.V., 2023) Schuhknecht, Felix Martin; Islam, Tamjidul; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedTraditional multi-core CPU architectures integrate a set of homogeneous cores, where all cores are of exactly the same type. Last year, with the release of Intel's 12th generation Core consumer processors, this setup drastically changed: Apart from so-called performance cores, which provide a high clock frequency, hyper threading, and large caches, the architecture also integrates so-called efficient cores, which are less performant but rather energy efficient. Obviously, such a performance-heterogeneous architecture complicates task-to-resource scheduling and should be actively considered by the application that schedules the tasks.In this experience report, we discuss our first steps with this new architecture in the context of parallel query processing. We focus on inter-query-parallelism, where whole transactions/queries are the unit of schedule, and investigate which type of core fits to which type of workload best. To do so, we first perform a set of micro-benchmarks on the cores to analyze their different performance characteristics. Based on that, we propose two scheduling strategies that actively schedule tasks to different core types, depending on their characteristics. Our initial findings suggest that the awareness of heterogeneous CPU architectures must indeed be actively incorporated by the task scheduler within a DBMS to efficiently utilize this new type of hardware.
Load citations