P331 - BTW2023- Datenbanksysteme für Business, Technologie und Web
Permanent URI for this collectionhttps://dl.gi.de/handle/20.500.12116/40312
Authors with most Documents
Browse
6 results
Search Results
Conference Paper MLProvCodeGen: A Tool for Provenance Data Input and Capture of Customizable Machine Learning Scripts(Gesellschaft für Informatik e.V., 2023) Mustafa, Tarek Al; König-Ries, Birgitta; Samuel, Sheeba; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedOver the last decade Machine learning (ML) has dramatically changed the application ofand research in computer science. It becomes increasingly complicated to assure the transparency and reproducibility of advanced ML systems from raw data to deployment. In this paper, we describe an approach to supply users with an interface to specify a variety of parameters that together provide complete provenance information and automatically generate executable ML code from this information. We introduce MLProvCodeGen (Machine Learning Provenance Code Generator), a JupyterLab extension to generate custom code for ML experiments from user-defined metadata. ML workflows can be generated with different data settings, model parameters, methods, and trainingparameters and reproduce results in Jupyter Notebooks. We evaluated our approach with two ML applications, image and multiclass classification, and conducted a user evaluation.Conference Paper Towards a User-Empowering Architecture for Trustability Analytics(Gesellschaft für Informatik e.V., 2023) Bruchhaus, Sebastian; Reis, Thoralf; Bornschlegl, Marco Xaver; Störl, Uta; Hemmje, Matthias; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedMachine learning (ML) thrives on big data like huge data sets and streams from IOT devices. Those technologies are becoming increasingly commonplace in our day to day existence. Learning autonomous intelligent actors (AIAs) impact our lives already in the form of, e.g. chat bots, medical expert systems, and facial recognition systems. Doubts concerning ethical, legal, and social implications of such AIAs become increasingly compelling in consequence. Our society now finds itself confronted with decisive questions: Should we trust AI? Is it fair, transparent, and respecting privacy? An individual psychological threshold for cooperation with AIAs has been postulated. In Shaefer’s words: “No trust, no use”. On the other hand, ignorance of an AIA’s weak points and idiosyncrasies can lead to overreliance. This paper proposes a prototypical microservice architecture for trustability analytics. Its architecture shall introduce self-awareness concerning trustability into the AI2VIS4BigData reference model for big data analysis and visualization by borrowing the concept of a “looking-glass self” from psychology.Conference Paper Benchmarking the Second Generation of Intel SGX for Machine Learning Workloads(Gesellschaft für Informatik e.V., 2023) Lutsch, Adrian; Singh, Gagandeep; Mundt, Martin; Mogk, Ragnar; Binnig, Carsten; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedFor domains with high data privacy and protection demands, such as health care and finance, outsourcing machine learning tasks often requires additional security measures. Trusted Execution Environments like Intel SGX are a powerful tool to achieve this additional security. Until recently, Intel SGX incurred high performance costs, mainly because it was severely limited in terms of available memory and CPUs. With the second generation of SGX, Intel alleviates these problems. Therefore, we revisit previous use cases for ML secured by SGX and show initial results of a performance study for ML workloads on SGXv2.Conference Paper Tuning Cassandra through Machine Learning(Gesellschaft für Informatik e.V., 2023) Eppinger, Florian; Störl, Uta; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedNoSQL databases have become an important component of many big data and real-time web applications. Their distributed nature and scalability make them an ideal data storage repository for a variety of use cases. While NoSQL databases are delivered with a default ”off-the-shelf” configuration, they offer configuration settings to adjust a database’s behavior and performance to a specific use case and environment. The abundance and oftentimes imperceptible inter-dependencies of configuration settings make it difficult to optimize and performance-tune a NoSQL system. There is no one-size-fits-all configuration and therefore the workload, the physical design, and available resources need to be taken into account when optimizing the configuration of a NoSQL database. This work explores Machine Learning as a means to automatically tune a NoSQL database for optimal performance. Using Random Forest and Gradient Boosting Decision Tree Machine Learning algorithms, multiple Machine Learning models were fitted with a training dataset that incorporates properties of the NoSQL physical configuration (replication and sharding). The best models were then employed as surrogate models to optimize the Database Management System’s configuration settings for throughput and latency using various Black-box Optimization algorithms. Using an Apache Cassandra database, multiple experiments were carried out to demonstrate the feasibility of this approach, even across varying physical configurations. The tuned Database Management System (DBMS) configurations yielded throughput improvements of up to 4%, read latency reductions of up to 43%, and write latency reductions of up to 39% when compared to the default configuration settings.Conference Paper Better Safe than Sorry: Visualizing, Predicting, and Successfully Guiding Courses of Study(Gesellschaft für Informatik e.V., 2023) Kerth, Alexander; Schuhknecht, Felix; Pensel, Lukas; Henneberg, Justus; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedSuccessfully going through a course of study is a lengthy and challenging task. To obtain a degree, many obstacles must be overcome and the right decisions must be made at the right point in time, often overwhelming students. To reduce the amount of dropouts, the goal of study advisors is to reach out to endangered students in time and to provide them help and guidance. To support the work of study advisors, who typically have to monitor a large amount of students simultaneously, we present in this demonstration an easy-to-use graphical tool that (a) allows the advisor to visualize all relevant information of study data in a responsive graph in order to overview the current study situation. Additional to visualization, our tool provides (b) a forecasting functionality based on pre-trained models and (c) a warning feature to identify endangered students early on. In the on-site demonstration, the audience will be able to step into the role of a study advisor and use our tool and all of its features to identify and guide struggling students within anonymized real-world study data.Conference Paper Approach to Synthetic Data Generation for Imbalanced Multi-class Problems with Heterogeneous Groups(Gesellschaft für Informatik e.V., 2023) Treder-Tschechlov, Dennis; Reimann, Peter; Schwarz, Holger; Mitschang, Bernhard; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedTo benchmark novel classification algorithms, these algorithms should be evaluated on data with characteristics that also appear in real-world use cases. Important data characteristics that often lead to challenges for classification approaches are multi-class imbalance and heterogeneous groups. Real-world data that comprise these characteristics are usually not publicly available, e. g., because they constitute sensible patient information or due to privacy concerns. Further, the manifestations of the characteristics cannot be controlled specifically on real-world data. A more rigorous approach is to synthetically generate data such that different manifestations of the characteristics can be controlled. However, existing data generators are not able to generate data that feature both data characteristics, i. e., multi-class imbalance and heterogeneous groups. In this paper, we propose an approach that fills this gap as it allows to synthetically generate data that exhibit both characteristics. In particular, we make use of a taxonomy model that organizes real-world entities in domain-specific heterogeneous groups to generate data reflecting the characteristics of these groups. In addition, we incorporate probability distributions to reflect the imbalances of multiple classes and groups from real-world use cases. Our approach is applicable in different domains, as taxonomies are the simplest form of knowledge models and thus are available in many domains. The evaluation shows that our approach can generate data that feature the data characteristics multi-class imbalance and heterogeneous groups and that it allows to control different manifestations of these characteristics.
Load citations