P331 - BTW2023- Datenbanksysteme für Business, Technologie und Web
Permanent URI for this collectionhttps://dl.gi.de/handle/20.500.12116/40312
Authors with most Documents
Browse
3 results
Search Results
Conference Paper WannaDB: Ad-hoc SQL Queries over Text Collections(Gesellschaft für Informatik e.V., 2023) Hättasch, Benjamin; Bodensohn, Jan-Micha; Vogel, Liane; Urban, Matthias; Binnig, Carsten; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, Gottfriedn this paper, we propose a new system called WannaDB that allows users to interactively perform structured explorations of text collections in an ad-hoc manner. Extracting structured data from text is a classical problem where a plenitude of approaches and even industry-scale systems already exists. However, these approaches lack in the ability to support the ad-hoc exploration of texts using structured queries. The main idea of WannaDB is to include user interaction to support ad-hoc SQL queries over text collections using a new two-phased approach. First, a superset of information nuggets from the texts is extracted using existing extractors such as named entity recognizers. Then, the extractions are interactively matched to a structured table definition as requested by the user based on embeddings. In our evaluation, we show that WannaDB is thus able to extract structured data from a broad range of (real-world) text collections in high quality without the need to design extraction pipelines upfront.Conference Paper SportsTables: A new Corpus for Semantic Type Detection(Gesellschaft für Informatik e.V., 2023) Langenecker, Sven; Sturm, Christoph; Schalles, Christian; Binnig, Carsten; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedTable corpora such as VizNet or TURL which contain annotated semantic types per column are important to build machine learning models for the task of automatic semantic type detection. However, there is a huge discrepancy between corpora that are used for training and testing since real-world data lakes contain a huge fraction of numerical data which are not present in existing corpora. Hence, in this paper, we introduce a new corpus that contains a much higher proportion of numerical columns than existing corpora. To reflect the distribution in real-world data lakes, our corpus SportsTables has on average approx. 86% numerical columns, posing new challenges to existing semantic type detection models which have mainly targeted non-numerical columns so far. To demonstrate this effect, we show the results of a first study using a state-of-the-art approach for semantic type detection on our new corpus and demonstrate significant performance differences in predicting semantic types for textual and numerical data.Conference Paper Benchmarking the Second Generation of Intel SGX for Machine Learning Workloads(Gesellschaft für Informatik e.V., 2023) Lutsch, Adrian; Singh, Gagandeep; Mundt, Martin; Mogk, Ragnar; Binnig, Carsten; König-Ries, Birgitta; Scherzinger, Stefanie; Lehner, Wolfgang; Vossen, GottfriedFor domains with high data privacy and protection demands, such as health care and finance, outsourcing machine learning tasks often requires additional security measures. Trusted Execution Environments like Intel SGX are a powerful tool to achieve this additional security. Until recently, Intel SGX incurred high performance costs, mainly because it was severely limited in terms of available memory and CPUs. With the second generation of SGX, Intel alleviates these problems. Therefore, we revisit previous use cases for ML secured by SGX and show initial results of a performance study for ML workloads on SGXv2.
Load citations