KDIR 2023 Abstracts


Full Papers
Paper Nr: 41
Title:

Closeness Centrality Detection in Homogeneous Multilayer Networks

Authors:

Hamza R. Pavel, Anamitra Roy, Abhishek Santra and Sharma Chakravarthy

Abstract: Centrality measures for simple graphs are well-defined and several main-memory algorithms exist for each. Simple graphs have been shown to be not adequate for modeling complex data sets with multiple types of entities and relationships. Although multilayer networks (or MLNs) have been shown to be better suited, there are very few algorithms for centrality measure computation directly on MLNs. Typically, they are converted (aggregated or projected) to simple graphs using Boolean AND or OR operators to compute various centrality measures, which is not only inefficient but incurs a loss of structure and semantics. In this paper, algorithms have been proposed that compute closeness centrality on an MLN directly using a novel decoupling-based approach. Individual results of layers (or simple graphs) of an MLN are used and a composition function is developed to compute the closeness centrality nodes for the MLN. The challenge is to do this efficiently while preserving the accuracy of results with respect to the ground truth. However, since these algorithms use only layer information and do not have complete information of the MLN, computing a global measure such as closeness centrality is a challenge. Hence, these algorithms rely on heuristics derived from intuition. The advantage is that this approach lends itself to parallelism and is more efficient than the traditional approach. Two heuristics, termed CC1 and CC2, have been presented for composition and their accuracy and efficiency have been empirically validated on a large number of synthetic and real-world-like graphs with diverse characteristics. CC1 is prone to generate false negatives whereas CC2 reduces them, is more efficient, and improves accuracy.
Download

Paper Nr: 42
Title:

Subject Classification of Software Repository

Authors:

Abdelhalim H. Dahou and Brigitte Mathiak

Abstract: Software categorization involves organizing software into groups based on their behavior or domain. Traditionally, categorization has been crucial for software maintenance, aiding programmers in locating programs, identifying features, and finding similar ones within extensive code repositories. Manual categorization is expensive, tedious, and labor-intensive, leading to the growing importance of automatic categorization approaches. However, existing datasets primarily focus on technical categorization for the most common programming language, leaving a gap in other areas. This paper addresses the research problem of classifying software repositories that contain R code. The objective is to develop a classification model capable of accurately and efficiently categorizing these repositories into predefined classes with less data. The contribution of this research is twofold. Firstly, we propose a model that enables the categorization of software repositories focusing on R programming, even with a limited amount of training data. Secondly, we conduct a comprehensive empirical evaluation to assess the impact of repository features and data augmentation on automatic repository categorization. This research endeavors to advance the field of software categorization and facilitate better utilization of software repositories in the context of diverse domains research.
Download

Paper Nr: 53
Title:

UNCOVER: Identifying AI Generated News Articles by Linguistic Analysis and Visualization

Authors:

Lucas Liebe, Jannis Baum, Tilman Schütze, Tim Cech, Willy Scheibel and Jürgen Döllner

Abstract: Text synthesis tools are becoming increasingly popular and better at mimicking human language. In trust-sensitive decisions, such as plagiarism and fraud detection, identifying AI-generated texts poses larger difficulties: decisions need to be made explainable to ensure trust and accountability. To support users in identifying AI-generated texts, we propose the tool UNCOVER. The tool analyses texts through three explainable linguistic approaches: Stylometric writing style analysis, topic modeling, and entity recognition. The result of the tool is a prediction and visualization of the analysis. We evaluate the tool on news articles by means of accuracy of the prediction and an expert study with 13 participants. The final prediction is based on classification of stylometric and evolving topic analysis. It achieved an accuracy of 70.4% and a weighted F1-score of 85.6%. The participants preferred to base their assessment on the prediction and the topic graph. In contrast, they found the entity recognition to be an ineffective indicator. Moreover, five participants highlighted the explainable aspects of UNCOVER and overall the participants achieved 69% accuracy. Eight participants expressed interest to continue using UNCOVER for identifying AI-generated texts.
Download

Paper Nr: 58
Title:

Barycentre Averaging for the Move-Split-Merge Time Series Distance Measure

Authors:

Christopher Holder, David Guijo-Rubio and Anthony Bagnall

Abstract: Distance functions play a core role in many time series machine learning algorithms for tasks such as clustering, classification and regression. Time series often require bespoke distance functions because small offsets in time can lead to large distances between series that are conceptually similar. Elastic distances compensate for misalignment by creating a path through a cost matrix by warping and/or editing time series. Time series are most commonly clustered with partitional algorithms such as k-means and k-medoids using elastic distance measures such as Dynamic Time Warping (DTW). The distance is used to assign cases to the closest cluster representative. k-means requires the averaging of time series to find these representative centroids. If DTW is used to assign membership, but the arithmetic mean is used to find centroids, k-means performance degrades significantly. An averaging technique specific to DTW, called DTW Barycentre Averaging (DBA), overcomes the averaging problem however, can only be used with DTW. As such alternative distance functions such as Move-Split-Merge (MSM) are forced to use the arithmetic mean to compute new centroids and suffer similar degraded performance as k-means-DTW without DBA. To address this we propose a averaging method for MSM distance, MSM Barycentre Averaging (MBA) and show that when used to find centroids it significantly improves MSM based k-means and is better than commonly used alternatives.
Download

Paper Nr: 73
Title:

Visual Counterfactual Explanations Using Semantic Part Locations

Authors:

Florence Böttger, Tim Cech, Willy Scheibel and Jürgen Döllner

Abstract: As machine learning models are becoming more widespread and see use in high-stake decisions, the explainability of these decisions is getting more relevant. One approach for explainability are counterfactual explanations, which are defined as changes to a data point such that it appears as a different class. Their close connection to the original dataset aids their explainability. However, existing methods of creating counterfacual explanations often rely on other machine learning models, which adds an additional layer of opacity to the explanations. We propose additions to an established pipeline for creating visual counterfacual explanations by using an inherently explainable algorithm that does not rely on external models. Using annotated semantic part locations, we replace parts of the counterfactual creation process. We evaluate the approach on the CUB-200-2011 dataset. Our approach outperforms the previous results: we improve (1) the average number of edits by 0.1 edits, (2) the keypoint accuracy of editing within any semantic parts of the image by an average of at least 7 percentage points, and (3) the keypoint accuracy of editing the same semantic parts by at least 17 percentage points.
Download

Paper Nr: 92
Title:

Enhancing Explainable Matrix Factorization with Tags for Multi-Style Explanations

Authors:

Olurotimi Seton, Pegah S. Haghighi, Mohammed Alshammari and Olfa Nasraoui

Abstract: Black-box AI models tend to be more accurate but less transparent and scrutable than white-box models. This poses a limitation for recommender systems that rely on black-box models, such as Matrix Factorization (MF). Explainable Matrix Factorization (EMF) models are “explainable” extensions of Matrix Factorization, a state of the art technique widely used due to its flexibility in learning from sparse data and accuracy. EMF can incorporate explanations derived, by design, from user or item neighborhood graphs, among others, into the model training process, thereby making their recommendations explainable. So far, an EMF model can learn a model that produces only one explanation style, and this in turn limits the number of recommendations with computable explanation scores. In this paper, we propose a framework for EMFs with multiple styles of explanation, based on ratings and tags, by incorporating EMF algorithms that use scores derived from tagcentric graphs to connect rating neighborhood-based EMF techniques to tag-based explanations. We used precalculated explainability scores that have been previously validated in user studies that evaluated user satisfaction with each style individually. Our evaluation experiments show that our proposed methods provide accurate recommendations while providing multiple explanation styles, without sacrificing the accuracy of the recommendations.
Download

Paper Nr: 103
Title:

Impact of Thresholds of Univariate Filters for Predicting Species Distribution

Authors:

Yousra Cherif, Ali Idri and Omar El Alaoui

Abstract: Researchers rely on species distribution models (SDMs) to establish a correlation between species occurrence records and environmental data. These models offer insights into the ecological and evolutionary aspects of the subject. Feature selection (FS) aims to choose useful interlinked features or remove those that are unnecessary and redundant, reduce model costs, storage needs, and make the induced model easier to understand. Therefore, to predict the distribution of three bird species, this study compares five filter-based univariate feature selection methods to select relevant features for classification tasks using five thresholds, as well as four classifiers; Support Vector Machine (SVM), Light gradient-boosting machine (LGBM), Decision Tree (DT), and Random Forest (RF). The empirical evaluations involve several techniques, such as the 5-fold cross-validation method, the Scott Knott (SK) test, and Borda Count. In addition, we used three performance criteria (accuracy, kappa and F1-score). Experiments showed that 40% and 50% thresholds were the best choice for classifiers, with RF outperforming LGBM, DT and SVM. Finally, the best combination for each classifier is as follows: RF and LGBM classifiers using Mutual information with 40% threshold, DT using ReliefF with 50% thresholds, and SVM using Anova F-value with 40% thresholds.
Download

Paper Nr: 131
Title:

Evaluation of Information Retrieval Models and Query Performance Predictors for Amharic Adhoc Task

Authors:

Tilahun Yeshambel, Josiane Mothe and Yaregal Assabie

Abstract: Query performance prediction (QPP) is the task of evaluating the quality of retrieval results of a query within the context of a retrieval model. Although several research activities have been carried out on QPP for many languages, query performance predictors are not studied yet for Amharic adhoc information retrieval (IR) task. In this paper, we present the effect of various IR models on Amharic queries, and make some analysis on the computed features for QPP methods from both Indri and Terrier indexes based on the Amharic Adhoc Information Retrieval Test Collection (2AIRTC). We conducted various experiments to attest the quality of Amharic queries and performance of IR models on 2AIRTC which is TREC-like test collection. The correlation degree between predictors is used to measure the dependence between various query performance predictors, or between a predictor and a retrieval score. Our finding shows that Jelinek-Mercer model outperformed the BM25 and Dirichlet models. The finding also indicates the correlation matrices between the query-IDF predictors and the evaluation measures show very low Pearson correlation coefficient values.
Download

Paper Nr: 134
Title:

Evaluating the Use of Interpretable Quantized Convolutional Neural Networks for Resource-Constrained Deployment

Authors:

Harry Rogers, Beatriz De La Iglesia and Tahmina Zebin

Abstract: The deployment of Neural Networks on resource-constrained devices for object classification and detection has led to the adoption of network compression methods, such as Quantization. However, the interpretation and comparison of Quantized Neural Networks with their Non-Quantized counterparts remains inadequately explored. To bridge this gap, we propose a novel Quantization Aware eXplainable Artificial Intelligence (XAI) pipeline to effectively compare Quantized and Non-Quantized Convolutional Neural Networks (CNNs). Our pipeline leverages Class Activation Maps (CAMs) to identify differences in activation patterns between Quantized and Non-Quantized. Through the application of Root Mean Squared Error, a subset from the top 5% scoring Quantized and Non-Quantized CAMs is generated, highlighting regions of dissimilarity for further analysis. We conduct a comprehensive comparison of activations from both Quantized and Non-Quantized CNNs, using Entropy, Standard Deviation, Sparsity metrics, and activation histograms. The ImageNet dataset is utilized for network evaluation, with CAM effectiveness assessed through Deletion, Insertion, and Weakly Supervised Object Localization (WSOL). Our findings demonstrate that Quantized CNNs exhibit higher performance in WSOL and show promising potential for real-time deployment on resource-constrained devices.
Download

Paper Nr: 136
Title:

Content Significance Distribution of Sub-Text Blocks in Articles and Its Application to Article-Organization Assessment

Authors:

You Zhou and Jie Wang

Abstract: We explore how to capture the significance of a sub-text block in an article and how it may be used for text mining tasks. A sub-text block is a sub-sequence of sentences in the article. We formulate the notion of content significance distribution (CSD) of sub-text blocks, referred to as CSD of the first kind and denoted by CSD-1. In particular, we leverage Hugging Face’s SentenceTransformer to generate contextual sentence embeddings, and use MoverScore over text embeddings to measure how similar a sub-text block is to the entire text. To overcome the exponential blowup on the number of sub-text blocks, we present an approximation algorithm and show that the approximated CSD-1 is almost identical to the exact CSD-1. Under this approximation, we show that the average and median CSD-1’s for news, scholarly research, argument, and narrative articles share the same pattern. We also show that under a certain linear transformation, the complement of the cumulative distribution function of the beta distribution with certain values of α and β resembles a CSD-1 curve. We then use CSD-1’s to extract linguistic features to train an SVC classifier for assessing how well an article is organized. Through experiments, we show that this method achieves high accuracy for assessing student essays. Moreover, we study CSD of sentence locations, referred to as CSD of the second kind and denoted by CSD-2, and show that average CSD-2’s for different types of articles possess distinctive patterns, which either conform common perceptions of article structures or provide rectification with minor deviation.
Download

Paper Nr: 149
Title:

Pill Metrics Learning with Multihead Attention

Authors:

Richárd Rádli, Zsolt Vörösházi and László Czúni

Abstract: In object recognition, especially, when new classes can easily appear during the application, few-shot learning has great importance. Metrics learning is an important elementary technique for few-shot object recognition which can be applied successfully for pill recognition. To enforce the exploitation of different object features we use multi-stream metrics learning networks for pill recognition in our article. We investigate the usage of multihead attention layers at different parts of the network. The performance is analyzed on two datasets with superior results to a state-of-the-art multi-stream pill recognition network.
Download

Paper Nr: 161
Title:

Interdependencies and Cascading Effects of Disasters on Critical Infrastructures: An Analysis of Base Station Communication Networks

Authors:

Eva K. Lee and William Z. Wang

Abstract: There are sixteen critical infrastructure (CI) sectors whose assets, systems, and networks, whether physical or virtual, are considered so vital to the United States that their incapacitation or destruction would have a debilitating effect on military readiness, economic security, public health, or safety. The communications sector is unique as a critical infrastructure sector due to its central role in facilitating the flow of information, enabling communication, and supporting all other CIs as well as other components of the economy and society. Within the communications sector, the cellular base station (cell tower) network serves as its foundational backbone. During a crisis, if towers in the network stop functioning or are damaged, the service load of associated users/businesses will have to be transferred to other towers, potentially causing congestion and cascading effects of overload service outages and vulnerabilities. In this paper, we investigate cellular base station network vulnerability by uncovering the most critical nodes in the network whose collapse would trigger extreme cascading effects. We model the cellular base station network via a linear-threshold influence network, with the objective of maximizing the spread of influence. A two-stage approach is proposed to determine the set of critical nodes. The first stage clusters the nodes geographically to form a set of sub-networks. The second stage simulates congestion propagation by solving an influence maximization problem on each sub-network via a greedy Monte Carlo simulation and a heuristic Simpath algorithm. We also identify the cascading nodes that could run into failure if critical nodes fail. The results offer policymakers insight into allocating resources for maximum protection and resiliency against natural disasters or attacks by terrorists or foreign adversaries. We extend the model to the weighted LT influence network (WLT-IN) and prove that the associated influence function is monotone and submodular. We also demonstrate an adaptable usage of WLT-IN for airport risk assessment and biological intelligence of COVID 19 disease spread and its scope of impact to air transportation, economy, and population health.
Download

Paper Nr: 187
Title:

On the Value of Combiners in Heterogeneous Ensemble Effort Estimation

Authors:

Mohamed Hosni

Abstract: Effectively managing a software project to deliver a high-quality product primarily depends on accurately estimating the effort required throughout the software development lifecycle. Various effort estimation methods have been proposed in the literature, including machine learning (ML) techniques. Previous attempts have aimed to provide accurate estimates of software development effort estimation (SDEE) using individual estimation techniques. However, the literature on SDEE suggests that there is no commonly superior estimation technique applicable to all software project contexts. Consequently, the idea of using an ensemble approach emerged. An ensemble combines multiple estimators using a specific combination rule. This approach has been investigated extensively in the past decade, with overall results indicating that it can yield better performance compared to other estimation approaches. However, not all aspects of ensemble methods have been thoroughly explored in the literature, particularly the combination rule used to generate the ensemble’s output. Therefore, this paper aims to shed light on this approach by investigating both types of combiners: three linear and four non-linear. The ensemble learners employed in this study were K-Nearest Neighbors, Decision Trees, Support Vector Regression, and Artificial Neural Networks. The grid search technique was employed to tune the hyperparameters for both the learners and the non-linear combiners. Six datasets were utilized for the empirical analysis. The overall results were satisfactory, as they indicated that the ensemble and single techniques exhibited similar predictive properties, and the ensemble with a non-linear rule demonstrated better performance.
Download

Short Papers
Paper Nr: 39
Title:

A Flexible Approach for Retrieving Geometrically Similar Finite Element Models Using Point Cloud Autoencoders

Authors:

Sonja Schlenz, Simon Mößner, Carl H. Ek and Fabian Duddeck

Abstract: For the development of complex products like vehicle components, knowledge about previous solutions is a key factor. Complete solutions or parts thereof can often be reused if a similar previous model can be identified. To gain independence from the individual experience of single engineers about previous models and a tedious search process, identifying and retrieving the most similar models from large databases offers great potential. Accordingly, this paper introduces a method to achieve this kind of shape retrieval based on engineering data. 3D geometries are represented as point clouds and reduced to one single vector with an autoencoder to identify similarities in the latent space. The method can be used in a flexible way to identify global or local similarities as well as to emphasize different parts of the structure in the similarity search. The method is evaluated on an industrial dataset containing real-world engineering data.
Download

Paper Nr: 40
Title:

Fine-Tuning and Aligning Question Answering Models for Complex Information Extraction Tasks

Authors:

Matthias Engelbach, Dennis Klau, Felix Scheerer, Jens Drawehn and Maximilien Kintz

Abstract: The emergence of Large Language Models (LLMs) has boosted performance and possibilities in various NLP tasks. While the usage of generative AI models like ChatGPT opens up new opportunities for several business use cases, their current tendency to hallucinate fake content strongly limits their applicability to document analysis, such as information retrieval from documents. In contrast, extractive language models like question answering (QA) or passage retrieval models guarantee query results to be found within the boundaries of an according context document, which makes them candidates for more reliable information extraction in productive environments of companies. In this work we propose an approach that uses and integrates extractive QA models for improved feature extraction of German business documents such as insurance reports or medical leaflets into a document analysis solution. We further show that fine-tuning existing German QA models boosts performance for tailored extraction tasks of complex linguistic features like damage cause explanations or descriptions of medication appearance, even with using only a small set of annotated data. Finally, we discuss the relevance of scoring metrics for evaluating information extraction tasks and deduce a combined metric from Levenshtein distance, F1-Score, Exact Match and ROUGE-L to mimic the assessment criteria from human experts.
Download

Paper Nr: 44
Title:

Recommendation System for Product Test Failures Using BERT

Authors:

Xiaolong Sun, Henrik Holm, Sina Molavipour, Fitsum G. Gebre, Yash Pawar, Kamiar Radnosrati and Serveh Shalmashi

Abstract: Historical failure records can provide insights to investigate if a similar situation occurred during the troubleshooting process in software. However, in the era of information explosion, massive amounts of data make it unrealistic to rely solely on manual inspection of root causes, not to mention mapping similar records. With the ongoing development and breakthroughs of Natural Language Processing (NLP), we propose an end-to-end recommendation system that can instantly generate a list of similar records given a new raw failure record. The system consists of three stages: 1) general and tailored pre-processing of raw failure records; 2) information retrieval; 3) information re-ranking. In the process of model selection, we undertake a thorough exploration of both frequency-based models and language models. To mitigate issues stemming from imbalances in the available labeled data, we propose an updated Recall@K metric that utilizes an adaptive K. We also develop a multi-stage training pipeline to deal with limited labeled data and investigate how different strategies affect performance. Our comprehensive experiments demonstrate that our two-stage BERT model, fine-tuned on extra domain data, achieves the best score over the baseline models.
Download

Paper Nr: 46
Title:

An Explainable Knowledge Graph-Based News Recommendation System

Authors:

Zühal Kurt, Thomas Köllmer and Patrick Aichroth

Abstract: The paper outlines an explainable knowledge graph-based recommendation system that aims to provide personalized news recommendations and tries to explain why an item is recommended to a particular user. The system leverages a knowledge graph (KG) that models the relationships between items and users’ preferences, as well as external knowledge sources such as item features and user profiles. The main objectives of this study are to train a recommendation model that can predict whether a user will click on a news article or not, and then obtain the explainable recommendations for the same purpose. This is achieved with three steps: Firstly, KG of the MIND dataset are generated based on the history and, the clicked information of the users, the category and subcategory of the news. Then, the path reasoning approaches are utilized to reach explainable paths of recommended news/items. Thirdly, the proposed KG-based model is evaluated using MIND News data sets. Experiments have been conducted using the MIND-demo and MIND-small datasets, which are the open-source English news datasets for public research scope. Experimental results indicate that the proposed approach performs better in terms of recommendation explainability, making it a promising basis for developing transparent and interpretable recommendation systems.
Download

Paper Nr: 61
Title:

Enterprise Search: Learning to Rank with Click-Through Data as a Surrogate for Human Relevance Judgements

Authors:

Colin Daly and Lucy Hederman

Abstract: Learning to Rank (LTR) has traditionally made use of relevance judgements (i.e. human annotations) to create training data for ranking models. But, gathering feedback in the form of relevance judgements is expensive, time-consuming and may be subject to annotator bias. Much research has been carried out by commercial web search providers into harnessing click-through data and using it as a surrogate for relevance judgements. Its use in Enterprise Search (ES), however, has not been explored. If click-through data relevance feedback correlates with that of the human relevance judgements, we could dispense with small relevance judgement training data and rely entirely on abundant quantities of click-through data. We performed a correlation analysis and compared the ranking performance of a ‘real world’ ES service of a large organisation using both relevance judgements and click-through data. We introduce and publish the ENTRP-SRCH dataset specifically for ES. We calculated a correlation coefficient of r = 0.704 (p<0.01). Additionally, the nDCG@3 ranking performance using relevance judgements is just 1.6% higher than when click-through data is used. Subsequently, we discuss ES implementation trade-offs between relevance judgements and implicit feedback and highlight potential preferences and biases of both end-users and expert annotators.
Download

Paper Nr: 62
Title:

A Personalized Book Recommender System for Adults Based on Deep Learning and Filtering

Authors:

Yiu-Kai Ng

Abstract: Reading improves the reader’s vocabulary and knowledge of the world. It can open minds to different ideas which may challenge the reader to view things in a different light. Reading books benefits both physical and mental health of the reader, and those benefits can last a lifetime. It begins in early childhood and continue through the senior years. A good book should make the reader curious to learn more, and excited to share with others. For some readers, their reluctance to read is due to competing interests such as sports. For others, it is because reading is difficult and they associate it with frustration and strain. A lack of imagination can turn reading into a rather boring activity. In order to encourage adults to read, we propose an elegant book recommender for adults based on a deep learning and filtering approaches that can infer the content and the quality of books without utilizing the actual content, which are often unavailable due to the copyright constraint. Our book recommender filters books for adult readers simply based on user ratings, which are widely available on social media, for making recommendations. Experimental results have verified the effectiveness of our proposed book recommender system.
Download

Paper Nr: 68
Title:

Mapping Cost-Sensitive Learning for Imbalanced Medical Data: Research Trends and Applications

Authors:

Imane Araf, Ali Idri and Ikram Chairi

Abstract: Incorporating Machine Learning (ML) in medicine has opened up new avenues for leveraging complex medical data to enhance patient outcomes and advance the field. However, the imbalanced nature of medical data poses a significant challenge, resulting in biased ML models that perform poorly on the minority class of interest. To address this issue, researchers have proposed various approaches, among which Cost-Sensitive Learning (CSL) stands out as a promising technique to improve the accuracy of ML models. To the best of our knowledge, this paper presents the first systematic mapping study on CSL for imbalanced medical data. To comprehensively investigate the scope of existing literature, papers published from January 2010 to December 2022 and sourced from five major digital libraries were thoroughly explored. A total of 173 papers were selected and analyzed according to three classification criteria: publication years, channels and sources; medical disciplines; and CSL approaches. This study provides a valuable resource for researchers seeking to explore the current state of research and advance the application of CSL for imbalanced data in medicine.
Download

Paper Nr: 70
Title:

Which Word Embeddings for Modeling Web Search Queries? Application to the Study of Search Strategies

Authors:

Claire Ibarboure, Ludovic Tanguy and Franck Amadieu

Abstract: In order to represent the global strategies deployed by a user during an information retrieval session on the Web, we compare different pretrained vector models capable of representing the queries submitted to a search engine. More precisely, we use static (type-level) and contextual (token-level, such as provided by transformers) word embeddings on an experimental French dataset in an exploratory approach. We measure to what extent the vectors are aligned with the main topics on the one hand, and with the semantic similarity between two consecutive queries (reformulations) on the other. Even though contextual models manage to differ from the static model, it is with a small margin and a strong dependence on the parameters of the vector extraction. We propose a detailed analysis of the impact of these parameters (e.g. combination and choice of layers). In this way, we observe the importance of these parameters on the representation of queries. We illustrate the use of models with a representation of a search session as a trajectory in a semantic space.
Download

Paper Nr: 72
Title:

Deep Reinforcement Agent for Efficient Instant Search

Authors:

Ravneet S. Arora, Sreejith Menon, Ayush Jain and Nehil Jain

Abstract: Instant Search is a paradigm where a search system retrieves answers on the fly while typing. The naı̈ve implementation of an Instant Search system would hit the search back-end for results each time a user types a key, imposing a very high load on the underlying search system. In this paper, we propose to address the load issue by identifying tokens that are semantically more salient toward retrieving relevant documents and utilizing this knowledge to trigger an instant search selectively. We train a reinforcement agent that interacts directly with the search engine and learns to predict the word’s importance in relation to the search engine. Our proposed method treats the search system as a black box and is more universally applicable to diverse architectures. To further support our work, a novel evaluation framework is presented to study the trade-off between the number of triggered searches and the system’s performance. We utilize the framework to evaluate and compare the proposed reinforcement method with other baselines. Experimental results demonstrate the efficacy of the proposed method in achieving a superior trade-off.
Download

Paper Nr: 74
Title:

KRAKEN: A Novel Semantic-Based Approach for Keyphrases Extraction

Authors:

Simone D’Amico, Lorenzo Malandri, Fabio Mercorio and Mario Mezzanzanica

Abstract: A research area of NLP is known as keyphrases extraction, which aims to identify words and expressions in a text that comprehensively represent the content of the text itself. In this study, we introduce a new approach called KRAKEN (Keyphrease extRAction maKing use of EmbeddiNgs). Our method takes advantage of widely used NLP techniques to extract keyphrases from a text in an unsupervised manner and we compare the results with well-known benchmark datasets in the literature. The main contribution of this work is developing a novel approach for keyphrase extraction. Both natural language text preprocessing techniques and distributional semantics techniques, such as word embeddings, are used to obtain a vector representation of the texts that maintains their semantic meaning. Through KRAKEN, we propose and design a new method that exploits word embedding for identifying keyphrases, considering the relationship among words in the text. To evaluate KRAKEN, we employ benchmark datasets and compare our approach with state-of-the-art methods. Another contribution of this work is the introduction of a metric to rank the identified keyphrases, considering the relatedness of both the words within the phrases and all the extracted phrases from the same text.
Download

Paper Nr: 84
Title:

Approaches for Enhancing Preference Balance in Neighbor-Based Group Recommender Systems

Authors:

Le H. Nam

Abstract: The Increasing Trend of Group Activities Has Led to Changes in Recommender Systems, Shifting from Recommending Individual Users to Recommending Groups of Users. a Group Recommender System Consists of Two Primary Stages: Aggregating the Profiles of all Group Members to Create a Virtual User and Providing Recommendations to This Virtual User. This Paper Focuses on the Stage of Recommending the Virtual User. Specifically, Our Proposed Approach Aims to Recommend the Virtual User to Achieve a Harmonious Balance Among the Diverse Preferences of Group Members by Combining the Profiles of Group Members with that of the Virtual User. Additionally, We Integrate Textual Comments Observed from Users to Further Enhance the Accuracy of Group Recommendations. Experiments Conducted on Three Popular Datasets from Amazon Have Demonstrated the Effectiveness of the Proposed Approach in Terms of the F1-Score.
Download

Paper Nr: 96
Title:

Enhancing Diabetic Retinopathy Detection Using CNNs with Dimensionality Reduction Techniques and K-Nearest Neighbors Ensembles

Authors:

Chaymaa Lahmar and Ali Idri

Abstract: Diabetic Retinopathy (DR) is the most frequent cause of blindness and visual impairment among working-age adults in the world. Machine learning (ML) and deep learning (DL) techniques are playing an important role in the early detection of DR. This paper proposes a new homogeneous ensemble approach constructed using a set of hybrid architectures, as base learners, and two combination rules (hard and weighted voting) for referable DR detection using fundus images over the Kaggle DR, APTOS and Messidor-2 datasets. The hybrid architectures are created using seven deep feature extractors (DenseNet201, InceptionResNetV2, MobileNetV2, InceptionV3, VGG16, VGG19, and ResNet50), six dimensionality reduction techniques (Principal component analysis, Select from model feature selection, Recursive feature elimination with cross-validation, Factor analysis, Chi-Square test, and Low variance filter), and k-nearest neighbors algorithm (KNN) for classification. The results showed the importance of the proposed approach considering that it outperformed its base learners, and achieved an accuracy value of 92.47% for the Kaggle DR dataset, 89.59% for the APTOS dataset, and 82.03% for the Messidor-2 dataset. The experimental results demonstrated that the proposed approach is impactful for the detection of referable DR, and thus represents a promising tool to assist ophthalmologists in the diagnosis of DR.
Download

Paper Nr: 98
Title:

Attentional Sentiment and Confidence Aware Neural Recommender Model

Authors:

Lamia Berkani, Lina Ighilaza and Fella Dib

Abstract: One of the major problems of recommendation systems is the rating data sparseness and information overload. To address these issues, some studies are leveraging review information to construct an accurate user/item latent factor. We propose in this article a neural hybrid recommender model based on attentional hybrid sentiment analysis, using BERT word embedding and deep learning models. An attention mechanism is used to capture the most relevant information. As reviews may contain misleading information (" fake good reviews / fake bad reviews "), a confidence matrix has been used to measure the relationship between rating outliers and misleading reviews. Then, the sentiment analysis module with fake reviews detection is used to update the user-item rating matrix. Finally, a hybrid recommendation is processed by combining the generalized matrix factorization (GMF) and the multilayer perceptron (MLP). The results of experiments on two datasets from the Amazon database show that our approach significantly outperforms state-of-the-art baselines and related work.
Download

Paper Nr: 102
Title:

Automated Classification of Building Objects Using Machine Learning

Authors:

Nadeem Iftikhar, Peter N. Gade, Kasper M. Nielsen and Jesper Mellergaard

Abstract: In the construction sector, digital technologies are being employed to enable architects, engineers and builders in the creation of digital building models. Although these technologies come equipped with inherent classification systems, they also bring forth certain obstacles. Frequently, these systems categorize building elements at levels that exceed their necessary specificity. To illustrate, these classification systems might allocate values at a broader granularity, such as “exterior wall” rather than at a more precise level, like “exterior glass wall with no columns”. As a result, the manual classification of building elements at a granular level becomes essential. Nonetheless, manual classification frequently results in inaccuracies and erroneous semantic details, while also consuming a significant amount of time. Precise and prompt classification of building objects holds significant importance for activities like cost planning, construction cost management and overall procurement processes. To address this, the current paper suggests an automated classification approach for building objects, focusing on specific types, through the utilization of machine learning. The effectiveness of the proposed system is showcased using real-world data from a prominent architectural firm based in Scandinavia.
Download

Paper Nr: 105
Title:

Make Deep Networks Shallow Again

Authors:

Bernhard Bermeitinger, Tomas Hrycej and Siegfried Handschuh

Abstract: Deep neural networks have a good success record and are thus viewed as the best architecture choice for complex applications. Their main shortcoming has been, for a long time, the vanishing gradient which prevented the numerical optimization algorithms from acceptable convergence. An important special case of network architecture, frequently used in computer vision applications, consists of using a stack of layers of the same dimension. For this architecture, a breakthrough has been achieved by the concept of residual connections—an identity mapping parallel to a conventional layer. This concept substantially alleviates the vanishing gradient problem and is thus widely used. The focus of this paper is to show the possibility of substituting the deep stack of residual layers with a shallow architecture with comparable expressive power and similarly good convergence properties. A stack of residual layers can be expressed as an expansion of terms similar to the Taylor expansion. This expansion suggests the possibility of truncating the higher-order terms and receiving an architecture consisting of a single broad layer composed of all initially stacked layers in parallel. In other words, a sequential deep architecture is substituted by a parallel shallow one. Prompted by this theory, we investigated the performance capabilities of the parallel architecture in comparison to the sequential one. The computer vision datasets MNIST and CIFAR10 were used to train both architectures for a total of 6,912 combinations of varying numbers of convolutional layers, numbers of filters, kernel sizes, and other meta parameters. Our findings demonstrate a surprising equivalence between the deep (sequential) and shallow (parallel) architectures. Both layouts produced similar results in terms of training and validation set loss. This discovery implies that a wide, shallow architecture can potentially replace a deep network without sacrificing performance. Such substitution has the potential to simplify network architectures, improve optimization efficiency, and accelerate the training process.
Download

Paper Nr: 112
Title:

DrBerry: Detection of Diseases in Blueberry Bush Leaves

Authors:

Cristopher Morales, Edgar Cavero and Willy Ugarte

Abstract: The following research presents a mobile application that can recognize the following plages usually found on blueberry leaves: oidium, heliothis and alternaria. These diseases affects the growth of the bush an thus reduce its yield. Additionally, an open dataset will be available for future investigations. Yolov5, a convolutional neural network, is used for the development of the model, data collection was performed in the Fundo San Roberto, Huaral-Peru, and data augmentation techniques were used to increment the amount of workable data. Thanks to this the following results were obtained: accuracy of 84% and recall of 91%. We predict that the model could be improved to recognize other plages given the right amount of data.
Download

Paper Nr: 115
Title:

Easy Scaling: The Most Critical Consideration for Choosing Analytical Database Management Systems in the Cloud Era

Authors:

Jie Liu and Genyuan Du

Abstract: Analytical database management systems offer significant advantages for organizations practicing data-driven decision-making. ADBMSs rely on massively parallel processing for performance improvement, increased availability and other computation related resources, and improved scalability and stability. In this position paper, we argue that (1) Gustafson-Barsis’ Law aligns well with use cases suitable for Cloud-based ADBMS, still, neither Amdahl’s law nor Gustafson’s law is sufficient in guiding us on answer the question "how many processors should we use to gain better performance economically", and (2) ADBMS’s capability of utilizing parallel processing does not translate directly into easy scaling, specially scaling horizontally by adding more instances or nodes to distribute the workload at will, so when costs are somewhat controllable, allowing easy scaling should be by far the most critical consideration for choosing an ADBMS.
Download

Paper Nr: 122
Title:

Enhancing Gesture Recognition for Sign Language Interpretation in Challenging Environment Conditions: A Deep Learning Approach

Authors:

Domenico Amalfitano, Vincenzo D’Angelo, Antonio M. Rinaldi, Cristiano Russo and Cristian Tommasino

Abstract: Gesture recognition systems have gained popularity as an effective means of communication, leveraging the simplicity and effectiveness of gestures. With the absence of a universal sign language due to regional variations and limited dissemination in schools and media, there is a need for real-time translation systems to bridge the communication gap. The proposed system aims to translate American Sign Language (ASL), the predominant sign language used by deaf communities in real-time in North America, West Africa, and Southeast Asia. The system utilizes SSD Mobilenet FPN architecture, known for its real-time performance on low-power devices, and leverages transfer learning techniques for efficient training. Data augmentation and preprocessing procedures are applied to improve the quality of training data. The system’s detection capability is enhanced by adapting color space conversions, such as RGB to YCbCr and HSV, to improve the segmentation for varying lighting conditions. Experimental results demonstrate the system’s Accessibility and non-invasiveness, achieving high accuracy in recognizing ASL signs.
Download

Paper Nr: 138
Title:

An Explorative Guide on How to Detect Forged Car Insurance Claims with Language Models

Authors:

Quentin Telnoff, Emanuela Boros, Mickael Coustaty, Fabrice Crohas, Antoine Doucet and Frédéric L. Bars

Abstract: Detecting forgeries in insurance car claims is a complex task that requires detecting fraudulent or overstated claims related to property damage or personal injuries after a car accident. Building predictive models for detecting them raises several issues (e.g. imbalance, concept drift) that cannot only depend on the frequency or timing of the reported incidents. The difficulty of tackling this type of task is further intensified by the static tabular data generally used in this domain, while submitted insurance claims largely consist of textual data. We, thus, propose an explorative guide for detecting forged car insurance claims with language models. Specifically, we investigate two transformer-based frameworks: supervised (where the model is trained to differentiate between forged and non-forged cases) and self-supervised (where the model captures the standard attributes of non-forged claims). For handling static tabular data and unstructured text fields, we inspect various forms of data row modelling (table serialization techniques), different losses, and two language models (one general and one domain-specific). Our work highlights the challenges and limitations of existing frameworks.
Download

Paper Nr: 139
Title:

Classification of Questionnaires with Open-Ended Questions

Authors:

Miraç Tuğcu, Tolga Çekiç, Begüm Ç. Erdinç, Seher C. Akay and Onur Deniz

Abstract: Questionnaires with open-ended questions are used across industries to collect insights from respondents. The answers to these questions may lead to labelling errors because of the complex questions. However, to handle this noise in the data, manual labour might not be feasible due to low-resource scenarios. Here, we propose an end-to-end solution to handle questionnaire-style data as a text classification problem. In order to mitigate labelling errors, we use a data-centric approach to group inconsistent examples from the banking customer questionnaire dataset in Turkish. For the model architecture, BiLSTM is preferred to capture longterm dependencies between contextualized word embeddings of BERT. We achieved significant results on the binary questionnaire classification task. We obtained results up to 81.9% recall and 79.8% F1 score with the clustering method to clean the dataset and presented the results of how it impacts overall model performance on both the original and clean versions of the data.
Download

Paper Nr: 153
Title:

Machine Learning Models for Prostate Cancer Identification

Authors:

Elias Dritsas, Maria Trigka and Phivos Mylonas

Abstract: In the present research paper, we focused on prostate cancer identification with machine learning (ML) techniques and models. Specifically, we approached the specific disease as a 2-class classification problem by categorizing patients based on tumour type as benign or malignant. We applied the synthetic minority over-sampling technique (SMOTE) in our ML models in order to reveal the model with the best predictive ability for our purpose. After the experimental evaluation, the Rotation Forest (RotF) model overcame the others, achieving an accuracy, precision, recall, and f1-score of 86.3%, and an AUC equal to 92.4% after SMOTE with 10-fold cross-validation.
Download

Paper Nr: 165
Title:

Mexico City’s Urban Trees Reforestation Based on Characteristics of Plantation Sites

Authors:

Héctor J. Vázquez and Mihaela Juganaru

Abstract: Urban trees reforestation grounded in characteristics of plantation sites is necessary to tree maintenance and health care. Decisions concerning when, where and which tree species to plant have important consequences for tree survival and resilience. Through the application of Multiple Correspondence Analysis and Clustering of qualitative criteria, it was possible to establish nine clusters based on the qualitative modalities of planation sites and so to associate them with urban tree species. The use of indexes related to the percentage of modalities with respect to the sample, specificity and homogeneity of clusters resulted useful criteria to describe plantation sites. We study the case of urban trees in Mexico City.
Download

Paper Nr: 169
Title:

Knowledge Discovery for Risk Assessment in Economic and Food Safety

Authors:

Maria C. Silva, Brigida M. Faria and Luis P. Reis

Abstract: Foodborne diseases continue to spread widely in the 21st century. In Portugal, the Economic and Food Safety Authority (ASAE), have the goal of monitoring and preventing non-compliance with regulatory legislation on food safety, regulating the conduct of economic activities in the food and non-food sectors, as well as accessing and communicating risks in the food chain. This work purpose and evaluated a global risk indicator considering three risk factors provided by ASAE (non-compliance rate, product or service risk and consumption volume). It also compares the performance on the prediction of risk of four classification models Decision Tree, Naïve Bayes, k-Nearest Neighbor and Artificial Neural Network before and after feature selection and hyperparameter tuning. The principal findings revealed that the service provider, food and beverage and retail were the activity sectors present in the dataset with the highest global risk associated with them. It was also observed that the Decision Tree classifier presented the best results. It was also verified that data balancing using the SMOTE method led to a performance increase of about 90% with the Decision Tree and k-Nearest Neighbor models. The use of machine learning can be helpful in risk assessment related to food safety and public health. It was possible to conclude that areas regarding major global risks are the ones that are more frequented by the population and require more attention. Thus, relying on risk assessment using machine learning can have a positive influence on economic crime prevention related to food safety as well as public health.
Download

Paper Nr: 180
Title:

Methodology for the Analysis of Agricultural Data in the Mexican Context: Study Case of Marigold

Authors:

Cristal G. Durán and Mihaela Juganaru

Abstract: Agricultural production data for multiple crops is available as open data; However, to discover information in the data it is necessary to consider methodologies, methods and tools that allow guiding the research work to specifically explore agricultural data. This article aims to propose an adaptation of the CRISP-DM and OSEMN methodologies to the agricultural context, which helps to study any crop. In addition, to apply the proposed methodology to the agricultural production of an endemic Mexican product that is the marigold flower, Tagetes erecta.
Download

Paper Nr: 191
Title:

DAF: Data Acquisition Framework to Support Information Extraction from Scientific Publications

Authors:

Muhammad A. Suryani, Steffen Hahne, Christian Beth, Klaus Wallmann and Matthias Renz

Abstract: Researchers encapsulate their findings in publications, generally available in PDFs, which are designed primarily for platform-independent viewing and printing and do not support editing or automatic data extraction. These documents are a rich source of information in any domain, but the information in these publications is presented in text, tables and figures. However, manual extraction of information from these components would be beyond tedious and necessitates an automatic approach. Therefore, an automatic extraction approach could provide valuable data to the research community while also helping to manage the increasing number of publications. Previously, many approaches focused on extracting individual components from scientific publications, i.e. metadata, text or tables, but failed to target these data components collectively. This paper proposes a Data Acquisition Framework (DAF), the most comprehensive framework to our knowledge. The DAF extracts enhanced metadata, segmented text, captions and content of tables and figures respectively. Through rigorous evaluation on two distinct datasets from the Marine Science and Chemical Domain we showcase the superior performance compared of the DAF to the baseline PDFDataExtractor. We also provide an illustrative example to underscore DAF’s adaptability in the realm of research data management.
Download

Paper Nr: 192
Title:

Challenges in Implementing a University-Based Innovation Search Engine

Authors:

Arman Arzani, Marcus Handte and Pedro J. Marrón

Abstract: In universities, technology transfer plays an important role in the joint development and dissemination of knowledge as a product that benefits society through innovation. In order to facilitate knowledge transfer, many universities hire innovation coaches that employ a scouting process to identify faculty members and students who possess the requisite knowledge, expertise, and potential to establish startups. Since there is no systematic approach to measure the innovation potential of university members based on their academic activities, the scouting process is typically subjective and relies heavily on the experience of the innovation coaches. In this paper, we motivate the need for INSE (INnovation Search Engine) to support innovation coaches during their search for innovation potential at a university. After discussing the information needs of the scouting process, we outline a basic system architecture to support it, and we identify a number of research challenges. Our aim is to motivate vigorous research in this area by illustrating the need for novel, data-driven approaches towards effective innovation scouting and successful knowledge transfer.
Download

Paper Nr: 23
Title:

Identifying Similar Top-K Household Electricity Consumption Patterns

Authors:

Nadeem Iftikhar, Akos Madarasz and Finn E. Nordbjerg

Abstract: Gaining insight into household electricity consumption patterns is crucial within the energy sector, particularly for tasks such as forecasting periods of heightened demand. The consumption patterns can furnish insights into advancements in energy efficiency, exemplify energy conservation and demonstrate structural transformations to specific clusters of households. This paper introduces different practical approaches for identifying similar households through their consumption patterns. Initially different data sets are merged, followed by aggregating data to a higher granularity for short-term or long-term forecasts. Subsequently, unsupervised nearest neighbors learning algorithms are employed to find similar patterns. These proposed approaches are valuable for utility companies in offering tailored energy-saving recommendations, predicting demand, engaging consumers based on consumption patterns, visualizing energy use, and more. Furthermore, these approaches can serve to generate authentic synthetic data sets with minimal initial data. To validate the accuracy of these approaches, a real data set spanning eight years and encompassing 100 homes has been employed.
Download

Paper Nr: 25
Title:

Detecting Greenwashing in the Environmental, Social, and Governance Domains Using Natural Language Processing

Authors:

Yue Zhao, Leon Kroher, Maximilian Engler and Klemens Schnattinger

Abstract: Greenwashing, where companies misleadingly project environmental, social, and governance (ESG) virtues, challenges stakeholders. This study examined the link between internal ESG sentiments and public opinion on social media across 12 pharmaceutical firms from 2012 to 2022. Using natural language processing (NLP), we analyzed internal documents and social media. Our findings showed no significant correlation between internal and external sentiment scores, suggesting potential greenwashing if there’s inconsistency in sentiment. This inconsistency can be a red flag for stakeholders like investors and regulators. In response, we propose an NLP-based Q&A system that generates context-specific questions about a company’s ESG performance, offering a potential solution to detect greenwashing. Future research should extend to other industries and additional data sources like financial disclosures.
Download

Paper Nr: 27
Title:

Speech Detection of Real-Time MRI Vocal Tract Data

Authors:

Jasmin Menges, Johannes Walter, Jasmin Bächle and Klemens Schnattinger

Abstract: This paper investigates the potential of Deep Learning in the area of speech production. The purpose is to study whether algorithms are able to classify the spoken content based only on images of the oral region. With the real-time MRI data of Lim et al. more detailed insights into the speech production of the vocal tract could be obtained. In this project, the data was applied to recognize spoken letters from tongue movements using a vector-based image detection approach. In addition, to generate more data, randomization was applied. The pixel vectors of a video clip during which a certain letter was spoken could then be passed into a Deep Learning model. For this purpose, the neural networks LSTM and 3D-CNN were used. It has been proven that it is possible to classify letters with an accuracy of 93% using a 3D-CNN model.
Download

Paper Nr: 48
Title:

A Comparative Study on Main Content Extraction Algorithms for Right to Left Languages

Authors:

Houriye Esfahanian, Abdolreza Nazemi and Andreas Geyer-Schulz

Abstract: With the daily increase of published information on the Web, extracting the web page’s main content has become an important issue. Since 2010, in addition to the English Language, the contents with the right to left languages such as Arabic or Persian are also increasing. In this paper, we compared the three famous main content extraction algorithms published in the last decade, Boilerpipe, DANAg, and Web-AM, to find the best algorithm considering evaluation measures and performance. The ArticleExtractor algorithm of the Boilerpipe approach was scored as the most accurate algorithm, with the highest average score of F1 measure of 0.951. On the contrary, the DANAg algorithm was selected with the best performance, being able to process more than 21 megabytes per second. Considering the accuracy and the effectiveness of the main content extraction projects, one of the two Boilerpipe or DANAg algorithms can be used.
Download

Paper Nr: 54
Title:

LAxplore: An NLP-Based Tool for Distilling Learning Analytics and Learning Design Instruments out of Scientific Publications

Authors:

Atezaz Ahmad, Jan Schneider, Daniel Schiffner, Esad Islamovic and Hendrik Drachsler

Abstract: Each year, the amount of research publications is increasing. Staying on top of the state of the art is a pressing issue. The field of Learning Analytics (LA) is no exception, with the rise of digital education systems that are used broadly these days from K12 up to Higher Education. Keeping track of the advances in LA is challenging. This is especially the case for newcomers to the field, as well as for the increasing number of LA units that consult their teachers and scholars on applying evidence-based research outcomes in their lectures. To keep an overview of the rapidly growing research findings on LA, we developed LAxplore, a tool that uses NLP to extract relevant information from the LA literature. In this article, we present the evaluation of LAxplore. Results from the evaluation show that LAxplore can significantly support researchers in extracting information from relevant LA publications as it reduces the time of searching and retrieving the knowledge by a factor of six. However, the accurate extraction of relevant information from LA literature is not yet ready to be fully automatized and some manual work is still required.
Download

Paper Nr: 64
Title:

Enhancing Healthcare in Emergency Department Through Patient and External Conditions Profiling: A Cluster Analysis

Authors:

Mariana Carvalho and Ana Borges

Abstract: Improving healthcare delivery in emergency departments (EDs) is of paramount importance to ensure efficient and effective patient care. This study aims to enhance healthcare in the ED by employing cluster analysis techniques to profile patients and external conditions. Through a comprehensive analysis of patient data and factors associated with the ED environment, we seek to identify patterns, optimize resource allocation, and tailor interventions for improved outcomes. The identification of distinct patient profiles and understanding of the impact of external factors allows to understand the complex dynamics of the ED. Additionally, it enables healthcare professionals to better understand patient populations, anticipate healthcare needs, and tailor treatment plans accordingly. Therefore, in this paper, we apply a clustering technique to obtain three clusters with different characteristics, both at the patient level and at the level of external factors, with different emergency room inflows.
Download

Paper Nr: 81
Title:

Collaborative Emotion Annotation: Assessing the Intersection of Human and AI Performance with GPT Models

Authors:

Hande Aka Uymaz and Senem Kumova Metin

Abstract: In this study, we explore emotion detection in text, a complex yet vital aspect of human communication. Our focus is on the formation of an annotated dataset, a task that often presents difficulties due to factors such as reliability, time, and consistency. We propose an alternative approach by employing artificial intelligence (AI) models as potential annotators, or as augmentations to human annotators. Specifically, we utilize ChatGPT, an AI language model developed by OpenAI. We use its latest versions, GPT3.5 and GPT4, to label a Turkish dataset having 8290 terms according to Plutchik’s emotion categories, alongside three human annotators. We conduct experiments to assess the AI’s annotation capabilities both independently and in conjunction with human annotators. We measure inter-rater agreement using Cohen’s Kappa, Fleiss Kappa, and percent agreement metrics across varying emotion categorizations- eight, four, and binary. Particularly, when we filtered out the terms where the AI models were indecisive, it was found that including AI models in the annotation process was successful in increasing inter-annotator agreement. Our findings suggest that, the integration of AI models in the emotion annotation process holds the potential to enhance efficiency, reduce the time of lexicon development and thereby advance the field of emotion/sentiment analysis.
Download

Paper Nr: 110
Title:

A Long-Term Funds Predictor Based on Deep Learning

Authors:

Shuiyi Kuang and Yan Zhang

Abstract: Numerous neural network models have been created to predict the rise or fall of stocks since deep learning has gained popularity, and many of them have performed quite well. However, since the share market is hugely influenced by various policy changes or unexpected news, it is challenging for investors to use such short-term predictions as a guide. In this paper, a suitable long-term predictor for the funds market is proposed and tested using different kinds of neural network models, including the Long Short-Term Memory (LSTM) model with different layers, the Gated Recurrent Units (GRU) model with different layers, and the combination model of LSTM and GRU. These models were evaluated on two funds datasets with various stock market technical indicators added. Since the fund is a long-term investment, we attempted to predict the range of change in the future 20 trading days. The experimental results demonstrated that the single GRU model performed best, reached an accuracy of 92.14% to correctly predict the direction of rise or fall, and the accuracy of predicting the specific change also hit 85.35%.
Download

Paper Nr: 113
Title:

Unified New Techniques for NP-Hard Budgeted Problems with Applications in Team Collaboration, Pattern Recognition, Document Summarization, Community Detection and Imaging

Authors:

Dorit S. Hochbaum

Abstract: This paper introduces new techniques for any NP-hard problems formulated as monotone integer programming (IPM) with a budget constraint “budgeted IPM”. Problems of this type have diverse applications, including maximizing team collaboration, the maximum diversity problem, facility dispersion, threat detection, minimizing conductance, clustering, and pattern recognition. We present a unified framework for effective algorithms for budgeted IPM problems based on the Langrangian relaxation of the budget constraint. It is shown that all optimal solutions for all values of the Lagrange multiplier are generated very efficiently, and the piecewise linear concave envelope (convex, for minimization problems) of these solutions has breakpoints that are optimal solutions for the respective budgets. This is used to derive high quality upper and lower bounds for budgets that do not correspond to breakpoints. We show that for all these problems, the weight “perturbation” concept, that was successful for the problem of maximum diversity in enhancing the number and distribution of breakpoints, is applicable. Furthermore, the insights derived from this efficient frontier of solutions, lead to the result that all the respective ratio problems have a solution at the “first” breakpoint, which generalizes the concept of maximum density subgraph.
Download

Paper Nr: 117
Title:

Comparing Ensemble and Single Classifiers Using KNN Imputation for Incomplete Heart Disease Datasets

Authors:

Ismail Moatadid, Ibtissam Abnane and Ali Idri

Abstract: Heart disease remains a significant global health challenge, necessitating accurate and reliable classification techniques for early detection and diagnosis. Choosing a suitable classifier model for a dataset containing missing data is a pervasive issue in medical datasets, which can severely impact the performance of classification models. In this work, we present a comparative analysis of three ensemble techniques (i.e. Random Forest (RF), Extreme Gradient Boosting (XGB), and Bagging) and three single technique (i.e. K-nearest neighbor (KNN), Multilayer Perceptron (MLP), and Support Vector Machine (SVM)) applied to four heart disease medical datasets (i.e. Hungarian, Cleveland, Statlog and HeartDisease). The main objective of this study is to compare the performance of ensemble and single classifiers in handling incomplete heart disease datasets using KNN imputation and identify an effective approach for heart disease classification. We found that, overall, MLP outperformed SVM and KNN across datasets. Moreover, we found that ensemble techniques consistently outperformed the single techniques across multiple metrics and datasets. The ensemble models consistently achieved higher accuracy, precision, recall, F1 score, and AUC values. Therefore, for heart disease classification using KNN imputation, the ensemble techniques, particularly RF, Bagging, and XGB, proved to be the most effective models.
Download

Paper Nr: 121
Title:

Machine Learning in Customer-Centric Web Design: The Website of a Portuguese Higher Education Institution

Authors:

Vitor M. Pinto, Fernando P. Belfo, Isabel Pedrosa and Lorenzo Valgimigli

Abstract: Prospective students interact with the brand of higher education institutions (HEI) via several channels throughout their journey to choose a course to enroll. The institutional website is among these channels and the way it is designed might influence how engaged these visitors are. Web analytics tools allow collecting high amounts of user behavior data, which can generate insights that help to improve higher education institutions website and the students’ incentives to apply for a course. Techniques of Data Mining are presented as a proposition to help generating insights with an applied case study of a Portuguese HEI. The CRISP-DM method was used to generate suggestions to improve user engagement. The tools applied from Google Tag Manager, Analytics, BigQuery and RapidMiner allowed to collect, storage, transform, visualize and model data using the machine learning algorithms Naïve Bayes, Generalized Linear Model, Logistic Regression, Fast Large Margin and Decision Tree. The main results showed that: the course pages do attract volume of users, but their engagement is low; the general undergraduate course page is more successful to bring users who see course content and that; masters and other course pages do attract engaged users who see undergraduate that content.
Download

Paper Nr: 154
Title:

Advancing Flotation Process Optimization Through Real-Time Machine Vision Monitoring: A Convolutional Neural Network Approach

Authors:

Ahmed Bendaouia, El H. Abdelwahed, Sara Qassimi, Abdelmalek Boussetta, Intissar Benzakour, Oumkeltoum Amar, François Bourzeix, Khalil Jabbahi and Oussama Hasidi

Abstract: The mining industry’s continuous pursuit of sustainable practices and enhanced operational efficiency has led to an increasing interest in leveraging innovative technologies for process monitoring and optimization. This study focuses on the implementation of Convolutional Neural Networks (CNN) for real-time monitoring of differential flotation circuits in the mining sector. Froth flotation, a widely used technique for mineral separation, necessitates precise control and monitoring to achieve maximum recovery of valuable minerals and separate them from gangue. The research delves into the significance of froth surface visual properties and their correlation with flotation froth quality. By capitalizing on CNN’s ability to identify valid, hidden, novel, potentially useful and meaningful information from image data, this study showcases how it surpasses traditional techniques for the flotation monitoring. The paper provides an in-depth exploration of the dataset collected from various stages of the Zinc flotation banks, labeled with elemental grade values of Zinc (Zn), Iron (Fe), Copper (Cu), and Lead (Pb). CNNs’ implementation in a regression problematic allows for real-time monitoring of mineral concentrate grades, enabling precise assessments of flotation performance. The successful application of CNNs in the Zinc flotation circuit opens up new possibilities for improved process control and optimization in mineral processing. By continuously monitoring froth characteristics, engineers and operators can make informed decisions, leading to enhanced mineral recovery and reduced waste.
Download

Paper Nr: 188
Title:

Encoding Techniques for Handling Categorical Data in Machine Learning-Based Software Development Effort Estimation

Authors:

Mohamed Hosni

Abstract: Planning, controlling, and monitoring a software project primarily rely on the estimates of the software development effort. These estimates are usually conducted during the early stages of the software life cycle. At this phase, the available information about the software product is categorical in nature, and only a few numerical data points are available. Therefore, building an accurate effort estimator begins with determining how to process the categorical data that characterizes the software project. This paper aims to shed light on the ways in which categorical data can be treated in software development effort estimation (SDEE) datasets through encoding techniques. Four encoders were used in this study, including one-hot encoder, label encoder, count encoder, and target encoder. Four well-known machine learning (ML) estimators and a homogeneous ensemble were utilized. The empirical analysis was conducted using four datasets. The datasets generated by means of the one-hot encoder appeared to be suitable for the ML estimators, as they resulted in more accurate estimation. The ensemble, which combined four variants of the same technique trained using different datasets generated by means of encoder techniques, demonstrated an equal or better performance compared to the single ML estimation technique. The overall results are promising and pave the way for a new approach to handling categorical data in SDEE datasets.
Download