KDIR 2022 Abstracts

Full Papers

Paper Nr:	6
Title:	Two-dimensional Motif Extraction from Images: A Study using an Electrocardiogram
Authors:	Hanadi Aldosari, Frans Coenen, Gregory Y. H. Lip and Yalin Zheng
Abstract:	A mechanism using the concept of 2D motifs to classify Electrocardiogram (ECG) data is presented. The motivation is that existing techniques typically first transform ECG data into a 1D signal (waveform) format and then extract a small number of features from this format for classification purposes. The transformation into the waveform format introduces an approximation of the data, and the consequent feature selection means that only a small part of the coarsened signal is utilised. The proposed approach works directly with the image format, no transformation takes place, features (motifs) are selected by considering the entire ECG image. It is argued that this produces a better classification than that which can be achieve using the waveform format. The proposed 2D Motif extraction approach is fully described and evaluated. Good results are returned, a best accuracy 85% in comparison with a best accuracy of 70% using a comparable 1D waveform approach. An analysis is also presented with respect to the augmentation of 2D motifs with 2D discords.
Download

Paper Nr:	8
Title:	Fast Algorithms for the Capacitated Vehicle Routing Problem using Machine Learning Selection of Algorithm’s Parameters
Authors:	Roberto Asín-Achá, Olivier Goldschmidt, Dorit S. Hochbaum and Isaías I. Huerta
Abstract:	We present machine learning algorithms for automatically determining algorithm’s parameters for solving the Capacitated Vehicle Routing Problem (CVRP) with unit demands. This is demonstrated here for the “sweep algorithm” which assigns customers to a truck, in a wedge area of a circle of parametrically selected radius around the depot, with demand up to its capacity. We compare the performance of several machine learning algorithms for the purpose of predicting this threshold radius parameter for which the sweep algorithm delivers the best, lowest value, solution. For the selected algorithm, KNN, that is used as an oracle for the automatic selection of the parameter, it is shown that the automatically configured sweep algorithm delivers better solutions than the “best” single parameter value algorithm. Furthermore, for the real worlds instances in the new benchmark introduced here, the sweep algorithm has better running times and better quality of solutions compared to that of current leading algorithms. Another contribution here is the introduction of the new CVRP real world data benchmark based on about a million customers locations in Los Angeles and about a million customers locations in New York city areas. This new benchmark includes a total of 46000 problem instances.
Download

Paper Nr:	10
Title:	End-to-End Multi-channel Neural Networks for Predicting Influenza a Virus Hosts and Antigenic Types
Authors:	Yanhua Xu and Dominik Wojtczak
Abstract:	Influenza occurs every season and occasionally causes pandemics. Despite its low mortality rate, influenza is a major public health concern, as it can be complicated by severe diseases like pneumonia. A accurate and low-cost method to predict the origin host and subtype of influenza viruses could help reduce virus transmission and benefit resource-poor areas. In this work, we propose multi-channel neural networks to predict antigenic types and hosts of influenza A viruses with hemagglutinin and neuraminidase protein sequences. An integrated data set containing complete protein sequences were used to produce a pre-trained model, and two other data sets were used for testing the model’s performance. One test set contained complete protein sequences, and another test set contained incomplete protein sequences. The results suggest that multi-channel neural networks are applicable and promising for predicting influenza A virus hosts and antigenic subtypes with complete and partial protein sequences.
Download

Paper Nr:	12
Title:	Degree Centrality Algorithms for Homogeneous Multilayer Networks
Authors:	Hamza Reza Pavel, Abhishek Santra and Sharma Chakravarthy
Abstract:	Centrality measures for simple graphs/networks are well-defined and each has numerous main-memory algorithms. However, for modeling complex data sets with multiple types of entities and relationships, simple graphs are not ideal. MultiLayer Networks (or MLNs) have been proposed for modeling them and have been shown to be better suited in many ways. Since there are no algorithms for computing centrality measures directly on MLNs, existing strategies reduce (aggregate or collapse) MLN layers to simple networks using Boolean AND or OR operators. This approach negates the benefits of MLN modeling as these computations tend to be expensive and furthermore results in loss of structure and semantics. In this paper, we propose heuristic-based algorithms for computing centrality measures (specifically, degree centrality) on MLNs directly (i.e., without reducing them to simple graphs) using a newly-proposed decoupling-based approach which is efficient as well as structure and semantics preserving. We propose multiple heuristics to calculate the degree centrality using the network decoupling-based approach and compare accuracy and precision with Boolean OR aggregated Homogeneous MLNs (HoMLNs) for ground truth. The network decoupling approach can take advantage of parallelism and is more efficient compared to aggregation-based approaches. Extensive experimental analysis is performed on large synthetic and real-world data sets of varying graph characteristics to validate the accuracy, precision, and efficiency of our proposed algorithms.
Download

Paper Nr:	16
Title:	CAP-DSDN: Node Co-association Prediction in Communities in Dynamic Sparse Directed Networks and a Case Study of Migration Flow
Authors:	Jaya Sreevalsan-Nair and Astha Jakher
Abstract:	Predicting the community structure in the time series, or snapshots, of a real-world graph in the future, is a pertinent challenge. This is motivated by the study of migration flow networks. The dataset is characterized by edge sparsity due to the inconsistent availability of data. Thus, we generalize the problem to predicting community structure in a dynamic sparse directed network (DSDN). We introduce a novel application of co-association which is a pairwise relationship between the nodes belonging to the same community. We thus propose a three-step algorithm, CAP-DSDN, for co-association prediction (CAP) in such a network. Given the absence of benchmark data or ground truth, we use an ensemble of community detection (CD) algorithms and evaluation metrics widely used for directed networks. We then define a metric based on entropy rate as a threshold to filter the network for determining a significant and data-complete subnetwork. We propose the use of autoregressive models for predicting the co-association relationship given in its matrix format. We demonstrate the effectiveness of our proposed method in a case study of international refugee migration during 2000–18. Our results show that our method works effectively for migration flow networks for short-term prediction and when the data is complete across all snapshots.
Download

Paper Nr:	26
Title:	Long Form Question Answering Dataset Creation for Business Use Cases using Noise-Added Siamese-BERT
Authors:	Tolga Çekiç, Yusufcan Manav, Batu Helvacıoğlu, Enes Burak Dündar, Onur Deniz and Gülşen Eryiğit
Abstract:	In business cases, there is an increasing need for automated long form question answering (LFQA) systems from business documents, however data for training such systems is not easily achievable. Developing such data sets require a costly human annotation stage where <<question-answer-related document passage>> triplets should be created. In this paper, we present a method to rapidly develop an LFQA dataset from existing logs of help-desk data without need of manual human annotation stage. This method first creates a SiameseBert encoder to relate recorded answers with business documents’ passages. For this purpose, the SiameseBert encoder is trained over a synthetically created dataset imitating paraphrased document passages using a noise model. The encoder is then used to create the necessary triplets for LFQA from business documents. We train a Dense Passage Retrieval (DPR) system using a bi-encoder architecture for the retrieval stage and a cross-encoder for re-ranking the retrieved document passages. The results show that the proposed method is successful at rapidly developing LFQA systems for business use cases, yielding a 85% recall of the correct answer at the top 1 of the returned results.
Download

Paper Nr:	31
Title:	An Effective Two-stage Noise Training Methodology for Classification of Breast Ultrasound Images
Authors:	Yiming Bian and Arun K. Somani
Abstract:	Breast cancer is one of the most common and deadly diseases. An early diagnosis is critical and in-time treatment can help prevent the further spread of cancer. Breast ultrasound images are widely used for diagnosis, but the diagnosis heavily depends on the radiologist’s expertise and experience. Therefore, computer-aided diagnosis (CAD) systems are developed to provide an effective, objective, and reliable understanding of medical images for radiologists and diagnosticians. With the help of modern convolutional neural networks (CNNs), the accuracy and efficiency of CAD systems are greatly improved. CNN-based methods rely on training with a large amount of high-quality data to extract the key features and achieve a good performance. However, such noise-free medical data in high volume are not easily accessible. To address the data limitation, we propose a novel two-stage noise training methodology that effectively improves the performance of breast ultrasound image classification with speckle noise. The proposed mix-noise-trained model in Stage II trains on a mix of noisy images at multiple different intensity levels. Our experiments demonstrate that all tested CNN models obtain resilience to speckle noise and achieve excellent performance gain if the mix proportion is selected appropriately. We believe this study will benefit more people with a faster and more reliable diagnosis.
Download

Paper Nr:	33
Title:	A Multi-stage Multi-group Classification Model: Applications to Knowledge Discovery for Evidence-based Patient-centered Care
Authors:	Eva K. Lee and Brent Egan
Abstract:	We present a multi-stage, multi-group classification framework that incorporates discriminant analysis via mixed integer programming (DAMIP) with an exact combinatorial branch-and-bound (BB) algorithm and a fast particle swarm optimization (PSO) for feature selection for classification. By utilizing a reserved judgment region, DAMIP allows the classifier to delay making decisions on ‘difficult-to-classify’ observations and develop new classification rules in a later stage. Such a design works well for mixed (poorly separated) data that are difficult to classify without committing a high percentage of misclassification errors. We also establish variant DAMIP models that enable problem-specific fine tuning to establish proper misclassification limits and reserved judgement levels that facilitate efficient management of imbalanced groups. This ensures that minority groups with relatively few entities are treated equally as the majority groups. We apply the framework to two real-life medical problems: (a) multi-site treatment outcome prediction for best practice discovery in cardiovascular disease, and (b) early disease diagnosis in predicting subjects into normal cognition, mild cognitive impairment, and Alzheimer’s disease groups using neuropsychological tests and blood plasma biomarkers. Both problems involve poorly separated data and imbalanced groups in which traditional classifiers yield low prediction accuracy. The multi-stage BB-PSO/DAMIP manages the poorly separable imbalanced data well and returns interpretable predictive results with over 80% blind prediction accuracy. Mathematically, DAMIP is NP-complete with its classifier proven to be universally strongly consistent. Hence, DAMIP has desirable solution characteristics for machine learning purposes. Computationally, DAMIP is the first multi-group, multi-stage classification model that simultaneously includes a reserved judgment capability and the ability to constrain misclassification rates within a single model. The formulation includes constraints that transform the features from their original space to the group space, serving as a dimension reduction mechanism.
Download

Paper Nr:	37
Title:	Assessing the Impact of Deep End-to-End Architectures in Ensemble Learning for Histopathological Breast Cancer Classification
Authors:	Hasnae Zerouaoui, Ali Idri and Omar El Alaoui
Abstract:	One of the most significant public health issues in the world and a major factor in women’s mortality is breast cancer (BC). Early diagnosis and detection can significantly improve the likelihood of survival. Therefore, this study suggests a deep end-to-end heterogeneous ensemble approach by using deep learning (DL) models for breast histological images classification tested on the BreakHis dataset. The proposed approach showed a significant increase of performances compared to their base learners. Thus, seven DL architectures (VGG16, VGG19, ResNet50, Inception_V3, Inception_ResNet_V2, Xception, and MobileNet) were trained using 5fold cross-validation. Thereafter, deep end-to-end heterogeneous ensembles of two up to seven base learners were constructed based on accuracy using majority and weighted voting. Results showed the effectiveness of deep end-to-end ensemble learning techniques for breast cancer images classification into malignant or benign. The ensembles designed with weighted voting method exceeded the others with an accuracy value reaching 93.8%, 93.4%, 93.3%, and 91.8% through the BreakHis dataset’s four magnification factors: 40X, 100X, 200X, and 400X respectively.
Download

Paper Nr:	40
Title:	Safe Screening for Logistic Regression with ℓ –ℓ Regularization
Authors:	Anna Deza and Alper Atamtürk
Abstract:	In logistic regression, it is often desirable to utilize regularization to promote sparse solutions, particularly for problems with a large number of features compared to available labels. In this paper, we present screening rules that safely remove features from logistic regression with ℓ0 − ℓ2 regularization before solving the problem. The proposed safe screening rules are based on lower bounds from the Fenchel dual of strong conic relaxations of the logistic regression problem. Numerical experiments with real and synthetic data suggest that a high percentage of the features can be effectively and safely removed apriori, leading to substantial speed-up in the computations.
Download

Paper Nr:	44
Title:	Generalization of Probabilistic Latent Semantic Analysis to k-partite Graphs
Authors:	Yohann Salomon and Pietro Pinoli
Abstract:	Many data can be easily modelled as bipartite or k-partite graphs. Among the many computational analyses that can be run on such graphs, link prediction, i.e., the inference of novel links between nodes, is one of the most valuable and has many applications on real world data. While for bipartite graphs many methods exist for this task, only few algorithms are able to perform link prediction on k-partite graphs. The Probabilistic Latent Semantic Analysis (PLSA) is an algorithm based on latent variables, named topics, designed to perform matrix factorisation. As such, it is straightforward to apply PLSA to the task of link prediction on bipartite graphs, simply by decomposing the association matrix. In this work we extend PLSA to k-partite graphs; in particular we designed an algorithm able to perform link prediction on k-partite graphs, by exploiting the information in all the layers of the target graph. Our experiments confirm the capability of the proposed method to effectively perform link prediction on k-partite graphs.
Download

Paper Nr:	45
Title:	Tag-Set-Sequence Learning for Generating Question-answer Pairs
Authors:	Cheng Zhang and Jie Wang
Abstract:	Transformer-based QG models can generate question-answer pairs (QAPs) with high qualities, but may also generate silly questions for certain texts. We present a new method called tag-set sequence learning to tackle this problem, where a tag-set sequence is a sequence of tag sets to capture the syntactic and semantic information of the underlying sentence, and a tag set consists of one or more language feature tags, including, for example, semantic-role-labeling, part-of-speech, named-entity-recognition, and sentiment-indication tags. We construct a system called TSS-Learner to learn tag-set sequences from given declarative sentences and the corresponding interrogative sentences, and derive answers to the latter. We train a TSS-Learner model for the English language using a small training dataset and show that it can indeed generate adequate QAPs for certain texts that transformer-based models do poorly. Human evaluation on the QAPs generated by TSS-Learner over SAT practice reading tests is encouraging.
Download

Paper Nr:	49
Title:	Search Reliability Comparison of Two Text-based Search Algorithms in an Online Literature Database for Integrative Medicine: A Technical Report on a 32-bit to 64-bit Migration
Authors:	Sebastian Unger, Christa K. Raak and Thomas Ostermann
Abstract:	Although there is a steady increase of scientific publications in integrative medicine, it is still difficult to get a valid overview of published evidence. The open accessible bibliographical database CAMbase 3.0 (available at https://cambase.de) hosted by Witten/Herdecke University is one of such established databases in this field. In 2020, CAMbase 2.0 was migrated to a newer 64-bit operating systems, resulting in a variety of issues. A promising solution of keeping and accessing the data of CAMbase 2.0 was to replace the business logic with the open-source platform Solr, which uses a score ranking algorithm instead of a semantic-syntactic interpretation of search queries as in CAMbase 2.0. As a result, the before-after analysis with T-tests showed mainly no significant differences in the equality of the queried titles after applying SBERT, not even in the number of search hits (t = 1.43, df = 35, p = 0.17), but in query times (t = 4.2, df = 35, p < 0.01). While search hits remained stable as the speed increases, the approach with Solr is more efficient, making this technical report a possible blueprint for similar bibliography-based databases projects.
Download

Short Papers

Paper Nr:	3
Title:	Automata-based Explainable Representation for a Complex System of Multivariate Times Series
Authors:	Ikram Chraibi Kaadoud, Lina Fahed, Tian Tian, Yannis Haralambous and Philippe Lenca
Abstract:	Complex systems represented by multivariate time series are ubiquitous in many applications, especially in industry. Understanding a complex system, its states and their evolution over time is a challenging task. This is due to the permanent change of contextual events internal and external to the system. We are interested in representing the evolution of a complex system in an intelligible and explainable way based on knowledge extraction. We propose XR-CSB (eXplainable Representation of Complex System Behavior) based on three steps: (i) a time series vertical clustering to detect system states, (ii) an explainable visual representation using unfolded finite-state automata and (iii) an explainable pre-modeling based on an enrichment via exploratory metrics. Four representations adapted to the expertise level of domain experts for acceptability issues are proposed. Experiments show that XR-CSB is scalable. Qualitative evaluation by experts of different expertise levels shows that XR-CSB meets their expectations in terms of explainability, intelligibility and acceptability.
Download

Paper Nr:	5
Title:	Towards View-invariant Vehicle Speed Detection from Driving Simulator Images
Authors:	Antonio Hernández Martínez, David Fernández Llorca and Iván García Daza
Abstract:	The use of cameras for vehicle speed measurement is much more cost effective compared to other technologies such as inductive loops, radar or laser. However, accurate speed measurement remains a challenge due to the inherent limitations of cameras to provide accurate range estimates. In addition, classical vision-based methods are very sensitive to extrinsic calibration between the camera and the road. In this context, the use of data-driven approaches appears as an interesting alternative. However, data collection requires a complex and costly setup to record videos under real traffic conditions from the camera synchronized with a high-precision speed sensor to generate the ground truth speed values. It has recently been demonstrated (Martinez et al., 2021) that the use of driving simulators (e.g., CARLA) can serve as a robust alternative for generating large synthetic datasets to enable the application of deep learning techniques for vehicle speed estimation for a single camera. In this paper, we study the same problem using multiple cameras in different virtual locations and with different extrinsic parameters. We address the question of whether complex 3D-CNN architectures are capable of implicitly learning view-invariant speeds using a single model, or whether view-specific models are more appropriate. The results are very promising as they show that a single model with data from multiple views reports even better accuracy than camera-specific models, paving the way towards a view-invariant vehicle speed measurement system.
Download

Paper Nr:	9
Title:	An Improved Support Vector Model with Recursive Feature Elimination for Crime Prediction
Authors:	Sphamandla I. May, Omowunmi E. Isafiade and Olasupo O. Ajayi
Abstract:	The Support Vector Machine (SVM) model has proven relevant in several applications, including crime analysis and prediction. This work utilized the SVM model and developed a predictive model for crime occurrence types. The SVM model was then enhanced using feature selection mechanism, and the enhanced model was compared to the classical SVM. To evaluate the classical and enhanced models, two distinct datasets, one from Chicago and the other from Los Angeles, were used for experiment. In an attempt to enhance the performance of the SVM model and reduce complexity, this work utilised relevant feature selection techniques. We used the Recursive Feature Elimination (RFE) model to enhance SVM’s performance and reduce its complexity, and observed performance increase of an average of 15% from the City of Chicago dataset and 20% from the Los Angeles dataset. Thus, incorporation of appropriate feature selection techniques enhances predictive power of classification algorithms.
Download

Paper Nr:	11
Title:	Gutenbrain: An Architecture for Equipment Technical Attributes Extraction from Piping & Instrumentation Diagrams
Authors:	Marco Vicente, João Guarda and Fernando Batista
Abstract:	Piping and Instrumentation Diagrams (P&ID) are detailed representations of engineering schematics with piping, instrumentation and other related equipment and their physical process flow. They are critical in engineering projects to convey the physical sequence of systems, allowing engineers to understand the process flow, safety and regulatory requirements, and operational details. P&IDs may be provided in several formats, including scanned paper, CAD files, PDF, images, but these documents are frequently searched manually to identify all the equipment and their inter-connectivity. Furthermore, engineers must search the related technical specifications in separate technical documents, as P&ID usually don’t include technical specifications. This paper presents Gutenbrain, an architecture to extract equipment technical attributes from piping & instrumentation diagrams and technical documentation, which relies in textual information only. It first extracts equipment from P&IDs, using meta-data to understand the equipment type, and text coordinates to detect the equipment even when it is represented in multiple lines of text. After detecting the equipment and storing it in a database, it allows retrieving and inferring technical attributes from the related technical documentation using two question answering models based on BERT-like contextual embeddings, depending on the equipment type meta-data. One question answering model works with free questions of continuous text, while the other uses tabular data. This ensemble approach allows us to extract technical attributes from documents where information is unstructured and scattered. The performance results for the equipment extraction stage achieve about 97,2% precision and 71,2% recall. The stored information can be later accessed using Elasticsearch, allowing engineers to save thousands of hours in maintenance engineering tasks.
Download

Paper Nr:	17
Title:	TerrorMine: Automatically Identifying the Group behind a Terrorist Attack
Authors:	Alan Falzon and Joel Azzopardi
Abstract:	Terrorism is a problem that provokes fear and causes death internationally. The Global Terrorism Database (GTD) contains a large number of terrorist attack records which can be used for data mining to help counter or mitigate future terror attacks. TerrorMine employs AI techniques to identify perpetrators responsible for terrorist attacks. Moreover, the effect of clustering beforehand is investigated, while also attempting to identify new (unknown) terrorist organisations, and predicting future activity of terror groups. Several experiments are performed. The Random Forest model obtains the highest Weighted F1-score when identifying responsible perpetrators. Furthermore, upon clustering the data using Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBScan) before classification, training time is reduced by more than 50%. Various techniques are used for the unsupervised identification of whether a terrorist attack was carried out by an unknown terrorist group. Nearest Neighbours gives the highest Macro F1-score when cross-validated. When forecasting the future impact of the different terrorist groups, Prophet achieved an F1-score higher than that of Autoregressive Integrated Moving Average (ARIMA).
Download

Paper Nr:	18
Title:	A Novel Approach towards Gap Filling of High-Frequency Radar Time-series Data
Authors:	Anne-Marie Camilleri, Joel Azzopardi and Adam Gauci
Abstract:	The real-time monitoring of the coastal and marine environment is vital for various reasons including oil spill detection and maritime security amongst others. Systems such as High Frequency Radar (HFR) networks are able to record sea surface currents in real-time. Unfortunately, such systems can suffer from malfunctions caused by extreme weather conditions or frequency interference, thus leading to a degradation in the monitoring system coverage. This results in sporadic gaps within the observation datasets. To counter this problem, the use of deep learning techniques has been investigated to perform gap-filling of the HFR data. Additional features such as remotely sensed wind data were also considered to try enhance the prediction accuracy of these models. Furthermore, look-back values between 3 and 24 hours were investigated to uncover the minimal amount of historical data required to make accurate predictions. Finally, drift in the data was also analysed, determining how often these model architectures might require re-training to keep them valid for predicting future data.
Download

Paper Nr:	21
Title:	PatternRank: Leveraging Pretrained Language Models and Part of Speech for Unsupervised Keyphrase Extraction
Authors:	Tim Schopf, Simon Klimek and Florian Matthes
Abstract:	Keyphrase extraction is the process of automatically selecting a small set of most relevant phrases from a given text. Supervised keyphrase extraction approaches need large amounts of labeled training data and perform poorly outside the domain of the training data (Bennani-Smires et al., 2018). In this paper, we present PatternRank, which leverages pretrained language models and part-of-speech for unsupervised keyphrase extraction from single documents. Our experiments show PatternRank achieves higher precision, recall and F1 -scores than previous state-of-the-art approaches. In addition, we present the KeyphraseVectorizers package, which allows easy modification of part-of-speech patterns for candidate keyphrase selection, and hence adaptation of our approach to any domain.
Download

Paper Nr:	24
Title:	Studying Interaction Patterns for Knowledge Graph Exploration
Authors:	Loris Grether and Hans Friedrich Witschel
Abstract:	The flexible data models of knowledge graphs (KGs) are powerful tools for handling large and dynamic data sets and are increasingly used for the tasks of data processing and storage. Although a KG may contain rich data and powerful connections, it is upon the searchers to explore these graphs and make sense out of them. The objective of this research paper is to investigate if and how KG exploration can be improved from a user’s point of view, to enhance the discovery of information. A qualitative user study should deliver insights on how different users interact with a KG, at what point they struggle and missed potential discoveries. Recognizing and understanding the intentions of the users is necessary to create solutions that support them best in their particular situation. Based on the findings, new features and improvements are suggested, developed and added to a prototypical KG exploration application, to be finally tested with regard to their impact on user exploration and acceptance. Based on the collected data we could identify the best guidance mechanisms that improve KG exploration the most.
Download

Paper Nr:	32
Title:	On the Impossibility to Assure a Finite Software Concepts’ Catalog
Authors:	Iaakov Exman
Abstract:	In recent times it has been recognized that Concepts play a central role within Software. This has been expressed by Fred Brooks’ idea that “Conceptual Integrity is the most important consideration for software system design”. However, concepts as human natural language words with assigned meaning by the Concepts’ relationships, evolve under continual dynamics of concepts discovery. This language dynamics has consequences that cannot be ignored. This paper illustrates concepts discovery within design patterns, up to very large-scale systems, highlighting intrinsic shortcomings of Concepts’ semantics as a solid basis for Software Conceptual Integrity. Paradoxically, these shortcomings are the consequence of the very creative process of Concepts Discovery from existing knowledge. Finally, one arrives at the paper’s main results: the absolute Software Concepts freedom of choice, typical of natural languages, implies the impossibility to assure a finite Software Concepts catalog. One finds oneself in an unending pursue of additional concepts to achieve some kind of Integrity or completeness. Even deliberate finite catalogs cannot be definitive. But there is no reason for despair. Finite Software Concepts’ catalogs, despite not definitive, are still very useful.
Download

Paper Nr:	34
Title:	EnuwaJGX: Machine Learning Gene Prediction Software Application Model - An Innovative Method to Precision Medicine and Predictive Analysis of Visualising Mutated Genes Associated to Neurological Phenotype of Diseases
Authors:	Daniel F. O. Onah
Abstract:	This research investigates an aspect of precision medicine related to genes and their association with diseases. Precision medicine is a growing area in medical science research. By definition precision medicine is an approach that allows the selection of treatments that are most likely to help treat patients based on the genetic understanding of their diseases. This approach proposes the customization of a medical model for healthcare, treatment, medical decision making about genetic diseases and develop models that are tailored to individual patient. There are readily available datasets provided by Genomics England related to diseases and the genes that cause these diseases. This research presents a predictive technique that scores the possibilities of a mutated gene causing a neurological phenotype. There are over a thousand genes associated with 26 subtypes of neurological diseases as defined by Genomics England capturing genetic variation, gene structure and coexpression network. The gene prediction was performed with search algorithms and methods that sequentially looped through the database for true match. Linear search algorithm was applied along index search method to perform the prediction matching of gene(s) that are associated to the disease(s). The prediction algorithm was formulated based on a Mathematical/probabilistic concept that was used to design the model for processing the data-set ready for gene prediction. It became apparent that over half a million (> 500;000) genes were predicted in this study that were associated to the neurological phenotype of the diseases in this research work.
Download

Paper Nr:	35
Title:	Monitoring Mood in a Stream of Self-reflections
Authors:	Eduard Hoenkamp and Andrew Gibson
Abstract:	Burnout and job stress are tragic events that unfortunately occur in many professions. In the teaching profession, however, it affects not just the individual, but also several concomitant parties: students, school, and parents. This has lead to the widespread problem of teacher attrition, where the challenge has become not so much to attract teachers, but to retain them. The present research is based on the reflective writing of early career teachers (ECTs). These ECTs volunteered to write short weekly reflections during a period of about half a year. Spotting potential wellbeing problems in these series of reflections, however, calls for careful reading and studying of such large amounts of texts that manual processing became impracticable. Hence, we developed an algorithm which transforms such a stream of reflections into a 3-D visualization of mood changes, in which times of stress and potential for burnout can be detected more easily. This in turns makes it possible to notice points of concern when there is still time to intervene.
Download

Paper Nr:	38
Title:	Training Neural Networks in Single vs. Double Precision
Authors:	Tomas Hrycej, Bernhard Bermeitinger and Siegfried Handschuh
Abstract:	The commitment to single-precision floating-point arithmetic is widespread in the deep learning community. To evaluate whether this commitment is justified, the influence of computing precision (single and double precision) on the optimization performance of the Conjugate Gradient (CG) method (a second-order optimization algorithm) and Root Mean Square Propagation (RMSprop) (a first-order algorithm) has been investigated. Tests of neural networks with one to five fully connected hidden layers and moderate or strong nonlinearity with up to 4 million network parameters have been optimized for Mean Square Error (MSE). The training tasks have been set up so that their MSE minimum was known to be zero. Computing experiments have dis-closed that single-precision can keep up (with superlinear convergence) with double-precision as long as line search finds an improvement. First-order methods such as RMSprop do not benefit from double precision. However, for moderately nonlinear tasks, CG is clearly superior. For strongly nonlinear tasks, both algorithm classes find only solutions fairly poor in terms of mean square error as related to the output variance. CG with double floating-point precision is superior whenever the solutions have the potential to be useful for the application goal.
Download

Paper Nr:	39
Title:	Number of Attention Heads vs. Number of Transformer-encoders in Computer Vision
Authors:	Tomas Hrycej, Bernhard Bermeitinger and Siegfried Handschuh
Abstract:	Determining an appropriate number of attention heads on one hand and the number of transformer-encoders, on the other hand, is an important choice for Computer Vision (CV) tasks using the Transformer architecture. Computing experiments confirmed the expectation that the total number of parameters has to satisfy the condition of overdetermination (i.e., number of constraints significantly exceeding the number of parameters). Then, good generalization performance can be expected. This sets the boundaries within which the number of heads and the number of transformers can be chosen. If the role of context in images to be classified can be assumed to be small, it is favorable to use multiple transformers with a low number of heads (such as one or two). In classifying objects whose class may heavily depend on the context within the image (i.e., the meaning of a patch being dependent on other patches), the number of heads is equally important as that of transformers.
Download

Paper Nr:	41
Title:	Whole-slide Classification of H&E-stained Cervix Uteri Tissue using Deep Neural Networks
Authors:	Ferdaous Idlahcen, Pierjos Francis Colere Mboukou, Hasnae Zerouaoui and Ali Idri
Abstract:	Cervical cancer (CxCa) is heavily swerved toward low- and middle- income countries (LMICs). Without prompt actions, the burden is anticipated to worsen by 50% from 2020 to 2040 - nearly 90% of deaths to occur in sub-Saharan Africa (SSA). Yet, uterine cervix neoplasms are readily avoidable due to a protracted latent cancer period. As it stands, deep learning (DL) is a potent solution for enhancing the early detection of cervical cancer. This work assesses and compares the performance of seven end-to-end learning architectures to automatically recognize cervical lesions and carcinoma histotypes upon hematoxylin and eosin (H&E)-stained pathology images. Pre-trained VGG16, VGG19, InceptionV3, ResNet50, MobileNetV2, InceptionResNetV2, and DenseNet201 were the implemented deep convolutional neural networks (dCNNs) throughout the present empirical analysis. Experiments are conducted on two datasets: (i) Mendeley liquid-based cytology (LBC) and (ii) The Cancer Genome Atlas (TCGA) Cervical Squamous Cell Carcinoma and Endocervical Adenocarcinoma diagnostic slides. All tests were validated under a 5-fold cross-validation, with four key metrics, Scott-Knott (SK), and Borda count schemes. Both pathology data appear to promote InceptionV3 and DenseNet201. Yet, while VGG16 is a weak-performing approach for liquid-based cytology, it evinces promise in histopathology yielding 99.33% accuracy, 98.85% precision, 99.83% recall, and 99.34% F-measure.
Download

Paper Nr:	48
Title:	Emotional Interpretation of Opera Seria: Impact of Specifics of Drama Structure (Position Paper)
Authors:	Pablo Gervás and Álvaro Torrente
Abstract:	The application of artificial intelligence techniques to help musicologists analyse and classify operatic arias in terms of the sentiment they might be expressing constitutes a novel task that may benefit from the application of sentiment analysis techniques. However, because the analysis of text in this instance aims to provide information to support the analisis of the associated music, the conventions of how narrative is structured in traditional opera need to be taken into account to ensure that the relevant spans of text are considered. The present position paper argues for a treatment of operatic libretti as semi-structured data, to take advantage of annotations on speaker identity and recitative vs. aria distinctions so that the most relevant sentiment for the music of the arias can be mined from the texts. This would constitute a new task that applies artificial intelligence specifically to the needs of musicology.
Download

Paper Nr:	50
Title:	The Role of Data in Crisis Management Models in the Health Care Context
Authors:	Hannele Väyrynen, Annamaija Paunu and Nina Helander
Abstract:	Successful crisis management is consisted of different factors, varying actors and operation environments. Health care system is one of the most critical sectors in societies to operate also in a crisis situation. In the middle of a crisis, digitalization and access to data can have an important role as an enabler. In this paper, the role of data in crisis management models in health care context is studied. The theoretical frame is derived from the crisis management literature review. The study is able to identify the role of data in seven critical elements in crisis management models that need consideration during crisis, namely data has supporting, enabling as well as critical role in technology, strategy, government, adaptation mechanisms, scenarios, security of supply chain and co-operation in crisis management. As a result of the study, different aspects of data in promoting successful crisis management are proposed.
Download

Paper Nr:	51
Title:	Multiple-choice Question Generation for the Chinese Language
Authors:	Yicheng Sun, Hejia Chen and Jie Wang
Abstract:	We present a method to generate multiple-choice questions (MCQs) from Chinese texts for factual, eventual, and causal answer keys. We first identify answer keys of these types using NLP tools and regular expressions. We then transform declarative sentences into interrogative sentences, and generate three distractors using geographic and aliased entity knowledge bases, Synonyms, HowNet, and word embeddings. We show that our method can generate adequate questions on three of the four reported cases that the SOTA model has failed. Moreover, on a dataset of 100 articles randomly selected from a Chinese Wikipedia data dump, our method generates a total of 3,126 MCQs. Three well-educated native Chinese speakers evaluate these MCQs and confirm that 76% of MCQs, 85% of question-answer paris, and 91% of questions are adequate and 96.5% of MCQs are acceptable.
Download

Paper Nr:	2
Title:	Predicting Visible Terms from Image Captions using Concreteness and Distributional Semantics
Authors:	Jean Charbonnier and Christian Wartena
Abstract:	Image captions in scientific papers usually are complementary to the images. Consequently, the captions contain many terms that do not refer to concepts visible in the image. We conjecture that it is possible to distinguish between these two types of terms in an image caption by analysing the text only. To examine this, we evaluated different features. The dataset we used to compute tf.idf values, word embeddings and concreteness values contains over 700 000 scientific papers with over 4,6 million images. The evaluation was done with a manually annotated subset of 329 images. Additionally, we trained a support vector machine to predict whether a term is a likely visible or not. We show that concreteness of terms is a very important feature to identify terms in captions and context that refer to concepts visible in images.
Download

Paper Nr:	4
Title:	The Twitter-Lex Sentiment Analysis System
Authors:	Sergiu Limboi and Laura Dioşan
Abstract:	Twitter Sentiment Analysis is demanding due to the freestyle way people express their opinions and feelings. Using only the preprocessed text from a dataset does not bring enough value to the process. Therefore, there is a need to define and mine different and complex features to detect hidden information from a tweet. The proposed Twitter-Lex Sentiment Analysis system combines lexicon features with Twitter-specific ones to improve the classification performance. Therefore, several features are considered for the Sentiment Analysis process: only textual input from a tweet, hash-tags, and some flavors that combine them with the feature defined based on the result produced by a lexicon. So, the Vader lexicon is used to determine the sentiment of a tweet. This output will be appended to the four perspectives we defined, considering the features offered by Twitter. The experimental results reveal that our system, which focuses on the role of features in a classification process, outperforms the baseline approach (use of original tweets) and provides good value to new directions and improvements.
Download

Paper Nr:	13
Title:	Towards Explainability in Modern Educational Data Mining: A Survey
Authors:	Basile Tousside, Yashwanth Dama and Jörg Frochte
Abstract:	Data mining has become an integral part of many educational systems, where it provides the ability to explore hidden relationship in educational data as well as predict students’ academic achievements. However, the proposed techniques to achieve these goals, referred to as educational data mining (EDM) techniques, are mostly not explainable. This means that the system is black-boxed and offers no insight regarding the understanding of its decision making process. In this paper, we propose to delve into explainability in the EDM landscape. We analyze the current state-of-the-art method in EDM, empirically scrutinize their strengths and weaknesses regarding explainability and making suggestions on how to make them more explainable and more trustworthy. Furthermore, we propose metrics able to efficiently evaluate explainable systems integrated in EDM approaches, therefore quantifying the degree of explanability and trustworthiness of these approaches.
Download

Paper Nr:	19
Title:	Learning to Estimate Crowd Size by Applying Convolutional Neural Network to Aerial Imaging Analysis
Authors:	Wing-Fat Cheng, Man-Ching Yuen and Yuk-Chun So
Abstract:	Using image and video to conduct crowd analysis in public places is an effective tool to establish situational awareness. Currently, the gap between different organizations on crowd counting differs greatly. Many research works investigated into utilizing image recognition technology to provide a fair estimation of the crowd count. In this paper, we propose a convolutional neural network model on aerial image analysis to learn to estimate crowd size. To find out the requirements of the efficient and reliable crowd size estimation system, we also investigate current approaches in crowd size estimation, such as regression, CNN and by-detention with image recognition technology. Our work allows the event organizers to get a fair description of the crowd behaviors. The main contribution of this paper is the application of CNN for solving the problem of crowd size estimation.
Download

Paper Nr:	22
Title:	EGAN: Generatives Adversarial Networks for Text Generation with Sentiments
Authors:	Andres Pautrat-Lertora, Renzo Perez-Lozano and Willy Ugarte
Abstract:	In these last years, communication with computers has made enormous steps, like the robot Sophia that surprised many people with their human interactions, behind this kind of robot, there is a machine learning model for text generation to interact with others, but in terms of text generation with sentiments not many investigations have been done. A model like GAN has opportunities to become an excellent option to attack this new problem because of their discriminator and generator competing for search the optimal solution. In this paper, a GAN model is presented that can generate text with different emotions based on a dataset recompiled from tweets labeled with emotions and then deployed in an NAO robot to speak the text in short phrases using voice commands. The model is evaluated with different methods popular in text generation like BLLEU and additionally, a human experiment is done to prove the quality and sentiment accuracy.
Download

Paper Nr:	25
Title:	Interpretable Disease Name Estimation based on Learned Models using Semantic Representation Learning of Medical Terms
Authors:	Ikuo Keshi, Ryota Daimon and Atsushi Hayashi
Abstract:	This paper describes a method for constructing a learned model for estimating disease names using semantic representation learning for medical terms and an interpretable disease-name estimation method based on the model. Experiments were conducted using old and new electronic medical records from Toyama University Hospital, where the data distribution of disease names differs significantly. The F1-score of the disease name estimation was improved by about 10 points compared with the conventional method using a general word semantic vector dictionary with a faster linear SVM. In terms of the interpretability of the estimation, it was confirmed that 70% of the disease names could provide higher-level concepts as the basis for disease name estimation. As a result of the experiments, we confirmed that both interpretability and accuracy for disease name estimation are possible to some extent.
Download

Paper Nr:	36
Title:	Adapting Transformers for Detecting Emergency Events on Social Media
Authors:	Emanuela Boros, Gaël Lejeune, Mickaël Coustaty and Antoine Doucet
Abstract:	Detecting emergency events on social media could facilitate disaster monitoring by categorizing and prioritizing tweets in catastrophic situations to assist emergency service operators. However, the high noise levels in tweets, combined with the limited publicly available datasets have rendered the task difficult. In this paper, we propose an enhanced multitask Transformer-based model that highlights the importance of entities, event descriptions, and hashtags in tweets. This approach includes a Transformer encoder with several layers over the sequential token representation provided by a pre-trained language model that acts as a task adapter for detecting emergency events in noisy data. We conduct an evaluation on the Text REtrieval Conference (TREC) 2021 Incident Streams (IS) track dataset, and we conclude that our proposed approach brought considerable improvements to emergency social media classification.
Download

Paper Nr:	52
Title:	Predicting Reputation Score of Users in Stack-overflow with Alternate Data
Authors:	Sahil Yerawar, Sagar Jinde, P. K. Srijith, Maunendra Sankar Desarkar, K. M. Annervaz and Shubhashis Sengupta
Abstract:	The community question and answering (CQA) sites such as Stack Overflow are used by many users around the world to obtain answers to technical questions. Here, the reliability of a user is determined using metrics such as reputation score. It is important for the CQA sites to assess the reputation score of the new users joining the site. Accurate estimation of reputation scores of these cold start users can help in tasks like question routing, expert recommendation and ranking etc. However, lack of activity information makes it quite difficult to assess the reputation score for new users. We propose an approach which makes use of alternate data associated with the users to predict the reputation score of the new users. We show that the alternate data obtained using users’ personal websites could improve the reputation score performance. We develop deep learning models based on feature distillation, such as the student-teacher models, to improve the reputation score prediction of new users from the alternate data. We demonstrate the effectiveness of the proposed approaches on the publicly available stack overflow data and publicly available alternate data.
Download