KDIR 2021 Abstracts

Full Papers

Paper Nr:	11
Title:	A Case Study and Qualitative Analysis of Simple Cross-lingual Opinion Mining
Authors:	Gerhard Hagerer, Wing Sheung Leung, Qiaoxi Liu, Hannah Danner and Georg Groh
Abstract:	User-generated content from social media is produced in many languages, making it technically challenging to compare the discussed themes from one domain across different cultures and regions. It is relevant for domains in a globalized world, such as market research, where people from two nations and markets might have different requirements for a product. We propose a simple, modern, and effective method for building a single topic model with sentiment analysis capable of covering multiple languages simultanteously, based on a pre-trained state-of-the-art deep neural network for natural language understanding. To demonstrate its feasibility, we apply the model to newspaper articles and user comments of a specific domain, i.e., organic food products and related consumption behavior. The themes match across languages. Additionally, we obtain an high proportion of stable and domain-relevant topics, a meaningful relation between topics and their respective textual contents, and an interpretable representation for social media documents. Marketing can potentially benefit from our method, since it provides an easy-to-use means of addressing specific customer interests from different market regions around the globe. For reproducibility, we provide the code, data, and results of our study.
Download

Paper Nr:	16
Title:	Event Log Abstraction in Client-Server Applications
Authors:	M. Amin Yazdi, Pejman Farhadi Ghalatia and Benedikt Heinrichs
Abstract:	Process mining provides various techniques in response to the increasing demand for understanding the execution of the underlying processes of software systems. The discovery and conformance checking techniques allow for the analysis of event data and verify compliances. However, in real-life scenarios, the event data recorded by software systems often contain numerous activities resulting in unstructured process models that are not usable by domain experts. Hence, event log abstraction is an essential preprocessing step to deliver a desired abstracted model that is human-readable and enables process analysis. This paper provides an overview of the literature and proposes a novel approach for transforming fine-granular event logs generated from client-server applications to a higher level of abstraction suitable for domain experts for further analysis. Moreover, we demonstrate the validity of the suggested method with the help of two case studies.
Download

Paper Nr:	25
Title:	Improved Session-based Recommender System by Context Awareness in e-Commerce Domain
Authors:	Ramazan Esmeli, Mohamed Bader-El-Den, Hassana Abdullahi and David Henderson
Abstract:	Over the past two decades, there has been a rapid increase in the number of online sales. This has resulted in an increase in the data collected about the users’ behaviour which has aided the development of several novel Recommender System (RS) methods. One of the main problem in RS is the lack of ”explicit rating”; many customers do not rate the items they buy or view. However, the user behaviour (e.g. session duration, number of items, item duration view, etc.) during an online session could give an indication about what the user preferences ”implicit rating”. In this paper, we present a method to derive numerical implicit ratings from user browsing behaviour. Also, we examine the impact of using the derived numerical implicit ratings as context factors on some of the main RS methods, i.e. Factorisation Recommender and Item-Item Collaborative Filtering models. We evaluate the performance of the proposed framework on two large e-commerce datasets. The computational experiments show that in the absence of user explicit rating, the use of the user behaviour data to generate numerical implicit ratings could significantly improve the several RS methods.
Download

Paper Nr:	32
Title:	FormalStyler: GPT based Model for Formal Style Transfer based on Formality and Meaning Preservation
Authors:	Mariano de Rivero, Cristhiam Tirado and Willy Ugarte
Abstract:	Style transfer is a natural language processing generation task, it consists of substituting one given writing style for another one. In this work, we seek to perform informal-to-formal style transfers in the English language. This process is shown in our web interface where the user input a informal message by text or voice. This project’s target audience are students and professionals in the need to improve the quality of their work by formalizing their texts. A style transfer is considered successful when the original semantic meaning of the message is preserved after the independent style has been replaced. This task is hindered by the scarcity of training and evaluation datasets alongside the lack of metrics. To accomplish this task we opted to utilize OpenAI’s GPT-2 Transformer-based pre-trained model. To adapt the GPT-2 to our research, we fine-tuned the model with a parallel corpus containing informal text entries paired with the equivalent formal ones. We evaluate the fine-tuned model results with two specific metrics, formality and meaning preservation. To further fine-tune the model we integrate a human-based feedback system where the user selects the best formal sentence out of the ones generated by the model. The resulting evaluations of our solution exhibit similar to improved scores in formality and meaning preservation to state-of-the-art approaches.
Download

Paper Nr:	34
Title:	Towards a Rule-based Visualization Recommendation System
Authors:	Arnab Chakrabarti, Farhad Ahmad and Christoph Quix
Abstract:	Data visualization plays an important role in the analysis of data and the identification of insights and characteristics within the dataset. However, visualizing datasets, especially high dimensional ones, is a very difficult and time-consuming process that requires a great deal of manual effort. The automation of data visualization is done in the form of Visualization Recommendation Systems by detecting factors such as data characteristics and user intended tasks in order to recommend useful visualizations. In this paper, we propose a Visualization Recommendation System, built on a knowledge-based rule engine, that takes minimal user input, extracts important data characteristics and supports a large number of visualization techniques depending on both the data characteristics and the intended tasks of the user. Through our proposed model we show the efficacy of such recommendations for users without any domain expertise. Lastly, we evaluate our system with real-world use case scenarios to prove the effectiveness and the feasibility of our approach.
Download

Paper Nr:	35
Title:	Product Embedding for Large-Scale Disaggregated Sales Data
Authors:	Yinxing Li and Nobuhiko Terui
Abstract:	This paper recommends a system that incorporates the marketing environment and customer heterogeneity. We employ and extend Item2Vec and Item2Vec approaches to high-dimensional store data. Our study not only aims to propose a model with better forecasting precision but also to reveal how customer demographics affect customer behaviour. Our empirical results show that marketing environment and customer heterogeneity increase forecasting precision and those demographics have a significant influence on customer behaviour through the hierarchical model.
Download

Short Papers

Paper Nr:	4
Title:	Link Prediction for Wikipedia Articles based on Temporal Article Embedding
Authors:	Jiaji Ma and Mizuho Iwaihara
Abstract:	Wikipedia articles contain a vast number of hyperlinks (internal links) connecting subjects to other Wikipedia articles. It is useful to predict future links for newly created articles. Suggesting new links from/to existing articles can reduce editors’ burdens, by prompting editors about necessary or missing links in their updates. In this paper, we discuss link prediction on linked and versioned articles. We propose new graph embeddings utilizing temporal random walk, which is biased by timestamp difference and semantic difference between linked and versioned articles. We generate article sequences by concatenating the article titles and category names on each random walk path. A pretrained language model is further trained to learn contextualized embeddings of article sequences. We design our link prediction experiments by predicting future links between new nodes and existing nodes. For evaluation, we compare our model’s prediction results with three random walk-based graph embedding models DeepWalk, Node2vec, and CTDNE, through ROC AUC score, PRC AUC score, Precision@k, Recall@k, and F1@k as evaluation metrics. Our experimental results show that our proposed TLPRB outperforms these models in all the evaluation metrics.
Download

Paper Nr:	5
Title:	Natural Language-based User Guidance for Knowledge Graph Exploration: A User Study
Authors:	Hans Friedrich Witschel, Kaspar Riesen and Loris Grether
Abstract:	Large knowledge graphs hold the promise of helping knowledge workers in their tasks by answering simple and complex questions in specialised domains. However, searching and exploring knowledge graphs in current practice still requires knowledge of certain query languages such as SPARQL or Cypher, which many untrained end users do not possess. Approaches for more user-friendly exploration have been proposed and range from natural language querying over visual cues up to query-by-example mechanisms, often enhanced with recommendation mechanisms offering guidance. We observe, however, a lack of user studies indicating which of these approaches lead to a better user experience and optimal exploration outcomes. In this work, we make a step towards closing this gap by conducting a qualitative user study with a system that relies on formulating queries in natural language and providing answers in the form of subgraph visualisations. Our system is able to offer guidance via query recommendations based on a current context. The user study evaluates the impact of this guidance in terms of both efficiency and effectiveness (recall) of user sessions. We find that both aspects are improved, especially since query recommendations provide inspiration, leading to a larger number of insights discovered in roughly the same time.
Download

Paper Nr:	8
Title:	Event-based Pathology Data Prioritisation: A Study using Multi-variate Time Series Classification
Authors:	Jing Qi, Girvan Burnside, Paul Charnley and Frans Coenen
Abstract:	A particular challenge for any hospital is the large amount of pathology data that doctors are routinely presented with. Pathology result analysis is routine in hospital environments. Some form of machine learning for pathology result prioritisation is therefore desirable. Patients typically have a history of pathology results, and each pathology result may have several dimensions, hence time series analysis for prioritisation suggests itself. However, because of the resource required, labelled prioritisation training data is typically not readily available. Hence traditional supervised learning and/or ranking is not a realistic solution and some alternative solution is required. The idea presented in this paper is to use the outcome event, what happened to a patient, as a proxy for a ground truth prioritisation data set. This idea is explored using two approaches: kNN time series classification and Long Short-Term Memory deep learning.
Download

Paper Nr:	13
Title:	Who? What? Event Tracking Needs Event Understanding
Authors:	Nicholas Mamo, Joel Azzopardi and Colin Layfield
Abstract:	Humans first learn, then think and finally perform a task. Machines neither learn nor think, but we still expect them to perform tasks as well as humans. In this position paper, we address the lack of understanding in Topic Detection and Tracking (TDT), an area that builds timelines of events, but which hardly understands events at all. Without understanding events, TDT has progressed slowly as the community struggles to solve the challenges of modern data sources, like Twitter. We explore understanding from different perspectives: what it means for machines to understand events, why TDT needs understanding, and how algorithms can generate knowledge automatically. To generate understanding, we settle on a structured definition of events based on the four Ws: the Who, What, Where and When. Of the four Ws, we focus especially on the Who and the What, aligning them with other research areas that can help TDT generate event knowledge automatically. In time, understanding can lead to machines that not only track events better, but also model and mine them.
Download

Paper Nr:	14
Title:	Analysis of a German Legal Citation Network
Authors:	Tobias Milz, Michael Granitzer and Jelena Mitrović
Abstract:	The paper introduces the creation and analysis of a German legal citation network. The network consists of over 200.000 German court cases from all levels of appeal and jurisdiction and more than 50.000 laws. References to court decisions and laws are extracted from within the decision text of the court cases and added as links to the network. We apply network-based analysis techniques to support common legal information retrieval tasks such as identification of important court decisions and laws and case similarity searches. Furthermore, we demonstrate that the German case citation network displays scale-free behaviour, similar to that of the U.S. and Austrian Supreme Courts as shown by previous research.
Download

Paper Nr:	17
Title:	Conversation Extraction from Event Logs
Authors:	Sébastien Salva, Laurent Provot and Jarod Sue
Abstract:	Event logs are more and more considered for helping IT personnel understand system behaviour or performance. One way to get knowledge from event logs is by extracting conversations (a.k.a. sessions) through the recovering of event correlations. This paper proposes a highly parallel algorithm to retrieve conversations from event logs, without having any knowledge about the used correlation mechanisms. To make the event log exploration effective and efficient, we devised an algorithm that covers an event log and builds the possible conversation sets w.r.t. the data found within the events. To limit the conversation set exploration and quicker recover good candidates, the algorithm is guided by an heuristic based upon the evaluation of invariants and conversation quality attributes. This heuristic also offers flexibility to users, as the quality and invariants can be adapted to the system context. We report experimental results obtained from 6 case studies and show that our algorithm has the capability of recovering the expected conversation sets in reasonable time delays.
Download

Paper Nr:	18
Title:	From Payment Services Directive 2 (PSD2) to Credit Scoring: A Case Study on an Italian Banking Institution
Authors:	Roberto Saia, Alessandro Giuliani, Livio Pompianu and Salvatore Carta
Abstract:	The Payments Systems Directive 2 (PSD2), recently issued by the European Union, allows the banks to share their customer data if they authorize the operation. On the one hand, this opportunity offers interesting perspectives to the financial operators, allowing them to evaluate the customers reliability (Credit Scoring) even in the absence of the canonical information typically used (e.g., age, current job, total incomes, or previous loans). On the other hand, the state-of-the-art approaches and strategies still train their Credit Scoring models using the canonical information. This scenario is further worsened by the scarcity of proper datasets needed for research purposes and the class imbalance between the reliable and unreliable cases, which biases the reliability of the classification models trained using this information. The proposed work is aimed at experimentally investigating the possibility of defining a Credit Scoring model based on the bank transactions of a customer, instead of using the canonical information, comparing the performance of the two models (canonical and transaction-based), and proposing an approach to improve the performance of the transactions-based model. The obtained results show the feasibility of a Credit Scoring model based only on banking transactions, and the possibility of improving its performance by introducing simple meta-features.
Download

Paper Nr:	19
Title:	Automatic General Metadata Extraction and Mapping in an HDF5 Use-case
Authors:	Benedikt Heinrichs, Nils Preuß, Marius Politze, Matthias S. Müller and Peter F. Pelz
Abstract:	Extracting interoperable metadata from data entities is not an easy task. A method for this would need to extract non-interoperable metadata values first and then map the extracted metadata to some sensible representation. In the case of HDF5 files, metadata annotation is already an option, making it an easy target for extracting these non-interoperable metadata values. This paper describes a use-case, that utilizes this property to automatically annotate their data. However, the issue arises, that these metadata values are not reusable, due to their missing interoperability, and validatable since they do not follow any defined metadata schema. Therefore, this paper provides a solution for mapping the defined metadata values to interoperable metadata by extracting them first using a general metadata extraction pipeline and then proposing a method for mapping them. This method can receive a number of application profiles and creates interoperable metadata based on the best fit. The method is validated against the introduced use-case and shows promising results for other research domains as well.
Download

Paper Nr:	21
Title:	Exhaustive Solution for Mining Frequent Conceptual Links in Large Networks using a Binary Compressed Representation
Authors:	Hadjer Djahnit and Malika Bessedik
Abstract:	In the domain of social network analysis, the frequent pattern mining task gives large opportunities for knowledge discovery. One of the most recent variations of the pattern definition applied to social networks is the frequent conceptual links (FCL). A conceptual link represents a set of links connecting groups of nodes such as nodes of each group share common attributes. When the number of these links exceeds a predefined threshold, it is referred to as a frequent conceptual link and it aims to describe the network in term of the most connected type of nodes while exploiting structural and semantic information of the network. Since the inception of this technique, a number of improvements were achieved in the search process in order to optimise its performances. In this paper, we propose a new algorithm for extracting frequent conceptual links from large networks. By adopting a new compressed structure for the network, the proposed approach reaches up to 90% of gain in the execution time.
Download

Paper Nr:	23
Title:	Multidimensional Demographic Profiles for Fair Paper Recommendation
Authors:	Reem Alsaffar and Susan Gauch
Abstract:	Despite double-blind peer review, bias affects which papers are selected for inclusion in conferences and journals. To address this, we present fair algorithms that explicitly incorporate author diversity in paper recommendation using multidimensional author profiles that include five demographic features, i.e., gender, ethnicity, career stage, university rank and geolocation. The Overall Diversity method ranks papers based on an overall diversity score whereas the Multifaceted Diversity method selects papers that fill the highest-priority demographic feature first. We evaluate these algorithms with Boolean and continuous-valued features by recommending papers for SIGCHI 2017 from a pool of SIGCHI 2017, DIS 2017 and IUI 2017 papers and compare the resulting set of papers with the papers accepted by the conference. Both methods increase diversity with small decreases in utility using profiles with either Boolean or continuous feature values. Our best method, Multifaceted Diversity, recommends a set of papers that match demographic parity, selecting authors who are 42.50% more diverse with a 2.45% gain in utility. This approach could be applied during conference papers, journal papers, or grant proposal selection or other tasks within academia.
Download

Paper Nr:	26
Title:	Extracting Event-related Information from a Corpus Regarding Soil Industrial Pollution
Authors:	Chuanming Dong, Philippe Gambette and Catherine Dominguès
Abstract:	We study the extraction and reorganization of event-related information in texts regarding industrial pollution. The object is to build a memory of polluted sites that gathers the information about industrial events from various databases and corpora. An industrial event is described through several features as the event trigger, the industrial activity, the institution, the pollutant, etc. In order to efficiently collect information from a large corpus, it is necessary to automatize the information extraction process. To this end, we manually annotated a part of a corpus about soil industrial pollution, then we used it to train information extraction models with deep learning methods. The models we trained achieve 0.76 F-score on event feature extraction. We intend to improve the models and then use them on other text resources to enrich the polluted sites memory with extracted information about industrial events.
Download

Paper Nr:	28
Title:	Transformer-based Language Models for Semantic Search and Mobile Applications Retrieval
Authors:	João Coelho, António Neto, Miguel Tavares, Carlos Coutinho, João Oliveira, Ricardo Ribeiro and Fernando Batista
Abstract:	Search engines are being extensively used by Mobile App Stores, where millions of users world-wide use them every day. However, some stores still resort to simple lexical-based search engines, despite the recent advances in Machine Learning, Information Retrieval, and Natural Language Processing, which allow for richer semantic strategies. This work proposes an approach for semantic search of mobile applications that relies on transformer-based language models, fine-tuned with the existing textual information about known mobile applications. Our approach relies solely on the application name and on the unstructured textual information contained in its description. A dataset of about 500 thousand mobile apps was extended in the scope of this work with a test set, and all the available textual data was used to fine-tune our neural language models. We have evaluated our models using a public dataset that includes information about 43 thousand applications, and 56 manually annotated non-exact queries. The results show that our model surpasses the performance of all the other retrieval strategies reported in the literature. Tests with users have confirmed the performance of our semantic search approach, when compared with an existing deployed solution.
Download

Paper Nr:	33
Title:	Toward Formal Data Set Verification for Building Effective Machine Learning Models
Authors:	Jorge López, Maxime Labonne and Claude Poletti
Abstract:	In order to properly train a machine learning model, data must be properly collected. To guarantee a proper data collection, verifying that the collected data set holds certain properties is a possible solution. For example, guaranteeing that that data set contains samples across the whole input space, or that the data set is balanced w.r.t. different classes. We present a formal approach for verifying a set of arbitrarily stated properties over a data set. The proposed approach relies on the transformation of the data set into a first order logic formula, which can be later verified w.r.t. the different properties also stated in the same logic. A prototype tool, which uses the z3 solver, has been developed; the prototype can take as an input a set of properties stated in a formal language and formally verify a given data set w.r.t. to the given set of properties. Preliminary experimental results show the feasibility and performance of the proposed approach, and furthermore the flexibility for expressing properties of interest.
Download

Paper Nr:	38
Title:	Artist Recommendation based on Association Rule Mining and Community Detection
Authors:	Okan Çiftçi, Samet Tenekeci and Ceren Ülgentürk
Abstract:	Recent advances in the web have greatly increased the accessibility of music streaming platforms and the amount of consumable audio content. This has made automated recommendation systems a necessity for listeners and streaming platforms alike. Therefore, a wide variety of predictive models have been designed to identify related artists and music collections. In this paper, we proposed a graph-based approach that utilizes association rules extracted from Spotify playlists. We constructed several artist networks and identified related artist clusters using Louvain and Label Propagation community detection algorithms. We analyzed internal and external cluster agreements based on different validation criteria. As a result, we achieved up to 99.38% internal and 90.53% external agreements between our models and Spotify’s related artist lists. These results show that integrating association rule mining concepts with graph databases can be a novel and effective way to design an artist recommendation system.
Download

Paper Nr:	39
Title:	Explainability and Continuous Learning with Capsule Networks
Authors:	Janis Mohr, Basil Tousside, Marco Schmidt and Jörg Frochte
Abstract:	Capsule networks are an emerging technique for image recognition and classification tasks with innovative approaches inspired by the human visual cortex. State of the art is that capsule networks achieve good accuracy for future image recognition tasks and are a promising approach for hierarchical data sets. In this work, it is shown that capsule networks can generate image descriptions representing detected objects in images. This visualisation in combination with reconstructed images delivers strong and easily understandable explainability regarding the decision-making process of capsule networks and leading towards trustworthy AI. Furthermore it is shown that capsule networks can be used for continuous learning utilising already learned basic geometric shapes to learn more complex objects. As shown by our experiments, our approach allows for distinct explainability making it possible to use capsule networks where explainability is required.
Download

Paper Nr:	40
Title:	Towards Tracking Provenance from Machine Learning Notebooks
Authors:	Dominik Kerzel, Sheeba Samuel and Birgitta König-Ries
Abstract:	Machine learning (ML) pipelines are constructed to automate every step of ML tasks, transforming raw data into engineered features, which are then used for training models. Even though ML pipelines provide benefits in terms of flexibility, extensibility, and scalability, there are many challenges when it comes to their reproducibility and data dependencies. Therefore, it is crucial to track and manage metadata and provenance of ML pipelines, including code, model, and data. The provenance information can be used by data scientists in developing and deploying ML models. It improves understanding complex ML pipelines and facilitates analyzing, debugging, and reproducing ML experiments. In this paper, we discuss ML use cases, challenges, and design goals of an ML provenance management tool to automatically expose the metadata. We introduce MLProvLab, a JupyterLab extension, to automatically identify the relationships between data and models in ML scripts. The tool is designed to help data scientists and ML practitioners track, capture, compare, and visualize the provenance of machine learning notebooks.
Download

Paper Nr:	42
Title:	Improving Legal Information Retrieval: Metadata Extraction and Segmentation of German Court Rulings
Authors:	Ingo Glaser, Sebastian Moser and Florian Matthes
Abstract:	Legal research is a vital part of the work of lawyers. The increasing complexity of legal cases has led to a desire for fast and accurate legal information retrieval, leveraging semantic information. However, two main problems occur on that path. First, the share of published judgments is only marginal. Second, it lacks state-of-the-art NLP approaches to extract semantic information. The latter, in turn, can be attributed to the issue of data scarcity. One big issue in the publication process of court rulings is the lack of automatization. Yet, the digitalization of court rulings, specifically transforming the textual representation from the court into a machine-readable format, is mainly done manually. To address this issue, we propose an automated pipeline to segment court rulings and extract metadata. We integrate that pipeline into a prototypical web application and use it for a qualitative evaluation. The results show that the extraction of metadata and the classification of paragraphs into the respective verdict segments perform well and can be utilized within the existing processes at legal publishers.
Download

Paper Nr:	44
Title:	Retinal Blood Vessel Segmentation using Convolutional Neural Networks
Authors:	Arun Kumar Yadav, Arti Jain, Jorge Luis Morato Lara and Divakar Yadav
Abstract:	Human beings often become victims to numerous diseases. Among these, diabetes stands out for its impairment of quality of life and even potential mortality. The diabetes needs to be properly taken care of, otherwise failure to detect its presence within proper time duration leads to a loss of life. According to the World Health Organization, the worldwide number of diabetic patients were 463 million during 2019 and is expected to cross 700 million by the 2045i. In the past, a lot of research has been carried out for retinal blood vessel segmentation for identification of Diabetic Retinopathy using various machine learning and deep learning models. In this research work, Convolutional Neural Network (CNN) and CLAHE are applied to tackle the problem of retinal blood vessel segmentation. Experimental evaluation shows that the proposed method outperforms with 0.9806 accuracy, quite competitive with respect to the state-of-art.
Download

Paper Nr:	3
Title:	Fine-grained Topic Detection and Tracking on Twitter
Authors:	Nicholas Mamo, Joel Azzopardi and Colin Layfield
Abstract:	With its large volume of data and free access to information, Twitter revolutionised Topic Detection and Tracking (TDT). Thanks to Twitter, TDT could build timelines of real-world events in real-time. However, over the years TDT struggled to adapt to Twitter’s noise. While TDT’s solutions stifled noise, they also kept the area from building granular timelines of events, and today, TDT still relies on large datasets from popular events. In this paper, we detail Event TimeLine Detection (ELD) as a solution: a real-time system that combines TDT’s two broad approaches, document-pivot and feature-pivot methods. In ELD, an on-line document-pivot technique clusters a stream of tweets, and a novel feature-pivot algorithm filters clusters and identifies topical keywords. This mixture allows ELD to overcome the technical limitations of traditional TDT algorithms to build fine-grained timelines of both popular and unpopular events. Nevertheless, our results emphasize the importance of robust topic tracking and the ability to filter subjective content.
Download

Paper Nr:	6
Title:	Determining Policy Communication Effectiveness: A Lexical Link Analysis Approach
Authors:	Claire Dyer, Brian Wood, Ying Zhao and Douglas J. MacKinnon
Abstract:	Military policies are often promulgated from the echelon II and III level, but it is often difficult to ascertain whether they are interpreted and implemented as intended. There is often a communication gap. This article seeks to determine if data mining tools such as lexical link analysis can measure that gap in quantitative terms. It starts by examining how lexical link analysis can determine how policies are communicated through the various echelons, assessing whether lexical link analysis can be used to determine if policies are interpreted and redistributed as intended, and exploring what this data tells us about policy communication and the implications for policy execution. The author uses lexical link analysis to reveal if there is a degree of policy mismatch at lower echelon levels and makes assessments about this mismatch based on established communication theory. This paper validated that Naval aircrew’s understanding of policies is of vital importance and higher policy tends to be interpreted and reissued with greater specificity as it moves down the chain of command as exemplified between Echelons IV and V.
Download

Paper Nr:	7
Title:	Human-error-potential Estimation based on Wearable Biometric Sensors
Authors:	Hiroki Ohashi and Hiroto Nagayoshi
Abstract:	This study tackles on a new problem of estimating human-error potential on a shop floor on the basis of wearable sensors. Unlike existing studies that utilize biometric sensing technology to estimate people’s internal state such as fatigue and mental stress, we attempt to estimate the human-error potential in a situation where a target person does not stay calm, which is much more difficult as sensor noise significantly increases. We propose a novel formulation, in which the human-error-potential estimation problem is reduced to a classification problem, and introduce a new method that can be used for solving the classification problem even with noisy sensing data. The key ideas are to model the process of calculating biometric indices probabilistically so that the prior knowledge on the biometric indices can be integrated, and to utilize the features that represent the movement of target persons in combination with biometric features. The experimental analysis showed that our method effectively estimates the human-error potential.
Download

Paper Nr:	12
Title:	Relevance of Similarity Measures Usage for Paraphrase Detection
Authors:	Tedo Vrbanec and Ana Meštrović
Abstract:	The article describes the experiments and their results using two Deep Learning (DL) models and four measures of similarity/distance, determining the similarity of documents from the three publicly available corpora of paraphrased documents. As DL models, Word2Vec was used in two variants and FastText in one. The article explains the existence of a multitude of hyperparameters and defines their values, selection of effective ways of text processing, the use of some non-standard parameters in Natural Language Processing (NLP), the characteristics of the corpora used, the results of the pairs (DL model, similarity measure) processing corpora, and tries to determine combinations of conditions under which use of exactly certain pairs yields the best results (presented in the article), measured by standard evaluation measures Accuracy, Precision, Recall and primarily F-measure.
Download

Paper Nr:	22
Title:	MEDIS: Analysis Methodology for Data with Multiple Complexities
Authors:	Raluca Portase, Ramona Tolas and Rodica Potolea
Abstract:	Hidden and unexpected value can be found in the vast amounts of data generated by IoT devices and industrial sensors. Extracting this knowledge can help on more complex tasks such as predictive maintenance or remaining useful time prediction. Manually inspecting the data is a slow, expensive, and highly subjective task that made automated solutions very popular. However, finding the value inside Big Data is a difficult task with many complexities. We present a general preprocessing methodology (MEDIS- MEthdology for preprocessing Data with multiple complexitIeS) consisting of a set of techniques and approaches which address such complexities.
Download

Paper Nr:	24
Title:	From Implicit Preferences to Ratings: Video Games Recommendation based on Collaborative Filtering
Authors:	Rosária Bunga, Fernando Batista and Ricardo Ribeiro
Abstract:	This work studies and compares the performance of collaborative filtering algorithms, with the intent of proposing a videogame-oriented recommendation system. This system uses information from the video game platform Steam, which contains information about the game usage, corresponding to the implicit feedback that was later transformed into explicit feedback. These algorithms were implemented using the Surprise library, that allows to create and evaluate recommender systems that deal with explicit data. The algorithms are evaluated and compared with each other using metrics such as RSME, MAE, Precision@k, Recall@k and F1@k. We have concluded that computationally low demanding approaches can still obtain suitable results.
Download

Paper Nr:	29
Title:	Social Media Analytics: An Overview of Applications and Approaches
Authors:	Zeinab Khanjarinezhadjooneghani and Nasseh Tabrizi
Abstract:	Users' activities on social media generate useful and valuable information about the general public that can be used in policymaking and the decision-making processes. In this paper we review social media analysis approaches and its' application areas by sampling and reviewing studies from IEEE Xplore, Science Direct, and SpringerLink. The primary focus of this paper is on the role of social media data in extracting, tracking, and evaluating general public activities, public opinions, and public behaviour. We look at several application areas, including disaster, urban planning, public health, politics, business, and marketing. The frequently used approaches in these areas are topic analysis, sentiment analysis, social network analysis, spatial and temporal analysis. Moreover, this study provides insight for those who wish to learn about social media's role as a data source for research related to our real-world issues and events.
Download

Paper Nr:	31
Title:	Decomposing Training Data to Improve Network Intrusion Detection Performance
Authors:	Roberto Saia, Alessandro Sebastian Podda, Gianni Fenu and Riccardo Balia
Abstract:	Anyone working in the field of network intrusion detection has been able to observe how it involves an ever- increasing number of techniques and strategies aimed to overcome the issues that affect the state-of-the-art solutions. Data unbalance and heterogeneity are only some representative examples of them, and each misclassification made in this context could have enormous repercussions in different crucial areas such as, for instance, financial, privacy, and public reputation. This happens because the current scenario is characterized by a huge number of public and private network-based services. The idea behind the proposed work is decomposing the canonical classification process into several sub-processes, where the final classification depends on all the sub-processes results, plus the canonical one. The proposed Training Data Decomposition (TDD) strategy is applied on the training datasets, where it applies a decomposition into regions, according to a defined number of events and features. The reason that leads this process is related to the observation that the same network event could be evaluated in a different manner, when it is evaluated in different time periods and/or when it involves different features. According to this observation, the proposed approach adopts different classification models, each of them trained in a different data region characterized by different time periods and features, classifying the event both on the basis of all model results, and on the basis of the canonical strategy that involves all data.
Download