Abstract: |
BACKGROUND:
Dataset search has become an important task due to increasing value of data. Each "dataset", for example, government statistic dataset or scientific dataset, usually includes title, description, and data table, that are characterized by their multi-field, multi-modal, and structured content. The dataset search task is similar to traditional document search but it needs to handle such complex data.
Traditional search methods can be categorized in to full-text search such as BM25 and neural search approaches. Recently, neural search has achieved better result and become more popular with cross-encoder and dual-encoder architectures. Because of better scalability, the current standard architecture is dual-encoder, that uses large transformers models to encode the "query" and the "search item" as embedding vectors, then compute their dot product to measure relevance.
However, the dual-encoder architecture has several limitations in the case of dataset search. First, it cannot directly handle complex data such as multi-field data. Second, it usually requires heavy training and fine-tuning, but training data is usually small in dataset search. In addition, the accuracy of dual-encoder needs to be improved in general.
PROPOSED METHOD:
To address the limitations of dual-encoder, we propose the contextual link prediction (CLP) architecture that puts a relational mapping module on top of the encoding module. In this architecture, the encoding module can be reused between datasets to minimize training and fine-tuning requirements. The relational mapping module can be trained and fine-tuned more efficiently and can be enhanced with richer mappings to improve accuracy. Most importantly, the relational mapping module enables handling of complex data such as multi-field data.
KEY TECHNIQUES:
The relational mapping module is based on the light-weight mapping operation from knowledge graph embedding method. It is used to map "query embedding" to relevant "dataset embedding". It can also be used in certain way to map the "field embedding" to the "dataset embedding" and enable handling of multi-field data.
In particular, we propose the multi-field relational-fusion method to compose the "dataset embedding" from the "field embedding". In this method, each field embedding is mapped by a field-specific relational mapping, then the mapped embedding are summed to get the dataset embedding. This is simple but expressive, in the sense that it preserves the information of all field embeddings, because it is equivalent to concatenate all field embeddings and apply a large relational mapping.
TRAINING:
To train this model, we treat the search task as a link prediction task and use the knowledge graph embedding training objective. The encoding module and the relational mapping module are trained together end-to-end. Our key insight is the information retrieval problem can be treated as a graph modeling problem.
PRELIMINARY RESULT:
An early version of this model was evaluated on the NTCIR-15 dataset search competition. The data contains US and Japan government statistic datasets with over one million "dataset" items. The test set contains 96 queries for the US and 96 queries for the Japan datasets. This model achieved promising result with the best performance on the average metric, outperforming several popular and strong baselines such as BM25 full-text search and BERT cross-encoder model. |