IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING

Data Synthesis Reinvented: Preserving Missing Patterns for Enhanced Analysis
Wang X, Asif H, Gupta S and Vaidya J
Synthetic data is being widely used as a replacement or enhancement for real data in fields as diverse as healthcare, telecommunications, and finance. Unlike real data, which represents actual people and objects, synthetic data is generated from an estimated distribution that retains key statistical properties of the real data. This makes synthetic data attractive for sharing while addressing privacy, confidentiality, and autonomy concerns. Real data often contains missing values that hold important information about individual, system, or organizational behavior. Standard synthetic data generation methods eliminate missing values as part of their pre-processing steps and thus completely ignore this valuable source of information. Instead, we propose methods to generate synthetic data that preserve both the observable and missing data distributions; consequently, retaining the valuable information encoded in the missing patterns of the real data. Our approach handles various missing data scenarios and can easily integrate with existing data generation methods. Extensive empirical evaluations on diverse datasets demonstrate the effectiveness of our approach as well as the value of preserving missing data distribution in synthetic data.
Cafe: Improved Federated Data Imputation by Leveraging Missing Data Heterogeneity
Min S, Asif H, Wang X and Vaidya J
Federated learning (FL), a decentralized machine learning approach, offers great performance while alleviating autonomy and confidentiality concerns. Despite FL's popularity, how to deal with missing values in a federated manner is not well understood. In this work, we initiate a study of federated imputation of missing values, particularly in complex scenarios, where missing data heterogeneity exists and the state-of-the-art (SOTA) approaches for federated imputation suffer from significant loss in imputation quality. We propose Cafe, a personalized FL approach for missing data imputation. Cafe is inspired from the observation that heterogeneity can induce differences in observable and missing data distribution across clients, and that these differences can be leveraged to improve the imputation quality. Cafe computes personalized weights that are automatically calibrated for the level of heterogeneity, which can remain unknown, to develop personalized imputation models for each client. An extensive empirical evaluation over a variety of settings demonstrates that Cafe matches the performance of SOTA baselines in homogeneous settings while significantly outperforming the baselines in heterogeneous settings.
Hierarchical Active Learning with Label Proportions on Data Regions
Luo Z, Gao Q, He Y, Wang H, Hauskrecht M and Li T
Learning classification models from real-world data often requires substantial human effort devoted to instance annotation. As the instance-based annotating process can be very time-consuming and costly, we propose a novel active learning framework that builds classification models from human-annotated . A region is defined by a set of conjunctive patterns that are formed by value ranges over the input features. A region label is a human assessment of the class in the data population covered by the region. By leveraging algorithms, regions and their class proportions can be used to train instance-based classification models. However, the key challenge is that in practice, very few regions are defined already. Therefore, to identify regions important for model learning, we design a (HAL) framework, which actively builds a hierarchy of regions. Similar to the decision-tree learning process, our approach progressively divides the input data space into smaller sub-regions, solicits labels for the new regions, and retrains the base classification model with all the leaf regions. And we further develop a (forest) solution, which builds multiple shallower hierarchies that have more informative, diverse, and simpler regions. We evaluate our HAL framework on numerous impactful classification datasets as well as on a real user study - on the survival analysis of colorectal cancer patients. The results demonstrate that region-based active learning methods can learn high-quality classifiers from very few labeled regions. Hence, our framework is shown very effective in reducing the human annotation effort needed for building classification models.
A Neural Database for Answering Aggregate Queries on Incomplete Relational Data
Zeighami S, Seshadri R and Shahabi C
Real-world datasets are often incomplete due to data collection cost, privacy considerations or as a side effect of data integration/preparation. We focus on answering aggregate queries on such datasets, where data incompleteness causes the answers to be inaccurate. To address this problem, assuming typical relational data, existing work generates synthetic data to complete the database, a challenging task, especially in the presence of bias in observed data. Instead, we propose a paradigm shift by learning to directly estimate query answers, circumventing the difficult data generation step. Our approach, dubbed NeuroComplete, learns to answer queries in three steps. First, NeuroComplete generates a set of queries for which accurate answers can be computed given the incomplete dataset. Next, it embeds queries in a feature space, through which each query is effectively represented with the portion of the database that contributes to the query answer. Finally, it trains a neural network in a supervised learning fashion: both query features (input) and correct answers (labels) are known. The learned model generates accurate answers to new queries at test time, exploiting the generalizability of the learned model in the embedding space. Extensive experimental results on real datasets show up to 4 times for AVG queries and 10 times for COUNT queries error reduction compared with the state-of-the-art.
Identifying Anomalies while Preserving Privacy
Asif H, Vaidya J and Papakonstantinou PA
Identifying anomalies in data is vital in many domains, including medicine, finance, and national security. However, privacy concerns pose a significant roadblock to carrying out such an analysis. Since existing privacy definitions do not allow good accuracy when doing outlier analysis, the notion of sensitive privacy has been recently proposed to deal with this problem. Sensitive privacy makes it possible to analyze data for anomalies with practically meaningful accuracy while providing a strong guarantee similar to differential privacy, which is the prevalent privacy standard today. In this work, we relate sensitive privacy to other important notions of data privacy so that one can port the technical developments and private mechanism constructions from these related concepts to sensitive privacy. Sensitive privacy critically depends on the underlying anomaly model. We develop a novel n-step lookahead mechanism to efficiently answer arbitrary outlier queries, which provably guarantees sensitive privacy if we restrict our attention to common a class of anomaly models. We also provide general constructions to give sensitively private mechanisms for identifying anomalies and show the conditions under which the constructions would be optimal.
Pushing ML Predictions Into DBMSs
Paganelli M, Sottovia P, Park K, Interlandi M and Guerra F
In the past decade, many approaches have been suggested to execute ML workloads on a DBMS. However, most of them have looked at in-DBMS ML from a training perspective, whereas ML inference has been largely overlooked. We think that this is an important gap to fill for two main reasons: (1) in the near future, every application will be infused with some sort of ML capability; (2) behind every web page, application, and enterprise there is a DBMS, whereby in-DBMS inference is an appealing solution both for efficiency (e.g., less data movement), performance (e.g., cross-optimizations between relational operators and ML) and governance. In this article, we study whether DBMSs are a good fit for prediction serving. We introduce a technique for translating trained ML pipelines containing both featurizers (e.g., one-hot encoding) and models (e.g., linear and tree-based models) into SQL queries, and we compare in-DBMS performance against popular ML frameworks such as Sklearn and ml.net. Our experiments show that, when pushed inside a DBMS, trained ML pipelines can have performance comparable to ML frameworks in several scenarios, while they perform quite poorly on text featurization and over (even simple) neural networks.
Weakly Supervised Concept Map Generation through Task-Guided Graph Translation
Lu J, Dong X and Yang C
Recent years have witnessed the rapid development of concept map generation techniques due to their advantages in providing well-structured summarization of knowledge from free texts. Traditional unsupervised methods do not generate task-oriented concept maps, whereas deep generative models require large amounts of training data. In this work, we present (Graph Translation-based Document To Graph), an automatic concept map generation framework that leverages generalized NLP pipelines to derive semantic-rich initial graphs, and translates them into more concise structures under the weak supervision of downstream task labels. The concept maps generated by can provide interpretable summarization of structured knowledge for the input texts, which are demonstrated through human evaluation and case studies on three real-world corpora. Further experiments on the downstream task of document classification show that beats other concept map generation methods. Moreover, we specifically validate the labeling efficiency of in the label-efficient learning setting and the flexibility of generated graph sizes in controlled hyper-parameter studies.
M: Mixed Models With Preferences, Popularities and Transitions for Next-Basket Recommendation
Peng B, Ren Z, Parthasarathy S and Ning X
Next-basket recommendation considers the problem of recommending a set of items into the next basket that users will purchase as a whole. In this paper, we develop a novel mixed model with preferences, popularities and transitions (M) for the next-basket recommendation. This method models three important factors in next-basket generation process: 1) users' general preferences, 2) items' global popularities and 3) transition patterns among items. Unlike existing recurrent neural network-based approaches, M does not use the complicated networks to model the transitions among items, or generate embeddings for users. Instead, it has a simple encoder-decoder based approach (ed-Trans) to better model the transition patterns among items. We compared M with different combinations of the factors with 5 state-of-the-art next-basket recommendation methods on 4 public benchmark datasets in recommending the first, second and third next basket. Our experimental results demonstrate that M significantly outperforms the state-of-the-art methods on all the datasets in all the tasks, with an improvement of up to 22.1%. In addition, our ablation study demonstrates that the ed-Trans is more effective than recurrent neural networks in terms of the recommendation performance. We also have a thorough discussion on various experimental protocols and evaluation metrics for next-basket recommendation evaluation.
Domain-specific Topic Model for Knowledge Discovery in Computational and Data-Intensive Scientific Communities
Zhang Y, Calyam P, Joshi T, Nair S and Xu D
Shortened time to knowledge discovery and adapting prior domain knowledge is a challenge for computational and data-intensive communities such as e.g., bioinformatics and neuroscience. The challenge for a domain scientist lies in the actions to obtain guidance through query of massive information from diverse text corpus comprising of a wide-ranging set of topics when: investigating new methods, developing new tools, or integrating datasets. In this paper, we propose a novel "domain-specific topic model" (DSTM) to discover latent knowledge patterns about relationships among research topics, tools and datasets from exemplary scientific domains. Our DSTM is a generative model that extends the Latent Dirichlet Allocation (LDA) model and uses the Markov chain Monte Carlo (MCMC) algorithm to infer latent patterns within a specific domain in an unsupervised manner. We apply our DSTM to large collections of data from bioinformatics and neuroscience domains that include more than 25,000 of papers over the last ten years, featuring hundreds of tools and datasets that are commonly used in relevant studies. Evaluation experiments based on generalization and information retrieval metrics show that our model has better performance than the state-of-the-art baseline models for discovering highly-specific latent topics within a domain. Lastly, we demonstrate applications that benefit from our DSTM to discover intra-domain, cross-domain and trend knowledge patterns.
A Generalized Framework for Preserving Both Privacy and Utility in Data Outsourcing
Xie S, Mohammady M, Wang H, Wang L, Vaidya J and Hong Y
Property preserving encryption techniques have significantly advanced the utility of encrypted data in various data outsourcing settings (e.g., the cloud). However, while preserving certain properties (e.g., the prefixes or order of the data) in the encrypted data, such encryption schemes are typically limited to specific data types (e.g., prefix-preserved IP addresses) or applications (e.g., range queries over order-preserved data), and highly vulnerable to the emerging inference attacks which may greatly limit their applications in practice. In this paper, to the best of our knowledge, we make the first attempt to generalize the prefix preserving encryption via encoding that is not only applicable to more general data types (e.g., geo-locations, market basket data, DNA sequences, numerical data and timestamps) but also secure against the inference attacks. Furthermore, we present a generalized framework that generates multiple data views in which one view fully preserves the utility for data analysis, and its accurate analysis result can be obliviously retrieved. Given any specified privacy leakage bound, the computation and communication overheads are minimized to effectively defend against different inference attacks. We empirically evaluate the performance of our outsourcing framework against two common inference attacks on two different real datasets: the check-in location dataset and network traffic dataset, respectively. The experimental results demonstrate that our proposed framework preserves both privacy (with bounded leakage and indistinguishability of data views) and utility (with 100% analysis accuracy).
MOLER: Incorporate Molecule-Level Reward to Enhance Deep Generative Model for Molecule Optimization
Fu T, Xiao C, Glass LM and Sun J
The goal of molecular optimization is to generate molecules similar to a target molecule but with better chemical properties. Deep generative models have shown great success in molecule optimization. However, due to the iterative local generation process of deep generative models, the resulting molecules can significantly deviate from the input in molecular similarity and size, leading to poor chemical properties. The key issue here is that the existing deep generative models restrict their attention on substructure-level generation without considering the entire molecule as a whole. To address this challenge, we propose Molecule-Level Reward functions (MOLER) to encourage (1) the input and the generated molecule to be similar, and to ensure (2) the generated molecule has a similar size to the input. The proposed method can be combined with various deep generative models. Policy gradient technique is introduced to optimize reward-based objectives with small computational overhead. Empirical studies show that MOLER achieves up to 20.2% relative improvement in success rate over the best baseline method on several properties, including QED, DRD2 and LogP.
Heterogeneous Network Representation Learning: A Unified Framework with Survey and Benchmark
Yang C, Xiao Y, Zhang Y, Sun Y and Han J
Since real-world objects and their interactions are often multi-modal and multi-typed, heterogeneous networks have been widely used as a more powerful, realistic, and generic superclass of traditional homogeneous networks (graphs). Meanwhile, representation learning ( embedding) has recently been intensively studied and shown effective for various network mining and analytical tasks. In this work, we aim to provide a unified framework to deeply summarize and evaluate existing research on heterogeneous network embedding (HNE), which includes but goes beyond a normal survey. Since there has already been a broad body of HNE algorithms, as the first contribution of this work, we provide a generic paradigm for the systematic categorization and analysis over the merits of various existing HNE algorithms. Moreover, existing HNE algorithms, though mostly claimed generic, are often evaluated on different datasets. Understandable due to the application favor of HNE, such indirect comparisons largely hinder the proper attribution of improved task performance towards effective data preprocessing and novel technical design, especially considering the various ways possible to construct a heterogeneous network from real-world application data. Therefore, as the second contribution, we create four benchmark datasets with various properties regarding scale, structure, attribute/label availability, and . from different sources, towards handy and fair evaluations of HNE algorithms. As the third contribution, we carefully refactor and amend the implementations and create friendly interfaces for 13 popular HNE algorithms, and provide all-around comparisons among them over multiple tasks and experimental settings. By putting all existing HNE algorithms under a unified framework, we aim to provide a universal reference and guideline for the understanding and development of HNE algorithms. Meanwhile, by open-sourcing all data and code, we envision to serve the community with an ready-to-use benchmark platform to test and compare the performance of existing and future HNE algorithms (https://github.com/yangji9181/HNE).
HAM: Hybrid Associations Models for Sequential Recommendation
Peng B, Ren Z, Parthasarathy S and Ning X
Sequential recommendation aims to identify and recommend the next few items for a user that the user is most likely to purchase/review, given the user's purchase/rating trajectories. It becomes an effective tool to help users select favorite items from a variety of options. In this manuscript, we developed hybrid associations models (HAM) to generate sequential recommendations. using three factors: 1) users' long-term preferences, 2) sequential, high-order and low-order association patterns in the users' most recent purchases/ratings, and 3) synergies among those items. HAM uses simplistic pooling to represent a set of items in the associations, and element-wise product to represent item synergies of arbitrary orders. We compared HAM models with the most recent, state-of-the-art methods on six public benchmark datasets in three different experimental settings. Our experimental results demonstrate that HAM models significantly outperform the state of the art in all the experimental settings. with an improvement as much as 46.6%. In addition, our run-time performance comparison in testing demonstrates that HAM models are much more efficient than the state-of-the-art methods. and are able to achieve significant speedup as much as 139.7 folds.
VERTICOX: Vertically Distributed Cox Proportional Hazards Model Using the Alternating Direction Method of Multipliers
Dai W, Jiang X, Bonomi L, Li Y, Xiong H and Ohno-Machado L
The Cox proportional hazards model is a popular semi-parametric model for survival analysis. In this paper, we aim at developing a federated algorithm for the Cox proportional hazards model over vertically partitioned data (i.e., data from the same patient are stored at different institutions). We propose a novel algorithm, namely VERTICOX, to obtain the global model parameters in a distributed fashion based on the Alternating Direction Method of Multipliers (ADMM) framework. The proposed model computes intermediary statistics and exchanges them to calculate the global model without collecting individual patient-level data. We demonstrate that our algorithm achieves equivalent accuracy for the estimation of model parameters and statistics to that of its centralized realization. The proposed algorithm converges linearly under the ADMM framework. Its computational complexity and communication costs are polynomially and linearly associated with the number of subjects, respectively. Experimental results show that VERTICOX can achieve accurate model parameter estimation to support federated survival analysis over vertically distributed data by saving bandwidth and avoiding exchange of information about individual patients. The source code for VERTICOX is available at: https://github.com/daiwenrui/VERTICOX.
CHEER: Rich Model Helps Poor Model via Knowledge Infusion
Xiao C, Hoang TN, Hong S, Ma T and Sun J
There is a growing interest in applying deep learning (DL) to healthcare, driven by the availability of data with multiple feature channels in environments (e.g., intensive care units). However, in many other practical situations, we can only access data with much fewer feature channels in a environments (e.g., at home), which often results in predictive models with poor performance. How can we boost the performance of models learned from such environment by leveraging knowledge extracted from existing models trained using in a related environment? To address this question, we develop a knowledge infusion framework named CHEER that can succinctly summarize such into transferable representations, which can be incorporated into the to improve its performance. The infused model is analyzed theoretically and evaluated empirically on several datasets. Our empirical results showed that CHEER outperformed baselines by 5.60% to 46.80% in terms of the macro-F1 score on multiple physiological datasets.
HyperMinHash: MinHash in LogLog space
Yu YW and Weber GM
In this extended abstract, we describe and analyze a lossy compression of MinHash from buckets of size to buckets of size by encoding using floating-point notation. This new compressed sketch, which we call HyperMinHash, as we build off a HyperLogLog scaffold, can be used as a drop-in replacement of MinHash. Unlike comparable Jaccard index fingerprinting algorithms in sub-logarithmic space (such as b-bit MinHash), HyperMinHash retains MinHash's features of streaming updates, unions, and cardinality estimation. For an additive approximation error on a Jaccard index , given a random oracle, HyperMinHash needs space. HyperMinHash allows estimating Jaccard indices of 0.01 for set cardinalities on the order of with relative error of around 10% using 2MiB of memory; MinHash can only estimate Jaccard indices for cardinalities of with the same memory consumption.
Quantifying Differential Privacy in Continuous Data Release Under Temporal Correlations
Cao Y, Yoshikawa M, Xiao Y and Xiong L
Differential Privacy (DP) has received increasing attention as a rigorous privacy framework. Many existing studies employ traditional DP mechanisms (e.g., the Laplace mechanism) as primitives to continuously release private data for protecting privacy at each time point (i.e., event-level privacy), which assume that the data at different time points are independent, or that adversaries do not have knowledge of correlation between data. However, continuously generated data tend to be temporally correlated, and such correlations can be acquired by adversaries. In this paper, we investigate the potential privacy loss of a traditional DP mechanism under temporal correlations. First, we analyze the privacy leakage of a DP mechanism under temporal correlation that can be modeled using Markov Chain. Our analysis reveals that, the event-level privacy loss of a DP mechanism may . We call the unexpected (TPL). Although TPL may increase over time, we find that its supremum may exist in some cases. Second, we design efficient algorithms for calculating TPL. Third, we propose data releasing mechanisms that convert any existing DP mechanism into one against TPL. Experiments confirm that our approach is efficient and effective.
Secure and Efficient Skyline Queries on Encrypted Data
Liu J, Yang J, Xiong L and Pei J
Outsourcing data and computation to cloud server provides a cost-effective way to support large scale data storage and query processing. However, due to security and privacy concerns, sensitive data (e.g., medical records) need to be protected from the cloud server and other unauthorized users. One approach is to outsource encrypted data to the cloud server and have the cloud server perform query processing on the encrypted data only. It remains a challenging task to support various queries over encrypted data in a secure and efficient way such that the cloud server does not gain any knowledge about the data, query, and query result. In this paper, we study the problem of secure skyline queries over encrypted data. The skyline query is particularly important for multi-criteria decision making but also presents significant challenges due to its complex computations. We propose a fully secure skyline query protocol on data encrypted using semantically-secure encryption. As a key subroutine, we present a new secure dominance protocol, which can be also used as a building block for other queries. Furthermore, we demonstrate two optimizations, data partitioning and lazy merging, to further reduce the computation load. Finally, we provide both serial and parallelized implementations and empirically study the protocols in terms of efficiency and scalability under different parameter settings, verifying the feasibility of our proposed solutions.
Real-Time Change Point Detection with application to Smart Home Time Series Data
Aminikhanghahi S, Wang T and Cook DJ
Change Point Detection (CPD) is the problem of discovering time points at which the behavior of a time series changes abruptly. In this paper, we present a novel real-time nonparametric change point detection algorithm called SEP, which uses Separation distance as a divergence measure to detect change points in high-dimensional time series. Through experiments on artificial and real-world datasets, we demonstrate the usefulness of the proposed method in comparison with existing methods.
Automated Phrase Mining from Massive Text Corpora
Shang J, Liu J, Jiang M, Ren X, Voss CR and Han J
As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (, Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, AutoPhrase can be extend to model single-word quality phrases.
Differentially Private Distributed Online Learning
Li C, Zhou P, Xiong L, Wang Q and Wang T
In the big data era, the generation of data presents some new characteristics, including wide distribution, high velocity, high dimensionality, and privacy concern. To address these challenges for big data analytics, we develop a privacy-preserving distributed online learning framework on the data collected from distributed data sources. Specifically, each node (i.e., data source) has the capacity of learning a model from its local dataset, and exchanges intermediate parameters with a random part of their own neighboring (logically connected) nodes. Hence, the topology of the communications in our distributed computing framework is unfixed in practice. As online learning always performs on the sensitive data, we introduce the notion of differential privacy (DP) into our distributed online learning algorithm (DOLA) to protect the data privacy during the learning, which prevents an adversary from inferring any significant sensitive information. Our model is of general value for big data analytics in the distributed setting, because it can provide rigorous and scalable privacy proof and have much less computational complexity when compared to classic schemes, e.g., secure multiparty computation (SMC). To tackle high-dimensional incoming data entries, we study a sparse version of the DOLA with novel DP techniques to save the computing resources and improve the utility. Furthermore, we present two modified private DOLAs to meet the need of practical applications. One is to convert the DOLA to distributed stochastic optimization in an offline setting, the other is to use the mini-batches approach to reduce the amount of the perturbation noise and improve the utility. We conduct experiments on real datasets in a configured distributed platform. Numerical experiment results validate the feasibility of our private DOLAs.