Catalogue Search | MBRL

Certifiable Unlearning Pipelines for Logistic Regression: An Experimental Study

by Mahadevan, Ananth , Mathioudakis, Michael in Algorithms , Effectiveness , Efficiency

2022

Machine unlearning is the task of updating machine learning (ML) models after a subset of the training data they were trained on is deleted. Methods for the task are desired to combine effectiveness and efficiency (i.e., they should effectively “unlearn” deleted data, but in a way that does not require excessive computational effort (e.g., a full retraining) for a small amount of deletions). Such a combination is typically achieved by tolerating some amount of approximation in the unlearning. In addition, laws and regulations in the spirit of “the right to be forgotten” have given rise to requirements for certifiability (i.e., the ability to demonstrate that the deleted data has indeed been unlearned by the ML model). In this paper, we present an experimental study of the three state-of-the-art approximate unlearning methods for logistic regression and demonstrate the trade-offs between efficiency, effectiveness and certifiability offered by each method. In implementing this study, we extend some of the existing works and describe a common unlearning pipeline to compare and evaluate the unlearning methods on six real-world datasets and a variety of settings. We provide insights into the effect of the quantity and distribution of the deleted data on ML models and the performance of each unlearning method in different settings. We also propose a practical online strategy to determine when the accumulated error from approximate unlearning is large enough to warrant a full retraining of the ML model.

Journal Article

Share this book

Add to My Shelf

Fair Max–Min Diversity Maximization in Streaming and Sliding-Window Models

by Mathioudakis, Michael , Li, Jia , Wang, Yanhao in Algorithms , Analysis , Approximation

2023

Diversity maximization is a fundamental problem with broad applications in data summarization, web search, and recommender systems. Given a set X of n elements, the problem asks for a subset S of k≪n elements with maximum diversity, as quantified by the dissimilarities among the elements in S. In this paper, we study diversity maximization with fairness constraints in streaming and sliding-window models. Specifically, we focus on the max–min diversity maximization problem, which selects a subset S that maximizes the minimum distance (dissimilarity) between any pair of distinct elements within it. Assuming that the set X is partitioned into m disjoint groups by a specific sensitive attribute, e.g., sex or race, ensuring fairness requires that the selected subset S contains ki elements from each group i∈[m]. Although diversity maximization has been extensively studied, existing algorithms for fair max–min diversity maximization are inefficient for data streams. To address the problem, we first design efficient approximation algorithms for this problem in the (insert-only) streaming model, where data arrive one element at a time, and a solution should be computed based on the elements observed in one pass. Furthermore, we propose approximation algorithms for this problem in the sliding-window model, where only the latest w elements in the stream are considered for computation to capture the recency of the data. Experimental results on real-world and synthetic datasets show that our algorithms provide solutions of comparable quality to the state-of-the-art offline algorithms while running several orders of magnitude faster in the streaming and sliding-window settings.

Journal Article

Share this book

Add to My Shelf

Scalably Using Node Attributes and Graph Structure for Node Classification

by Merchant, Arpit , Mahadevan, Ananth , Mathioudakis, Michael in Algorithms , Classification , Euclidean space

2022

The task of node classification concerns a network where nodes are associated with labels, but labels are known only for some of the nodes. The task consists of inferring the unknown labels given the known node labels, the structure of the network, and other known node attributes. Common node classification approaches are based on the assumption that adjacent nodes have similar attributes and, therefore, that a node’s label can be predicted from the labels of its neighbors. While such an assumption is often valid (e.g., for political affiliation in social networks), it may not hold in some cases. In fact, nodes that share the same label may be adjacent but differ in their attributes, or may not be adjacent but have similar attributes. In this work, we present JANE (Jointly using Attributes and Node Embeddings), a novel and principled approach to node classification that flexibly adapts to a range of settings wherein unknown labels may be predicted from known labels of adjacent nodes in the network, other node attributes, or both. Our experiments on synthetic data highlight the limitations of benchmark algorithms and the versatility of JANE. Further, our experiments on seven real datasets of sizes ranging from 2.5K to 1.5M nodes and edge homophily ranging from 0.86 to 0.29 show that JANE scales well to large networks while also demonstrating an up to 20% improvement in accuracy compared to strong baseline algorithms.

Journal Article

Share this book

Add to My Shelf

Integrated Hydrologic Analysis of Wastewater Contaminant Flows from On-Site Sewage Disposal Systems to Groundwater, Streams, and the Ocean Waters of Kane'ohe Bay, O'ahu, Hawai'i, USA

by Glenn, Craig R , Dores, Daniel E , Whittier, Robert B in Composition , Computer simulation , Computer-generated environments

2024

On the Hawaiian Island of O'ahu, nearly 1,500 on-site sewage disposal systems (OSDS) exist within the Kane'ohe Bay drainage basin, releasing an estimated 3,800 cubic meters (one million gallons) of untreated wastewater into the groundwater each day, threatening stream and coastal water quality. The study area–Kahalu'u, Hawai'i–is characterized by the highest density (units per area) of OSDS in the Kane'ohe Bay drainage basin. This study evaluates hydrologic flow paths from wastewater point sources to groundwater and surface waters by utilizing a combination of unmanned aerial vehicle thermal infrared (UAV-TIR) imaging, stream gauging and seepage runs, and numerical groundwater models (MODFLOW and MT3DMS). Eight coastal groundwater seep locations were identified with UAV-TIR, with all seeps occurring through coastal valley fill sediments. Geochemical analysis of seeps revealed significantly elevated concentrations of all major nutrients compared to surrounding ocean waters. Groundwater nitrogen transport was modeled with MT3DMS and compared to measured concentrations. Most OSDS are located within the valley fill, and MODFLOW results suggest that the valley fill is more hydraulically conductive than the surrounding dike-intruded basalt, thus controlling groundwater and contaminant dispersion to streams and the ocean. Modeling also reveals that groundwater flow from leeward of the Ko'olau ridgeline and from adjacent watersheds to the southeast may be significant hydrologic inputs to the study area.

Journal Article

Share this book

Add to My Shelf

Integrated Hydrologic Analysis of Wastewater Contaminant Flows from On-Site Sewage Disposal Systems to Groundwater, Streams, and the Ocean Waters of Kāne‘ohe Bay, O‘ahu, Hawai‘i, USA

by Glenn, Craig R , Whittier, Robert B , Dulai, Henrietta

2024

Journal Article

Share this book

Add to My Shelf

Graph Summarization via Node Grouping: A Spectral Algorithm

by Merchant, Arpit , Wang, Yanhao , Mathioudakis, Michael in Algorithms , Cluster analysis , Clustering

2022

Graph summarization via node grouping is a popular method to build concise graph representations by grouping nodes from the original graph into supernodes and encoding edges into superedges such that the loss of adjacency information is minimized. Such summaries have immense applications in large-scale graph analytics due to their small size and high query processing efficiency. In this paper, we reformulate the loss minimization problem for summarization into an equivalent integer maximization problem. By initially allowing relaxed (fractional) solutions for integer maximization, we analytically expose the underlying connections to the spectral properties of the adjacency matrix. Consequently, we design an algorithm called SpecSumm that consists of two phases. In the first phase, motivated by spectral graph theory, we apply k-means clustering on the k largest (in magnitude) eigenvectors of the adjacency matrix to assign nodes to supernodes. In the second phase, we propose a greedy heuristic that updates the initial assignment to further improve summary quality. Finally, via extensive experiments on 11 datasets, we show that SpecSumm efficiently produces high-quality summaries compared to state-of-the-art summarization algorithms and scales to graphs with millions of nodes.

Paper

Share this book

Add to My Shelf

Streaming Algorithms for Diversity Maximization with Fairness Constraints

by Wang, Yanhao , Mathioudakis, Michael , Fabbri, Francesco in Algorithms , Approximation , Data search

2022

Diversity maximization is a fundamental problem with wide applications in data summarization, web search, and recommender systems. Given a set \\(X\\) of \\(n\\) elements, it asks to select a subset \\(S\\) of \\(k \\ll n\\) elements with maximum \\emph{diversity}, as quantified by the dissimilarities among the elements in \\(S\\). In this paper, we focus on the diversity maximization problem with fairness constraints in the streaming setting. Specifically, we consider the max-min diversity objective, which selects a subset \\(S\\) that maximizes the minimum distance (dissimilarity) between any pair of distinct elements within it. Assuming that the set \\(X\\) is partitioned into \\(m\\) disjoint groups by some sensitive attribute, e.g., sex or race, ensuring \\emph{fairness} requires that the selected subset \\(S\\) contains \\(k_i\\) elements from each group \\(i \\in [1,m]\\). A streaming algorithm should process \\(X\\) sequentially in one pass and return a subset with maximum \\emph{diversity} while guaranteeing the fairness constraint. Although diversity maximization has been extensively studied, the only known algorithms that can work with the max-min diversity objective and fairness constraints are very inefficient for data streams. Since diversity maximization is NP-hard in general, we propose two approximation algorithms for fair diversity maximization in data streams, the first of which is \\(\\frac{1-\\varepsilon}{4}\\)-approximate and specific for \\(m=2\\), where \\(\\varepsilon \\in (0,1)\\), and the second of which achieves a \\(\\frac{1-\\varepsilon}{3m+2}\\)-approximation for an arbitrary \\(m\\). Experimental results on real-world and synthetic datasets show that both algorithms provide solutions of comparable quality to the state-of-the-art algorithms while running several orders of magnitude faster in the streaming setting.

Paper

Share this book

Add to My Shelf

Max-Min Diversification with Fairness Constraints: Exact and Approximation Algorithms

by Mathioudakis, Michael , Li, Jia , Wang, Yanhao in Algorithms , Approximation , Datasets

2023

Diversity maximization aims to select a diverse and representative subset of items from a large dataset. It is a fundamental optimization task that finds applications in data summarization, feature selection, web search, recommender systems, and elsewhere. However, in a setting where data items are associated with different groups according to sensitive attributes like sex or race, it is possible that algorithmic solutions for this task, if left unchecked, will under- or over-represent some of the groups. Therefore, we are motivated to address the problem of \\emph{max-min diversification with fairness constraints}, aiming to select \\(k\\) items to maximize the minimum distance between any pair of selected items while ensuring that the number of items selected from each group falls within predefined lower and upper bounds. In this work, we propose an exact algorithm based on integer linear programming that is suitable for small datasets as well as a \\(\\frac{1-\\varepsilon}{5}\\)-approximation algorithm for any \\(\\varepsilon \\in (0, 1)\\) that scales to large datasets. Extensive experiments on real-world datasets demonstrate the superior performance of our proposed algorithms over existing ones.

Paper

Share this book

Add to My Shelf

Fair and Representative Subset Selection from Data Streams

by Wang, Yanhao , Mathioudakis, Michael , Fabbri, Francesco in Algorithms , Approximation , Data mining

2021

We study the problem of extracting a small subset of representative items from a large data stream. In many data mining and machine learning applications such as social network analysis and recommender systems, this problem can be formulated as maximizing a monotone submodular function subject to a cardinality constraint \\(k\\). In this work, we consider the setting where data items in the stream belong to one of several disjoint groups and investigate the optimization problem with an additional \\emph{fairness} constraint that limits selection to a given number of items from each group. We then propose efficient algorithms for the fairness-aware variant of the streaming submodular maximization problem. In particular, we first give a \\( (\\frac{1}{2}-\\varepsilon) \\)-approximation algorithm that requires \\( O(\\frac{1}{\\varepsilon} \\log \\frac{k}{\\varepsilon}) \\) passes over the stream for any constant \\( \\varepsilon>0 \\). Moreover, we give a single-pass streaming algorithm that has the same approximation ratio of \\((\\frac{1}{2}-\\varepsilon)\\) when unlimited buffer sizes and post-processing time are permitted, and discuss how to adapt it to more practical settings where the buffer sizes are bounded. Finally, we demonstrate the efficiency and effectiveness of our proposed algorithms on two real-world applications, namely \\emph{maximum coverage on large graphs} and \\emph{personalized recommendation}.

Paper

Share this book

Add to My Shelf

Certifiable Machine Unlearning for Linear Models

by Mahadevan, Ananth , Mathioudakis, Michael in Approximation , Machine learning , Privacy

2021

Machine unlearning is the task of updating machine learning (ML) models after a subset of the training data they were trained on is deleted. Methods for the task are desired to combine effectiveness and efficiency, i.e., they should effectively \"unlearn\" deleted data, but in a way that does not require excessive computation effort (e.g., a full retraining) for a small amount of deletions. Such a combination is typically achieved by tolerating some amount of approximation in the unlearning. In addition, laws and regulations in the spirit of \"the right to be forgotten\" have given rise to requirements for certifiability, i.e., the ability to demonstrate that the deleted data has indeed been unlearned by the ML model. In this paper, we present an experimental study of the three state-of-the-art approximate unlearning methods for linear models and demonstrate the trade-offs between efficiency, effectiveness and certifiability offered by each method. In implementing the study, we extend some of the existing works and describe a common ML pipeline to compare and evaluate the unlearning methods on six real-world datasets and a variety of settings. We provide insights into the effect of the quantity and distribution of the deleted data on ML models and the performance of each unlearning method in different settings. We also propose a practical online strategy to determine when the accumulated error from approximate unlearning is large enough to warrant a full retrain of the ML model.

Paper

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter