arXiv: Doubly Non-Central Beta Matrix Factorization for Stable Dimensionality Reduction of Bounded Support Matrix Data

A new paper by Anjali Albert (Nagulpally) and Aaron Schein.

Abstract: We consider the problem of developing interpretable and computationally efficient matrix decomposition methods for matrices whose entries have bounded support. Such matrices are found in large-scale DNA methylation studies and many other settings. Our approach decomposes the data matrix into a Tucker representation wherein the number of columns in the constituent factor matrices is not constrained. We derive a computationally efficient sampling algorithm to solve for the Tucker decomposition. We evaluate the performance of our method using three criteria: predictability, computability, and stability. Empirical results show that our method has similar performance as other state-of-the-art approaches in terms of held-out prediction and computational complexity, but has significantly better performance in terms of stability to changes in hyper-parameters. The improved stability results in higher confidence in the results in applications where the constituent factors are used to generate and test scientific hypotheses such as DNA methylation analysis of cancer samples.

Head over to arXiv to read more: https://arxiv.org/abs/2410.18425

arXiv: Maximum a Posteriori Inference for Factor Graphs via Benders’ Decomposition

A new paper by Harsh Dubey and Ji Ah Lee is now on arXiv.

Abstract: Many Bayesian statistical inference problems come down to computing a maximum a-posteriori (MAP) assignment of latent variables. Yet, standard methods for estimating the MAP assignment do not have a finite time guarantee that the algorithm has converged to a fixed point. Previous research has found that MAP inference can be represented in dual form as a linear programming problem with a non-polynomial number of constraints. A Lagrangian relaxation of the dual yields a statistical inference algorithm as a linear programming problem. However, the decision as to which constraints to remove in the relaxation is often heuristic. We present a method for maximum a-posteriori inference in general Bayesian factor models that sequentially adds constraints to the fully relaxed dual problem using Benders’ decomposition. Our method enables the incorporation of expressive integer and logical constraints in clustering problems such as must-link, cannot-link, and a minimum number of whole samples allocated to each cluster. Using this approach, we derive MAP estimation algorithms for the Bayesian Gaussian mixture model and latent Dirichlet allocation. Empirical results show that our method produces a higher optimal posterior value compared to Gibbs sampling and variational Bayes methods for standard data sets and provides certificate of convergence.

Head over to https://arxiv.org/abs/2410.19131 to read more.

Paper: Identification of significant gene expression changes in multiple perturbation experiments using knockoffs

Our paper on knockoffs for response identification has been published in Briefings in Bioinformatics. The full article is available at https://academic.oup.com/bib/article/24/2/bbad084/7073968. This is work by Tingting Zhao and Harsh Dubey. Tingting is a former postdoc with the UMass TRIPODS project and now an Assistant Professor at Bryant University.

Summary: Large-scale multiple perturbation experiments have the potential to reveal a more detailed understanding of the molecular pathways that respond to genetic and environmental changes. A key question in these studies is which gene expression changes are important for the response to the perturbation. This problem is challenging because (i) the functional form of the nonlinear relationship between gene expression and the perturbation is unknown and (ii) identification of the most important genes is a high-dimensional variable selection problem. To deal with these challenges, we present here a method based on the model-X knockoffs framework and Deep Neural Networks to identify significant gene expression changes in multiple perturbation experiments. This approach makes no assumptions on the functional form of the dependence between the responses and the perturbations and it enjoys finite sample false discovery rate control for the selected set of important gene expression responses. We apply this approach to the Library of Integrated Network-Based Cellular Signature data sets which is a National Institutes of Health Common Fund program that catalogs how human cells globally respond to chemical, genetic and disease perturbations. We identified important genes whose expression is directly modulated in response to perturbation with anthracycline, vorinostat, trichostatin-a, geldanamycin and sirolimus. We compare the set of important genes that respond to these small molecules to identify co-responsive pathways. Identification of which genes respond to specific perturbation stressors can provide better understanding of the underlying mechanisms of disease and advance the identification of new drug targets.