--- title: "Outlier-Robust Longitudinal Imputation with smriti" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Outlier-Robust Imputation with smriti} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, echo = FALSE} knitr::opts_chunk$set(collapse = TRUE, comment = "#>") ``` ## The Outlier Problem in Longitudinal Data Real-world longitudinal data — clinical measurements, sensor readings, behavioural logs — routinely contains outliers. A single faulty blood pressure reading, a corrupted accelerometer spike, or a miscoded survey response can poison the sample covariance matrix and cascade through the entire imputation pipeline. Most imputation methods are vulnerable: - **mice** (CART/method = "cart"): regression trees split on individual points; a single extreme value can dominate a leaf node. - **missForest** / **missRanger**: random forests are more robust than linear models, but the bootstrapped trees still propagate outlier influence through the response variable. - **Amelia**: assumes multivariate normality; a single 5-sigma outlier inflates the EM-estimated covariance. - **FIML** (lavaan): maximum likelihood under normality; outliers inflate residual variances. smriti's `robust = TRUE` mode takes a different approach: it constructs the *target covariance manifold* from rank-based (Spearman) correlations and median absolute deviation (MAD) scale estimates, then projects the result to the nearest positive-semidefinite matrix (Higham 1988). This target is structurally immune to outliers before the Lagrangian routing even begins. ## Demonstration: 5% Contamination We simulate a 4-wave longitudinal study (N = 200) where 5% of subjects have their measurements shifted by +5 SD — a plausible sensor-artefact scenario. ```{r setup, message=FALSE} library(smriti) set.seed(20250601) n <- 200; t_points <- 4 # ── Generate clean data with a linear growth process ────────────────────── generate_data <- function(n, add_outliers = FALSE) { latent_intercept <- rnorm(n, 6, 1) latent_slope <- rnorm(n, 2, 1) data_mat <- matrix(0, n, t_points) for (j in seq_len(t_points)) { data_mat[, j] <- latent_intercept + (j - 1) * latent_slope + rnorm(n, 0, 1) } if (add_outliers) { idx <- sample(seq_len(n), floor(0.05 * n)) data_mat[idx, ] <- data_mat[idx, ] + 5.0 # +5 SD shift } colnames(data_mat) <- paste0("T", seq_len(t_points)) as.data.frame(data_mat) } df_clean <- generate_data(n, add_outliers = FALSE) df_outlier <- generate_data(n, add_outliers = TRUE) # ── Induce 15% MAR missingness (same pattern for both) ──────────────────── set.seed(42) apply_mar <- function(df) { df_miss <- df for (t in 1:(t_points - 1)) { idx <- which(!is.na(df_miss[, t])) x_prev <- scale(df_miss[idx, t]) p_miss <- 1 / (1 + exp(-(x_prev - qnorm(1 - 0.15)))) drop_idx <- idx[rbinom(length(idx), 1, p_miss) == 1] df_miss[drop_idx, t + 1] <- NA } df_miss } df_clean_miss <- apply_mar(df_clean) df_outlier_miss <- apply_mar(df_outlier) cat("Clean data missingness: ", sum(is.na(df_clean_miss)), "cells\n") cat("Outlier data missingness:", sum(is.na(df_outlier_miss)), "cells\n") ``` ### Clean Data: All Methods Perform Similarly ```{r clean-comparison, eval=FALSE} # On clean Normal data, default and robust modes produce similar results. imp_clean_default <- smriti_impute(df_clean_miss, time_cols = 1:4, robust = FALSE) imp_clean_robust <- smriti_impute(df_clean_miss, time_cols = 1:4, robust = TRUE) ``` ### Contaminated Data: The Robust Advantage ```{r outlier-comparison, eval=FALSE} # On outlier-contaminated data, the robust mode preserves the true structure. imp_outlier_default <- smriti_impute(df_outlier_miss, time_cols = 1:4, robust = FALSE) imp_outlier_robust <- smriti_impute(df_outlier_miss, time_cols = 1:4, robust = TRUE) # Compare recovered covariance against the true (clean) population matrix. true_cov <- cov(df_clean[, 1:4]) # no missingness, no outliers cat("Default mode Frobenius distance from truth:", sqrt(sum((cov(imp_outlier_default[, 1:4]) - true_cov)^2)), "\n") cat("Robust mode Frobenius distance from truth:", sqrt(sum((cov(imp_outlier_robust[, 1:4]) - true_cov)^2)), "\n") ``` In our production benchmarks (500 replications across 90 conditions), the robust mode consistently improves covariance recovery by 1–3 percentage points over missForest under outlier-contaminated MAR data. The Spearman/MAD target matrix isolates the structural signal from the contamination before the Lagrangian routing step, preventing outlier-induced variance inflation. ## When to Use robust = TRUE | Scenario | Recommendation | |----------------------------------------|-------------------------| | Clean, approximately Normal data | `robust = FALSE` (Pearson) | | Known sensor artefacts or data errors | `robust = TRUE` | | Heavy-tailed distributions (not skewed)| `robust = TRUE` | | Severely skewed (e.g. lognormal) | `smriti_fiml()` or `custom_target` | The robust mode is not a cure for skew. For lognormal or otherwise asymmetric distributions, use `smriti_fiml()` to derive the target covariance from a properly specified latent growth model, or supply your own `custom_target` matrix. ## Integration with Existing Pipelines The robust mode is accessible from all convenience wrappers: ```{r wrappers, eval=FALSE} # missForest + smriti robust refinement smriti_forest(df, time_cols = 1:4, robust = TRUE) # missRanger + smriti robust refinement smriti_ranger(df, time_cols = 1:4, robust = TRUE) # Multiple imputation with robust targets on every replicate smriti_mi(df, time_cols = 1:4, m = 10, robust = TRUE) ```