bitBites (2026-01-21): TMLE, Claude skills, Stack

bitBite

Author

Chun Su

Published

January 21, 2026

R, data science and AI

Targeted Maximum Likelihood Estimation (TMLE) is a semiparametric causal inference method for estimating the average treatment effect (ATE). It yield consistent ATE estimates if either the outcome model or the treatment (propensity score) model is correctly specified (Doubly Robust). When combined with ML models (particularly super learners), TMLE potentially improving accuracy when exact model forms are unknown. These two blogs by Ken Koon Wong, Bias, Variance, and Doubly Robust Estimation: Testing The Promise of TMLE in Simulated Data and Testing Super Learner’s Coverage - A Note To Myself, clarified this method through simulation, and evaluated TMLE estimate through bias, variance and coverage metrics.
Claude Skills: this term comes to my feeds frequently in last week. While “Tools” instructs LLM specific capacibilities, “Skills” provides LLM a complete workflow. The D-AI-LY: An Autonomous Statistical Digest by Dmitry Shkolnik is a project-stage Agent AI that automate feeds ingestion for statistical bulletin. The github repo is also available.
Building an agent through {smolagents} by leveraging hugging face models. The github notebool is available to following the screencast. The short course on building agents can be found at this playlist

scTenifolsKnk published in 2022, is an in silico perturbation tool that performs virtual gene knockouts using only unperturbed single-cell RNA-seq data as input. It takes 3 steps to create virtual KO effect:
1. constructing a wild-type gene regulatory network (GRN) from the data. The key techniques used here is tensor decomposion.
2. generating a pseudo-KO GRN by setting the edges of the knocked-out gene to zero
3. using manifold alignment to compare the original and perturbed networks and rank genes by the degree of differential regulation inferred from their distances in the aligned manifolds. In the recent study
Transcriptome-wide Mendelian randomization and single-cell analysis during CD4+ T cell activation deciphers immunotherapeutic targets for colorectal cancer, this virtual KO approach was applied downstream of Mendelian Randomization (MR) of dynamic single-cell eQTLs and CRC GWAS to prioritize causal colorectal cancer gene candidates, leading to the identification of PARP14 and ORMDL3 as novel targets strongly associated with immune therapy resistance and underscoring the importance of CD4⁺ T cell activation in CRC progression

Stack is a new single-cell foundation model released by Arc Institute that builds on previous scRNA-seq foundation models such as geneformers (2023) and scGPT (2024). The model introduced a context-aware architecture trained on ~150 million uniformly preprocessed human single cells from the AI-curated scBaseCount repository. Stack advances previous models through three key innovations:
1. It tokenize gene expression vector (1 x p) each cell to “gene module tokens” (d x token_n) that capture coherent biological groupings.
2. It employs a tabular transformer architecture with alternating intra- and inter-cellular attention, where intra-cellular attention learns how gene modules contribute to a cell’s identity and inter-cellular attention integrates information across all cells in a cell set (a single experiment sample) to model multi-cellular programs.
3. At inference stage, Stack uses in-context learning by taking a query cell set (eg. WT condition in cell type A) and a prompt cell set (eg. KO gene1 in cell type B) as input, leverages the relationships learned across contexts to simulate prompt-conditoned (eg. disease, perturbation) gene expression in query cell context (eg. cell type, donors) without task-specific fine-tuning.
Adjusted from Dong et al fig 1B to highlight dimensions of data at training stage 1-2