Small Models, Sparse Data: How BBB-Nuke Predicts Blood-Brain Barrier Penetration

ADMET prediction is one of the most consequential bottlenecks in drug discovery. For CNS therapeutics, the bottleneck is even narrower: only about 2% of small molecules cross the blood-brain barrier. The data to model this is sparse, noisy, and scattered across decades of literature. Here is how we built BBB-Nuke to work within those constraints, not around them.

The Sparse Data Problem in ADMET

Machine learning in drug discovery is often presented as a big-data problem. Train on millions of compounds, throw a transformer at the SMILES strings, and let the architecture figure it out. For properties like solubility or lipophilicity, where public datasets contain hundreds of thousands of measurements, this approach can work. But for ADMET endpoints that actually determine whether a drug succeeds in humans, the data landscape is remarkably thin. BBB penetration is a stark example: the B3DB dataset, the most comprehensive public collection, contains roughly 7,800 annotated compounds. Efflux transporter data is even sparser, often limited to a few hundred measurements per protein, scattered across heterogeneous assay conditions.

This scarcity is not an accident. BBB permeability is expensive to measure experimentally. PAMPA-BBB and MDCK assays provide surrogate readouts, but true in vivo brain penetration data requires rodent studies with tissue sampling. The result is a field where the most important question, “will this molecule reach the brain?” has the least data available to answer it.

Why Smaller Models Win in Sparse Regimes

The instinct in modern ML is to scale. More parameters, more pretraining data, more compute. But in sparse ADMET regimes, this instinct is counterproductive. Large neural networks overfit small datasets. They memorize training examples rather than learning generalizable structure-activity relationships. The variance term in the bias-variance tradeoff dominates when N is small, and deep architectures amplify this. Our experience building BBB-Nuke confirmed this empirically. Gradient-boosted trees trained on 67 carefully engineered features, 10 physicochemical descriptors, 7 efflux transport signatures, and 50 fingerprint-derived PCA components, consistently outperformed neural baselines. The final classifier achieves 0.933 AUROC in 5-fold cross-validation, beating published large-model baselines including ADMETlab 2.0 (0.82) and LightBBB (0.84). The lesson is not that neural networks are bad. It is that when your training set is measured in thousands rather than millions, every feature must earn its place, and the model must be constrained enough to generalize.

In sparse data regimes, the quality of your features matters more than the capacity of your model. Sixty-seven features, each with a clear biophysical rationale, outperform black-box embeddings trained on orders of magnitude more data.

Extracting Signal from Literature

If the structured databases are small, the literature is not. Over 27,000 papers across bioRxiv and NCBI PubMed contain experimental data on BBB permeability, efflux transport, and CNS drug disposition. The problem is that this data is locked in prose, tables, supplementary PDFs, and figure captions. We processed over 108 million tokens of scientific text through DeepSeek-based extraction pipelines, converting unstructured findings into structured, machine-readable annotations. This literature mining served two critical purposes. First, it expanded our training corpus. The B3DB dataset provided an initial 1,000 BBB-positive compounds; after CNS-MPO filtration and literature-augmented validation, we retained 529 high-confidence BBB-positive molecules that form the backbone of our positive training set. Second, and perhaps more importantly, it gave us the protein landscape.

The BBB-Nuke data pipeline: structured datasets and literature extraction converge on 65 barrier-functional proteins.

Mapping the Barrier Proteome

One of the most consequential decisions in building a BBB prediction model is determining which biological mechanisms to model. Most approaches treat the BBB as a passive filter: does the molecule have the right logP, the right molecular weight, the right polar surface area? BBB-Nuke takes a different view. The barrier is an active biological system, with 65 functional proteins, including influx transporters, efflux pumps, and metabolic enzymes, each recognizing specific molecular chemotypes. We identified these 65 proteins by mining the literature corpus and cross-referencing with UniProt annotations. UMAP clustering of their ligand-binding profiles reveals clear functional segregation: enzymes (CYP450 family, MAO-A, MAO-B, AChE) cluster tightly in shared ligand-recognition space, while transporters (EAAT1-3, LAT1, OATP1A2) form a distinct, more diffuse cluster consistent with their broader substrate specificity. Efflux proteins (MDR1, ABCG2, MRP1-5) cluster separately, closer to transporters than to enzymes, mirroring the biological reality of influx-efflux competition at the barrier.

Interactive UMAP projection of 65 barrier-functional proteins. Enzymes (blue), transporters (pink), and efflux (green) cluster in patterns that mirror their biological roles.

This spatial positioning is not decorative. It informs the architecture of BBB-Nuke. Rather than treating BBB penetration as a single classification problem, the pipeline models efflux transport explicitly. Seven random forest classifiers, one per major efflux transporter, predict substrate likelihood independently. These efflux signatures become features in the final classifier, allowing the model to capture the mechanistic reality that a molecule can have excellent passive permeability and still fail to accumulate in the brain because MDR1 pumps it back out.

Benchmarks

The standard metric in this space is AUROC on held-out compound sets. BBB-Nuke achieves 0.933, compared to 0.72 for CNS-MPO (the industry-standard heuristic), 0.79 for BBB-Score, 0.82 for ADMETlab 2.0, and 0.84 for LightBBB. On an independent held-out benchmark (Benchmark 2.5), BBB-Nuke scores 0.810 AUROC, exceeding BBB-Score (0.790) on data the model was never trained on. But AUROC alone does not capture what matters for drug discovery. The practical question is: how many molecules does the model let through that should not pass, and how many does it reject that would have worked? In a field where false positives cost months of synthesis and assay work, BBB-Nuke's precision advantage over threshold-based approaches is where the real value lands.

BBB Penetration Prediction: Model Comparison

AUROC on held-out test sets. Higher is better.

CNS-MPO

0.72

BBB-Score

0.79

ADMETlab 2.0

0.82

LightBBB

0.84

BBB-Nuke

0.933

Screening a Billion Compounds

A model is only useful if it can operate at the scale the problem demands. The purchasable chemical space now exceeds 10 billion compounds across Enamine REAL, ZINC, and PubChem. We have screened over 1.02 billion compounds through BBB-Nuke, identifying 540 million hits at a P(BBB) threshold of 0.70. This is, to our knowledge, the largest BBB permeability screen ever conducted. The results are publicly available on HuggingFace (ATTN-Lab/bbbnuke-screening-1B). The screen itself serves a dual purpose: it provides immediate utility for researchers filtering compound libraries, and it generates the synthetic labels needed to train next-generation models. When curated experimental data covers thousands of compounds but the chemical space spans billions, model predictions on the broader space become a form of data augmentation, one where the augmented labels carry the biases of the original model but also capture structural patterns that no experimental dataset could cover at this scale.

Chemical Diversity at Scale

Scale without diversity is noise. To validate that the 1B screen covers meaningful chemical territory, we visualized the results using TMAP (Tree MAP), a dimensionality-reduction method purpose built for large chemical libraries. TMAP preserves local neighborhood structure better than UMAP or t-SNE for molecular fingerprints, revealing scaffold clusters and chemotype boundaries that flat projections miss. The interactive maps below show the structural landscape of BBB-positive hits, BBB-negative compounds, and reference libraries (PubChem, Enamine REAL), colored by P(BBB) score, CNS-MPO, and physicochemical properties. What emerges is not a single chemical family that dominates, but a distributed landscape of scaffolds spanning multiple drug-like chemotypes, confirming that the model generalizes across structural classes rather than memorizing a narrow training distribution.

Interactive Chemical Space Map

TMAP visualization of BBB-screened compounds. Hover for structure details. Colored by P(BBB) score.

Synthetic Data as Infrastructure

The 1B screen is not the end product. It is infrastructure. In sparse data regimes, the most effective augmentation strategy is not to generate random molecules and predict their labels. It is to screen real, purchasable chemical space at scale and use the resulting predictions, combined with confidence estimates and applicability domain checks, as training signal for models that can learn representations no curated dataset could teach. This is the flywheel: curated data trains a compact, interpretable model. That model screens billions of compounds. The screen generates synthetic labels that, after filtering and validation, expand the training corpus for the next iteration. Each cycle widens the chemical coverage while the core model remains small, fast, and auditable. The alternative, waiting for experimental BBB data to accumulate at scale, would take decades.

The response to sparse data is not to wait for more data or to build larger models. It is to build compact, mechanistically informed models that can operate at scale, and then use their predictions to bootstrap the next generation of training data.

BBB-Nuke is open source and available as both a Python package and an API. The 1B screening dataset is on HuggingFace. The TMAP visualizations are interactive and explorable. We believe the sparse data problem in ADMET is solvable, not by scaling models up, but by scaling data extraction, feature engineering, and screening infrastructure to match the size of the question.

Temitope Sobodu