How Vardera Used Weco to Push Their Models From 78% to 93%

June 9, 2026 • By Vayum Arora

"Weco has been a step-change multiplier for our engineering team and helped us close multiple customer contracts."

Vardera builds AI-driven authentication, appraisal, and valuation for resellers, auction houses, and collectors. Their ML team runs multiple classification models across image, text, and decision logic, and uses Weco as the optimization layer that lifts each one's metrics.

+17.7%

F1 on the first Weco run against their PMI classifier

+18.9%

precision at recall ≥ 0.40 across CatBoost runs

200+

Weco runs across Vardera's pipeline since

Key Takeaways

Vardera's ML team optimizes models across multiple classification problems: grading, authentication, and risk detection.
- F1 climbed 17.7% (0.583 → 0.686) against their PMI-based classifier.
- Precision at recall ≥ 0.40 climbed 18.9% (0.789 → 0.938) across CatBoost runs.
Vardera has run 200+ Weco optimizations across the rest of their pipeline, moving key metrics by over 50%.
Weco wrote features the team hadn't written: cross-field contradictions, joint encodings of individually-neutral signals, granular decomposition of aggregate scores, stylometric signals in free text, and missing data treated as information.
Every change Weco proposed shipped as auditable code — reviewable as a normal PR, with no black-box embeddings.

Recall at a fixed false-positive rate: baseline versus Weco's best run across six FPR operating points, with the largest recall gains at the tightest false-positive budgets

The problem class: adversarial classification

Authentication, grading, and counterfeit detection are all adversarial classification problems. The adversary adapts continuously. And the metric that matters in production isn't holdout AUC — it's precision at a fixed recall, or recall at a fixed false-positive budget, set by product and risk leaders rather than the modeler.

Most teams hit the same ceiling. Hyper-parameter tuning, ensembling, and adding more data all show diminishing returns past a competent baseline. The remaining headroom lives in the code itself: features, model architecture, decision logic. That's the slowest and most expensive layer to iterate on. Hiring another ML engineer costs $200K+ a year plus months of onboarding, and doesn't change iteration cadence.

Vardera was running multiple baseline approaches in active development — PMI-based scoring, Naive Bayes, and a TF-IDF centroid classifier — on an adversarial training set of 414 hand-labeled examples with a 3:1 safe-to-risky imbalance.

Beneath all of this sat a single constraint: a classification model improves only as fast as an engineer can propose, build, and test new features and architectures, one experiment at a time.

The approach: what Weco does

Weco operates as an automated ML engineer. You point it at your training pipeline, set the metric you want moved, and walk away. An LLM proposes code changes, an evaluator built from your metric and data scores each one, and a tree search keeps context across attempts so good branches get extended and dead ends get pruned. The underlying algorithm, AIDE, is open source and topped OpenAI's MLE-Bench.

For Vardera, the first Weco run targeted F1 on the PMI-based scoring approach. F1 moved from 0.583 to 0.686 (up 17.7%), and along the way, Weco wrote features the team hadn't written.

Subsequent runs came after the team tightened their operating metric to precision at recall ≥ 0.40 and switched the underlying model to CatBoost. Across those runs, precision climbed from 0.789 to 0.938 (up 18.9%), and 100+ iterations later, the cumulative result landed at roughly a 50% improvement on their metrics.

What Weco produced, against the team's own production metrics:

Metric	Before Weco	After Weco	Change
F1 score	0.583	0.686	+17.7%
Precision at recall ≥ 0.40	0.789	0.938	+18.9%

The first came from a single 50-step run, on a PMI-based scoring approach the team had already hand-tuned to a standstill. The second came after Weco switched the model to CatBoost and the team tightened the metric to match their real production constraints. The team has since run roughly 200 Weco runs across the rest of its pipeline.

The solution patterns: feature and logic discoveries Weco made

The interesting part isn't the score — it's what Weco actually wrote. A few patterns recurred across runs, and they generalize past any single classification problem:

Cross-field contradictions. When one field disagrees with another it claims to describe. For Vardera, that surfaced as year-of-issue declared in a title vs. tagged in metadata. The same shape generalizes to declared vs. observed device type, stated vs. measured geography, claimed credential vs. inferred capability.
Joint encodings of individually-neutral features. A high-risk region paired with a high-value item class, scored as a single interaction rather than two additive signals.
Granular decomposition of aggregate scores. Splitting a single risk score into its component counts — risky / safe / total — and using each as an independent input.
Stylometric signals in free text. Capitalization ratio, repetition density in one field but not another, disclaimer text appearing in a description but absent from the title.
Missing data as information. Treating an absent certification tag as suspicious rather than neutral when the rest of the listing makes a strong claim.

These aren't features AutoML proposes, because AutoML doesn't touch feature code. They aren't what a general-purpose LLM coding loop converges on inside a sane budget either. They surface when a code-level search has enough context to explore a domain and enough selection pressure to keep what works.

Vardera has since run 200+ Weco optimizations across the rest of their pipeline. The team also began using Weco as a second phase of training — running it on model outputs against ground truth to push final classifier accuracy higher.

The outcome: faster iteration without losing auditability

The bigger shift isn't the lift on any single metric. It's that the iteration cadence stopped being bound by how fast an engineer could propose new features.

Every ML team's improvement loop is rate-limited by code throughput: one experiment at a time, one engineer's working memory. Weco runs that loop in the background, in parallel, until the metric stops moving. Continuous re-optimization against drift becomes tractable rather than a budget decision; re-running on new data doesn't register against any serious loss rate.

Every change Weco proposes is ordinary code. Vardera reads it, reviews it, and version-controls it like any other PR. There's no black-box embedding, no model-as-a-service handoff.

Vardera didn't replace their ML engineer. They gave the one they already had a second one that ran overnight.

Try it on your own pipeline

The fastest way to find out whether Weco moves your metric is to point it at the pipeline you already have.

pip install weco

weco run --source train.py \
         --eval-command "python evaluate.py" \
         --metric recall_at_fpr \
         --goal maximize \
         --steps 50

--source is the file Weco rewrites. --eval-command is whatever you already run to score a model. If you'd rather see it run on your own stack first, book a 20-minute call.

we·collaborate

How Vardera Used Weco to Push Their Models From 78% to 93%

Key Takeaways

The problem class: adversarial classification

The approach: what Weco does

The solution patterns: feature and logic discoveries Weco made

The outcome: faster iteration without losing auditability

Try it on your own pipeline

More from Weco AI:

An AI Agent Became the #1 Contributor in OpenAI's Hiring Challenge

SpecBench: Measuring Reward Hacking in Long-Horizon Coding Agents

AutoResearch vs Classical Hyperparameter Tuning

Follow Us