Weco Now Integrates with LangSmith

LangSmith is great for measuring agent quality. But once you have the evals in place, the actual improvement loop is still mostly manual.
Most teams using LangSmith already have the foundation for optimization: a dataset, a target
function, and evaluators that score outputs. What still takes time is the part between
evaluations. Someone reads failures, forms a hypothesis, edits a prompt or some code, runs
evaluate() again, and repeats.
Weco now integrates directly with LangSmith to automate that loop. Point Weco at your existing dataset and evaluators, and it iteratively searches for better code and prompt changes between evaluations while tracking the best-performing version it finds.
How It Works
Instead of writing a separate shell eval script or rebuilding your scoring logic somewhere else, you tell Weco which LangSmith dataset to evaluate against, which function to call, and which evaluators to use. Weco then modifies your source code, runs the LangSmith evaluation, compares scores, and keeps iterating.
The practical effect is that the LangSmith setup you already trust for measurement becomes the thing Weco optimizes against. You do not have to create a second evaluation pipeline just to get an optimization loop.
Why It Matters
Teams spend real time building evals: curating datasets, writing evaluators, and tuning judge prompts until the scores actually reflect quality. Once that work is done, the frustrating part is that improvement is still driven by manual trial and error.
This integration turns that existing investment into leverage. The same LangSmith setup you use to measure regressions and compare variants can now drive automated improvement runs too. That means less time translating eval feedback into edits by hand, and more time actually exploring the search space.
Key Capabilities
- Use the eval stack you already have. Run Weco directly against a LangSmith dataset, target function, and evaluators instead of maintaining a separate eval command.
- Combine code evaluators and LLM judges. Mix local Python evaluators with dashboard evaluators configured in LangSmith, then combine them with a custom metric function.
-
Optimize on one split, validate on another. Use dataset splits to optimize on
an
optsplit and then run a separate holdout pass to check whether gains generalize. - Run from the CLI or a browser-based wizard. Power users can configure everything with flags, while new users can launch the setup wizard and wire up a run visually.
- Handle async dashboard evaluators automatically. When LangSmith dashboard evaluators return scores asynchronously, Weco polls for results so you do not have to manage that part yourself.
Get Started
If you already have a LangSmith dataset and evaluators, you can connect Weco to that workflow today.
The full tutorial walks through a complete example, from dataset setup to optimization to holdout validation. If you want something concrete to start from, use the ZephHR LangSmith example.
If your target function calls an LLM provider directly, install the dependencies your agent already needs alongside Weco, then follow the tutorial to connect it to your LangSmith setup.
Already using LangSmith for evals? Read the tutorial and use that eval stack to drive optimization too. Questions or help adapting it to your workflow — contact@weco.ai.


