運用Claude生成CUDA核心並強化開源模型

Huggingface

大約 1 個月前

AI 生成摘要

本文探討一款名為「upskill」的新工具，用於生成和評估代理技能，特別是利用Claude為diffusers模型創建CUDA核心。此過程旨在提升小型開源模型在複雜、特定領域任務上的能力。

We Got Claude to Build CUDA Kernels and teach open models!

We got Claude to teach open models how to write CUDA kernels!

This blog post walks through the process of using a new tool, upskill, to generate and evaluate agent skills with large models and use them with smaller models. We will benchmark upskill on the task of writing CUDA kernels for diffusers models, but the process is generally useful for cutting costs, or using smaller models on hard and domain-specific problems.

What are agent skills?

In case you missed it, agent skills are taking the coding agent game by storm. In fact, they’re a straightforward concept to define model context as files, like instructions as markdown and code as scripts. The file format makes them easy to generate, share, and review. In short, they’re an practical medium to share capabilities across models and tools, and they're most useful in specific domains or hard problems. Not stuff the model can do well anyway.

This post showcases this process by using Claude to generate a Skill file that can be used by open source models for a complex and specialized task: write CUDA kernels.
We first tried a simple skill based on existing documentation, and we found that it improved performance for some others, but not all. In fact, it could even degrade performance or increase token usage for some models. Check out the plot below to see the performance of the model with and without the basic skill.

Now, let's walk through how you can use upskill to upskill your agents on hard problems, and measure performance.

1. Get the teacher (Claude Opus 4.5) to build a kernel

First, we use Claude Code to build a kernel interactively and export the trace. We worked through the process by instructing, validating, and adding documentation links. This somewhat naive process is important to reveal the models' initial challenges. In fact, you can iterate on this multiple times, by trying to solve the task with draft versions of the skill, and experimenting with smaller models. Each time, you can instruct the agent to improve the skill and test it on the smaller model.

Here's an example of the skill that we created and have been using to build kernels. We started from this agent trace where the agent was able to build a kernel, but not without some help.

2. Make an agent skill from the trace

Once the teacher model has performed the task, we need them to make a skill. There are a number of effective ways to do this.

In most cases, the first 2 options result in functional skills. However, the performance of an agent with the skill is unknown. That’s where upskill is useful, because it will also generate test cases for your skill based on the trace. It then compares the results under both scenarios: using the trace, or applying the skill. We see below that the original model (Claude Opus)l met the same performance with and without the skill. This means the skill captured the task for this model. Great!

3. Take your skill to an open source, smaller, or cheaper model

Finally, we need to transfer our newly created skill to the tool or model we intend to use. Most tools like codex, cursor, and opencode have settled on a consistent format for skills, which is a directory at {agent}/skills/{skill_name}/SKILL.md , so we just need to copy the skill directory to this location.

With upskill we can pass a skill and a set of models to the eval command and upskill will run the test cases on those models with and without the skill to compare performance. We can see here that the skill increases accuracy on some open models, but not on all.

In this case, we might want to iterate further on the gpt-oss skills by regenerating the skill. We can do upskill generate –from {skill}.

There is more to agent skills than model performance. Often agents can reach a given accuracy with or without a skill, they just need to consume more tokens to get there. For recurring tasks, we want to optimize agents to use less tokens to achieve the same accuracy. The results below reveal another dimension to the skill. Some models are significantly reducing their performance token usage, whilst others are using more tokens with the skill. For example, with moonshotai/Kimi-K2-Thinking the skill is clearly effective in terms of accuracy and token usage. However, for Claude Opus 4.5 there is no clear performance increase and an increase in token usage, so you would not want to use this skill with Claude Opus 4.5.

tldr; try out and evaluate models with the skills you create. Use upskill eval or a similar tool to evaluate the models performance with and without skills.

That’s the high level end to end of upskilling your coding agents on hard problems. Try out upskill now like this:

Deep dive tutorial into building kernels with agent skills

We have a high level understanding of how we can upskill an agent. Let’s now look at the use case we solved for writing CUDA kernels.

We didn’t just want to write kernel code, but understand the full kernel-builder workflow: project structure, build.toml configuration, architecture-specific optimizations, and PyTorch bindings. This tutorial shows how upskill creates validated skills that actually work.

The kernel-builder-cuda-kernels skill teaches Claude everything it needs to know about CUDA development: which GPU architecture to target, how to structure a kernel-builder project, when to use shared memory versus registers, and how to write PyTorch bindings.

With this skill, you can tell Claude things like:

And Claude will create the complete project structure, CUDA implementation, and build configuration—following the exact conventions that kernel-builder expects.

This isn't about generating boilerplate. The skill encodes domain expertise: H100 uses compute capability 9.0, shared memory should be aligned to 128 bytes, async memory copies require CUDA_ARCH >= 900. Knowledge that would take hours to gather from documentation gets packaged into ~500 tokens that load on demand.

Setup and Install

Install upskill:

Set your API key:

That's it. Upskill uses Anthropic Claude Opus-4.5 model by default but also supports OpenAI and local models via OpenAI-compatible endpoints as generators. We want to use the more expensive and higher quality models to generate skills, and the smaller ones to use them. Think robin hood.

Skill Generation

Let's walk through generating a skill that teaches agents how to build CUDA kernels with HuggingFace's kernels library.

Generate the Skill

Start with a clear task description:

Above we used upskill, but it could in fact be any agent or chat tool and an exported trace.

Also, we could start from an existing skill and add to it:

upskill loads the existing skill, applies your improvements, and re-evaluates to ensure the changes help.

upskill creates a skill, generates test cases, evaluates performance, and refines based on failures:

The baseline shows how the model performs without any skill. The "with skill" result shows performance after the skill is injected into context. A 35% improvement means the skill is working.

The skill is saved as a directory following the Agent Skills specification:

Evaluate on a Different Model

The important test is: does this skill help local or cheaper models to build kernels?

A 45% improvement on "unsloth/GLM-4.7-Flash-GGUF:Q4_0" means the skill successfully transfers domain knowledge from a capable model to a faster, cheaper one. Skills that work on weaker models will definitely work on stronger ones.

This is the core value proposition: use expensive models to create skills, then deploy those skills with cheap or local models.

How the evaluation in upskill works

upskill uses a teacher-student approach to evaluate models where the teacher model generates test cases for the student model to be evaluated on.

If you pass an existing skill to upskill eval, it will generate test cases for the skill and evaluate the model on them. Test cases are simple input/output pairs that verify the agent understands the task:

We can also test how a skill performs across different models:

This helps you find the cost-performance sweet spot: maybe Haiku with the skill is good enough for your use case, saving significant API costs.

What's Next

We've shown that upskill can create validated skills that transfer domain expertise from powerful models to cheaper ones. The kernel-builder skill is just one example of what's possible.

Some things to try:

The approach works for any specialized task where you'd otherwise write detailed prompts repeatedly. Skills are portable across Claude Code, Codex, Cursor, and other tools that support the Agent Skills specification.