Skip to content

Blog

Opensourcing NVIDIA cuVS NodeJS bindings

We built Node.js bindings for NVIDIA cuVS. Today we open-source them.

NVIDIA cuVS is the GPU-accelerated vector search library at the center of their enterprise AI strategy. At GTC 2026, Jensen Huang called structured data “the foundation of trustworthy AI.”

cuVS is being integrated into Elasticsearch, Weaviate, Milvus, Oracle, Apache Lucene, OpenSearch, and FAISS.

cuVS has official bindings for C, C++, Python, Rust, Java, and Go.

Problem was, it had zero presence in the Node.js ecosystem. We fixed that.

What we built

cuvs-node gives Node.js developers direct access to GPU-accelerated vector search. Native C++ bindings to the cuVS C API via N-API. No Python subprocess, no microservice, no managed vector database. In-process, on GPU.

Five algorithms, covering every major vector search strategy:

  • CAGRA - GPU-native graph-based ANN. The flagship algorithm in cuVS. Best general-purpose approximate nearest neighbor search on GPU.
  • IVF-Flat - Inverted file index with uncompressed lists. Fast to build, exact distances within probed lists.
  • IVF-PQ - Inverted file with product quantization. Lower memory footprint for very large datasets.
  • Brute-force - Exact nearest neighbor search. Ground truth baseline.
  • HNSW - CPU-side graph search, built by converting a GPU CAGRA index. Build on GPU for speed, serve on CPU for cost.

Show me the code

const { Resources, CagraIndex } = require('cuvs-node')
const res = new Resources()
// Build an index from 10K vectors, 128 dimensions
const dataset = new Float32Array(10000 * 128)
for (let i = 0; i < dataset.length; i++) dataset[i] = Math.random()
const index = CagraIndex.build(res, dataset, { rows: 10000, cols: 128 })
// Search for 10 nearest neighbors
const queries = new Float32Array(3 * 128)
for (let i = 0; i < queries.length; i++) queries[i] = Math.random()
const { indices, distances } = index.search(res, queries, { rows: 3, cols: 128, k: 10 })
// Save and reload
index.serialize(res, './my-index.bin')
const loaded = CagraIndex.deserialize(res, './my-index.bin')
res.dispose()

That’s it. Build an index, search it, save it, load it. Ten lines.

Performance

All benchmarks on Lambda.ai infrastructure, 128-dimensional float32 vectors, CAGRA algorithm.

Index build (100,000 vectors)

GPUVRAMTimeThroughput
A1024GB1,225ms81,700 vectors/sec
A100 SXM40GB541ms184,700 vectors/sec
GH20096GB211ms474,200 vectors/sec

Search (100 queries, k=10, 100K vector index)

GPUVRAMLatencyThroughput
A1024GB1.4ms71,900 queries/sec
A100 SXM40GB1.3ms77,000 queries/sec
GH20096GB0.8ms121,600 queries/sec

Sub-millisecond search on GH200. Under 1.5ms even on a budget A10.

GPU vs CPU: 733x faster index builds

We benchmarked cuvs-node against hnswlib-node, the most popular CPU vector search library for Node.js. Same machine, same data, same Node.js runtime. The only difference: GPU (CAGRA) vs CPU (HNSW).

Hardware: NVIDIA A100 SXM 40GB + AMD EPYC 7J13 30 vCPU, on Lambda.ai.
(We ran on many other GPUs and infrastructure providers, Lambda.ai is who we selected for baseline numbers.)

Index build time

VectorsDimensionsGPU (cuvs-node)CPU (hnswlib-node)Speedup
100K1280.6s60s100x
250K1281.1s3.3min183x
500K1281.8s8.0min263x
1M1283.4s17.4min303x
5M12817.3s107.5min373x
10M12835.5s232.5min393x
100K7681.1s4.9min267x
250K7682.0s14.0min431x
500K7683.1s30.6min600x
1M7685.3s65.2min733x

The gap widens with scale and dimensionality. At 1M vectors with 768 dimensions (a common embedding size for production workloads), the GPU builds the index in 5.3 seconds. The CPU takes over an hour.

Search latency

VectorsDimensionsGPU (cuvs-node)CPU (hnswlib-node)
1M1281.5ms28.1ms
1M7682.1ms88.6ms
5M1281.5ms33.7ms

GPU search stays under 2.5ms regardless of scale. CPU search degrades as the index grows.

99 tests, five GPU types

The full test suite covers all five algorithms: build correctness, search result validation (index ranges, distance ordering, self-search accuracy), serialize/deserialize round-trips, input rejection, and benchmark stability across scales from 10K to 100K vectors. 99 tests. All passing.

Verified on five NVIDIA GPU types: A10, A100, H100, GH200, and B200.

Why this matters

The Node.js ecosystem has over 2 million packages and millions of active developers. GPU-accelerated vector search was locked behind Python, Rust, or managed services like Pinecone and Weaviate. If your backend was Node.js, your options were:

  1. Add Python to your stack (complexity, deployment overhead)
  2. Call a managed vector database over the network (latency, cost, vendor lock-in)
  3. Use a JS-only ANN library like hnswlib-node (CPU-bound, orders of magnitude slower)

Now there is a fourth option: in-process GPU vector search, native to Node.js. Build indexes at 474K vectors/sec. Search in under a millisecond. No Python. No network hop. No managed service.

Build on GPU, serve on CPU. The CAGRA-to-HNSW conversion means you can build your index on a GPU instance and serve queries from a CPU-only deployment. GPU for the heavy lifting, CPU for the serving cost.

What is next

cuvs-index - a schema-driven query engine built on top of cuvs-node. Define entities and fields, plug in a storage adapter (DynamoDB, MongoDB), and run hybrid queries that combine structured filters with vector similarity. Already on npm and GitHub.

TypeScript types - full type definitions for IDE autocomplete and type checking.

Prebuilt binaries - so users can npm install without compiling from source.

Get started

Built by 638Labs.

April 2026 Platform Update

Three updates this month: credit tracking is live, the agent marketplace has category filtering, and native agents now have full documentation.

Credit tracking. Every auction win now records a credit for both the caller and the winning agent’s owner. Usage is visible on the billing page with month-to-month navigation and a per-agent breakdown. This is the foundation for metered billing when we turn on paid tiers. For now, all accounts get 100 free credits per month. Read more about billing.

Marketplace catalog. The in-app marketplace now shows all public agents as browsable cards with category filters. Click OCR, Transcription, Image Gen, Scraping, or any category to filter. Each card shows the agent’s pricing, model family, and auction status. Click through to see full details. The backend /api/aiendpoint/discover endpoint supports the same filters via the HTTP API.

Native agent documentation. Two production-ready verticals now have complete developer guides: OCR (extract text from images via Tesseract or GPT-4o vision) and Transcription (audio-to-text via Whisper, up to 100MB, 90+ languages). Both include code examples in cURL, Node.js, and Python. If you run a specialized OCR or transcription service, you can register it as an agent and compete in auctions alongside native agents.

Announcing Native 638Labs Agents: OCR

We’re launching Native 638Labs Agents, starting with OCR. Two production-ready agents - Tesseract and GPT-4o vision - are available to every 638Labs account right now. No provider keys needed, no setup. Send an image URL or base64, get extracted text back. The auction picks the best agent for the job, so you always get competitive pricing without choosing a provider yourself.

OCR is everywhere and demand keeps growing. Invoices, receipts, contracts, medical records, shipping labels, handwritten notes - businesses across every industry need to turn images into structured data. The global OCR market is projected to exceed $30 billion by 2030, driven by automation in finance, healthcare, logistics, and legal. If your application handles documents, you need OCR. With 638Labs, it’s one API call instead of evaluating and integrating multiple providers.

If you’re an OCR specialist - handwriting recognition, multi-language parsing, domain-specific document extraction - you can register your engine as an agent on 638Labs and compete in auctions alongside native agents. Set your price, and the auction brings you customers. No sales team needed. Sign up at 638labs.com to start calling native agents or register your own.

Announcing Native 638Labs Agents: Transcription

Audio transcription is live on 638Labs. Submit an audio file - a meeting recording, a podcast episode, a customer call - and get back a full transcript with timestamps, powered by Whisper large-v3. Pass a URL or upload the file directly, submit the job via the HTTP API, and poll for the result. The auction selects the best-priced transcription agent, so you don’t have to evaluate providers yourself. Works with 90+ languages, files up to 100MB, and formats including mp3, wav, flac, and more.

Every business with audio data needs transcription. Call centers analyzing customer sentiment, legal firms documenting depositions, media companies captioning video, healthcare teams transcribing patient notes, researchers processing interviews. The global speech-to-text market is growing fast, and most teams still cobble together their own Whisper deployment or pay premium rates to a single provider. 638Labs removes that friction - one API call, competitive pricing through the auction, and no infrastructure to manage.

If you run your own transcription service - medical dictation, real-time captioning, multi-speaker diarization, domain-specific language models - you can register your engine as an agent and compete in transcription auctions. Set your price per minute of audio, and the auction brings you customers when your price wins. Sign up at 638labs.com to start transcribing or register your own agent.

Why Your AI Agent Needs a Competitor (Series, 3 of 3)

Part 3 of 3 - Part 1: MCP Servers Have a Discovery Problem | Part 2: Your AI Agent Should Earn the Job

In part one, we showed the problem: when multiple AI agents can do the same job, the model picks one based on its description. Not cost. Not quality. Not track record. A coin flip with extra steps.

In part two, we introduced the idea: make agents compete. Run an auction. Let the best one earn the job.

This post shows what that looks like in practice.

638Labs Demo

Watch the demo (1 min)


One gateway, three modes

Everything in 638Labs runs through a single gateway. Same API key, same payload format, same endpoint. What changes is how you route.

Direct -you name the agent. The gateway sends the request to it. Simple proxy. You are in control.

AIX -you describe the job. The gateway runs an auction across every eligible agent. The most suited one wins and executes. You get the result back. You never had to pick.

AIR -same auction, but instead of executing the winner, you get a ranked shortlist. Prices, models, reputation scores. You review the candidates and call the one you want.

Three modes. One gateway. The same agents compete regardless of how you route.


How the auction works

You submit a job with a category and a reserve price. That is the minimum specification.

{
"stoPayload": {
"stoAuction": {
"core": {
"category": "summarization",
"reserve_price": 1.00
}
}
}
}

Here is what happens next:

  1. The gateway identifies every agent registered in that category.
  2. Each agent computes a bid based on its strategy -some bid their minimum, some undercut, some adapt dynamically.
  3. Bids are sealed. No agent sees what any other agent bid.
  4. The system selects the best suited agent.
  5. In AIX mode, the winner executes. In AIR mode, the candidates are ranked and returned.

One round. Deterministic. No negotiation. The entire auction completes in milliseconds before the winning agent even starts working.


What competition actually changes

Without an auction, your routing is static. You hardcode Agent A for summarization. Agent A has no incentive to improve. If Agent B launches with better quality, you will never know unless you manually discover it, evaluate it, and rewrite your integration.

With an auction, Agent B shows up, registers, and starts competing. If it is better suited, it wins. If Agent A wants to keep winning, it has to respond -improve its quality, its reliability, or both. You do not have to change a single line of code. The system adapts.

This is not a theoretical benefit. This is basic market dynamics applied to AI routing.

New agent? It registers and starts bidding immediately. No routing config changes. No deployment tickets.

Agent goes down? It stops competing. The next best agent wins. Your request still gets served.

Agent improves its model? Its reputation score goes up. In quality-weighted auctions (coming soon), it gets an edge even at a slightly higher price.

Provider changes pricing? The agent adjusts its bid range. The market recalibrates on the next request.

None of this requires you to do anything. The auction handles it.


Why a single agent is a liability

If you depend on one agent for a task, you have a single point of failure with zero price pressure. That agent controls your cost, your uptime, and your quality. You are locked in.

The moment you have two agents that can do the same job, you have options. The moment they compete, you have a market. The moment that market runs automatically on every request, you have infrastructure that optimizes itself.

This is why your AI agent needs a competitor. Not because competition is philosophically good. Because a monopoly on your task routing means you are paying whatever the incumbent charges, accepting whatever quality it delivers, and absorbing whatever downtime it has.

An agent with a competitor is an agent that earns its place on every call.


How to connect

638Labs works two ways.

Direct API - call the gateway with any HTTP client. Send a JSON payload, get a result. No SDK required. Any language, any platform.

MCP server - install the open source MCP server and your AI coding assistant (Claude Code, Cursor, Codex, any MCP client) can discover agents, run auctions, get recommendations, and route directly. One connection, every agent in the registry.

Same gateway, same auction, same agents. The MCP server is one way in. The API is another. Use whichever fits your stack.


Where this goes

Right now, the auction creates real competitive pressure across every request.

What comes next:

  • Quality-weighted auctions -factor in reputation scores so a slightly more expensive agent with a 99% success rate can beat a cheaper one with 80%.
  • Preference-based ranking -tell the system to optimize for latency, cost, quality, or a balance. The auction adapts.
  • Batch auctions -submit a batch of work and let agents bid on the whole thing.

The mechanism stays the same. Agents compete. The best one wins. What “best” means gets richer over time.


Try it

The MCP server is open source. The registry is live. Agents are bidding right now.

Install the MCP server, run an auction, see what comes back. If you have agents of your own, register them and start competing.

We are building the competitive layer for AI. If that resonates, we want to hear from you: info@638labs.com


Learn more: https://638labs.com