EN ES PT

Building a Sub-Millisecond
Search Engine in Rust

How we built a hybrid BM25 + semantic search engine that handles 27K queries/sec with sub-millisecond p99 latency on a single node. Architecture, real benchmarks, and lessons learned.

<1ms
p99 Query Latency
27K
Queries/sec
21K
Docs/sec Indexing
766B
Per Document

The Problem

Most teams reach for Elasticsearch or Solr when they need full-text search. These are solid tools — but they come with a cost: JVM overhead, GC pauses, cluster complexity, and query latencies that typically land in the 5–50ms range for simple queries. For many applications, that's fine.

We needed something different. Our requirements: sub-millisecond latency at p99, hybrid ranking combining BM25 text relevance with semantic similarity, and single-binary deployment — no JVM, no cluster coordinator, no YAML.

The answer was Rust.

Architecture Overview

The engine is structured as a set of modular Rust crates, each responsible for a single concern. The API layer is Axum + Tokio. The indexing and retrieval layer is built on Tantivy. Semantic search runs through Qdrant, and results from both paths are fused via Reciprocal Rank Fusion (RRF).

API
Axum HTTP Query Parser Auth Middleware
Search
BM25 (Tantivy) Semantic (Qdrant) RRF Fusion
Index
Inverted Index Vector Store Segment Merge
Storage
mmap Files Redis Cache Commit Log

Why Tantivy

Tantivy is a full-text search library written in Rust, inspired by Apache Lucene. Unlike Elasticsearch (which wraps Lucene in a JVM layer), Tantivy compiles to native code with zero garbage collection. It gives us direct control over memory layout, segment merging, and tokenization pipelines.

The key advantage: Tantivy's IndexReader uses memory-mapped segments with Arc-based sharing. Multiple search threads read from the same mapped memory without copying or locking. This is why concurrent throughput scales near-linearly.

Hybrid Ranking with RRF

We combine two ranking signals:

These are fused using Reciprocal Rank Fusion, which merges ranked lists without requiring score normalization. The algorithm is O(n) in the number of results and adds negligible overhead — under 1ms even for 1,000 results.

// RRF fusion — O(n) merge of ranked lists
fn rrf_fuse(bm25: &[DocScore], semantic: &[DocScore], k: f32) -> Vec<DocScore> {
    let mut scores: HashMap<DocId, f32> = HashMap::new();

    for (rank, doc) in bm25.iter().enumerate() {
        *scores.entry(doc.id).or_default() += 1.0 / (k + rank as f32 + 1.0);
    }
    for (rank, doc) in semantic.iter().enumerate() {
        *scores.entry(doc.id).or_default() += 1.0 / (k + rank as f32 + 1.0);
    }

    let mut fused: Vec<_> = scores.into_iter().collect();
    fused.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
    fused
}

Benchmark Results

All benchmarks were run on macOS x86_64, release build, with Criterion for statistical validity. The corpus consists of synthetic documents (educational content, song metadata, SCADA records) matching our production workload profile.

Indexing Throughput

At small corpus sizes, setup cost (directory creation, 50MB writer allocation) dominates. As the corpus grows, throughput stabilizes at 21K docs/sec. Serial and batch modes perform identically — Tantivy's internal tokenization and I/O is the real bottleneck, not lock overhead.

Corpus SizeMethodTimeDocs/sec
1Kserial2.41s416
1Kbatch2.37s422
10Kserial2.43s4,117
10Kbatch2.52s3,963
50Kserial2.51s19,881
50Kbatch2.64s18,913
100Kserial4.76s21,006
100Kbatch5.08s19,690

Query Latency

The headline number: p99 latency stays under 1ms even at 50K documents. Median latency at 50K docs is 234 microseconds. For context, a typical Elasticsearch simple query on a comparable corpus returns in 5–15ms — roughly 20–50x slower.

Corpusp50p95p99Mean
1K119us296us335us136us
10K131us294us537us160us
50K234us712us984us290us
Why microseconds matter

At 984us p99, your slowest 1-in-100 query is still under a millisecond. This means you can layer additional processing — re-ranking, personalization, A/B test logic — and still return results in under 5ms total. That's the budget Elasticsearch uses for the search query alone.

Concurrent Query Throughput

Throughput scales near-linearly with concurrent workers thanks to Tantivy's lock-free IndexReader architecture. At 100 concurrent workers, we sustain 27,287 queries per second on a single node.

WorkersTotal QueriesWall TimeQueries/sec
110020.9ms4,782
101,00051.1ms19,566
505,000211.9ms23,600
10010,000366.5ms27,287

Query Parser Performance

The query parser handles simple terms, phrase queries, negation, and complex boolean expressions — all under 103 microseconds. This is pre-search overhead; it runs before the index is touched.

Query TypeExampleMean Latency
Simplefracciones matematicas76.7us
Phrase"tabla periodica" quimica74.0us
Negationvolcanes -tectonica ciencias102.6us
Complex"sistema solar" -pluton type:educational85.2us

RRF Ranking Fusion

The hybrid ranking step adds negligible overhead. Even fusing 1,000 results from both BM25 and semantic pipelines takes under 650 microseconds.

Results to FuseMean Latency
106.0us
10058.3us
500317.6us
1,000647.6us

Memory Efficiency

The index uses approximately 766 bytes per document in RSS. A 50K document corpus adds only 36MB to the process memory footprint. Compare this to Elasticsearch, where a comparable index typically requires 2–5KB per document plus JVM heap overhead.

MetricValue
RSS before indexing56 MB
RSS after 50K docs93 MB
Delta36 MB
Bytes per document~766 bytes

Deep Dive

Why Lock Overhead Is Zero

One surprising result: serial and batch indexing perform identically. The intuition is that batch mode should avoid lock contention, but in practice the bottleneck is Tantivy's internal tokenization and segment I/O — not the write lock.

Tantivy uses a single IndexWriter with an internal 50MB buffer. Documents are tokenized, analyzed, and written to in-memory segments before being flushed to disk. The write lock is held only during the brief commit() call, which triggers a segment flush. During normal indexing, the lock is uncontended.

How Concurrent Reads Scale

The IndexReader in Tantivy creates lightweight Searcher instances that share the same memory-mapped segment files via Arc. No data is copied. Each search thread gets its own Searcher, but they all read from the same physical memory pages. The OS page cache does the rest.

This is fundamentally different from JVM-based engines, where each query allocates objects on the heap, increasing GC pressure under concurrent load.

The Query Parser

Our query parser supports boolean operators, phrase queries with "quotes", field-specific queries with field:value syntax, negation with -term, and wildcard matching. It compiles to a Tantivy Box<dyn Query> tree in under 100 microseconds regardless of complexity.

The parser uses a single-pass approach: tokenize the input, classify each token (phrase boundary, negation prefix, field prefix, plain term), and build the query tree bottom-up. No backtracking, no ambiguity. O(n) in input length.

RRF: Simple Beats Complex

We considered learned-to-rank models for combining BM25 and semantic scores. The problem: they require training data, add inference latency, and need retraining as the corpus changes. RRF has none of these problems.

The formula is trivial: for each document appearing in any ranked list, sum 1 / (k + rank) across all lists. Sort by score. Done. At k=60 (the standard constant), this produces results competitive with learned models on standard benchmarks, at a fraction of the complexity.

Key Takeaways

Methodology note

All benchmarks were generated using a dedicated Rust binary (benchmark_report.rs) that creates synthetic documents matching our production schema, measures wall-clock time with std::time::Instant, and computes percentiles from 100–10,000 samples per measurement. The Criterion benchmark suite (core_benchmarks.rs) provides statistical validation with confidence intervals. Source code is available for audit.

Need this kind of performance
in your search stack?

We audit and architect Rust systems. 48-hour turnaround. Real findings, not checklists.

Start Your Audit audit@newcool.io