Building a Sub-Millisecond Search Engine in Rust

The Problem

Most teams reach for Elasticsearch or Solr when they need full-text search. These are solid tools — but they come with a cost: JVM overhead, GC pauses, cluster complexity, and query latencies that typically land in the 5–50ms range for simple queries. For many applications, that's fine.

We needed something different. Our requirements: sub-millisecond latency at p99, hybrid ranking combining BM25 text relevance with semantic similarity, and single-binary deployment — no JVM, no cluster coordinator, no YAML.

The answer was Rust.

Architecture Overview

The engine is structured as a set of modular Rust crates, each responsible for a single concern. The API layer is Axum + Tokio. The indexing and retrieval layer is built on Tantivy. Semantic search runs through Qdrant, and results from both paths are fused via Reciprocal Rank Fusion (RRF).

API

Axum HTTP Query Parser Auth Middleware

↓

BM25 (Tantivy) Semantic (Qdrant) RRF Fusion

↓

Index

Inverted Index Vector Store Segment Merge

↓

Storage

mmap Files Redis Cache Commit Log

Why Tantivy

Tantivy is a full-text search library written in Rust, inspired by Apache Lucene. Unlike Elasticsearch (which wraps Lucene in a JVM layer), Tantivy compiles to native code with zero garbage collection. It gives us direct control over memory layout, segment merging, and tokenization pipelines.

The key advantage: Tantivy's IndexReader uses memory-mapped segments with Arc-based sharing. Multiple search threads read from the same mapped memory without copying or locking. This is why concurrent throughput scales near-linearly.

Hybrid Ranking with RRF

We combine two ranking signals:

BM25 from Tantivy — classic term-frequency relevance, fast and deterministic
Semantic similarity from Qdrant — vector embeddings for meaning-based matching

These are fused using Reciprocal Rank Fusion, which merges ranked lists without requiring score normalization. The algorithm is O(n) in the number of results and adds negligible overhead — under 1ms even for 1,000 results.

// RRF fusion — O(n) merge of ranked lists
fn rrf_fuse(bm25: &[DocScore], semantic: &[DocScore], k: f32) -> Vec<DocScore> {
    let mut scores: HashMap<DocId, f32> = HashMap::new();

    for (rank, doc) in bm25.iter().enumerate() {
        *scores.entry(doc.id).or_default() += 1.0 / (k + rank as f32 + 1.0);
    }
    for (rank, doc) in semantic.iter().enumerate() {
        *scores.entry(doc.id).or_default() += 1.0 / (k + rank as f32 + 1.0);
    }

    let mut fused: Vec<_> = scores.into_iter().collect();
    fused.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
    fused
}

Benchmark Results

All benchmarks were run on macOS x86_64, release build, with Criterion for statistical validity. The corpus consists of synthetic documents (educational content, song metadata, SCADA records) matching our production workload profile.

Indexing Throughput

At small corpus sizes, setup cost (directory creation, 50MB writer allocation) dominates. As the corpus grows, throughput stabilizes at 21K docs/sec. Serial and batch modes perform identically — Tantivy's internal tokenization and I/O is the real bottleneck, not lock overhead.

Corpus Size	Method	Time	Docs/sec
1K	serial	2.41s	416
1K	batch	2.37s	422
10K	serial	2.43s	4,117
10K	batch	2.52s	3,963
50K	serial	2.51s	19,881
50K	batch	2.64s	18,913
100K	serial	4.76s	21,006
100K	batch	5.08s	19,690

Query Latency

The headline number: p99 latency stays under 1ms even at 50K documents. Median latency at 50K docs is 234 microseconds. For context, a typical Elasticsearch simple query on a comparable corpus returns in 5–15ms — roughly 20–50x slower.

Corpus	p50	p95	p99	Mean
1K	119us	296us	335us	136us
10K	131us	294us	537us	160us
50K	234us	712us	984us	290us

Why microseconds matter

At 984us p99, your slowest 1-in-100 query is still under a millisecond. This means you can layer additional processing — re-ranking, personalization, A/B test logic — and still return results in under 5ms total. That's the budget Elasticsearch uses for the search query alone.

Concurrent Query Throughput

Throughput scales near-linearly with concurrent workers thanks to Tantivy's lock-free IndexReader architecture. At 100 concurrent workers, we sustain 27,287 queries per second on a single node.

Workers	Total Queries	Wall Time	Queries/sec
1	100	20.9ms	4,782
10	1,000	51.1ms	19,566
50	5,000	211.9ms	23,600
100	10,000	366.5ms	27,287

Query Parser Performance

The query parser handles simple terms, phrase queries, negation, and complex boolean expressions — all under 103 microseconds. This is pre-search overhead; it runs before the index is touched.

Query Type	Example	Mean Latency
Simple	`fracciones matematicas`	76.7us
Phrase	`"tabla periodica" quimica`	74.0us
Negation	`volcanes -tectonica ciencias`	102.6us
Complex	`"sistema solar" -pluton type:educational`	85.2us

RRF Ranking Fusion

The hybrid ranking step adds negligible overhead. Even fusing 1,000 results from both BM25 and semantic pipelines takes under 650 microseconds.

Results to Fuse	Mean Latency
10	6.0us
100	58.3us
500	317.6us
1,000	647.6us

Memory Efficiency

The index uses approximately 766 bytes per document in RSS. A 50K document corpus adds only 36MB to the process memory footprint. Compare this to Elasticsearch, where a comparable index typically requires 2–5KB per document plus JVM heap overhead.

Metric	Value
RSS before indexing	56 MB
RSS after 50K docs	93 MB
Delta	36 MB
Bytes per document	~766 bytes

Deep Dive

Why Lock Overhead Is Zero

One surprising result: serial and batch indexing perform identically. The intuition is that batch mode should avoid lock contention, but in practice the bottleneck is Tantivy's internal tokenization and segment I/O — not the write lock.

Tantivy uses a single IndexWriter with an internal 50MB buffer. Documents are tokenized, analyzed, and written to in-memory segments before being flushed to disk. The write lock is held only during the brief commit() call, which triggers a segment flush. During normal indexing, the lock is uncontended.

How Concurrent Reads Scale

The IndexReader in Tantivy creates lightweight Searcher instances that share the same memory-mapped segment files via Arc. No data is copied. Each search thread gets its own Searcher, but they all read from the same physical memory pages. The OS page cache does the rest.

This is fundamentally different from JVM-based engines, where each query allocates objects on the heap, increasing GC pressure under concurrent load.

The Query Parser

Our query parser supports boolean operators, phrase queries with "quotes", field-specific queries with field:value syntax, negation with -term, and wildcard matching. It compiles to a Tantivy Box<dyn Query> tree in under 100 microseconds regardless of complexity.

The parser uses a single-pass approach: tokenize the input, classify each token (phrase boundary, negation prefix, field prefix, plain term), and build the query tree bottom-up. No backtracking, no ambiguity. O(n) in input length.

RRF: Simple Beats Complex

We considered learned-to-rank models for combining BM25 and semantic scores. The problem: they require training data, add inference latency, and need retraining as the corpus changes. RRF has none of these problems.

The formula is trivial: for each document appearing in any ranked list, sum 1 / (k + rank) across all lists. Sort by score. Done. At k=60 (the standard constant), this produces results competitive with learned models on standard benchmarks, at a fraction of the complexity.

Key Takeaways

Rust + Tantivy eliminates the JVM tax. No GC pauses, no heap bloat, no warmup time. Your p99 is your p99 — not p99-minus-GC-pauses.
Sub-millisecond p99 is achievable. At 50K documents, our worst 1-in-100 query returns in 984 microseconds. This leaves budget for application logic.
Concurrency scales with zero effort. Tantivy's mmap + Arc architecture means you don't need to think about read locks. Just spawn more Searcher instances.
RRF is the right default for hybrid search. No training, no tuning, O(n) complexity, sub-millisecond latency. Start here; reach for LTR only if you need it.
766 bytes per document. Your laptop can index a million documents and still have RAM to spare.

Methodology note

All benchmarks were generated using a dedicated Rust binary (benchmark_report.rs) that creates synthetic documents matching our production schema, measures wall-clock time with std::time::Instant, and computes percentiles from 100–10,000 samples per measurement. The Criterion benchmark suite (core_benchmarks.rs) provides statistical validation with confidence intervals. Source code is available for audit.