How it works — OpenAlex Search (a RoaringRange demo)

01From work to document

The OpenAlex data, mapped

Each OpenAlex Work — a paper, book chapter, dataset, preprint — becomes one ranked document. Documents are numbered by cited_by_count, most-cited first, which is exactly what lets the popular “head” paint instantly.

What’s indexed

The searchable text of each work is concatenated and cut into trigrams, so a query matches across any field — a phrase from the abstract, an author, or a journal, not just the title. OpenAlex doesn’t ship abstracts as plain text; it stores an abstract_inverted_index — a map of each word to the positions where it occurs — so the builder reconstructs the running abstract from that index before indexing it (that’s what the abstract ← abstract_inverted_index chip below means).

title abstract ← abstract_inverted_index author names host venue → trigrams

The five facets

Five fields from each work become filter categories — the .rrf sidecar. Counts are free; selecting one fetches a single small posting.

OR / AND Within a field, categories OR together; across fields they AND — all computed from doc-ID bitmaps, with no backend.

What a result stores

The record behind each hit keeps just enough to render a card and link out — opaque bytes the format never inspects.

Example result card

Deep Residual Learning for Image Recognitiontitle

211,420citations

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun authors

2016 year• IEEE CVPR host venue• green OA

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously… abstract

id W2194775991 → openalex.org/W2194775991

DOI lookup

A small companion index — the .rril file — maps each work’s DOI straight to its document. Pasting a DOI, or a doi.org URL, jumps to the exact work in a couple of byte-range reads instead of a text search. The format appears in section 03.

02Query → range reads

Searching in the browser

A query tokenizes into trigrams, fetches matching dictionary blocks and head postings as parallel HTTP Range requests, ANDs the roaring bitmaps, and reads the page’s records — a few KB–MB in total. The multi-GB file is never downloaded whole.

One query, a near-constant number of round-trips — independent of corpus size.

flowIn-browser search · HTTP range reads

A query tokenizes into trigrams, fetches the matching dictionary blocks and head postings as parallel Range requests, ANDs the roaring bitmaps, and reads the page’s records. The multi-GB file is never downloaded whole.

Two tiers in time: the popular head paints first, then the full tail fills itself in.

head / tailTwo-tier search · instant top-K, then the rest

Doc IDs are numbered by citation count, so the most-cited matches live in the head and a ranked top-K paints instantly from one cheap wave. Once the query settles, the tail auto-loads — intersecting the rarest posting first, then fetching only the roaring containers surviving candidates still touch (adjacent ones coalesced into single reads). A rare phrase of common trigrams costs a few hundred KB, not the tens of MB of reading every full tail.

Boot is cheap because only the small top of each file loads up front.

bootWhat boots · versus what stays in S3

At boot the browser downloads only the small top of each file — the index header + sparse index, the facet metadata + top category heads, and the record store’s header — a few MB total. Everything else (dictionary + postings, facet tails, record blobs — tens of GB) stays in S3 and is range-fetched only as a query needs it.

03Meaning, not characters

Semantic & hybrid search

Trigram search matches characters — exact substrings. Semantic search matches meaning: the query and every work become vectors, and the nearest by cosine win — even with no shared words. Same static ethos — an IVFPQ index (the .rrvi file) boots once, and each query range-fetches only a few clusters.

From query to nearest works

The corpus is partitioned into nlist clusters by coarse centroids, and every vector is stored as a compact product-quantization code — 32 bytes, not a 2 KB float array — optionally rotated by OPQ for accuracy. Boot downloads the small boot region once: the OPQ rotation, the centroids, the PQ codebooks, and a cluster directory. Per query, the reader finds the nprobe nearest centroids in memory, range-fetches just those clusters’ code lists, and ranks them by asymmetric distance — the real query vector scored against quantized codes through precomputed tables, no full vectors needed.

query text → embed ← 512-d unit vector → nprobe clusters ← range reads → PQ ADC rank → top-K doc IDs

The query path mirrors the trigram one — a near-constant number of round-trips, independent of corpus size.

flowSemantic search · embed → range-fetch clusters

The query embeds to a 512-d vector (in the browser, or via the mode-1 Lambda), the reader picks the nprobe nearest centroids from the in-memory boot region, and range-fetches only those clusters’ PQ codes to rank by asymmetric distance. Out come ranked doc IDs in the same numbering as the text index — so records render through the very same path.

Two ways to embed a query

The search machinery is identical; only the embedder differs. Each model defines its own vector space, so each mode has its own .rrvi — but the corpus is embedded once, offline, for $0 either way.

INVARIANT Corpus and query must share the identical model, pooling, and prompt, or the query lands in the wrong space. Both modes guarantee it — the same model embeds both sides (the Lambda runs the very model the corpus was built with).

Facets over vector results

Vectors reuse the exact doc-ID numbering of the text index and the facet sidecar (vector_id == doc_id) — so the same roaring-bitmap facets apply with no remapping. Filtering a semantic search keeps only the candidate IDs that fall in the selected category bitmaps, the same membership test the trigram path already does. Two strategies trade recall for simplicity:

Counts are computed over the retrieved candidate pool rather than a complete bitmap — “semantic matches” is a ranked list, not an exact set — with the same in-wasm machinery and no backend. And hybrid inherits all of it: it fuses the filtered trigram and filtered semantic lists.

Hybrid — reciprocal rank fusion

Run both. Trigram nails exact terms (names, identifiers, rare phrases); semantic catches paraphrase and synonymy. Reciprocal rank fusion blends the two ranked lists without needing comparable scores: each list contributes 1 / (k + rank) to a doc (k ≈ 60), so anything near the top of either ranks high.

RRF score(doc) = Σ 1 / (k + rank_list(doc)) across the trigram and semantic lists — exact matches and meaning matches fused into one ranking, no score calibration required.

.rrviVector index · one static file

A 48-byte header, then the boot region — an optional OPQ rotation, the coarse centroids, the PQ codebooks, and a cluster directory — read once into memory. Each query fetches only the nprobe nearest clusters’ code lists ([u32 ids][u8 codes]). An optional .rrm2 (mode-2 static matrix) and .rrvr (bf16 full-vector re-rank sidecar, fetched only for the surviving top-K) ride alongside. The whole index is never downloaded.

04One index, a family of sidecars

The file formats

One static text index plus a family of companion files. Every box below loads by HTTP byte-range; nothing runs server-side. Doc IDs are assigned by citation count throughout, so the most-cited works sit in the “head” and top-K is free.

.rrsText index · one static file

A 20-byte header, a small sparse index loaded once, a key-sorted dictionary, then popularity-split postings. The header now carries headBoundary — a dynamic head/tail split (a multiple of 65,536, raised for larger corpora) — so the head holds the most-popular docs and top-K stays free. Boot keeps the header + sparse index in memory; each query binary-searches it, fetches one dictionary block, then the head postings per trigram.

.rrfFacet sidecar · filtering without a backend

Maps each category (year, type, language, …) to a doc-ID bitmap. Header, tables, and names load once — so listing categories and their full-corpus counts is free; selecting one fetches a single small head posting. Because facet doc IDs share the index’s popularity order, postings split at the same dynamic headBoundary as the text index. Within a field categories OR; across fields they AND.

.idx / .binRecord store · doc ID → stored fields

A search returns ranked doc IDs; the store turns each back into its fields. An offset index (.idx) maps a doc ID to a byte range in the record blob (.bin) — two ranged reads, and a page of consecutive IDs is one contiguous slice. Records are opaque bytes, so the format never dictates your data model.

.rrilDOI lookup · identifier → doc ID

Maps each work’s normalized DOI to a doc ID: a header, then fixed-size records sorted by hash. A lookup normalizes the query and binary-searches the hashes — a handful of 16-byte ranged reads — and a second “verify” hash rejects collisions. An exact DOI lands on the exact work, or nothing, with no text search.

.rrscSort columns · secondary indexes & client-side re-rank

An optional, range-fetchable store of dense values per doc ID — the build-time counterpart of an alternate sort key. A search returns IDs in citation rank; an .rrsc column lets the reader fetch one value per candidate and top-K client-side (sort by date, rating, any metric), ties broken by doc ID. The same container also backs a full second index: a u32 permutation column maps secondary_docid → primary_docid, so a results page — a contiguous run — resolves back to primary IDs in a single ranged read.

Searching millions of papers, with no server.