OAOpenAlex Searcha RoaringRange demo · how it works
Powered by RoaringRange
Searching millions of papers, with no server.
This RoaringRange demo searches OpenAlex works entirely in your browser, with no backend.
The index is one static file on S3, and a query fetches just a few small byte-ranges over HTTP —
never the whole file. It's powered by RoaringRange, and inspired by
Lunr.js and
Pagefind.
47.8M
OpenAlex works indexed
0
servers · backends
~3MB
downloaded at boot
KB–MB
fetched per query
01From work to document
The OpenAlex data, mapped
Each OpenAlex Work — a paper, book chapter, dataset, preprint — becomes one ranked document.
Documents are numbered by cited_by_count, most-cited first, which is exactly what lets the popular “head” paint instantly.
What’s indexed
The searchable text of each work is concatenated and cut into trigrams, so a query matches across any field —
a phrase from the abstract, an author, or a journal, not just the title.
OpenAlex doesn’t ship abstracts as plain text; it stores an abstract_inverted_index — a map of
each word to the positions where it occurs — so the builder reconstructs the running abstract from that
index before indexing it (that’s what the abstract ← abstract_inverted_index chip below means).
Five fields from each work become filter categories — the .rrf sidecar. Counts are free; selecting one fetches a single small posting.
Year
publication_year
The work’s publication year.
Type
type
article, book-chapter, dataset, preprint, …
Open access
oa_status
gold, green, hybrid, bronze, or closed.
Language
language
Shown as a full language name.
Topic
primary_topic
Falls back to the first concept.
OR / ANDWithin a field, categories OR together; across fields they AND — all computed from doc-ID bitmaps, with no backend.
What a result stores
The record behind each hit keeps just enough to render a card and link out — opaque bytes the format never inspects.
Example result card
Deep Residual Learning for Image Recognitiontitle
211,420citations
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun authors
2016 year•IEEE CVPR host venue•green OA
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously… abstract
id W2194775991 → openalex.org/W2194775991
DOI lookup
A small companion index — the .rril file — maps each work’s DOI straight to its document.
Pasting a DOI, or a doi.org URL, jumps to the exact work in a couple of byte-range reads instead of a text search.
The format appears in section 03.
02Query → range reads
Searching in the browser
A query tokenizes into trigrams, fetches matching dictionary blocks and head postings as parallel HTTP Range requests,
ANDs the roaring bitmaps, and reads the page’s records — a few KB–MB in total. The multi-GB file is never downloaded whole.
One query, a near-constant number of round-trips — independent of corpus size.
flowIn-browser search · HTTP range reads
A query tokenizes into trigrams, fetches the matching dictionary blocks and head postings as parallel Range requests, ANDs the roaring bitmaps, and reads the page’s records. The multi-GB file is never downloaded whole.
Two tiers in time: the popular head paints first, then the full tail fills itself in.
head / tailTwo-tier search · instant top-K, then the rest
Doc IDs are numbered by citation count, so the most-cited matches live in the head and a ranked top-K paints instantly from one cheap wave. Once the query settles, the tail auto-loads — intersecting the rarest posting first, then fetching only the roaring containers surviving candidates still touch (adjacent ones coalesced into single reads). A rare phrase of common trigrams costs a few hundred KB, not the tens of MB of reading every full tail.
Boot is cheap because only the small top of each file loads up front.
bootWhat boots · versus what stays in S3
At boot the browser downloads only the small top of each file — the index header + sparse index, the facet metadata + top category heads, and the record store’s header — a few MB total. Everything else (dictionary + postings, facet tails, record blobs — tens of GB) stays in S3 and is range-fetched only as a query needs it.
03Meaning, not characters
Semantic & hybrid search
Trigram search matches characters — exact substrings. Semantic search matches meaning:
the query and every work become vectors, and the nearest by cosine win — even with no shared words.
Same static ethos — an IVFPQ index (the .rrvi file) boots once, and each query
range-fetches only a few clusters.
From query to nearest works
The corpus is partitioned into nlist clusters by coarse centroids, and every vector is stored as a
compact product-quantization code — 32 bytes, not a 2 KB float array — optionally rotated
by OPQ for accuracy. Boot downloads the small boot region once: the OPQ rotation, the centroids,
the PQ codebooks, and a cluster directory. Per query, the reader finds the nprobe nearest centroids
in memory, range-fetches just those clusters’ code lists, and ranks them by asymmetric distance
— the real query vector scored against quantized codes through precomputed tables, no full vectors needed.
query text→embed ← 512-d unit vector→nprobe clusters ← range reads→PQ ADC rank→top-K doc IDs
The query path mirrors the trigram one — a near-constant number of round-trips, independent of corpus size.
The query embeds to a 512-d vector (in the browser, or via the mode-1 Lambda), the reader picks the nprobe nearest centroids from the in-memory boot region, and range-fetches only those clusters’ PQ codes to rank by asymmetric distance. Out come ranked doc IDs in the same numbering as the text index — so records render through the very same path.
Two ways to embed a query
The search machinery is identical; only the embedder differs. Each model defines its own vector space,
so each mode has its own .rrvi — but the corpus is embedded once, offline, for $0 either way.
In the browser — model2vec
mode 2 · no backend
A static model (potion-retrieval-32M) ported to wasm: tokenize (BERT WordPiece), average
the per-token vectors, normalize. No neural net, no API key, $0. The token matrix
(.rrm2, ~33 MB int8) downloads once and is browser-cached. Slightly lower quality than a transformer.
Lambda-based — EmbeddingGemma
mode 1 · AWS Lambda
EmbeddingGemma-300M runs in a Lambda as ONNX (no PyTorch), fronted by CloudFront. A real
transformer with asymmetric prompts (query: vs document:) gives stronger
paraphrase recall; ~40 ms warm. Only the live query calls it — the corpus was embedded locally.
INVARIANTCorpus and query must share the identical model, pooling, and prompt, or the query lands in the wrong space. Both modes guarantee it — the same model embeds both sides (the Lambda runs the very model the corpus was built with).
Facets over vector results
Vectors reuse the exact doc-ID numbering of the text index and the facet sidecar
(vector_id == doc_id) — so the same roaring-bitmap facets apply with no remapping.
Filtering a semantic search keeps only the candidate IDs that fall in the selected category bitmaps,
the same membership test the trigram path already does. Two strategies trade recall for simplicity:
Post-filter
retrieve → drop
Over-fetch a larger top-N, then drop IDs outside the selected bitmaps. Dead simple — one bitmap test per candidate — but a very selective filter can thin a page, so N is raised to compensate.
Pre-filter (ID selector)
filter while scanning
Push the allowed-set bitmap into the cluster scan: skip codes whose doc ID isn’t allowed and probe more clusters until top-K fills. Guarantees a full page even under a tight filter.
Counts are computed over the retrieved candidate pool rather than a complete bitmap —
“semantic matches” is a ranked list, not an exact set — with the same in-wasm machinery and no backend.
And hybrid inherits all of it: it fuses the filtered trigram and filtered semantic lists.
Hybrid — reciprocal rank fusion
Run both. Trigram nails exact terms (names, identifiers, rare phrases); semantic catches paraphrase and synonymy.
Reciprocal rank fusion blends the two ranked lists without needing comparable scores: each list contributes
1 / (k + rank) to a doc (k ≈ 60), so anything near the top of either ranks high.
RRFscore(doc) = Σ 1 / (k + ranklist(doc)) across the trigram and semantic lists — exact matches and meaning matches fused into one ranking, no score calibration required.
.rrviVector index · one static file
A 48-byte header, then the boot region — an optional OPQ rotation, the coarse centroids, the PQ codebooks, and a cluster directory — read once into memory. Each query fetches only the nprobe nearest clusters’ code lists ([u32 ids][u8 codes]). An optional .rrm2 (mode-2 static matrix) and .rrvr (bf16 full-vector re-rank sidecar, fetched only for the surviving top-K) ride alongside. The whole index is never downloaded.
04One index, a family of sidecars
The file formats
One static text index plus a family of companion files. Every box below loads by HTTP byte-range; nothing runs server-side.
Doc IDs are assigned by citation count throughout, so the most-cited works sit in the “head” and top-K is free.
.rrsText index · one static file
A 20-byte header, a small sparse index loaded once, a key-sorted dictionary, then popularity-split postings. The header now carries headBoundary — a dynamic head/tail split (a multiple of 65,536, raised for larger corpora) — so the head holds the most-popular docs and top-K stays free. Boot keeps the header + sparse index in memory; each query binary-searches it, fetches one dictionary block, then the head postings per trigram.
.rrfFacet sidecar · filtering without a backend
Maps each category (year, type, language, …) to a doc-ID bitmap. Header, tables, and names load once — so listing categories and their full-corpus counts is free; selecting one fetches a single small head posting. Because facet doc IDs share the index’s popularity order, postings split at the same dynamic headBoundary as the text index. Within a field categories OR; across fields they AND.
.idx / .binRecord store · doc ID → stored fields
A search returns ranked doc IDs; the store turns each back into its fields. An offset index (.idx) maps a doc ID to a byte range in the record blob (.bin) — two ranged reads, and a page of consecutive IDs is one contiguous slice. Records are opaque bytes, so the format never dictates your data model.
.rrilDOI lookup · identifier → doc ID
Maps each work’s normalized DOI to a doc ID: a header, then fixed-size records sorted by hash. A lookup normalizes the query and binary-searches the hashes — a handful of 16-byte ranged reads — and a second “verify” hash rejects collisions. An exact DOI lands on the exact work, or nothing, with no text search.
An optional, range-fetchable store of dense values per doc ID — the build-time counterpart of an alternate sort key. A search returns IDs in citation rank; an .rrsc column lets the reader fetch one value per candidate and top-K client-side (sort by date, rating, any metric), ties broken by doc ID. The same container also backs a full second index: a u32 permutation column maps secondary_docid → primary_docid, so a results page — a contiguous run — resolves back to primary IDs in a single ranged read.