OpenAlex Search a RoaringRange demo · how it works

Powered by RoaringRange

Searching millions of papers, with no server.

This RoaringRange demo searches OpenAlex works entirely in your browser, with no backend. The index is one static file on S3, and a query fetches just a few small byte-ranges over HTTP — never the whole file. It's powered by RoaringRange, and inspired by Lunr.js and Pagefind.

47.8M
OpenAlex works indexed
0
servers · backends
~3MB
downloaded at boot
KB–MB
fetched per query
01From work to document

The OpenAlex data, mapped

Each OpenAlex Work — a paper, book chapter, dataset, preprint — becomes one ranked document. Documents are numbered by cited_by_count, most-cited first, which is exactly what lets the popular “head” paint instantly.

What’s indexed

The searchable text of each work is concatenated and cut into trigrams, so a query matches across any field — a phrase from the abstract, an author, or a journal, not just the title. OpenAlex doesn’t ship abstracts as plain text; it stores an abstract_inverted_index — a map of each word to the positions where it occurs — so the builder reconstructs the running abstract from that index before indexing it (that’s what the abstract ← abstract_inverted_index chip below means).

title abstract ← abstract_inverted_index author names host venue trigrams

The five facets

Five fields from each work become filter categories — the .rrf sidecar. Counts are free; selecting one fetches a single small posting.

Year
publication_year
The work’s publication year.
Type
type
article, book-chapter, dataset, preprint, …
Open access
oa_status
gold, green, hybrid, bronze, or closed.
Language
language
Shown as a full language name.
Topic
primary_topic
Falls back to the first concept.
OR / AND Within a field, categories OR together; across fields they AND — all computed from doc-ID bitmaps, with no backend.

What a result stores

The record behind each hit keeps just enough to render a card and link out — opaque bytes the format never inspects.

Example result card

Deep Residual Learning for Image Recognitiontitle

211,420citations

Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun authors

2016 year IEEE CVPR host venue green OA

Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously… abstract

id  W2194775991  → openalex.org/W2194775991

DOI lookup

A small companion index — the .rril file — maps each work’s DOI straight to its document. Pasting a DOI, or a doi.org URL, jumps to the exact work in a couple of byte-range reads instead of a text search. The format appears in section 03.

03Meaning, not characters

Semantic & hybrid search

Trigram search matches characters — exact substrings. Semantic search matches meaning: the query and every work become vectors, and the nearest by cosine win — even with no shared words. Same static ethos — an IVFPQ index (the .rrvi file) boots once, and each query range-fetches only a few clusters.

From query to nearest works

The corpus is partitioned into nlist clusters by coarse centroids, and every vector is stored as a compact product-quantization code — 32 bytes, not a 2 KB float array — optionally rotated by OPQ for accuracy. Boot downloads the small boot region once: the OPQ rotation, the centroids, the PQ codebooks, and a cluster directory. Per query, the reader finds the nprobe nearest centroids in memory, range-fetches just those clusters’ code lists, and ranks them by asymmetric distance — the real query vector scored against quantized codes through precomputed tables, no full vectors needed.

query text embed ← 512-d unit vector nprobe clusters ← range reads PQ ADC rank top-K doc IDs

The query path mirrors the trigram one — a near-constant number of round-trips, independent of corpus size.

flowSemantic search · embed → range-fetch clusters
Semantic search — embed the query, then range-fetch the nearest clusters no backend in mode 2; one tiny embed call in mode 1 — the rest is the same range-read path as the text index Browser — WASM reader (Rust) boot once: OPQ + centroids + PQ codebooks + directory (in memory) embed query → 512-d unit vector model2vec·wasm or Lambda·Gemma nprobe nearest centroids (in memory) fetch those clusters' codes → ADC-rank → top-K doc IDs (PQ: 32 B/doc; no full vectors) optional facet filter — keep IDs in selected bitmaps fetch record JSON → render cards CloudFront → S3 (+ Lambda) boot region .rrvi · centroids+codebooks+dir .rrm2 model matrix static · mode 2 Lambda · Gemma ONNX · mode 1 cluster code lists .rrvi · [u32 ids][u8 codes] × nlist clusters records.bin GET boot region (once) GET .rrm2 (once · mode 2) or POST /embed (mode 1) GET nprobe cluster lists (parallel) GET records (coalesced) downloaded once range-fetched per query live · mode 1 Rust VectorIndex::open(f).search(query_vec, k, nprobe) → [ VectorHit { doc_id, score } ]
The query embeds to a 512-d vector (in the browser, or via the mode-1 Lambda), the reader picks the nprobe nearest centroids from the in-memory boot region, and range-fetches only those clusters’ PQ codes to rank by asymmetric distance. Out come ranked doc IDs in the same numbering as the text index — so records render through the very same path.

Two ways to embed a query

The search machinery is identical; only the embedder differs. Each model defines its own vector space, so each mode has its own .rrvi — but the corpus is embedded once, offline, for $0 either way.

In the browser — model2vec
mode 2 · no backend
A static model (potion-retrieval-32M) ported to wasm: tokenize (BERT WordPiece), average the per-token vectors, normalize. No neural net, no API key, $0. The token matrix (.rrm2, ~33 MB int8) downloads once and is browser-cached. Slightly lower quality than a transformer.
Lambda-based — EmbeddingGemma
mode 1 · AWS Lambda
EmbeddingGemma-300M runs in a Lambda as ONNX (no PyTorch), fronted by CloudFront. A real transformer with asymmetric prompts (query: vs document:) gives stronger paraphrase recall; ~40 ms warm. Only the live query calls it — the corpus was embedded locally.
INVARIANT Corpus and query must share the identical model, pooling, and prompt, or the query lands in the wrong space. Both modes guarantee it — the same model embeds both sides (the Lambda runs the very model the corpus was built with).

Facets over vector results

Vectors reuse the exact doc-ID numbering of the text index and the facet sidecar (vector_id == doc_id) — so the same roaring-bitmap facets apply with no remapping. Filtering a semantic search keeps only the candidate IDs that fall in the selected category bitmaps, the same membership test the trigram path already does. Two strategies trade recall for simplicity:

Post-filter
retrieve → drop
Over-fetch a larger top-N, then drop IDs outside the selected bitmaps. Dead simple — one bitmap test per candidate — but a very selective filter can thin a page, so N is raised to compensate.
Pre-filter (ID selector)
filter while scanning
Push the allowed-set bitmap into the cluster scan: skip codes whose doc ID isn’t allowed and probe more clusters until top-K fills. Guarantees a full page even under a tight filter.

Counts are computed over the retrieved candidate pool rather than a complete bitmap — “semantic matches” is a ranked list, not an exact set — with the same in-wasm machinery and no backend. And hybrid inherits all of it: it fuses the filtered trigram and filtered semantic lists.

Hybrid — reciprocal rank fusion

Run both. Trigram nails exact terms (names, identifiers, rare phrases); semantic catches paraphrase and synonymy. Reciprocal rank fusion blends the two ranked lists without needing comparable scores: each list contributes 1 / (k + rank) to a doc (k ≈ 60), so anything near the top of either ranks high.

RRF score(doc) = Σ 1 / (k + ranklist(doc)) across the trigram and semantic lists — exact matches and meaning matches fused into one ranking, no score calibration required.
.rrviVector index · one static file
Vector index — one static file (.rrvi) IVFPQ: a boot region read once, then per-cluster PQ code lists range-fetched only for the nprobe nearest clusters loaded at boot range-fetched Header 48 B OPQ rot. D×D · optional Centroids nlist×D f32 PQ codebooks m×256×dsub f32 Directory nlist×(u64,u32) Cluster lists per cluster · range-fetched magic "RRVI" · version · dim · nlist · m · nbits · metric · flags (OPQ?) boot region — read once, kept in memory centroids nlist coarse vectors → find the nprobe nearest codebooks m subspaces × 256 entries → score PQ codes OPQ optional D×D rotation → applied to the query directory per cluster → (byte offset, count) centroids dominate the boot size (≈ nlist × D × 4 bytes) one cluster list — fetched only if among the nprobe nearest u32 doc IDs × count u8 PQ codes × count · m doc IDs share the text index's numbering → records & facets reuse them with no remapping. count comes from the directory; m bytes/doc (here 32). Reader — boot once, then one wave of nprobe ranged reads. boot header + OPQ + centroids + codebooks + directory → memory (one read). per query embed → nprobe nearest centroids → GET those clusters' codes → ADC rank → top-K doc IDs. Rust write build_ivfpq_from_parts(parts).write(w) read VectorIndex::open(f).search(q, k, nprobe)
A 48-byte header, then the boot region — an optional OPQ rotation, the coarse centroids, the PQ codebooks, and a cluster directory — read once into memory. Each query fetches only the nprobe nearest clusters’ code lists ([u32 ids][u8 codes]). An optional .rrm2 (mode-2 static matrix) and .rrvr (bf16 full-vector re-rank sidecar, fetched only for the surviving top-K) ride alongside. The whole index is never downloaded.
04One index, a family of sidecars

The file formats

One static text index plus a family of companion files. Every box below loads by HTTP byte-range; nothing runs server-side. Doc IDs are assigned by citation count throughout, so the most-cited works sit in the “head” and top-K is free.

.rrsText index · one static file
Text index — one static file (.rrs) a trigram dictionary + popularity-split postings; a few small ranged reads per query, independent of corpus size loaded at boot range-fetched Header 20 B Sparse index 8 B × ⌈ngrams/stride⌉ Dictionary 24 B/entry · key-sorted Postings per trigram · [ head ][ tail ] magic "RRSI" · version · gramSize · ngrams · stride · headBoundary one dictionary entry — 24 B, sorted by key key u64 · 8 B headOffset u64 · 8 B headSize u32 tailSize u32 tail at headOffset + headSize, length tailSize fullSize = headSize + tailSize postings — per trigram: [ head ][ tail ] head docs [0, headBoundary) tail docs ≥ headBoundary headBoundary — dynamic, a multiple of 65,536 (default 65,536). Doc IDs descend by popularity → head = top-K. Reader — two range-fetch tiers over one static file. boot header (20 B) + sparse index → kept in memory, ~tens of KB. per query binary-search the sparse index → fetch one dictionary block → the head postings per trigram (AND smallest-first). tails fetched only when a query needs more than the head's top-K. Boot is ~tens of KB; each query is a handful of small ranged reads — independent of corpus size. Rust write build::write_index(w, gram, stride, entries) read Index::open(f).search(q, limit)
A 20-byte header, a small sparse index loaded once, a key-sorted dictionary, then popularity-split postings. The header now carries headBoundary — a dynamic head/tail split (a multiple of 65,536, raised for larger corpora) — so the head holds the most-popular docs and top-K stays free. Boot keeps the header + sparse index in memory; each query binary-searches it, fetches one dictionary block, then the head postings per trigram.
.rrfFacet sidecar · filtering without a backend
Facet sidecar — companion file (.rrf) categorical filters (year, type, language, …) with free counts — same range-fetch model as the index loaded at boot range-fetched on demand Header 24 B Field table 16 B/field Category table 36 B/category, key-sorted Strings names Postings per category, range-fetched magic "RRSF" · ver · #fields · #cats · strBytes Header + tables + strings (a few KB) load once at boot → listing categories and their full-corpus counts is then free. one category entry (36 B) key u64 · headOff u64 · headSize u32 · tailSize u32 cardinality u32 · nameOff u32 · nameLen u16 cardinality = full-corpus doc count → facet counts are free (no fetch) per category: [ head ][ tail ] head docs [0, headBoundary) tail docs ≥ headBoundary Split tracks the index's headBoundary — a multiple of 65,536 (default 65,536), raised for larger corpora. A ≤ ~8 KB head per category; tails only when paging in. Filter semantics — mirrors a BitmapFilter. Within a field, selected categories OR together; distinct fields AND. Result = textMatch AND filter, applied to the head first. Live (search-filtered) counts = |resultBitmap ∩ categoryHead|, computed in memory over the query's head result. Doc IDs share the index's popularity order, so facet postings split at the same dynamic headBoundary and range-fetch the same way (see format.svg). Rust write build::write_facets(w, fields) read FacetIndex::open(f).counts(result)
Maps each category (year, type, language, …) to a doc-ID bitmap. Header, tables, and names load once — so listing categories and their full-corpus counts is free; selecting one fetches a single small head posting. Because facet doc IDs share the index’s popularity order, postings split at the same dynamic headBoundary as the text index. Within a field categories OR; across fields they AND.
.idx / .binRecord store · doc ID → stored fields
Record store — a hit's stored fields (.idx + .bin) a search returns ranked doc IDs; the store maps each to its record bytes over HTTP Range — search → details, no backend loaded at boot range-fetched on demand records.idx — offset index Header 16 B Offsets — (N+1) × u64, one per doc ID off[d] … off[d+1] bound record d magic "RRSR" · ver · count N records.bin — record blob, in doc-ID (rank) order rec 0 rec 1 rec 2 rec 3 rec N-1 lookup(doc id d) — two ranged reads 1. read 16 B at idx[16 + d*8] → (off[d], off[d+1]) 2. read bin[off[d] … off[d+1]) → record bytes A results page is consecutive doc IDs → one contiguous blob slice (a single fetch). Records are opaque to the library — the container is standard, the encoding is yours. The store frames bytes for O(1) lookup by doc ID; what's inside (JSON, msgpack, …) and which fields it holds is the application's choice. RecordStore (the reader) returns the raw bytes; the app decodes them. Rust write build::write_records(bin, idx, recs) read RecordStore::open(idx, bin).get_many(ids)
A search returns ranked doc IDs; the store turns each back into its fields. An offset index (.idx) maps a doc ID to a byte range in the record blob (.bin) — two ranged reads, and a page of consecutive IDs is one contiguous slice. Records are opaque bytes, so the format never dictates your data model.
.rrilDOI lookup · identifier → doc ID
DOI lookup — an exact identifier straight to a doc ID (.rril) paste a DOI or a doi.org URL; a handful of 16-byte ranged reads land on the exact work — no text search loaded at boot range-fetched work.rril — identifier index, sorted by hash Header 16 B Records — fixed 16 B each, sorted by hash64(doi) hash64 u64 · verify u32 · docId u32 magic "RRIL" · ver · count N binary-search the sorted hashes — halve the window per 16 B probe ≈ ⌈log₂ N⌉ ≈ 26 probes for 47.8M probe lo mid hi resolve a DOI in four steps — each probe is one 16 B Range read 1. normalize — strip the doi.org / https prefix, lower-case → hash64(doi) 2. binary-search the record hashes → the candidate record 3. verify — a second hash rejects collisions (a miss → no result) 4. docId → the record store renders the exact work An exact identifier lands on an exact work — or nothing — without ever touching the text index.
Maps each work’s normalized DOI to a doc ID: a header, then fixed-size records sorted by hash. A lookup normalizes the query and binary-searches the hashes — a handful of 16-byte ranged reads — and a second “verify” hash rejects collisions. An exact DOI lands on the exact work, or nothing, with no text search.
.rrscSort columns · secondary indexes & client-side re-rank
Sort columns — re-rank by a secondary key (RRSC) a dense value-per-doc column; fetch a candidate set's values over HTTP Range and top-K client-side — sort by date, rating, any metric loaded at boot range-fetched work.rrsc — dense columns, indexed by doc ID Header 16 B Column table 24 B / column Names string blob Dense data — one column per sort key rows × width · doc-ID order magic "RRSC" · ver · colCount · rows N · strBytes value(d) = data[ dataOff + d·width ] Re-rank a materialized candidate set — fetch values, then top-K in the browser fetch each candidate's value(d) offsets sorted & coalesced into a few spans → one concurrent wave, mirroring the record store 2019 2021 2015 2023 d3 d7 d12 d18 pub_date column topk(candidates, k, descending) — partial-sort client-side candidates · citation rank d3 d7 d12 d18 pub_date ↓ re-ranked · newest first d18 d7 d3 d12 ties keep primary rank → “newest, then most-cited” A few KB per page; the multi-GB column data is never read whole — only the candidates' cells. Full second index — a second .rrs reindexed in secondary order + a u32 permutation column (secondary_docid → primary_docid). A result page is a contiguous run, so slice_u32(start, len) maps it back to primary IDs in one ranged read. Records & facets stay keyed by primary ID.
An optional, range-fetchable store of dense values per doc ID — the build-time counterpart of an alternate sort key. A search returns IDs in citation rank; an .rrsc column lets the reader fetch one value per candidate and top-K client-side (sort by date, rating, any metric), ties broken by doc ID. The same container also backs a full second index: a u32 permutation column maps secondary_docid → primary_docid, so a results page — a contiguous run — resolves back to primary IDs in a single ranged read.