Will Smidlein's Blog

Posts tagged "search"

Pagefind

Pagefind is a fully static search library that aims to perform well on large sites, while using as little of your users’ bandwidth as possible, and without hosting any infrastructure.

Delightful project I accidentally stumbled upon while building this very blog. It pre-computes all the search indexes at build time and then packages into a gloriously simple frontend. Who knows if I’ll ever post enough that it’s worth using. For the time being, it lives at /search.

It spits out a directory structure like this:

dist/pagefind
├── fragment
│   ├── en_4733ec7.pf_fragment
│   ├── en_5c9a98e.pf_fragment
│   ├── en_7ca223a.pf_fragment
│   └── en_953c689.pf_fragment
├── index
│   └── en_4d96258.pf_index
├── pagefind-entry.json
├── pagefind-highlight.js
├── pagefind-modular-ui.css
├── pagefind-modular-ui.js
├── pagefind-ui.css
├── pagefind-ui.js
├── pagefind.en_f57a1155c8.pf_meta
├── pagefind.js
├── wasm.en.pagefind
└── wasm.unknown.pagefind

The .js, .css, and even the wasm stuff all made sense, but I was curious about the binary blobs in the .pf_fragment, .pf_index, and .pf_meta files.

Weirdly (and somewhat ironically), I could not find any documentation on the actual binary format the indexes were being stored as. I poked around a bit before deciding to dig into the source code.

With the help of Claude, I’ve figured out that they’re using Concise Binary Object Representation via the minicbor Rust lib and sort of pieced together the root data structures. I have linked to them below.

.pf_fragment

#[derive(Serialize, Debug, Clone)]
pub struct PageFragmentData {
    pub url: String,
    pub content: String,
    pub word_count: usize,
    pub filters: BTreeMap<String, Vec<String>>,
    pub meta: BTreeMap<String, String>,
    pub anchors: Vec<PageAnchorData>,
}

#[derive(Serialize, Debug, Clone)]
pub struct PageAnchorData {
    pub element: String,
    pub id: String,
    pub text: String,
    pub location: u32,
}

Code

.pf_index

/// A single word index chunk: `pagefind/index/*.pf_index`
#[derive(Encode)]
pub struct WordIndex {
    #[n(0)]
    pub words: Vec<PackedWord>,
}

/// A single word as an inverse index of all locations on the site
#[derive(Encode, Clone, Debug)]
pub struct PackedWord {
    #[n(0)]
    pub word: String,
    #[n(1)]
    pub pages: Vec<PackedPage>,
}

/// A set of locations on a given page
#[derive(Encode, Clone, Debug)]
pub struct PackedPage {
    #[n(0)]
    pub page_number: usize, // Won't exceed u32 but saves us some into()s
    #[n(1)]
    pub locs: Vec<i32>,
}

Code

.pf_meta

/// All metadata we need to glue together search queries & results
#[derive(Encode, Debug)]
pub struct MetaIndex {
    #[n(0)]
    pub version: String,
    #[n(1)]
    pub pages: Vec<MetaPage>,
    #[n(2)]
    pub index_chunks: Vec<MetaChunk>,
    #[n(3)]
    pub filters: Vec<MetaFilter>,
    #[n(4)]
    pub sorts: Vec<MetaSort>,
}

/// Communicates the pagefind/index/*.pf_index file we need to load
/// when searching for a word that sorts between `from` and `to`
#[derive(Encode, PartialEq, Debug)]
pub struct MetaChunk {
    #[n(0)]
    pub from: String,
    #[n(1)]
    pub to: String,
    #[n(2)]
    pub hash: String,
}

#[derive(Encode, Debug)]
pub struct MetaPage {
    #[n(0)]
    pub hash: String,
    #[n(1)]
    pub word_count: u32,
}

Code