BMI3241 · File Organization · Term Project

CardCatalog

Bitmap-indexed library catalog engine for million-scale query performance.

CardCatalog analyzes the dataset first, then builds a compact columnar storage layer and dense bitmap inverted indexes to answer title, author, and category queries in microseconds.

0
records indexed
0ms
load + index time
0MB
resident memory
~0ns
point lookup
0µs
compound query
0µs
prefix query
Data flow · books_dataset.txt → query engine record code set bit query
scroll to watch one record travel the whole engine, step by step ↓ follow the data journey
Step 1 · Parse

Read the line, end-first

41 ms
load + index
scroll

The dataset chose the architecture.

We measured books_dataset.txt before designing anything. Two properties — very low field cardinality with large result sets, and dense sequential identifiers — determined every decision that follows.

78 / 59 / 15
Low cardinality

Only 78 distinct titles, 59 authors and 15 categories span all 1M rows — each category alone covers ~64–65K records.

1 … 1M
Dense sequential IDs

Identifiers run 1…N with no gaps, so a record's position is just offset = id − 1. The id is never stored.

~66K
Large result sets

A single field value matches tens of thousands of records — the canonical case for dense bitmaps over sparse posting lists.

Instead of a generic record-by-record layout, CardCatalog exploits the measured structure of the dataset. Low-cardinality fields become compact single-byte codes, and each field value receives a dense bitmap over record positions — turning every count into a cached number and every intersection into a word-wise AND.

Five layers, one direction of flow.

From a memory-mapped source file down through dictionary-encoded columns and a dense bitmap index to the query layer. Scroll to activate each layer.

LAYER 01Source Fileon disk
  • books_dataset.txt — one million rows
  • Robust comma-aware parsing anchored from the end of the line
  • Resolves the 25,602 titles that contain unquoted commas
LAYER 02Loading Pipelinemmap · threads
  • Memory-mapped read-only file, advised for sequential access
  • Two-pass parallel parsing over disjoint ranges
  • Thread-local value sets merged into deterministic dictionaries
LAYER 03Storage~5 MB columns
  • Struct-of-arrays (columnar) layout — cache-efficient scans
  • title_code author_code category_code as byte arrays
  • year as 16-bit values · id implicit, not stored
LAYER 04Index~152 bitmaps
  • Dense bitmap inverted index — one bitmap per value
  • Word-wise AND / OR / AND-NOT, vectorised to AVX2
  • Cached cardinalities · POPCNT for counts
LAYER 05Query Layerµs latency
  • get_by_id · exact field query · count
  • Compound query · prefix query (dictionary range + OR)
  • Full CRUD with tombstone deletion

Watch the bitmaps resolve a query.

A simulated view of the index — not a live backend. Pick an operation and watch the 64-bit blocks combine into a result vector.

Query builder

Exact query: value → bitmap matches · µs
Note · Displayed query timings represent index-level bitmap operations and counting — not printing hundreds of thousands of records to the screen.

Milliseconds required. Microseconds achieved.

Median of 200 repetitions on the release build (-O3 -march=native -flto -pthread), measured at the three assignment scales.

Load time & resident memory
10K → 100K → 1M · both grow roughly linearly
Load time (ms) Resident memory (MB)
Query latency (log scale)
point access flat · set-returning queries scale with the result
get_by_id query_category compound prefix "The"
0 µs
get_by_id @ 1M
0 µs
count (cached) @ 1M
0 µs
query_category @ 1M
0 µs
compound (AND) @ 1M
Target
< 1 ms
Achieved (compound)
8.8 µs
Faster than target by
113×

Mutable, without losing the speed.

Every write keeps the columns, the bitmaps and the cached cardinalities consistent in constant time. Deletion never moves data.

CREATE

Append

Append at the next offset, encode each field, set three bitmap bits, grow the columns.

amortised O(1)
READ

Lookup

Compute offset = id − 1, check the tombstone bit, decode the byte codes.

constant O(1)
UPDATE

Re-point

Clear the old bitmap bit, set the new bitmap bit, adjust the code and the cardinality.

constant O(1)
DELETE

Tombstone

Set one bit in the shared deleted bitmap and keep the data in place; queries AND-NOT it out.

constant O(1)

Binary snapshot

Restart the system without re-parsing the source. The snapshot stores coded columns, the year column, dictionaries and the deleted bitmap; the bitmaps are rebuilt in memory on load.

0 ms
save
0 ms
load

The bugs that taught us the system.

A debugging story. Each fix sharpened a correctness or measurement guarantee.

Measured data. Purpose-built index. Microsecond queries.

CardCatalog demonstrates how file organization, indexing, and memory-layout decisions can transform a million-record dataset into a compact, high-speed query engine.