How It Works

The pipeline that turns your code into searchable, call-graph-aware semantic memory.

Both editions of Clean run the same pipeline. The difference is only where it runs — on your laptop (local) or on Clean's servers (cloud).

1. Parse

Clean walks the repo and parses every source file with tree-sitter, extracting each function, class, and method as a discrete entity. Supported languages are Python, JavaScript, and TypeScript. Each entity keeps its name, kind, file path, exact line range, and structural metadata (class name, decorators, whether it's exported).

2. Build the call graph

As it parses, Clean extracts call relationships between entities — who calls whom. These edges (calls and called_by) are stored alongside each entity, so a search result can carry its neighbours without any extra lookups later. See Call graph context.

3. Embed

Each entity is embedded into a 384-dimensional vector with a local sentence-transformer model (all-MiniLM-L6-v2 by default). This is what makes search semantic: two functions that do similar things land near each other in vector space, even if they share no keywords.

4. Store

Vectors and metadata go into LanceDB. Crucially, LanceDB doesn't store bare vectors — each row holds the embedding next to the entity's file_path, line_start/line_end, name, full code, and call-graph edges. So a single nearest-neighbour query returns everything needed to locate the code; there's never a second "now go find it on disk" step.

5. Search

When your agent calls search_code, the query is embedded once and matched by similarity. If the query contains identifier-shaped tokens, Clean blends in direct name and path matches (hybrid retrieval) so exact names rank correctly. The top result is expanded with its call-graph neighbourhood, and everything is returned as a compact tiered summary.

6. Expand on demand

Search responses deliberately withhold full source. Your agent pulls complete code only for the results it actually needs, via expand_result or get_source — keeping context-window usage low.

7. Stay fresh

On the local edition, Clean detects when an index is stale (the repo changed) and re-indexes only the changed files in the background, using git diffs or content hashes. Your search returns immediately against the current index; fresh vectors are ready next time. See incremental indexing.

In one line

Parse with tree-sitter → build a call graph → embed every entity → store vectors + location together in LanceDB → answer semantic queries with tiered, call-graph-aware results → expand only what's needed.

How It Works

On this page