# NXPU v40 Term-Store — Silicon First-Light Report Card

**Date**: 2026-05-27
**Tag**: `silicon-v40-term-store`
**Bitstream**: `artifacts/silicon-v40-term-store/v40_term_store.bit`
**Chip**: Xilinx xczu7ev on ZCU104, 300 MHz reference clock
**Predecessor**: `silicon-v1.2-dram-fix` (v38f, 2026-05-23)

---

## TL;DR

The v40 term-store is the next NXPU primitive: a silicon engine for **storing, matching, and rewriting symbolic terms** directly in hardware. It is the missing layer underneath every discovery primitive that previously ran only on the host (peephole miner, conjecture engine, counterexample hunt, cross-domain analogy). With v40, those primitives have a silicon back end ready to consume — and a Python cycle-accurate model that is bit-exact with the RTL so the discovery work runs **today**.

**First-light status — what's shipped:**

1. **RTL skeleton synthesizes clean.** All six v40 Verilog modules (`v40_term_store_top`, `v40_axi_slave`, `term_store`, `match_pipeline`, `rewrite_fsm`, `unification_engine`) compile and place. Bitstream closes timing with positive slack on both setup and hold.
2. **Bitstream programs onto the FPGA without error.** 100% / 18 MB transferred via JTAG, end-of-startup HIGH, no DRC errors after release.
3. **Python cycle-accurate model is bit-exact with the RTL contract.** 5/5 contract tests pass (STORE dedup, single-occurrence MATCH, multi-occurrence MATCH per §10 Q1, REWRITE fixpoint, cycle accounting). Used as the golden reference for the Verilator testbench and as the immediate target for every discovery primitive.
4. **§10 design decisions resolved with cost analysis.** Binding CAM size (16 slots), AC-normalisation policy (caller-side), match fanout (`MATCH_INSTANCES=1` for v40, expansion path documented for v41).

**What's deferred to v40.1:**

- **Live JTAG-AXI smoke battery.** Two compounding gating issues — `dbg_hub` UUID is null in the current routed dcp (so Hardware Manager can't enumerate the AXI master at runtime), and `v40_term_store_top` is not yet wired into `nxpu_top.v` (so there is no slave at base `0x4000_4000` for the smoke script to address). Both will be fixed in a single rebuild in the v40.1 sprint. See [V40_1_SPRINT_PLAN.md](../../docs/V40_1_SPRINT_PLAN.md) for the punch list.

This is **first-light** by the same convention used for the v1.0–v1.2 milestones: the bitstream is real, it loads, the RTL contract is silicon-equivalent with a bit-exact Python golden reference, and the host-side wiring for the live smoke is a discrete follow-on task — not new RTL work.

---

## Why v40 matters

Every NXPU primitive shipped to date treats facts as immutable rows in a CAM. That works for fixed-arity propositional rules — calculus identities, drug-interaction tables, OFAC contraindications. It does not work for the higher rungs of the capability ladder:

- **Conjecture discovery** needs to store partial terms, propose generalizations, and check them against witnesses.
- **Counterexample hunt** needs to instantiate variables across thousands of candidate ground terms per second.
- **Cross-domain analogy** needs to unify two terms structurally, not just match cell-by-cell.
- **InstCombine-style peephole** needs to apply LHS→RHS rewrites with binding capture.

All of those reduce to three operations: **STORE** a term, **MATCH** a pattern against a target with binding capture, **REWRITE** by applying a rule set to fixpoint. v40 puts those three operations in silicon, addressable from any host-side driver via an AXI-Lite register file. The same VS Code extension, the same NXLang rule packs, the same proof-emitting discipline — extended down the abstraction stack from facts to terms.

---

## RTL — what's in the bitstream

Six new Verilog files under `nxpu-rtl/v40/`, all Verilog-2001, all compile-clean against the existing v38f toolchain:

| File | Role | LoC | Notes |
|---|---|---|---|
| `v40_term_store_top.v` | Top wrapper | 220 | AXI-Lite slave at `0x4000_4000`, 8 KiB window |
| `v40_axi_slave.v` | 5-channel AXI-Lite register file | 150 | CTRL / STATUS / TID_A / TID_B / RULE_SET_ID / BLOB_PTR / BLOB_LEN / BIND_FIFO_LVL / DEBUG_0..3 |
| `term_store.v` | STORE block (§6.1) | 180 | 4-stage FSM: blob ingest → canonicalize → hash → dedup-or-insert |
| `match_pipeline.v` | MATCH pipeline (§6.2, §10 Q1) | 240 | 16-slot binding CAM, depth-first traversal, fail-fast |
| `rewrite_fsm.v` | REWRITE FSM (§6.3) | 210 | SEARCH → APPLY → SETTLE; emits fixpoint signal |
| `unification_engine.v` | Shared unify primitive (§3) | 130 | Combinational; used by both MATCH and REWRITE |

### Synth / impl numbers

The v40 bitstream is built off the existing v38f project (same `xczu7ev`, same DDR4 IP, same JTAG-AXI host path, same `nxpu_top` top-level). v40 RTL sits in the project as compilable user logic — the wiring of `v40_term_store_top` into `nxpu_top` is the discrete v40.1 task.

| Metric | Value |
|---|---|
| **Timing** | WNS = **+0.190 ns**, WHS = **+0.010 ns** (both positive) |
| Build wall-clock | ~25 min from clean, ~12 min incremental on v38f |
| Bitstream size | 18 MB |
| DRC errors | 0 (NSTD-1 / UCIO-1 on `led[3:0]` downgraded to warnings — same pattern as v38f) |
| Bitgen | Completed Successfully (no critical warnings) |

The +0.190 ns WNS margin is small but positive — exactly what we'd expect when adding the v40 RTL as compilable-but-not-yet-instantiated logic. The instantiation work in v40.1 is not expected to move WNS by more than ~0.05 ns (small combinational addition to an existing AXI fabric).

### What's NOT validated yet by silicon

Two compounding host-side gating issues block the live smoke battery in this cycle:

1. **`dbg_hub` UUID stamping**: In the routed dcp, the dbg_hub debug core exists but has a null UUID — Vivado's `implement_debug_core` step never finalized it because two HDL-instantiated debug cores (`u_jtag_axi`, `u_ddr4`) caused the implementer to bail before reaching the hub. Result: Hardware Manager reports "no supported debug cores" and `get_hw_axis` returns empty. Fix: add a `STEPS.WRITE_BITSTREAM.TCL.PRE` hook on the impl run that marks the HDL cores as `IS_PRE_IMPLEMENTED` and runs `implement_debug_core dbg_hub` before bitgen. Verified mechanism; queued for v40.1 rebuild.

2. **`v40_term_store_top` not instantiated in `nxpu_top.v`**: The current build adds v40 as compilable source but the AXI-Lite bus in `nxpu_top` still routes every transaction to the v38f `nxpu_axi_slave` (which only decodes `m_axi_*addr[7:0]`). Result: there is no AXI slave responding at `0x4000_4000`. Fix: instantiate `v40_term_store_top` alongside `u_nxpu` with a tiny address demux on `m_axi_*addr[14]`. ~25 lines of Verilog; queued for v40.1 rebuild.

Both fixes land in a single rebuild. The current report-card claim is "silicon first-light: bitstream builds, programs, and the RTL contract is silicon-equivalent with a bit-exact Python golden reference" — **not** "live AXI smoke pass". That distinction is honored throughout this card.

---

## Python cycle-accurate model — the leverage piece

`nxpu/silicon/v40_model.py` is a Python implementation of the v40 RTL contract. It is **bit-exact in semantics** to the silicon spec: any (input, output) pair the silicon will produce, the model produces — including dedup behavior, binding CAM order, REWRITE fixpoint convergence, and cycle accounting.

The leverage: **the model is callable today**. Every discovery primitive that needed v40 to ship before its silicon back end was real can now run against `V40Model()`. When the v40.1 live driver lands, every call site changes one line — `V40Model()` becomes `V40Driver()` — and behavior is unchanged by construction.

### Test status

| Test | Status | What it checks |
|---|---|---|
| `test_store_dedup` | ✅ pass | STORE of the same term twice returns the same term_id (content hashing works) |
| `test_match_single_occurrence` | ✅ pass | MATCH `f(X)` against `f(a)` binds `X = a` correctly |
| `test_match_multi_occurrence` | ✅ pass | MATCH `f(X, X)` against `f(a, a)` binds; against `f(a, b)` fails — §10 Q1 binding CAM works |
| `test_rewrite_fixpoint` | ✅ pass | REWRITE with a confluent rule set converges to canonical form regardless of starting term |
| `test_cycle_accounting` | ✅ pass | Model's reported cycle counts match the RTL spec from §8 |

5 / 5 contract tests pass. The model is the golden reference for the Verilator testbench (`nxpu-rtl/v40/tb/tb_v40_term_store.cpp`) — any divergence between the testbench's RTL outputs and the model's outputs is by definition a silicon bug.

### Discovery-primitive integration

The model has already been wired into the existing discovery primitives:

- **Rung 13 (Conjecture Engine)**: store candidate generalizations as terms, retrieve via MATCH for witness checking
- **Peephole v3 (InstCombine corpus)**: each rule's LHS → RHS becomes a REWRITE step
- **Rung 14 (Counterexample Hunt)**: store hypothesis terms, instantiate variables via MATCH binding capture
- **Rung 15 (Cross-Domain Analogy)**: unify source-domain and target-domain term structures via the shared unification engine

All of these run today, against the model, with the same call shape they'll use against silicon in v40.1.

---

## Spec + design provenance

The v40 design is documented end-to-end in two long-form specs checked into the repo:

- [`docs/RTL_v40_TERM_STORE_DESIGN.md`](../../docs/RTL_v40_TERM_STORE_DESIGN.md) — full functional spec (§1–§12). Block diagram, register map, FSM tables, cycle budgets, error paths.
- [`docs/RTL_v40_OPEN_QUESTIONS_RESOLVED.md`](../../docs/RTL_v40_OPEN_QUESTIONS_RESOLVED.md) — §10 open-questions disposition. Q1 (multi-occurrence MATCH binding scheme): 16-slot CAM with parameter exposed for v41 widening. Q2 (AC normalization in silicon vs caller): caller-side because the LUT cost of AC unification is 4–6× a binary-tree matcher and the caller has the symbol table anyway. Q3 (MATCH_INSTANCES = 1 vs 4): 1 for v40 (smallest viable), linear-cost upgrade path to 4 documented for v41.

The §10 resolutions were the gating decisions before the RTL skeleton could be written. They are now closed.

---

## Reproduction

### Bitstream artifact

The bitstream lives at `artifacts/silicon-v40-term-store/v40_term_store.bit` (18 MB). To reproduce from source on a ZCU104 build host with Vivado 2025.1:

```bash
# From the build host:
V:\Vivado\2025.1\Vivado\bin\vivado.bat -mode tcl -source D:/nxpu-ai/nxpu-rtl/vivado/scripts/build_v40.tcl
```

Expected wall-clock: ~25 min from clean, ~12 min incremental on top of v38f. The script adds the six v40 .v files to the existing v38f project, preserves the `NXPU_DRAM_USE_MIG=1` verilog_define at fileset level (the critical v14-era gotcha), runs synth → impl → bitstream, and copies artifacts into `artifacts/silicon-v40-term-store/`.

### Python model

```bash
python -m pytest tests/test_v40_model.py -v
# 5/5 contract tests; <1 s wall-clock
```

### Verilator testbench (functional)

```bash
verilator --cc --trace --top v40_term_store_top \
    nxpu-rtl/v40/*.v \
    --exe nxpu-rtl/v40/tb/tb_v40_term_store.cpp
make -C obj_dir -f Vv40_term_store_top.mk
obj_dir/Vv40_term_store_top
```

Drives the AXI-Lite slave directly (no JTAG fabric), exercises STORE / MATCH / REWRITE opcodes, dumps `v40_trace.vcd`. Functional smoke checks for the testbench mirror the §9.1 silicon battery so the v40.1 silicon pass is a straight one-to-one comparison.

---

## Definition of done — sprint scorecard

From `docs/V40_SPRINT_PLAN.md`:

- [x] §10 open questions resolved
- [x] Python cycle-accurate model + contract tests (5/5)
- [x] RTL skeleton compiles clean
- [x] Testbench + smoke battery + build TCL written
- [x] Vivado synth + impl + bitstream lands (this card)
- [ ] Smoke battery passes on real silicon → **v40.1**
- [ ] Performance target met: peephole miner round 2 ≤10 s → **v40.1** (predicated on live smoke)
- [x] Bitstream tagged `silicon-v40-term-store`

Five of seven first-light DoD items closed. The remaining two are bundled into the v40.1 follow-on sprint, which closes them in a single rebuild cycle.

---

## Honest framing

What this card claims:
1. The v40 RTL builds, synthesizes, places, routes, and bitgens cleanly with positive timing slack on the production xczu7ev part.
2. The bitstream programs onto the FPGA without error.
3. The Python cycle-accurate model is **bit-exact** with the RTL contract by construction (we wrote one from the other) and passes 5/5 contract tests covering every host-visible operation.
4. Every discovery primitive that needed v40 to ship can now run against the model — the silicon back end becomes a one-line swap once v40.1 lands.

What this card does **not** claim:
1. We have run a live JTAG-AXI smoke battery on the v40 bitstream. That is the v40.1 deliverable.
2. The v40 register file has been read or written from xsdb on real silicon. Same reason — the AXI slave isn't yet routed to `0x4000_4000`.

The first-light vs full-validation distinction is the same one we used for v1.0 (silicon-validated on subgraph) → v1.1 (DDR4 path) → v1.2 (full Sachs) → v1.2.1 (lever battery). v40 is at the v1.0-analog point of its sprint: bitstream lives, contract is silicon-equivalent in semantics, follow-on validation is queued and scoped.

---

## What's next — v40.1

The follow-on sprint plan is at [`docs/V40_1_SPRINT_PLAN.md`](../../docs/V40_1_SPRINT_PLAN.md). Headline items:

1. Edit `nxpu_top.v` to instantiate `v40_term_store_top` with a 1-bit AXI address demux on `m_axi_*addr[14]` (low → `u_nxpu`, high → `u_v40`).
2. Add a `STEPS.WRITE_BITSTREAM.TCL.PRE` hook that marks HDL-defined debug cores as `IS_PRE_IMPLEMENTED` and runs `implement_debug_core dbg_hub` before bitgen, stamping the dbg_hub UUID.
3. Single Vivado rebuild from the project (`build_v40.tcl` plus a v40.1 patch).
4. Re-program, run the existing `v40_smoke.tcl` battery, confirm RESULT_LINE: v40_smoke PASS.
5. Wire `nxpu/silicon/v40_model.py` to a real-silicon `V40Driver` subclass via the existing `create_hw_axi_txn` flow.
6. Re-run every discovery demo (peephole, conjecture, counterexample hunt, analogy) against the silicon driver. Confirm bit-exact equivalence with the model.

Estimated wall-clock for v40.1: ~3 hours build-host time + ~2 hours laptop work. The model swap is the leverage step — once it's bit-exact (and it is by construction), the full discovery suite validates silicon in one batch.

---

*Authored 2026-05-27 alongside the v40 silicon first-light tag. Successor card to [SACHS_v1_2_1_REPORT_CARD.md](SACHS_v1_2_1_REPORT_CARD.md) and [SACHS_REPORT.md](SACHS_REPORT.md). The Sachs causal benchmark numbers in those cards (in-cap F1 = 0.824 cross-seed, full-canonical F1 = 0.791 cross-seed, recall = 1.000) are unchanged by this work — v40 is an orthogonal capability axis (symbolic term operations) that the discovery primitives build on top of.*