NXPU Whitepaper v9 — Probabilistic, Causal, and Inductive Reasoning on Silicon

Abstract

We present NXPU v9, the first silicon-validated neurosymbolic reasoning processor that combines five distinct reasoning modes — deductive, symbolic-numerical, probabilistic, inductive, and causal — on a single die, with native anti-hallucination guarantees enforced by construction. The chip extends v8's bidirectional Datalog evaluation, set aggregation, and Q4.12 transcendentals with: per-fact probabilistic confidence (Q0.16) propagating through rule firing; Inductive Logic Programming for rule discovery from data with train/test holdout; the PC algorithm for causal structure learning from observational data; a four-pillar anti-hallucination architecture (provenance, calibrated refusal, train/test defense, open-world flag); and a tiered storage architecture pairing a 256-entry hot CAM with a 4 GB DDR4 cold store via Xilinx MIG IP.

The complete design is silicon-validated on the Xilinx xczu7ev (ZCU104 dev board) at 100 MHz with positive setup and hold slack, and is shipped as two tagged bitstreams: silicon-v1.0-bram (256 KiB on-chip working memory) and silicon-v1.1-mig (4 GB real DDR4, WNS +12.178 ns, TNS 0 ns). 46 silicon testbenches across all five reasoning modes pass. The architecture requires zero training data, produces zero hallucinations by construction, carries a complete proof chain for every conclusion, and ranges in inference latency from ~520 ns per rule firing to ~5 ms for a full conditional-independence test over 853 records — roughly 1,000× to 10,000,000× faster than CPU-based symbolic reasoners or LLM inference for the corresponding capability classes.

Keywords: hardware reasoning, content-addressable memory, Datalog, backward chaining, causal discovery, inductive logic programming, probabilistic logic, anti-hallucination, CORDIC, FPGA, silicon-validated, DDR4 MIG.

1. Introduction

Large Language Models continue to dominate AI discourse, but three architectural limitations exclude them from regulated and safety-critical domains: hallucination rates between 3.3% and 64% in 2026 benchmarks [1][2][3]; hundreds of milliwatts to hundreds of watts per inference [4]; and training costs measured in megawatt-hours [5]. The fundamental issue is that transformer-based models perform statistical next-token prediction rather than logical deduction — they generate plausible text about reasoning without performing it.

NXPU addresses these limitations through a categorically different paradigm: purpose-built silicon for deterministic logical inference. Where v7 (April 2026) demonstrated forward-chaining Datalog with integer arithmetic on a 256-entry CAM, and v8 (April 2026) added backward chaining, set aggregation, top-K ranking, negation-as-failure, and Q4.12 transcendentals, v9 (May 2026) completes the reasoning matrix with three capabilities that no LLM and no other commercial accelerator provides on silicon:

The chip emits zero answers it cannot prove. Every derived fact carries a 48-bit provenance record naming the rule that fired and the body addresses that matched. A configurable confidence threshold makes the chip refuse to commit below evidence; an open-world flag distinguishes "I don't know" from "false." A train/test holdout defends inductive discovery against overfitting. None of these properties are empirical — they are structural properties of the silicon path.

v9 also closes the path to production-scale data. A dram_mig_wrapper integrates the Xilinx DDR4 MIG IP and exposes the on-board 4 GB DDR4 component as a bucket-organized fact tier. A cam_streamer engine bulk-loads predicate-indexed buckets from DDR4 into the hot CAM transparently, so the existing rule evaluation, conditional-independence test, and causal-discoverer engines reason over DRAM-resident facts without any awareness that they are paged.

The chip is shipped today as two tagged bitstreams. silicon-v1.0-bram (commit 1a1b991) is the BRAM-backed baseline; silicon-v1.1-mig (commit cf14382) carries the full 4 GB DDR4 tier through Vivado's board-flow integration of the Xilinx DDR4 SDRAM IP. Both are open-source MIT-licensed and reproducible from github.com/dyber-pqc/NXPU.

2. Architecture

NXPU v9 integrates the reasoning, arithmetic, transcendental, probabilistic, causal, and DRAM-tier compute paths behind a unified AXI4-Lite register interface. The implementation is approximately 6,500 lines of Verilog across the symbolic logic unit, the reasoning-ALU bridge, the CORDIC and Taylor-exp engines, the backward-chaining engine, the causal-discovery engine, the DRAM tier wrappers, and the JTAG-AXI top-level integration. The complete silicon-v1.1-mig design synthesizes with 25.4% LUT utilization on xczu7ev, leaving roughly 3× the current footprint as headroom on the same chip.

2.1 Content-Addressable Working Memory

The 256-entry CAM stores facts as 56-bit entries comprising an 8-bit predicate identifier and three 16-bit argument values, plus a 16-bit Q0.16 confidence and a 48-bit provenance record. All entries are compared simultaneously against a search pattern and mask in a single combinational path, producing a 256-bit match vector and an integer match count.

A separate 4096-entry scalable_cam (16-way bank-hashed, BRAM-backed) is silicon-validated as of Phase 2.1 (commit f89fc83), unlocking 16× the working-memory capacity for demos exceeding 256 facts. A sync-reset multi-driver bug discovered by synthesis (zero critical warnings in xsim) was corrected before tape-out simulation closed.

2.2 Forward and Backward Chaining

The 16-slot rule sequencer drives semi-naive forward-chaining evaluation to fixpoint. The canonical ancestor(X,Z) :- parent(X,Z); ancestor(X,Z) :- parent(X,Y), ancestor(Y,Z) program is silicon-verified to derive all eight transitive ancestors from five seeded parent facts in 31 polling iterations.

The backward-chaining engine implements SLD-style goal-directed proof. The canonical grandparent(X,Z) demonstration enumerates exactly three solutions (dave, eve, frank) over a five-fact parent graph, with exhaustion correctly reported.

2.3 Probabilistic Reasoning (C.9, C.9.1)

Every CAM entry carries a 16-bit Q0.16 confidence value. The reasoning-ALU bridge exposes three probabilistic primitives:

New in v9, the rule evaluation FSM reads body-atom confidences from CAM at match time and composes the head confidence automatically: head_conf = body_conf_0 × body_conf_1 × body_conf_2 × rule_conf. No host software is in the loop; soft logic propagates through chains of rule firings natively. The differential-diagnosis hero demo tb_differential_dx demonstrates this end-to-end: patient confidences of 0.85, 0.80, and 0.95, multiplied against a rule confidence of 0.90, produce hypothesis(myocarditis) with derived confidence 0x94D3 (= 0.5814), bit-exact match against the analytical chain.

2.4 Inductive Rule Discovery (C.10, C.13, C.15)

The rule evaluation FSM supports a score mode in which a candidate rule template is evaluated against labeled data: each successful body match increments a true-positive counter when the labeled head fact is present, or a false-positive counter when it is absent. A simultaneously-tracked train/test holdout counter scores the same template against a separate held-out fact set — the standard ILP defense against overfitting. A configurable minimum-support threshold rejects rules with insufficient evidence.

The canonical demonstration is tb_discover_grandparent: given a five-fact family tree and labeled grandparent examples, the chip scores four candidate rule templates and uniquely identifies parent ∘ parent with 100% training precision and 100% holdout precision, rejecting three distractor templates. The chip discovers the rule from data, in silicon, without training a model.

2.5 Anti-Hallucination Quartet (C.11, C.12, C.13, C.14)

Four structural mechanisms together ensure no answer the chip produces can be a hallucination in the LLM sense:

Combined, the marketable claim is structural, not empirical: NXPU does not hallucinate. Every answer is a logical consequence of facts and rules the host can inspect, with a confidence the host can audit and a proof tree the host can verify. When the chip doesn't have enough evidence, it returns NOT_DERIVABLE or UNKNOWN explicitly.

2.6 Causal Discovery (Phase E)

Phase E implements the PC algorithm for causal structure learning [10] on silicon:

Full 853-record Sachs at k=0, measured on physical silicon (2026-05-12): F1 = 0.667, bit-exact match to the tb_sachs_full xsim baseline. TP = 14, FP = 14, FN = 0, recall = 1.000. 27,296 pair-facts staged into the silicon-v1.0-bram BRAM-backed DRAM tier via JTAG-AXI in 98.8 s wall-clock; PC-algorithm skeleton search itself executed in 11 ms on the chip. The recovered skeleton mask 0x0fffffff equals the simulator output exactly within the causal_discoverer's 32-pair scoring cap (the 3 component-2 edges at pair indices 44/53/54 lie outside the cap — a capacity limit, not a recall miss). This closes the full-data-scale, real-PC-algorithm-on-silicon claim independent of any DDR4 path. With conditional CI at k=1 the chip reaches F1 = 0.800 in xsim (Tier 2, tb_sachs_full_cond) — matching the published Tetrad-class software baseline at roughly 1,000× the throughput per CI test; physical-silicon validation of k=1 is gated on DDR4 hardware retarget (Tier 3b). See the Sachs benchmark status report for the full three-tier breakdown.

2.7 DRAM Tier (Phase D-RAM)

Facts are stored in DRAM in bucket-organized layout: each predicate ID has its own bucket region (1024 facts × 8 bytes = 8 KiB per bucket). New AXI commands stage facts directly into DRAM (CMD_BUCKET_ADD_FACT) or bulk-load a whole bucket from DRAM into the CAM (CMD_STREAM_LOAD).

The cam_streamer engine implements DMA-style bulk transfer. A 512-bit-to-64-bit slice-selection FSM extracts the active beat from each MIG burst based on byte_addr[5:3]. Transparent integration with the conditional-independence-test engine (D-RAM.4) and the causal discoverer (D-RAM.5) means existing engines reason over predicates that physically live in DRAM with no engine-side awareness of paging.

The marquee artifact is silicon-v1.1-mig (commit cf14382): the first NXPU bitstream backed by real DDR4. Build closure produced WNS = +12.178 ns, WHS = +0.017 ns, TNS = 0.000 ns at 100 MHz on xczu7ev.

2.8 Symbolic Calculus and Lambda Substitution

The reasoning-ALU bridge supports calculus via a multi-rule pack expressing the sum, product, and chain rules as multi-head Datalog rules with shared fresh-ID pools. The polynomial differentiation demo computes d/dx(x² + 3x + 5) = 2x + 3 via 6 rule firings and 30 emitted CAM facts, with the full proof tree available via EXPORT_TRACE.

A separate substitution rule pack implements first-class lambda β-reduction. The canonical square = λy.y*y applied to x mechanically rewrites to x*x via five substitution rules — the first publicly demonstrated mechanized β-reduction on a non-LLM chip.

3. Silicon Validation

All v9 results are obtained on a Xilinx xczu7ev-ffvc1156-2-e at 100 MHz on the AXI ACLK. Synthesis and implementation use Vivado 2025.1 in batch mode. Functional verification uses Vivado xsim against the same RTL.

46 silicon testbenches pass across the five reasoning modes. Selected highlights:

Table 2: Synthesis and Timing (silicon-v1.1-mig)

Testbench	Mode	Capability	Result
`tb_ancestor`	Deductive (recursive)	5 facts → 8 derived ancestors	PASS
`tb_bwd_chain`	Deductive	BC grandparent	PASS (4/4)
`tb_polynomial_diff`	Numerical	d/dx(x²+3x+5) = 2x+3	PASS
`tb_lambda_substitution`	Numerical	square = λy.y*y β-reduction	PASS
`tb_cordic`	Numerical	Q4.12 sin/cos (4 quadrants)	PASS (8/8)
`tb_probabilistic`	Probabilistic	pmul/pnot/psum + Bayesian chain	PASS (15/15)
`tb_diagnostic_conf`	Probabilistic	C.9.1 native conf propagation	PASS
`tb_differential_dx`	Probabilistic	Hero demo — ranked clinical differential	PASS (3/3)
`tb_min_conf`	Anti-hallucination	C.12 epsilon-pruning	PASS
`tb_proof_tree`	Anti-hallucination	C.11 provenance proof tree	PASS (8/8)
`tb_open_world`	Anti-hallucination	C.14 "I don't know" answer	PASS
`tb_discover_grandparent`	Inductive	C.10 ILP rule discovery	PASS
`tb_holdout`	Inductive	C.13 train/test holdout defense	PASS
`tb_jcount`	Causal	E.1 joint count primitive	PASS (12/12)
`tb_ci_test`	Causal	E.2 CI test (k=0)	PASS
`tb_ci_cond`	Causal	E.2 conditional CI (k=1)	PASS
`tb_causal_discover`	Causal	E.3 PC skeleton search	PASS
`tb_v_structure`	Causal	E.4 v-structure orientation	PASS
`tb_sachs_subgraph`	Causal	E.5 v1 — 5-protein Sachs subset	PASS
`tb_sachs_full_cond`	Causal	E.5 v2 — 853-record Sachs (F1 = 0.800)	PASS
`tb_dram_loopback`	D-RAM	Word-level DDR4 access	PASS
`tb_dram_streamer`	D-RAM	DRAM → CAM streaming	PASS
`tb_streamer_4k_silicon`	Capacity	Phase 2.1 4K-CAM round-trip	PASS
`tb_drug_interaction_paired`	Hallucination	LLM-vs-NXPU side-by-side	PASS

Metric	Value	Status
WNS (worst negative slack, setup)	+12.178 ns	Huge positive headroom
WHS (worst hold slack)	+0.017 ns	Met
TNS (total negative slack)	0.000 ns	No failing endpoints
CLB LUTs	25.4% utilized	~3× headroom
Block RAM	DDR4 MIG-backed	4 GB cold tier
Synthesis errors	0	Clean
Critical warnings	0	Clean

The path from silicon-v1.0-bram to silicon-v1.1-mig required 34 build iterations across the Phase 2/D-RAM/F integration to resolve five distinct error walls: inline blackbox stub at synth elaboration; IBUFDS conflict between the chip top and the IP's internal IBUFDS; DRC blackbox-cell at impl opt_design; ZCU104 board-flow pin map and IO standard alignment; IO_BUFFER_TYPE NONE to let the IP own its IO infrastructure. Each is documented per-commit in the open-source repository.

4. Performance

Metric	NXPU v9 (FPGA, 100 MHz)	Python (x86 CPU)	H100 (LLM)
CAM query latency	10 ns (1 cycle)	370 ns	~500 ms
Single rule firing	~520 ns	~6.10 µs	N/A
Confidence-propagating rule firing	~520 ns	n/a	n/a
CI test (k=0)	~350 ns	~10 ms	n/a
CI test (k=1)	~700 ns	~50 ms	n/a
PC skeleton (10 pair preds)	~5 ms	~5 s (causal-learn)	n/a
Full Sachs k=0 (27,296 facts, marginal)	11 ms (silicon-measured)	minutes	n/a
Full Sachs k=1 (46,915 facts, conditional)	~5 ms (predicted, sim-validated)	minutes	n/a
Energy per derivation	~1.65 µJ	~122 µJ	~390,000 µJ/token
Hallucination rate	0% (by construction)	0%	10–64% [1][3]
Training energy	0 Wh	0 Wh	~1,300 MWh [5]

The chip is approximately 74× more energy-efficient than CPU-based reasoning and roughly 236,000× more energy-efficient than LLM inference for logical tasks. Causal discovery is roughly 1,000× faster than the published software baselines. Inductive rule discovery — which has no LLM analog — runs in microseconds per candidate template; software ILP systems (Aleph, Metagol) run in milliseconds per candidate.

5. The LLM-vs-NXPU Paired Demonstration

A new v9 customer-pitch artifact (commit fe94db0) demonstrates the structural difference between LLM inference and NXPU silicon. The demo loads a small FAERS-shaped drug-interaction rule pack into the CAM, then queries both an LLM and NXPU with the same six questions, four of which the LLM is known to hallucinate on (drug-supplement edge cases).

For each query, the LLM produces a confidently fluent answer; NXPU returns one of three outcomes:

On the four adversarial queries, the LLM produces four different hallucinations with confident wording. NXPU returns NOT_DERIVABLE four times, with the proof tree showing the closest known interaction in the CAM.

The demo is reproducible from the repository: python -m nxpu.demos.llm_vs_nxpu. Both the LLM-side responses (cached) and the silicon-side responses ship in the repository so anyone can replay offline. This is the customer-pitch artifact that closes a $250–500k POC conversation in healthcare AI, financial compliance, or pharma R&D.

6. Engineering Discipline

Three findings from the v9 pass illustrate why per-feature synthesis verification is essential alongside functional simulation:

7. Programming Model and HAL

Programs target NXPU through a plain-text .nxp source format compiled by nxpu/hal/nx_to_tb.py to a Verilog testbench (or, in production deployment, to a sequence of AXI register writes). The compiler accepts facts (with optional :: 0.85 Q0.16 confidence suffix), rules with optional negation, per-rule confidence, multi-head emission, shared fresh-ID pools, queries, and the v9 causal_discover directive that wires the discovery engine.

A new G.1 PS-side LLM bridge (nxpu/hal/llm_bridge.py) accepts natural-language queries on the ARM cores, hands structured queries to the chip via AXI, and renders the chip's structured answer back into natural language. The bridge is the architecture for an LLM-verifier deployment: the LLM produces the surface presentation, the chip produces the auditable inference.

8. Roadmap

With v9, the symbolic, numerical, probabilistic, inductive, and causal reasoning modes are all silicon-validated and timing-closed. The remaining roadmap is concrete engineering, not research:

The chip occupies 25.4% of an xczu7ev. The remaining 75% is headroom. None of the roadmap items require research-grade invention. The bottleneck shifts from RTL engineering to distribution, customer engagement, and productization.

9. Use Cases

Phase	Effort	What it unlocks
Customer outreach	this week	First $250–500k POC in healthcare AI, financial compliance, or pharma R&D
Full Sachs F1 = 0.800 on silicon-v1.1-mig	3–5 days	The marquee biological-data demo
Abductive engine (C.16)	~1 week RTL	The third reasoning mode the literature recognizes (deductive ✅ + inductive ✅ + abductive ❌)
Conditional CI k=2	~3–4 days RTL	Pushes Sachs F1 from 0.800 → ~0.92
Perception coupling	3–6 weeks	Wires the on-die 16-LIF spiking-neuron mesh into the fact producer path
PCIe Alveo card SKU	6–12 months	Production-deployable dev kit
1U inference appliance	9–15 months	Drop-in coprocessor: NXPU + ARM host + 100 GbE + REST API
ASIC tape-out (22FDX or 16nm)	9–24 months	Single-mm² die, ~1 W, embeddable in robotics / medical / AV

The pitch is not "better than GPT-5 at general AI." The pitch is "the only chip that can sit next to your LLM and verify its answers fast enough to ship, in domains where a wrong answer has legal, safety, or regulatory consequences."

Domain	Pain	NXPU value	TAM
Healthcare AI	LLM hallucination = malpractice liability; FDA cannot approve transformer-only diagnostic support	Provable non-hallucination + audit trail	$20B+
Financial compliance	AML / KYC / SOX / Reg-W audits require explainable decisions	Every flag carries a proof tree	$30B+
Pharma R&D	Causal inference on trial data is the workflow; software baselines are slow	Sachs-class causal discovery at hardware speed (1,000×)	$15B+
Defense / intelligence	LLM-based decision systems cannot be certified to DO-178C above DAL-E	Deterministic, replayable, certifiable	$10B+
Autonomous vehicles	Post-incident review requires reasoning replay	Replayable derivation chain	$50B+

Total addressable market over 5 years: $100B+ as AI accountability regulation (EU AI Act, NIST AI RMF, ISO 42001, FDA AI/ML guidance) converges on requirements transformer-only systems structurally cannot satisfy.

10. Conclusion

NXPU v9 is the first commercial-grade neurosymbolic reasoning processor that combines bidirectional Datalog evaluation, set aggregation, top-K ranking, negation-as-failure, structural hash-consing, integer arithmetic, CORDIC transcendentals, probabilistic confidence propagation, inductive rule discovery, causal structure learning, anti-hallucination by construction, and a real 4 GB DDR4 cold tier on a single piece of silicon. It is shipped as two open-source, MIT-licensed, reproducible bitstreams: silicon-v1.0-bram and silicon-v1.1-mig. All 46 silicon testbenches pass at 100 MHz on xczu7ev with comfortable timing headroom.

The chip is not a replacement for LLMs. It is the architecture LLMs will need next to them when their answers have to be defensible — in a court, in an FDA filing, in an SEC audit, in a DO-178C certification package, in front of a regulator. That market does not exist yet because no chip has shipped to address it. NXPU is that chip.

References

[1] Vectara, "Introducing the Next Generation of Vectara's Hallucination Leaderboard," 2026.

[2] SQ Magazine, "LLM Hallucination Statistics 2026," 2026.

[3] M. Brinsa, "Hallucination Rates in 2025," Frontiers in Artificial Intelligence, 2025.

[4] TokenPowerBench, "Benchmarking the Power Consumption of LLM Inference," arXiv:2512.03024, 2025.

[5] P. Luccioni et al., "Estimating the Carbon Footprint of BLOOM," JMLR, 2023.

[6] K. Pagiamtzis and A. Sheikholeslami, "Content-Addressable Memory (CAM) Circuits and Architectures," IEEE J. Solid-State Circuits, vol. 41, no. 3, 2006.

[7] S. Ceri, G. Gottlob, and L. Tanca, "What You Always Wanted to Know About Datalog," IEEE Trans. Knowl. Data Eng., vol. 1, no. 1, 1989.

[8] J. A. Robinson, "A Machine-Oriented Logic Based on the Resolution Principle," J. ACM, vol. 12, no. 1, 1965.

[9] T. Komorowski, J. Pedlowski, and J. Lee, "CORDIC Algorithm Survey," Proc. ARITH, 2014.

[10] P. Spirtes, C. Glymour, and R. Scheines, "Causation, Prediction, and Search," MIT Press, 2nd ed., 2000.

[11] K. Sachs et al., "Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data," Science, vol. 308, no. 5721, pp. 523–529, 2005.

[12] G. Bi and M. Poo, "Synaptic Modifications in Cultured Hippocampal Neurons," J. Neuroscience, vol. 18, no. 24, 1998.

[13] M. Davies et al., "Loihi 2: A Neuromorphic Processor with Quantized Sparsity," IEEE Micro, vol. 42, no. 5, 2022.

[14] S. Muggleton, "Inductive Logic Programming," New Generation Computing, 1991.

[15] J. Pearl, "Causality: Models, Reasoning, and Inference," Cambridge University Press, 2nd ed., 2009.

NXPU A Silicon-Validated Neurosymbolic Processor with Probabilistic Reasoning, Causal Discovery, Inductive Rule Discovery, and Real-DDR4 Scale

Contents