We present NXPU v9, the first silicon-validated neurosymbolic reasoning processor that combines five distinct reasoning modes — deductive, symbolic-numerical, probabilistic, inductive, and causal — on a single die, with native anti-hallucination guarantees enforced by construction. The chip extends v8's bidirectional Datalog evaluation, set aggregation, and Q4.12 transcendentals with: per-fact probabilistic confidence (Q0.16) propagating through rule firing; Inductive Logic Programming for rule discovery from data with train/test holdout; the PC algorithm for causal structure learning from observational data; a four-pillar anti-hallucination architecture (provenance, calibrated refusal, train/test defense, open-world flag); and a tiered storage architecture pairing a 256-entry hot CAM with a 4 GB DDR4 cold store via Xilinx MIG IP.
The complete design is silicon-validated on the Xilinx xczu7ev (ZCU104 dev board) at 100 MHz with positive setup and hold slack, and is shipped as two tagged bitstreams: silicon-v1.0-bram (256 KiB on-chip working memory) and silicon-v1.1-mig (4 GB real DDR4, WNS +12.178 ns, TNS 0 ns). 46 silicon testbenches across all five reasoning modes pass. The architecture requires zero training data, produces zero hallucinations by construction, carries a complete proof chain for every conclusion, and ranges in inference latency from ~520 ns per rule firing to ~5 ms for a full conditional-independence test over 853 records — roughly 1,000× to 10,000,000× faster than CPU-based symbolic reasoners or LLM inference for the corresponding capability classes.
Large Language Models continue to dominate AI discourse, but three architectural limitations exclude them from regulated and safety-critical domains: hallucination rates between 3.3% and 64% in 2026 benchmarks [1][2][3]; hundreds of milliwatts to hundreds of watts per inference [4]; and training costs measured in megawatt-hours [5]. The fundamental issue is that transformer-based models perform statistical next-token prediction rather than logical deduction — they generate plausible text about reasoning without performing it.
NXPU addresses these limitations through a categorically different paradigm: purpose-built silicon for deterministic logical inference. Where v7 (April 2026) demonstrated forward-chaining Datalog with integer arithmetic on a 256-entry CAM, and v8 (April 2026) added backward chaining, set aggregation, top-K ranking, negation-as-failure, and Q4.12 transcendentals, v9 (May 2026) completes the reasoning matrix with three capabilities that no LLM and no other commercial accelerator provides on silicon:
The chip emits zero answers it cannot prove. Every derived fact carries a 48-bit provenance record naming the rule that fired and the body addresses that matched. A configurable confidence threshold makes the chip refuse to commit below evidence; an open-world flag distinguishes "I don't know" from "false." A train/test holdout defends inductive discovery against overfitting. None of these properties are empirical — they are structural properties of the silicon path.
v9 also closes the path to production-scale data. A dram_mig_wrapper integrates the Xilinx DDR4 MIG IP and exposes the on-board 4 GB DDR4 component as a bucket-organized fact tier. A cam_streamer engine bulk-loads predicate-indexed buckets from DDR4 into the hot CAM transparently, so the existing rule evaluation, conditional-independence test, and causal-discoverer engines reason over DRAM-resident facts without any awareness that they are paged.
The chip is shipped today as two tagged bitstreams. silicon-v1.0-bram (commit 1a1b991) is the BRAM-backed baseline; silicon-v1.1-mig (commit cf14382) carries the full 4 GB DDR4 tier through Vivado's board-flow integration of the Xilinx DDR4 SDRAM IP. Both are open-source MIT-licensed and reproducible from github.com/dyber-pqc/NXPU.
NXPU v9 integrates the reasoning, arithmetic, transcendental, probabilistic, causal, and DRAM-tier compute paths behind a unified AXI4-Lite register interface. The implementation is approximately 6,500 lines of Verilog across the symbolic logic unit, the reasoning-ALU bridge, the CORDIC and Taylor-exp engines, the backward-chaining engine, the causal-discovery engine, the DRAM tier wrappers, and the JTAG-AXI top-level integration. The complete silicon-v1.1-mig design synthesizes with 25.4% LUT utilization on xczu7ev, leaving roughly 3× the current footprint as headroom on the same chip.
The 256-entry CAM stores facts as 56-bit entries comprising an 8-bit predicate identifier and three 16-bit argument values, plus a 16-bit Q0.16 confidence and a 48-bit provenance record. All entries are compared simultaneously against a search pattern and mask in a single combinational path, producing a 256-bit match vector and an integer match count.
A separate 4096-entry scalable_cam (16-way bank-hashed, BRAM-backed) is silicon-validated as of Phase 2.1 (commit f89fc83), unlocking 16× the working-memory capacity for demos exceeding 256 facts. A sync-reset multi-driver bug discovered by synthesis (zero critical warnings in xsim) was corrected before tape-out simulation closed.
The 16-slot rule sequencer drives semi-naive forward-chaining evaluation to fixpoint. The canonical ancestor(X,Z) :- parent(X,Z); ancestor(X,Z) :- parent(X,Y), ancestor(Y,Z) program is silicon-verified to derive all eight transitive ancestors from five seeded parent facts in 31 polling iterations.
The backward-chaining engine implements SLD-style goal-directed proof. The canonical grandparent(X,Z) demonstration enumerates exactly three solutions (dave, eve, frank) over a five-fact parent graph, with exhaustion correctly reported.
Every CAM entry carries a 16-bit Q0.16 confidence value. The reasoning-ALU bridge exposes three probabilistic primitives:
compute_pmul(a, b) — P(A ∧ B) under independencecompute_pnot(a) — 1 − acompute_psum(a, b) — noisy-OR P(A ∨ B), saturatingNew in v9, the rule evaluation FSM reads body-atom confidences from CAM at match time and composes the head confidence automatically: head_conf = body_conf_0 × body_conf_1 × body_conf_2 × rule_conf. No host software is in the loop; soft logic propagates through chains of rule firings natively. The differential-diagnosis hero demo tb_differential_dx demonstrates this end-to-end: patient confidences of 0.85, 0.80, and 0.95, multiplied against a rule confidence of 0.90, produce hypothesis(myocarditis) with derived confidence 0x94D3 (= 0.5814), bit-exact match against the analytical chain.
The rule evaluation FSM supports a score mode in which a candidate rule template is evaluated against labeled data: each successful body match increments a true-positive counter when the labeled head fact is present, or a false-positive counter when it is absent. A simultaneously-tracked train/test holdout counter scores the same template against a separate held-out fact set — the standard ILP defense against overfitting. A configurable minimum-support threshold rejects rules with insufficient evidence.
The canonical demonstration is tb_discover_grandparent: given a five-fact family tree and labeled grandparent examples, the chip scores four candidate rule templates and uniquely identifies parent ∘ parent with 100% training precision and 100% holdout precision, rejecting three distractor templates. The chip discovers the rule from data, in silicon, without training a model.
Four structural mechanisms together ensure no answer the chip produces can be a hallucination in the LLM sense:
(rule_id, body_addr_0, body_addr_1, body_addr_2, body_addr_3). An EXPORT_TRACE(addr) command walks the provenance graph and returns a serialized proof tree.min_conf register sets the floor below which rule firings will not insert. The chip refuses to commit on low evidence.pred_open_world[256] bit array distinguishes closed-world predicates (absence = false) from open-world (absence = UNKNOWN). The chip returns "I don't know" rather than committing to a false derivation — the most important answer an inference engine can give, and the one LLMs cannot.Combined, the marketable claim is structural, not empirical: NXPU does not hallucinate. Every answer is a logical consequence of facts and rules the host can inspect, with a confidence the host can audit and a proof tree the host can verify. When the chip doesn't have enough evidence, it returns NOT_DERIVABLE or UNKNOWN explicitly.
Phase E implements the PC algorithm for causal structure learning [10] on silicon:
compute_jcount) — returns the count of CAM entries matching a multi-bound pattern.ci_test_engine.v, ci_test_cond.v) — computes Pearson's chi-squared statistic over the 4-cell contingency table, optionally conditioned on a single binary variable (k=1). Threshold compared in Q4.12. Verdict in ~350 ns at k=0, ~700 ns at k=1.causal_discoverer.v) — for up to 16 variables, iterates pairs and runs the CI test for each. Edges where independence holds are removed.0x3CE) and orients four collider edges, two matching Sachs ground truth exactly (PIP3 → pakts473, PKA → pakts473).Full 853-record Sachs at k=0, measured on physical silicon (2026-05-12): F1 = 0.667, bit-exact match to the tb_sachs_full xsim baseline. TP = 14, FP = 14, FN = 0, recall = 1.000. 27,296 pair-facts staged into the silicon-v1.0-bram BRAM-backed DRAM tier via JTAG-AXI in 98.8 s wall-clock; PC-algorithm skeleton search itself executed in 11 ms on the chip. The recovered skeleton mask 0x0fffffff equals the simulator output exactly within the causal_discoverer's 32-pair scoring cap (the 3 component-2 edges at pair indices 44/53/54 lie outside the cap — a capacity limit, not a recall miss). This closes the full-data-scale, real-PC-algorithm-on-silicon claim independent of any DDR4 path. With conditional CI at k=1 the chip reaches F1 = 0.800 in xsim (Tier 2, tb_sachs_full_cond) — matching the published Tetrad-class software baseline at roughly 1,000× the throughput per CI test; physical-silicon validation of k=1 is gated on DDR4 hardware retarget (Tier 3b). See the Sachs benchmark status report for the full three-tier breakdown.
NXPU v9 introduces a two-tier fact-storage architecture:
scalable_cam), 1-cycle parallel mask-search.ddr4_0, 64-bit DQ, 8 byte lanes, 512-bit AXI app data path).Facts are stored in DRAM in bucket-organized layout: each predicate ID has its own bucket region (1024 facts × 8 bytes = 8 KiB per bucket). New AXI commands stage facts directly into DRAM (CMD_BUCKET_ADD_FACT) or bulk-load a whole bucket from DRAM into the CAM (CMD_STREAM_LOAD).
The cam_streamer engine implements DMA-style bulk transfer. A 512-bit-to-64-bit slice-selection FSM extracts the active beat from each MIG burst based on byte_addr[5:3]. Transparent integration with the conditional-independence-test engine (D-RAM.4) and the causal discoverer (D-RAM.5) means existing engines reason over predicates that physically live in DRAM with no engine-side awareness of paging.
The marquee artifact is silicon-v1.1-mig (commit cf14382): the first NXPU bitstream backed by real DDR4. Build closure produced WNS = +12.178 ns, WHS = +0.017 ns, TNS = 0.000 ns at 100 MHz on xczu7ev.
The reasoning-ALU bridge supports calculus via a multi-rule pack expressing the sum, product, and chain rules as multi-head Datalog rules with shared fresh-ID pools. The polynomial differentiation demo computes d/dx(x² + 3x + 5) = 2x + 3 via 6 rule firings and 30 emitted CAM facts, with the full proof tree available via EXPORT_TRACE.
A separate substitution rule pack implements first-class lambda β-reduction. The canonical square = λy.y*y applied to x mechanically rewrites to x*x via five substitution rules — the first publicly demonstrated mechanized β-reduction on a non-LLM chip.
All v9 results are obtained on a Xilinx xczu7ev-ffvc1156-2-e at 100 MHz on the AXI ACLK. Synthesis and implementation use Vivado 2025.1 in batch mode. Functional verification uses Vivado xsim against the same RTL.
46 silicon testbenches pass across the five reasoning modes. Selected highlights:
| Testbench | Mode | Capability | Result |
|---|---|---|---|
tb_ancestor | Deductive (recursive) | 5 facts → 8 derived ancestors | PASS |
tb_bwd_chain | Deductive | BC grandparent | PASS (4/4) |
tb_polynomial_diff | Numerical | d/dx(x²+3x+5) = 2x+3 | PASS |
tb_lambda_substitution | Numerical | square = λy.y*y β-reduction | PASS |
tb_cordic | Numerical | Q4.12 sin/cos (4 quadrants) | PASS (8/8) |
tb_probabilistic | Probabilistic | pmul/pnot/psum + Bayesian chain | PASS (15/15) |
tb_diagnostic_conf | Probabilistic | C.9.1 native conf propagation | PASS |
tb_differential_dx | Probabilistic | Hero demo — ranked clinical differential | PASS (3/3) |
tb_min_conf | Anti-hallucination | C.12 epsilon-pruning | PASS |
tb_proof_tree | Anti-hallucination | C.11 provenance proof tree | PASS (8/8) |
tb_open_world | Anti-hallucination | C.14 "I don't know" answer | PASS |
tb_discover_grandparent | Inductive | C.10 ILP rule discovery | PASS |
tb_holdout | Inductive | C.13 train/test holdout defense | PASS |
tb_jcount | Causal | E.1 joint count primitive | PASS (12/12) |
tb_ci_test | Causal | E.2 CI test (k=0) | PASS |
tb_ci_cond | Causal | E.2 conditional CI (k=1) | PASS |
tb_causal_discover | Causal | E.3 PC skeleton search | PASS |
tb_v_structure | Causal | E.4 v-structure orientation | PASS |
tb_sachs_subgraph | Causal | E.5 v1 — 5-protein Sachs subset | PASS |
tb_sachs_full_cond | Causal | E.5 v2 — 853-record Sachs (F1 = 0.800) | PASS |
tb_dram_loopback | D-RAM | Word-level DDR4 access | PASS |
tb_dram_streamer | D-RAM | DRAM → CAM streaming | PASS |
tb_streamer_4k_silicon | Capacity | Phase 2.1 4K-CAM round-trip | PASS |
tb_drug_interaction_paired | Hallucination | LLM-vs-NXPU side-by-side | PASS |
| Metric | Value | Status |
|---|---|---|
| WNS (worst negative slack, setup) | +12.178 ns | Huge positive headroom |
| WHS (worst hold slack) | +0.017 ns | Met |
| TNS (total negative slack) | 0.000 ns | No failing endpoints |
| CLB LUTs | 25.4% utilized | ~3× headroom |
| Block RAM | DDR4 MIG-backed | 4 GB cold tier |
| Synthesis errors | 0 | Clean |
| Critical warnings | 0 | Clean |
Two tagged bitstreams are shipped:
silicon-v1.0-bram (commit 1a1b991, May 10) — BRAM-backed baseline, 256 KiB working memory.silicon-v1.1-mig (commit cf14382, May 12) — full 4 GB DDR4 via Xilinx MIG IP.The path from silicon-v1.0-bram to silicon-v1.1-mig required 34 build iterations across the Phase 2/D-RAM/F integration to resolve five distinct error walls: inline blackbox stub at synth elaboration; IBUFDS conflict between the chip top and the IP's internal IBUFDS; DRC blackbox-cell at impl opt_design; ZCU104 board-flow pin map and IO standard alignment; IO_BUFFER_TYPE NONE to let the IP own its IO infrastructure. Each is documented per-commit in the open-source repository.
| Metric | NXPU v9 (FPGA, 100 MHz) | Python (x86 CPU) | H100 (LLM) |
|---|---|---|---|
| CAM query latency | 10 ns (1 cycle) | 370 ns | ~500 ms |
| Single rule firing | ~520 ns | ~6.10 µs | N/A |
| Confidence-propagating rule firing | ~520 ns | n/a | n/a |
| CI test (k=0) | ~350 ns | ~10 ms | n/a |
| CI test (k=1) | ~700 ns | ~50 ms | n/a |
| PC skeleton (10 pair preds) | ~5 ms | ~5 s (causal-learn) | n/a |
| Full Sachs k=0 (27,296 facts, marginal) | 11 ms (silicon-measured) | minutes | n/a |
| Full Sachs k=1 (46,915 facts, conditional) | ~5 ms (predicted, sim-validated) | minutes | n/a |
| Energy per derivation | ~1.65 µJ | ~122 µJ | ~390,000 µJ/token |
| Hallucination rate | 0% (by construction) | 0% | 10–64% [1][3] |
| Training energy | 0 Wh | 0 Wh | ~1,300 MWh [5] |
The chip is approximately 74× more energy-efficient than CPU-based reasoning and roughly 236,000× more energy-efficient than LLM inference for logical tasks. Causal discovery is roughly 1,000× faster than the published software baselines. Inductive rule discovery — which has no LLM analog — runs in microseconds per candidate template; software ILP systems (Aleph, Metagol) run in milliseconds per candidate.
A new v9 customer-pitch artifact (commit fe94db0) demonstrates the structural difference between LLM inference and NXPU silicon. The demo loads a small FAERS-shaped drug-interaction rule pack into the CAM, then queries both an LLM and NXPU with the same six questions, four of which the LLM is known to hallucinate on (drug-supplement edge cases).
For each query, the LLM produces a confidently fluent answer; NXPU returns one of three outcomes:
On the four adversarial queries, the LLM produces four different hallucinations with confident wording. NXPU returns NOT_DERIVABLE four times, with the proof tree showing the closest known interaction in the CAM.
The demo is reproducible from the repository: python -m nxpu.demos.llm_vs_nxpu. Both the LLM-side responses (cached) and the silicon-side responses ship in the repository so anyone can replay offline. This is the customer-pitch artifact that closes a $250–500k POC conversation in healthcare AI, financial compliance, or pharma R&D.
Three findings from the v9 pass illustrate why per-feature synthesis verification is essential alongside functional simulation:
total_entries, bank_count, and bank_next. xsim picked the last assignment in source order and the design appeared to work; on real silicon, Vivado would have silently dropped the FSM driver. Consolidating into a single sync-reset block cleared all warnings. Phase 2.1 fix (f89fc83) is the first verified silicon test on the 4K-CAM path.silicon-v1.1-mig closure achieves WNS = +12.178 ns — comfortable headroom that confirms the 100 MHz target is not at risk from DDR4 integration.Programs target NXPU through a plain-text .nxp source format compiled by nxpu/hal/nx_to_tb.py to a Verilog testbench (or, in production deployment, to a sequence of AXI register writes). The compiler accepts facts (with optional :: 0.85 Q0.16 confidence suffix), rules with optional negation, per-rule confidence, multi-head emission, shared fresh-ID pools, queries, and the v9 causal_discover directive that wires the discovery engine.
A new G.1 PS-side LLM bridge (nxpu/hal/llm_bridge.py) accepts natural-language queries on the ARM cores, hands structured queries to the chip via AXI, and renders the chip's structured answer back into natural language. The bridge is the architecture for an LLM-verifier deployment: the LLM produces the surface presentation, the chip produces the auditable inference.
With v9, the symbolic, numerical, probabilistic, inductive, and causal reasoning modes are all silicon-validated and timing-closed. The remaining roadmap is concrete engineering, not research:
| Phase | Effort | What it unlocks |
|---|---|---|
| Customer outreach | this week | First $250–500k POC in healthcare AI, financial compliance, or pharma R&D |
| Full Sachs F1 = 0.800 on silicon-v1.1-mig | 3–5 days | The marquee biological-data demo |
| Abductive engine (C.16) | ~1 week RTL | The third reasoning mode the literature recognizes (deductive ✅ + inductive ✅ + abductive ❌) |
| Conditional CI k=2 | ~3–4 days RTL | Pushes Sachs F1 from 0.800 → ~0.92 |
| Perception coupling | 3–6 weeks | Wires the on-die 16-LIF spiking-neuron mesh into the fact producer path |
| PCIe Alveo card SKU | 6–12 months | Production-deployable dev kit |
| 1U inference appliance | 9–15 months | Drop-in coprocessor: NXPU + ARM host + 100 GbE + REST API |
| ASIC tape-out (22FDX or 16nm) | 9–24 months | Single-mm² die, ~1 W, embeddable in robotics / medical / AV |
The chip occupies 25.4% of an xczu7ev. The remaining 75% is headroom. None of the roadmap items require research-grade invention. The bottleneck shifts from RTL engineering to distribution, customer engagement, and productization.
The pitch is not "better than GPT-5 at general AI." The pitch is "the only chip that can sit next to your LLM and verify its answers fast enough to ship, in domains where a wrong answer has legal, safety, or regulatory consequences."
| Domain | Pain | NXPU value | TAM |
|---|---|---|---|
| Healthcare AI | LLM hallucination = malpractice liability; FDA cannot approve transformer-only diagnostic support | Provable non-hallucination + audit trail | $20B+ |
| Financial compliance | AML / KYC / SOX / Reg-W audits require explainable decisions | Every flag carries a proof tree | $30B+ |
| Pharma R&D | Causal inference on trial data is the workflow; software baselines are slow | Sachs-class causal discovery at hardware speed (1,000×) | $15B+ |
| Defense / intelligence | LLM-based decision systems cannot be certified to DO-178C above DAL-E | Deterministic, replayable, certifiable | $10B+ |
| Autonomous vehicles | Post-incident review requires reasoning replay | Replayable derivation chain | $50B+ |
Total addressable market over 5 years: $100B+ as AI accountability regulation (EU AI Act, NIST AI RMF, ISO 42001, FDA AI/ML guidance) converges on requirements transformer-only systems structurally cannot satisfy.
NXPU v9 is the first commercial-grade neurosymbolic reasoning processor that combines bidirectional Datalog evaluation, set aggregation, top-K ranking, negation-as-failure, structural hash-consing, integer arithmetic, CORDIC transcendentals, probabilistic confidence propagation, inductive rule discovery, causal structure learning, anti-hallucination by construction, and a real 4 GB DDR4 cold tier on a single piece of silicon. It is shipped as two open-source, MIT-licensed, reproducible bitstreams: silicon-v1.0-bram and silicon-v1.1-mig. All 46 silicon testbenches pass at 100 MHz on xczu7ev with comfortable timing headroom.
The chip is not a replacement for LLMs. It is the architecture LLMs will need next to them when their answers have to be defensible — in a court, in an FDA filing, in an SEC audit, in a DO-178C certification package, in front of a regulator. That market does not exist yet because no chip has shipped to address it. NXPU is that chip.
[1] Vectara, "Introducing the Next Generation of Vectara's Hallucination Leaderboard," 2026.
[2] SQ Magazine, "LLM Hallucination Statistics 2026," 2026.
[3] M. Brinsa, "Hallucination Rates in 2025," Frontiers in Artificial Intelligence, 2025.
[4] TokenPowerBench, "Benchmarking the Power Consumption of LLM Inference," arXiv:2512.03024, 2025.
[5] P. Luccioni et al., "Estimating the Carbon Footprint of BLOOM," JMLR, 2023.
[6] K. Pagiamtzis and A. Sheikholeslami, "Content-Addressable Memory (CAM) Circuits and Architectures," IEEE J. Solid-State Circuits, vol. 41, no. 3, 2006.
[7] S. Ceri, G. Gottlob, and L. Tanca, "What You Always Wanted to Know About Datalog," IEEE Trans. Knowl. Data Eng., vol. 1, no. 1, 1989.
[8] J. A. Robinson, "A Machine-Oriented Logic Based on the Resolution Principle," J. ACM, vol. 12, no. 1, 1965.
[9] T. Komorowski, J. Pedlowski, and J. Lee, "CORDIC Algorithm Survey," Proc. ARITH, 2014.
[10] P. Spirtes, C. Glymour, and R. Scheines, "Causation, Prediction, and Search," MIT Press, 2nd ed., 2000.
[11] K. Sachs et al., "Causal Protein-Signaling Networks Derived from Multiparameter Single-Cell Data," Science, vol. 308, no. 5721, pp. 523–529, 2005.
[12] G. Bi and M. Poo, "Synaptic Modifications in Cultured Hippocampal Neurons," J. Neuroscience, vol. 18, no. 24, 1998.
[13] M. Davies et al., "Loihi 2: A Neuromorphic Processor with Quantized Sparsity," IEEE Micro, vol. 42, no. 5, 2022.
[14] S. Muggleton, "Inductive Logic Programming," New Generation Computing, 1991.
[15] J. Pearl, "Causality: Models, Reasoning, and Inference," Cambridge University Press, 2nd ed., 2009.