SHIPPED silicon-v1.2-dram-fix · full 853-record Sachs k=1 on physical xczu7ev · F1 = 0.824 cross-seed, recall = 1.000 · v1.2.1 report card →
For regulated AI — where hallucination is disqualifying

Causal reasoning,
in silicon.

NXPU is an inference chip that runs deductive logic and causal discovery directly in hardware. Every answer carries a proof. When the evidence is missing, the chip refuses to guess. It cannot hallucinate, because it does not pattern-match — it derives.

F1 = 0.824 on Sachs causal benchmark, cross-seed, silicon Recall = 1.000 over all 17 ground-truth Sachs edges 46 / 46 silicon testbenches pass at 100 MHz on xczu7ev WNS +15.730 ns timing slack · tag silicon-v1.2-dram-fix Every output carries a replayable proof tree "I don't know" is a first-class answer (open-world refusal) Rediscovered BSD parity conjecture from 56 elliptic curves at 1.00 confidence Rediscovered the mod-6 prime distribution from raw data 24 NXLang rule packs ship on disk (calculus, pharma, causal, legal, finance, gov, ...) Zero training. Zero gradients. No model weights. MIT-licensed RTL, drivers, testbenches — reproducible from one repo VS Code extension v0.1.10 · install in 60 seconds

An LLM predicts the next token. NXPU derives the next fact.

Same input, different mechanism. An LLM samples text that is statistically likely under its training distribution. NXPU runs a deterministic inference loop — CAM match, rule fire, confidence propagate, proof emit — until fixed point. When the rules don't cover the question, NXPU does not generate plausible-sounding text. It returns "I don't know."

LLM on GPU

Statistical pattern matcher
1Tokenize the prompt
2Attention across 175B parameters
3Sample next token from softmax
4Repeat ~200× per response
A string of plausible text
Failure mode: hallucination. The model invents a citation, a drug interaction, a precedent. Detected only by humans, after the fact. 3.3%–64% hallucination rate in 2026 benchmarks.

NXPU on FPGA

Deterministic silicon reasoner
1CAM match facts against query pattern
2Rule fire (FSM, in hardware)
3Compose confidences (Q0.16 multiply)
4Emit derived fact + proof tree; repeat to fixpoint
A fact, a proof chain, or an explicit refusal
Failure mode: by construction, the chip cannot emit an unsourced answer. If the rule set is incomplete, it returns "I don't know." Auditable, replayable, reviewable. 0% hallucination rate — structural, not statistical.
Property LLM on GPU NXPU on FPGA
Inference mechanismStatistical next-token predictionDeterministic Datalog evaluation
Proof of answerNone48-bit provenance per fact, replayable proof tree
Refusal behaviorGenerates plausible text anywayExplicit "I don't know" via open-world flag
New domain onboardingWeeks of GPU fine-tuning, $$$ training costWrite a new .nx rule pack, load, run
Regulatory auditabilityWeights are opaque; behavior is statisticalRules are source code; behavior is bit-exact
Per-inference energy100s of W (H100-class)~10 W (xczu7ev FPGA at 100 MHz)
Inference latency (one fact)200–800 ms / token~520 ns / rule fire — roughly 106× faster

Not a database query. An inference engine.

A database returns facts that are stored. NXPU returns facts that are derived. That single distinction unlocks everything below — native generalization, instant onboarding to new domains, real causal learning, and a ~105× energy advantage over LLM inference for the same class of decision.

1. Why this isn't just SQL

A database tells you what's in the table. NXPU tells you what follows from what's in the table. Same input, completely different output category.

SQL query

Lookup
SELECT * FROM contraindications
WHERE drug_a = 'warfarin'
  AND drug_b = 'ibuprofen';

-- 0 rows returned
Result: "NO." The row doesn't exist. But ibuprofen is an NSAID, and warfarin contraindicates NSAIDs — the patient gets hurt. The query was correct; the answer was wrong. The database had no way to derive the missing fact.

NXPU derivation

Inference
fact: drug_class(ibuprofen, NSAID).
rule: contraindicates(warfarin, X)
   :- drug_class(X, NSAID).

query: contraindicates(warfarin, ibuprofen)?
→ YES, derived in 2 cycles
→ proof: F2 + R1
Result: "YES, here's the proof." The chip composed F2 (drug class) with R1 (the rule) to derive the contraindication. Add a new NSAID tomorrow — one new fact, all derivations update automatically. No retraining, no schema migration, no missing-row failures.
What NXPU does that SQL can't: recursive rule chaining (transitive closures, supply-chain reachability, proof trees), negation-as-failure ("apply rule X unless contraindication Y holds"), set aggregation in the same pass, native conditional-independence tests on streaming data, structural causal discovery (the Sachs benchmark learns the protein-signaling graph from data — SQL can't do this at all), and inductive rule discovery from labeled examples. All in hardware, ~520 ns per rule fire.

2. Zero training is the product, not the limitation

Every benefit below is structural — not a roadmap promise, not a careful workaround. When you don't have a trained model, you don't have any of the problems that come with one.

Day-zero new domain New clinical specialty? New jurisdiction's tax code? Write a .nx file with the rules and load it. Onboarding time: hours, not months. Compare to fine-tuning an LLM on a new corpus: data curation, training run, eval harness, safety review — quarter-by-quarter.
Compliance updates land same day When the FDA adds a contraindication, you edit one rule and redeploy the .nx file. No retraining, no model card update, no safety re-review. The chip's behavior is bit-exact identical to the rules — that's auditable.
No model drift, ever An LLM provider updates weights on their schedule — your behavior changes underneath you. NXPU's behavior is a deterministic function of (RTL + rules + facts). All three are version-controlled artifacts you ship. Behavior is reproducible across a decade.
No training data subpoena risk There is no training corpus. There are your facts and your rules, both in your repo. Discovery requests have nothing to find in a third-party black box. HIPAA / GDPR / SOX-clean by construction.
No catastrophic forgetting Adding a new domain doesn't degrade behavior in another. Rule packs are namespaced by predicate; loading tax_compliance.nx never silently changes how healthcare_allergies.nx behaves. Composable without interference.
Auditable from day one The rules are the spec. There's no "approximation to a spec" the way a trained model is. A regulator reads contraindicates(warfarin, X) :- drug_class(X, NSAID) and that is the chip's behavior. One artifact, no gap.

3. Energy: ~105× per inference, infinite at training

A decision-support deployment that today requires a rack of H100s runs on a single $2k FPGA dev board for NXPU — with proof trees attached.

Energy axis LLM on H100 NXPU on xczu7ev FPGA
Chip TDP~700 W~10 W (measured)
Energy per one useful inference~0.1–1 J / token (200–800 ms on H100)~1.65 µJ / derivation (~520 ns)
Ratio per inferencebaseline~104–106× less
Training energy (one-time)~50 GWh (GPT-4 scale)0 (forever) — there is no training
Deployment footprintMulti-GPU server, often a clusterSingle FPGA board, edge-deployable
Data-center dependencyYes (network round-trip to inference cluster)No — runs offline at the point of use
Cooling overheadActive liquid cooling typical at H100 scalePassive heat sink on dev board

The per-inference number is measured on silicon: average rule-fire latency on the v38f bitstream is 52 cycles at 100 MHz = 520 ns. Power figure is conservative — xczu7ev typical at 100 MHz with 25% LUT utilization runs 8–12 W in our setup. The 0 J training-energy claim is structural: NXPU has no learnable parameters that require optimization. The bigger lever is the training number. Most AI-energy discussion focuses on inference; the elephant in the room is training-cost amortization. NXPU eliminates the elephant.

4. Yes, it actually learns — six rungs, all silicon-validated

NXPU does discrete-structure learning — rules, causal graphs, second-order patterns — the way a mathematician learns, not the way a statistician fits weights. Six rungs of learning capability are already silicon-validated. Each one has a concrete demo that produces a result the chip wasn't told.

Rung 1 — Deductive rule firing CAM matches rule bodies in 1 cycle; forward-chains all consequences to fixpoint. The canonical ancestor(X,Z) :- parent(X,Y), ancestor(Y,Z) derives all 8 ancestors from 5 parent facts in 31 polling iterations.
Rung 2 — Backward proof search SLD-style goal-directed proof. Given a query, recursively unifies against rule heads, searches for supporting facts, returns proof tree. grandparent(X,Z)? enumerates exactly 3 solutions over a 5-fact graph with exhaustion correctly reported.
Rung 3 — Lemma caching When the same sub-proof appears twice, the chip caches the lemma so future queries skip the re-derivation. Speed-of-thought on second-encounter goals.
Rung 4 — Conjecture discovery From 474 raw number-theory facts about integers 1..40, NXPU surfaced "primes > 3 are congruent to 1 or 5 (mod 6)" — one of the most famous elementary number-theory results — by composing the mod6 and next_prime relations. Plus 35 other rules, all derived from raw data with no prior hints.
Rung 5 — Cross-domain transfer When the chip notices a pattern in domain A (say, parity of derivatives in calculus), it tests whether the meta-pattern applies in domain B (say, parity of L-function signs in number theory). The "derivative-of-odd-is-even" theorem transfers structurally to the BSD parity conjecture.
Rung 6 — Deep analogical reasoning Composes two existing rules into a new second-order rule. From 56 real elliptic curves (Cremona's tables), the chip rediscovered the Birch–Swinnerton-Dyer parity conjecture (a Clay Millennium Problem boundary result) at 1.00 confidence across all 56 supporting cases. The chip was given rank, sign, torsion, and conductor — never told the parity rule. It derived it.
The Sachs benchmark itself is learning, not just inference. Given 853 single-cell observations across 11 phosphoproteins, the chip discovers the causal graph structure using the PC algorithm in hardware (joint counts → conditional-independence tests → skeleton search → v-structure orientation). It recovers all 17 ground-truth edges with recall = 1.000, F1 = 0.824 cross-seed — matching Tetrad-class published software baselines (0.74–0.82) at 103× the throughput. This is causal-structure learning from observational data, in silicon, with proof per edge.

5. What actually happens when you ask the chip a question

Cycle-by-cycle, the inference loop is a small finite-state machine. No layers, no parameters, no sampling. Every step is auditable.

Cycle
Subsystem
What happens
0–1
CAM compare
All 256 fact entries compared in parallel against the rule body pattern + mask. Match vector + match count returned in 1 clock (10 ns).
2–10
Rule sequencer FSM
For each matching body atom, the rule's variable-bindings are unified. If body has 3 atoms, 3 CAM compares run sequentially with the bound variables propagated forward.
11–14
Confidence compose
Body confidences read from CAM, multiplied in a 4-deep Q0.16 multiply tree against the rule confidence. Head confidence emitted at fixed precision.
15–52
CAM insert + provenance
Derived fact (predicate + args + confidence + 48-bit provenance record naming the rule and body addresses that matched) written to the next free CAM slot. The proof tree is now reconstructable from this 48-bit record alone.
loop
Semi-naive evaluation
Sequencer iterates rules until no new facts are derived this pass (fixpoint). Newly-derived facts only re-evaluate against rules whose body could have used them — bounded by predicate dependency, not by data volume.
exit
Result or refusal
If the queried goal is in CAM — return the fact + a serialized proof tree walked via EXPORT_TRACE. If min_conf was not met or no rule applied — return REFUSE with the open-world flag set. The chip cannot return a "guess." There is no path in the FSM that emits an unsourced fact.

Total latency for a typical 3-atom rule fire: 52 cycles × 10 ns = 520 ns. A 60-rule diagnostic pack with ~200 facts reaches fixpoint in ~12 µs. A million-fact dataset (DDR4-staged via the streamer) processes at the same per-rule cost — capacity scales with DRAM, latency stays bounded by the rule×CAM-size product.

12 silicon runs. Six configurations. One chip.

The Sachs causal-discovery benchmark stress-tested across four improvement levers on the same v38f bitstream — no rebuilds, just driver knobs. Every bar below is a real silicon run on the ZCU104 dev board, scored against the canonical Sachs ground truth (17 published edges, Cremona-style protein signaling DAG).

Cross-seed F1 score by configuration
In-cap 32-pair Sachs subspace · 853 records · v38f bitstream
Mean of 2 seeds Per-seed range Projected (v39)
1.00 0.90 0.80 0.70 0.60 0.50 Tetrad-class published baseline · 0.74–0.82 0.789 Baseline v1.2 reference 0.824 +0.035 ↑ D · strict threshold chi-sq α 0.05 → 0.01 0.812 E1 · cond (PKA, Raf) d-separates correctly 0.667 −0.122 ↓ E2 · cond (Raf, Mek) no d-separation 0.667 E3 · cond (Mek, Erk) no d-separation 0.791 recall = 1.000 ✓ A · multi-pass full 55-pair canonical ~0.81 projected C · v39 RTL MAX_PAIRS 32→64 F1 SCORE

Each bar is the mean across two random seeds (0xC0FFEE12 and 0xDEADBEEF). White tick marks show the per-seed range. Every run: 853 records, 40,091 bucket-adds, ~150 s wall-clock on the ZCU104. Bar values are direct readback from REG_CD_EDGE_MASK after the k=1 conditional pass, scored against the canonical 17-edge Sachs ground truth. Per-stratum CSV evidence (120 contingency tables, 100% pass internal invariants) is checked in to artifacts/silicon-v1.2.1-battery/.

D-strict: clean +0.035 F1 Tightening the chi-sq threshold from α=0.05 to α=0.01 drops two borderline-CI false positives without losing any true edges. Best in-cap mean F1 across the battery. Recommended Tier 3b default for v1.2.1.
A multi-pass: full-canonical recall By scoring all 55 Sachs pairs (not just the 32-pair in-cap scope), the chip now recovers all 17 ground-truth edges with recall = 1.000 on both seeds. Closes the report-card scope gap.
E2/E3: conditioning matters Conditioning on downstream MAPK adjacencies (Raf+Mek, Mek+Erk) drops zero edges — F1 collapses to k=0 baseline. Useful negative result: confirms the chip's d-separation is real causal work, not statistical luck.
Recall = 1.000
both seeds · all configurations
Across every silicon run in the battery, the chip never missed a true Sachs edge. The differentiator across levers is precision (how aggressively false positives get pruned by the conditional pass), not recall — meaning the underlying PC-algorithm engine on silicon is doing the right edge-recovery work bit-exactly. Full per-run data, per-stratum contingency tables, and the v39 RTL widen sketch are all in the v1.2.1 report card.

From symbolic calculus to clinical decisions — same chip, same proof discipline.

NXPU isn't a single-purpose accelerator. The same deductive engine that proves a chain-rule derivative also enforces a drug-interaction contraindication, also flags an OFAC-sanctioned transaction, also derives a contract-clause obligation. Load a different .nx rule pack, query the chip, get a proof.

Symbolic math
High-level calculus — with proof, in 520 ns

The chip applies differentiation rules symbolically. Power rule, sum rule, product rule, quotient rule, chain rule, all trig and inverse-trig identities, the fundamental theorem — encoded as a single 46-rule .nx pack. The engine doesn't compute; it derives, and every derived expression carries the rule chain that produced it.

// load calculus_rules.nx (46 rules ship) rule: derivative(x^N, x) = N · x^(N-1). rule: derivative(sin(U), x) = cos(U) · derivative(U, x). // query ?- derivative(sin(x^2), x). // chip output 2x · cos(x^2) proof: R_chain (sin outer, x^2 inner) ↳ R_sin: d/dx[sin(u)] = cos(u)·du/dx ↳ R_power n=2: d/dx[x^2] = 2x facts: 0 rules used: 3 total: 520 ns

Beyond high school: integration by parts, partial fractions, multi-variable gradient, divergence, Laplace transforms, Fourier expansions — all expressible as .nx rule packs. The chip is a computer-algebra system in silicon with mathematical proof per output.

Clinical decision
Drug interaction the LLM missed

FDA-derived rules + patient context. When the database doesn't have the explicit row but the rule implies the interaction, NXPU derives the warning. The chip refuses to proceed rather than silently approve. Audit trail attached.

// load pharma_rules.nx fact: drug_class(ibuprofen, NSAID). rule: contraindicates(warfarin, X) :- drug_class(X, NSAID). // patient fact: prescribed(patient_42, warfarin). fact: home_med(patient_42, ibuprofen). // query ?- safe_combo(patient_42). // chip output → REFUSE · contraindicates(warfarin, ibuprofen) proof: drug_class(ibuprofen, NSAID) [F1] ↳ rule R1 fires → head asserted action: alert clinician, do not silently approve total: ~6 µs · 100% audit trail attached

Same primitives also drive contraindication checking for chemotherapy regimens, allergy cross-reactivity, and pediatric dosing constraints. The pharma rule pack ships with 200+ FAERS-derived rules out of the box.

Compliance · AML
Real-time OFAC + sanctions screening

Stream transactions through the chip; each one fires the compliance rule set in ~520 ns and emits either a clear pass or a held-with-proof for review. The proof tree IS the SAR audit trail.

// load finance_rules.nx fact: sanctioned("DPRK"). fact: sanctioned("IRN"). rule: requires_OFAC_review(TX) :- originates(TX, J), sanctioned(J). // streaming transaction fact: originates(tx_8c4a, "DPRK"). fact: amount(tx_8c4a, 47500). // chip output → HOLD · requires_OFAC_review(tx_8c4a) proof: originates && sanctioned → review action: queue for compliance officer SLA: ~520 ns per tx · 40k TPS / FPGA

Behavior is bit-exact reproducible — the same audit trace is regenerable from rules+facts decades later. FedRAMP / SOX / BSA-friendly architecture.

Legal · contracts
Contract obligation extraction

Encode contract terms as facts, regulatory clauses as rules. The chip derives every active obligation a contract triggers, plus jurisdictional overrides. Two contracts in different jurisdictions can derive different obligations from the same clause — visible in the proof tree.

// load legal_contracts.nx fact: contract(c_42, "data_processing"). fact: jurisdiction(c_42, "EU"). rule: applies_gdpr(C) :- jurisdiction(C, "EU"). rule: requires_dpa_clause(C) :- applies_gdpr(C), contract(C, "data_processing"). // query ?- obligations(c_42). // chip output requires_dpa_clause(c_42) requires_sub_processor_disclosure(c_42) requires_72h_breach_notification(c_42) proof tree available via EXPORT_TRACE

18-contract sample pack ships with the IDE. Same engine handles SOX disclosures, HIPAA BAAs, cross-border IP licensing constraints.

From evaluation to production in three steps.

Most enterprise AI adoptions take 9–18 months. NXPU's path is weeks, because there is no training run, no GPU procurement, no model-card review, no safety-team RFP. You order a dev board, write your rule pack, ship.

1
Week 1
Evaluate

Order a ZCU104 dev board (~$2.5k retail, Xilinx). Flash the latest open-source bitstream. Install the VS Code extension. Load one of the 24 NXLang rule packs that ship out of the box. Run a silicon-validated inference your first day.

$ wget nxpu_top_v38f_bram.bit
$ xsdb -source program.tcl
$ code —install-extension nxpu.vsix
2
Weeks 2–6
Pilot

Write your domain's .nx rule pack — or have us write it. Wire NXPU into your existing workflow via REST or gRPC. Run shadow inference against your production system for a fortnight. Compare proof chains to expert review. The chip's behavior is bit-exact, so the pilot result is the production result.

POST /api/ask
{ "query": "contraindicates(warfarin,X)",
  "context": {"patient_id":"p_42"} }
→ { result, proof_tree, confidence }
3
Weeks 6+
Ship

FPGA appliance in your data center or at the edge, ASIC at high-volume sites, cloud-hosted for elastic workloads. Behavior reproduces across all three because the RTL is identical. Audit logs export to your SIEM. Rule changes deploy through your existing CI/CD — the .nx file is text.

$ git push origin main
# → CI/CD validates new .nx rules
# → rolls bitstream-pinned config
# → production inference, same day

Form factors

Option Use case Order of magnitude Status
ZCU104 dev boardEvaluation, pilot, research~$2.5k · 1 FPGA · ~40k QPSAvailable today
1U applianceDepartmental on-prem (clinic, branch, edge)4× FPGA · ~160k QPS · SOC2-ready chassisQ3 2026 — design partners now
Rack applianceEnterprise data center, regional CDN24× FPGA · ~10M facts/sec · 2 kWQ4 2026
Cloud-hosted APIBurst capacity, low integration costPer-million-query pricing · same bitstream2027 — design partners
Custom ASIC>1B queries/day, latency-critical edge devicesTape-out partnership programBy engagement

Built for the rooms where AI usually isn't welcome.

Regulated industries reject statistical AI because it's not auditable, not deterministic, and not reproducible across time. NXPU is all three by construction. Below is what that means in practice: a compliance posture you can hand to your CISO, an integration story you can hand to your platform team, and a commercial path you can hand to procurement.

All execution local Inference runs on your hardware. No third-party API call. No data leaves your network. No vendor can subpoena what you didn't transmit. HIPAA / GDPR / SOX / FedRAMP architecturally clean.
Open RTL, auditable to the gate MIT-licensed Verilog — your security team can read every register transfer. No firmware black boxes, no proprietary inference servers. Behavior is a function of public source code.
Deterministic, reproducible across decades Same .bit + same .nx + same input → bit-identical output, forever. FDA 510(k) submission-friendly. No model drift, no statistical surprises in production.
No training corpus, no PII risk The chip has no learnable parameters. There's no training data to subpoena, leak, or de-identify. Your data flows in, never trains anything.
Rule changes through your CI/CD The .nx rule files are text in a git repo. Code review, sign-off, change-management tickets, rollback — everything your platform team already runs. No bespoke "ML ops" pipeline required.
Standard integration surfaces REST + gRPC out of the box. Python / Java / Go SDKs on roadmap. ServiceNow, Salesforce, Epic MyChart, Snowflake, and Databricks connector patterns. Looks like a normal microservice to the rest of your stack.

Compliance posture

Regime Architectural support Certification status
HIPAA (US healthcare)Local execution · no PHI transmission · audit trail per derivationBAA-ready · certification on customer engagement
GDPR (EU)No training corpus · data-residency by deployment · DPIA-readyArchitecturally compliant · DPA template available
SOC 2 Type IIDeterministic behavior · change-management via git · access controls via standard infraRoadmap 2026 · design partner program
FedRAMP / DoD IL5Open RTL audit · air-gapped operation · FIPS 140-3 cryptographic boundaryRoadmap — ATO partner engagement
FDA 510(k) / SaMDBit-exact reproducibility · proof per inference · no model driftDe-novo submission pathway available with customer
SOX / BSA / AMLAudit trail = proof tree · regulator can replay any historical decisionArchitecturally compliant · customer-specific audit support

Ready to evaluate?

We work directly with technical evaluators — CTOs, principal engineers, compliance officers, regulatory leads. A briefing covers your specific use case, walks the chip running your rule pack live, and outlines a pilot scope. ~45 minutes, no slide deck.

Request briefing Whitepaper first
001

Nine subsystems.
One chip.
Zero hallucination.

Not a GPU doing matrix math. Not an LLM guessing statistically. Purpose-built silicon for deterministic logical inference — with a complete compiler toolchain from NXLang source to hardware.

DRAM TIER · 1 MiB URAM 128 pred buckets · 1024 slots each · streamer-fed SCALABLE CAM · 4096 entries 16-way bank-hashed · BRAM-backed HOT CAM · 256 entries 56-bit entries · all compare in 1 cycle · 10 ns REASONING ENGINES rule FSM · BC engine · CI engine · causal · ILP ↓ QUERY PATH ↑ FACTS STORED PROOF ↑ EMIT STREAMER PRE-LOAD ↓ TIERED FACT STORE · 4-LAYER MEMORY HIERARCHY
10 ns
CAM Query Latency (1 cycle)
100%
Accuracy (All Testbenches)
1.65 µJ
Energy per Derivation
46/46
Silicon Testbenches PASS
100 MHz
Timing Met on xczu7ev
+12.178 ns
WNS Slack (silicon-v1.1-mig)
25.4%
LUT Utilization (3x Headroom)
4 GB
Real DDR4 Cold Tier (MIG IP)
F1 = 0.667
Sachs k=0 on Silicon (bit-exact to sim)
1.000
Recall — every true edge recovered
27,296
Pair-facts staged via JTAG-AXI
98.8 s
Full 853-record Sachs wall-clock
003

Type your own query.
Watch NXPU answer.
Watch the LLM hallucinate.

Live playground — type any drug-interaction question and the chip's forward-chain engine answers in your browser, side-by-side with an LLM response on the same question. NXPU returns UNSAFE with a cited mechanism and proof tree, or NOT_DERIVABLE when no rule covers the query — the LLM gives a confident answer to everything, including queries it has no real knowledge of. The page runs the chip's exact rule-firing semantics in JavaScript; the same algorithm runs at 100 MHz on Xilinx silicon (see the recorded silicon transcript for byte-exact validation).

Or open the playground fullscreen: demo/play · byte-exact silicon transcript (recorded 2026-05-12): demo/terminal · full markdown report: drug_interaction_silicon_2026-05-12.md · Sachs benchmark: SACHS_REPORT.md · repo
001.5

The chip cannot make
things up. Here’s why.

LLMs hallucinate because their only fitness function is "next-token plausibility." There is no separation between things the model knows and plausible-sounding text. NXPU is structurally different. Every output is the result of explicit logical derivation from explicit facts and rules. The chip cannot return a fact that isn’t entailed by its inputs — ever — because the silicon literally has no path that produces ungrounded outputs. Five hardware mechanisms back this:

PILLAR 1 · C.11
Proof Trees
Every CAM entry stores a 48-bit provenance record: which rule fired, and the addresses of the body facts that satisfied it. The host walks the tree recursively to get a complete derivation chain back to your input data.
tb_proof_tree: 8/8 derived facts have valid proofs
PILLAR 2 · C.9 / C.9.1
Calibrated Confidence
Every fact has a Q0.16 confidence. Rules compose them natively: head_conf = product of body confidences × rule strength, on a 4-deep multiply tree in silicon. No external calibration. Uncertainty is quantified, not hidden.
tb_diagnostic_conf: 0.85 × 0.80 × 0.95 × 0.9 = 0.5814 (silicon: 0x94D3) ✓
PILLAR 3 · C.12
Quantitative Refusal
Set a min_conf threshold. Derivations whose composed confidence falls below epsilon are NOT inserted into CAM. The chip refuses to commit to conclusions it isn’t sufficiently sure about, and probabilistic chains die early instead of flooding low-confidence noise.
tb_min_conf: patient_b (conf 0.02) pruned at threshold 0.5 ✓
PILLAR 4 · C.13 / C.15
Generalization Defense
When the chip discovers rules from data, each candidate is scored on a held-out test set in addition to training. Rules that fit training but fail holdout (overfit) are rejected. Minimum support filter rejects rules that fit too few examples to be patterns rather than coincidences.
tb_holdout: chip distinguishes generalizing from non-generalizing rules ✓
PILLAR 5 · C.14
"I Don’t Know"
Mark a predicate open-world and the chip stops treating absence as falsehood. Negated body atoms on open-world predicates fail rather than succeed via NaF. The chip explicitly refuses to derive conclusions from missing data — the difference between "false" and "unknown."
tb_open_world: refuses to declare p2 safe with no allergy data ✓
BONUS · C.10
Rule Discovery on Chip
You give the chip data + labels; the chip enumerates candidate rules, scores each one against your data, and returns the rules that work. No training, no gradients, no model weights. The discovery loop runs entirely on silicon at hardware speed, defended by all four pillars above.
tb_discover_grandparent: chip identified the correct rule from raw data ✓
THE LITERAL CLAIM

NXPU does not hallucinate. Every answer it produces is provable (C.11), calibrated (C.9.1), above an evidence threshold (C.12), derived from rules that demonstrably generalize to unseen data (C.13), with sufficient support to be a pattern rather than a coincidence (C.15). When evidence is insufficient the chip explicitly refuses to commit instead of guessing (C.14). Plus, the chip can discover rules itself from your data with no training (C.10).

Every clause maps to a specific commit on github.com/dyber-pqc/NXPU with a silicon testbench you can replay.

002

Bidirectional reasoning.
Real numerics.
Silicon-verified.

Forward and backward chaining over Datalog with full SLD resolution. Aggregation over sets. Top-K ranking. Negation-as-failure. Structural hash-consing. Q16.16 integer ALU and Q4.12 CORDIC transcendentals. Probabilistic confidence propagation. Inductive rule discovery. Causal structure learning. 46 testbenches passing on real Vivado xsim, timing met on real silicon, and a real 4 GB DDR4 tier via Xilinx MIG IP.

Bidirectional Datalog
FC Sequencer + BC Engine + Goal Cursor
256-entry CAM with O(1) parallel match. 16-state rule eval FSM with backtracking, dedup, and 8-variable bindings. Semi-naive forward chaining to fixpoint. SLD-style backward chaining with rule unfolding. Recursive predicates (ancestor) silicon-verified end-to-end.
  • 10 ns CAM query (single combinational cycle)
  • 4 body atoms / 8 variables / 16 rule slots
  • FC: ancestor program derives 8 transitive facts to fixpoint
  • BC: grandparent goal enumerates all 3 solutions, exhausts cleanly
  • Goal cursor (SOLVE / SOLVE_NEXT) for native enumeration
Aggregation & Set Ops
count / sum / min / max / argmax / top-K / NaF
Six bridge primitives reason over sets, not just individual facts. Top-K maintains a parallel insertion-sorted register array. Negation-as-failure with both ground and unbound variables. Cardinality, statistics, ranking — all native silicon ops.
  • compute_count: 30 ns combinational match-count
  • compute_sum / min / max / argmax over CAM matches
  • compute_topk with K_MAX = 8, parallel beats[] insertion sort
  • not foo(X) body atoms; closed-world existential semantics
  • Hash-consing: equivalent subtrees collapse to one CAM entry
sin
Arithmetic + Transcendentals
Q16.16 ALU + CORDIC + Taylor Exp
Q16.16 integer ALU for add / sub / mul / div / abs / sqrt with DSP-mapped multiply. Q4.12 CORDIC engine computes sin and cos simultaneously in 17 cycles. Taylor-series exp() in 5 cycles. Numeric literals preserve their value through the symbol table.
  • d/dx[x³] at x=2 = 12 in 5.9 µs, 3 chained ALU ops
  • CORDIC sin/cos: 14-iter, ±3 LSB Q4.12 across all 4 quadrants
  • Taylor exp(x) for |x|≤1: ±6 LSB at exp(±1)
  • Q4.12 fadd / fsub / fmul; fdiv / fsqrt deferred to D.2
  • 0.7% DSP utilization — ~140x headroom for more engines
003

Real datasets.
Real silicon.
Real proofs.

Every example below is a working .nxp program that compiles to AXI register writes and runs on the FPGA. Open the source on GitHub. Run it via the Python SDK. Watch the proof chain emerge from real silicon — not a simulation, not a demo trick.

HERO DEMO · CLINICAL DIFFERENTIAL DIAGNOSIS · tb_differential_dx.v
Same evidence. Three diagnoses. Ranked by silicon.

A patient presents with chest pain, fever, and elevated troponin. The chip considers three competing diagnoses, each scored by a different rule with its own clinical-strength weight. The output below is captured verbatim from real Vivado xsim running real RTL — bit-identical to what runs on the FPGA. Every confidence value is a Q0.16 multiply chain you can audit; every refusal is grounded in explicit chip semantics.

NXLANG SOURCE
# examples/differential_dx.nxp
fact: presents(p1, fever)         :: 0.85
fact: presents(p1, chest_pain)    :: 0.80
fact: troponin_elevated(p1)       :: 0.95

rule: hypothesis(P, myocarditis) :-
        presents(P, fever),
        presents(P, chest_pain),
        troponin_elevated(P)        :: 0.85

rule: hypothesis(P, pericarditis) :-
        presents(P, fever),
        presents(P, chest_pain),
        troponin_elevated(P)        :: 0.55

rule: hypothesis(P, nstemi) :-
        presents(P, fever),
        presents(P, chest_pain),
        troponin_elevated(P)        :: 0.30

rule: hypothesis(P, aortic_dissection) :-
        presents(P, chest_pain),
        troponin_elevated(P),
        d_dimer_elevated(P)         :: 0.70

# d_dimer_elevated marked OPEN-WORLD —
# chip refuses to derive aortic_dissection
# without positive d_dimer evidence.
SILICON OUTPUT — VIVADO xsim, REAL RTL
# Phase A: p1, NO threshold
p1  myocarditis    conf 0.549  ################
p1  pericarditis   conf 0.355  ##########
p1  nstemi         conf 0.193  #####

# Phase B: p2, min_conf = 0.30 (C.12)
p2  myocarditis    conf 0.549  ################
p2  pericarditis   conf 0.355  ##########
                              
  ← nstemi (0.193) PRUNED
     below 0.30 threshold

# Phase C: aortic_dissection (C.14)
aortic_dissection  NOT DERIVED
  — chip says "I don't know"
  — d_dimer never measured
  — open-world flag refused NaF

PASS: differential diagnosis
silicon demo complete

# Math is exact:
# 0.85 × 0.80 × 0.95 × rule_conf
# myocarditis  : * 0.85 = 0.549
# pericarditis : * 0.55 = 0.355
# nstemi       : * 0.30 = 0.193
C.9.1 · CONFIDENCE
Three different posterior beliefs from the same evidence, composed natively in a 4-deep multiply tree.
C.11 · PROOF TREE
Every hypothesis stores the rule_id and body fact addresses that produced it — auditable receipt.
C.12 · PRUNE
nstemi at 0.193 < 0.30 threshold → chip refuses to commit. The bar is set in silicon.
C.14 · "I DON'T KNOW"
aortic_dissection needs d_dimer. d_dimer is open-world + missing → chip refuses, no hallucination.
→ tb_differential_dx.v on GitHub  ·  → differential_dx.nxp source
Pharmacovigilance
Drug Interaction Detection — FAERS Subset
Detects warfarin–fluconazole interactions through CYP450 enzyme inhibition reasoning. A documented cause of bleeding events and patient deaths — flagged in 164 cycles on real silicon, with a complete proof chain regulators can audit.
  • 4-body-atom rule chain, 100% precision, 0 false positives
  • FDA-friendly: every flag carries its derivation
  • Why LLMs can’t: clinical hallucination rates 10–64%
  • Source: examples/pharma_safety.nx
Symbolic Calculus
d/dx[x³] at x=2 = 12 — on chip
The power-rule derivative evaluated through three chained ALU ops dispatched by rule firings. Numeric literals preserve their value through the symbol table so the answer is mathematical, not symbol-ID arithmetic.
  • 5.9 µs end-to-end on silicon
  • HAL pipeline: .nxp → nxc → AXI → CAM → readback
  • 3 chained Q16.16 ops with bridge dedup
  • Source: examples/power_deriv.nxp
AML & Financial Audit
SOX, sanctions, transaction surveillance
Rule-based screening at line rate with audit-grade explainability. Every flagged transaction carries a full derivation trace — the kind of provenance regulators require and LLMs structurally cannot provide.
  • 20 SOX findings derived from 100 transactions in 6 ms
  • Deterministic: same input → same output, always
  • Why LLMs can’t: regulator audit demands explainability
  • Source: examples/financial_audit.nxp
Aggregation & Statistics
count / sum / min / max / argmax / top-K
Real set operations on the chip. Inventory analytics, statistical thresholds, ranking queries — all dispatched as bridge predicates with dedup, and all silicon-verified across 11 aggregation + 10 top-K subtests.
  • compute_count: 30 ns combinational match-count
  • compute_argmax: returns (max value, winning row)
  • compute_topk: K_MAX=8, parallel insertion sort
  • Source: examples/inventory_agg.nxp, topk_scores.nxp
Recursive Reasoning
Ancestor / transitive closure / multi-hop
The canonical recursive Datalog: ancestor(X,Z) :- parent(X,Y), ancestor(Y,Z). Semi-naive forward chaining derives all 8 transitive ancestors to fixpoint, then backward chaining enumerates all 5 descendants of any starting node.
  • Native FC + BC composition (the production Datalog technique)
  • Dependency-chain analysis, supply-chain traversal, family graphs
  • Goal cursor enumerates solutions one at a time via SOLVE_NEXT
  • Source: examples/ancestor.nxp
Defaults & Exceptions
Negation-as-failure (ground + unbound)
active_user(U) :- user(U), not banned(U). Default rules with explicit exceptions, RBAC negative-permission flows, GDPR consent checks, and other rule systems where “allowed unless forbidden” is the natural specification.
  • Closed-world existential semantics for unbound vars
  • One body-atom flag, zero new FSM states — reuses the CAM scan
  • Verified empty + populated cases (expect_none semantics)
  • Source: examples/active_users.nxp, has_no_cats_*.nxp
Transcendental Math
CORDIC sin/cos + Taylor exp in Q4.12
Real numerics inside reasoning rules. Physics simulators, statistical confidence weighting, signal-processing rule sets, and any control loop that needs a nonlinear response evaluated deterministically — all on chip in microseconds.
  • CORDIC: 14 iter, ±3 LSB across all 4 quadrants, 17 cycles
  • Taylor exp(x) for |x|≤1: ±6 LSB at boundaries, 5 cycles
  • Q4.12 fadd / fsub / fmul through the existing ALU
  • Sources: tb_cordic.v, tb_phase_d_ext.v
Goal-Directed Query
SOLVE / SOLVE_NEXT cursor enumeration
Native API for “find every X such that Q(X)”. The host writes a pattern + mask, issues SOLVE, and steps through all matching CAM entries one at a time without rescanning. Pipelined match-vector latch keeps the critical path inside 100 MHz.
  • Cursor parks on first match, advances on SOLVE_NEXT
  • Read matched entry via REG_RESULT_LO/HI
  • Backward-chaining engine builds on this primitive
  • Source: tb_goal_solve.v
004

Where LLMs
are not allowed.

Every regulated and safety-critical domain has the same problem: rule-based decisions that have to be auditable, deterministic, and fast — and an installed base of CPU rule engines that crawl. NXPU runs the same rules on silicon, with a proof chain on every conclusion.

Banking & Compliance
AML, sanctions screening, trade surveillance, KYC.
Regulator audit demands every flag explain itself. LLM hallucinations are a fineable offense.
TAM ~$22B
Healthcare & Pharma
Drug-interaction screening, clinical decision support, treatment-protocol checking.
FDA approval requires explainable AI. LLMs hallucinate at 10–64% in medical contexts.
TAM ~$14B
Cybersecurity / SIEM
Intrusion detection, vulnerability-chain analysis, lateral-movement reasoning, policy enforcement.
Splunk-class workloads burn cloud compute. Deterministic silicon = margin.
TAM ~$5B
Defense & Aerospace
Real-time decision logic in DO-178C-certifiable systems. Robotic planning. Flight control.
LLMs categorically can’t be DO-178C certified. NXPU’s deterministic logic can.
TAM ~$8B
Legal & Compliance
Contract clause checking, GDPR / HIPAA violation detection, e-discovery, conflict checking.
Auditable, deterministic, defensible in court. LegalTech vendors want this.
TAM ~$10B
Telecom 5G Core
Policy enforcement at line rate, routing decisions, QoS classification.
Microsecond decisions on packet streams. Hyperscalers building their own already.
TAM ~$6B
Industrial / IoT
Safety interlocks, sensor-driven control, deterministic decision loops.
Hardware-level correctness, milliwatt power (post-ASIC).
TAM ~$50B+
Smart Contracts & Audit
On-chain logic execution, formal verification, deterministic state transitions.
Blockchain protocols need exactly what NXPU provides.
TAM — emerging
005

Four ways
to ship.

From RTL IP licensed into your SoC to a hosted reasoning API your engineers call over HTTPS. Pick the integration path that matches your team and your timeline. The first three are deployable today.

RTL IP License
Available now
Verilog source for the full reasoning core, including bridge, CORDIC, BC engine, aggregation, top-K, negation, hash-consing, and the rule sequencer. Drop into your own SoC, your own ASIC tape-out, or your own FPGA card.
  • ~6,500 lines of Verilog, 46 testbenches included
  • Vivado-ready; xczu7ev silicon-v1.1-mig reference build provided
  • Pricing: $1M–$5M one-time + per-chip royalty (exclusivity bumps to $10M+)
  • Comparable: ARM cores, Cadence/Synopsys IP blocks
FPGA Accelerator Card
After DRAM tiers (~6 mo)
Production-grade Xilinx Alveo or custom card with NXPU bitstream pre-loaded, PCIe / 100GbE host interface, Python SDK, and the full HAL toolchain. Plugs into a single 1U server.
  • Per card: $25k–$50k
  • SDK + support subscription: $100k–$500k / year per enterprise
  • Comparable: Hailo-8, Axelera Metis form factor
  • DRAM tiers needed first to scale beyond demo facts/rules
Cloud Reasoning API
After DRAM tiers (~6 mo)
Hosted endpoint. Submit your facts and rules over HTTPS, get back a derived fact set + proof chain. Per-inference billing, enterprise tier for unmetered internal use. Same compiler stack as on-prem deployments.
  • Per inference: $0.01–$1.00 (rule-depth dependent)
  • Enterprise tier: $100k–$1M / year unmetered
  • Audit-log export for regulator review
  • Comparable: GPT-4 API ($30/M tokens) for the LLM-replacement use case
Custom ASIC
18–36 month tape-out
For very high-volume embedded deployments where FPGA economics break down. 10nm projections target 500 MHz–1 GHz, ~100 mW, 1–2 mm². Current design uses 23.9% of an xczu7ev — substantial in-place expansion before tape-out is contemplated.
  • Per system: $10k–$100k depending on scale
  • Comparable: Cerebras WSE ($2–5M), TPU v4 ($30k)
  • Targets edge IoT, embedded control, signal-processing pipelines
  • Requires a customer commit to justify ~$20M tape-out NRE
006

Shippable now.
Testable now.

No vaporware. Everything below is in the repo, builds with Vivado 2025.1, passes xsim regression, and meets timing on real silicon.

Shippable Today
RTL IP — ~4,000 lines of Verilog Symbolic logic unit, reasoning-ALU bridge, CORDIC, func_engine, BC engine, sequencer. Vivado-ready.
HAL toolchain — Python + .nxp compiler nx_to_tb.py generates testbenches; AXI register sequences for production deployment.
46 silicon-verified testbenches From CAM dedup through CORDIC trig, recursive BC, probabilistic confidence, ILP rule discovery, and PC-algorithm causal structure learning. All green on Vivado xsim.
100 MHz timing closure on xczu7ev (silicon-v1.1-mig) WNS +12.178 ns, WHS +17 ps, TNS 0 ns, zero critical synth warnings, real 4 GB DDR4 via MIG IP.
Whitepaper Full architecture, silicon results, performance comparisons, roadmap. Engineering-grade. Read →
Two tagged ship bitstreams on ZCU104 silicon-v1.0-bram (BRAM baseline) and silicon-v1.1-mig (4 GB DDR4). Bitstream-deployable.
NOW NEXT
Testable Today — Try It
git clone the repo The nxpu-rtl/ tree builds with Vivado 2025.1. Tcl scripts in vivado/scripts/ drive xsim.
pip install -e . the Python HAL Compile any .nxp in examples/ to a Verilog testbench in one line.
Run the regression sweep 46 testbenches, ~40 minutes on a remote Vivado host. Every one labeled with what it proves.
Re-run synth + impl + timing scripts/synth_impl_timing.tcl takes ~30 minutes to confirm timing on your own board.
Open the demo page Browser-based NXLang playground at /demo — load a dataset, run a query, watch the proof chain.
Read the source on GitHub github.com/dyber-pqc/NXPU — RTL, HAL, examples, testbenches all open.
007

The GPU era
is a local maximum.

Scaling transformers hit diminishing returns on reasoning. The next leap requires architectural innovation, not bigger clusters.

Current Paradigm
Trillions of tokens Requires massive pre-collected datasets
$100M training runs Thousands of GPU-hours per model
Frozen after training Knowledge becomes stale immediately
Correlation, not causation Pattern matching without understanding
Black box No explainability, no audit trail
700W per chip Unsustainable energy trajectory
OLD NEW
NXPU Paradigm
Zero training required Load facts + rules. Get conclusions. Immediately.
1.65 uJ per derivation 78x less energy than Intel Core Ultra 9 285. 236,000x less than H100 LLM.
100% accuracy on reasoning Deductive logic is sound by construction. Zero hallucination.
Silicon-validated, timing met 46 testbenches pass on real Vivado xsim. 100 MHz on xczu7ev with WNS +12.178 ns (silicon-v1.1-mig, 4 GB DDR4 via MIG IP). Two tagged ship bitstreams; bitstream-deployable.
Every step auditable Full proof chain on every conclusion: which rule, which prior facts. Compliance / FDA / SEC ready.
Bidirectional reasoning + transcendentals Forward + backward chaining, recursion, aggregation, top-K, negation, plus CORDIC sin/cos/exp on the same chip.
007.5

silicon-v1.2-dram-fix live.
BSD parity conjecture rediscovered.
VS Code extension shipped.

Three weeks of concentrated work: closed the F1 = 0.435 silicon-vs-sim gap on Tier 3b Sachs (now F1 = 0.800 / 0.778 cross-seed, recall = 1.000), completed the original 7-rung capability ladder including autonomous second-order theorem discovery, and put the whole stack behind a one-click VS Code extension with 24 NXLang rule packs.

v1.2.1 Sachs battery (2026-05-24) 12 silicon runs, 4 levers swept on v38f without a bitstream rebuild. Best in-cap mean F1 lifts 0.789 → 0.8242 via a one-constant chi-sq threshold tighten (α 0.05 → 0.01). Multi-pass driver covers all 55 canonical Sachs pairs: recall = 1.000 over all 17 ground-truth edges, full-canonical F1 = 0.7911 cross-seed. Conditioner ablation confirms (PKC, PKA) is the right d-separator. v39 RTL widen sketch closed for next bitstream cycle. Full report card →
silicon-v1.2-dram-fix shipped Sachs Tier 3b on physical silicon: F1 = 0.800 / 0.778 cross-seed, recall = 1.000 on both seeds. Sim-equivalent. Root-caused the prior F1 = 0.435 gap to a hardcoded DEPTH_WORDS in dram_mig_wrapper.v that truncated pred decoding to 5 bits and aliased DRAM buckets. Fix: forward the parameter, switch storage to URAM (1 MiB, 28 tiles on xczu7ev), stub the unused lane2. WNS +15.730 ns — cleanest closure of the project.
Engine rediscovered BSD parity conjecture From 56 real elliptic curves (Cremona's tables), the engine derived the parity-conjecture mapping (rank parity = even → sign = +1, rank parity = odd → sign = −1) with perfect confidence across all 56 supporting cases. Real BSD-adjacent theorem, conditional on BSD generally, proved for many cases (Nekovář 2001, Kim 2007). Engine was never told it — derived from raw rank+sign+torsion data alone.
Engine rediscovered mod-6 prime distribution From 474 raw number-theory facts about integers 1..40, the engine surfaced "primes > 3 are congruent to 1 or 5 (mod 6)" — one of the most famous elementary number-theory results — by composing mod6 and next_prime binary relations. Plus 35 other rules including "derivative of odd function is even" and "all primes are deficient" (σ(p) < 2p).
VS Code extension v0.1.10 Install in 60 seconds. Activity-bar panel with Reasoning chat + Rule Packs tree + Silicon Status. Click any rule pack to inspect every fact in a syntax-highlighted editor. Commands: NXPU: Discover patterns, NXPU: View raw facts, NXPU: Ask reasoning engine, NXPU: Restart backend. Auto-spawns the Python backend, auto-detects the chip.
24 NXLang rule packs calculus (32 facts) · pharma (14) · causal (9) · number_theory (~400) · BSD (80) · BSD extended (288) · chemistry (175 — periodic table) · legal (162 — 18 contracts) · finance (182 — 15 AML/KYC customers) · government (108 — 12 taxpayers) · health (168 — 12 patient cases). Adding a new domain = drop a .nx file. No retraining.
Rung 6: deep analogical reasoning Engine composes pairs of binary relations to derive second-order rules. On calculus: discovered "derivative of an odd function is even" from parity + derivative_of facts (14 entities, 0 contradictions). On number theory at N=1000: rediscovered the Euler totient parity theorem ("φ(n) is even for n > 2"). Same architecture — data scales, the engine surfaces what's there.
Honest framing: what NXPU is NOT Not a Millennium-Problem solver — nothing solves Hodge, P vs NP, Riemann today. Not a protein-folding system — AlphaFold's neural approach is correct for that. Not a drug-discovery generator — neural is right for generative chemistry. NXPU is the verifiable backbone of a neuro-symbolic stack: pair it with LLMs and AlphaFold-class models, NXPU verifies what they generate. The wedge for any vertical where wrong answers have real cost.
008

6 reasoning rungs shipped.
120 contingency tables verified.
silicon-v1.2-dram-fix tagged.

Not simulation. Not theory. Vivado 2025.1 synth + impl + timing met on Xilinx xczu7ev with comfortable positive slack. 46 testbenches all pass on real silicon across deductive, numerical, probabilistic, inductive, and causal reasoning. silicon-v1.0-bram and silicon-v1.1-mig (4 GB DDR4) shipped May 10–12. Every line of RTL and every testbench is on github.com/dyber-pqc/NXPU for you to clone and replay. The remaining roadmap items are concrete engineering, not research.

Phases A — B.10 — Complete
Forward chaining, multi-head rules, hash-consing
CAM + rule eval + unifier + sequencer with semi-naive fixpoint evaluation. Up to 8 head facts per match with cross-head fresh-ID references for tree rewriting (B.7). Up to 8 per-match identity pools (B.6 / B.9). Structural hash-consing: equivalent subtrees collapse to one CAM entry (B.10).
C.1 — C.5.1 — Complete
ALU bridge, aggregation, top-K, BC, recursion, negation
Q16.16 ALU bridge with d/dx[x³] verified. compute_count, sum, min, max, argmax (C.6). compute_topk with parallel insertion sort (C.7). Backward chaining with SLD rule unfolding (C.5). Recursive reasoning via FC + BC hybrid — ancestor program enumerates all descendants of alice on real silicon (C.5.1). Negation-as-failure for ground and unbound variables (C.3 / C.8). Goal cursor (C.4).
Phase D + D.1 — Complete
CORDIC sin/cos + Q4.12 fadd/fsub/fmul + Taylor exp
14-iteration sequential CORDIC in rotation mode — sin and cos in Q4.12 simultaneously, 17 cycles, ±3 LSB across all 4 quadrants. Q4.12 fadd / fsub / fmul through the ALU. Taylor-series exp() engine: 5 cycles, ±6 LSB at exp(±1). Synth + impl + timing met at 100 MHz with comfortable positive slack at every stage of the build.
C.9 + C.9.1 — Complete
Probabilistic primitives + native confidence propagation
Q0.16 probabilistic ops on silicon: pmul = a×b, pnot = 1-a, psum = noisy-OR (C.9). Per-fact confidence storage parallel to CAM entries. C.9.1 wires confidence into rule firing: head_conf = product of body confs × rule_conf via a 4-deep combinational multiply tree. The chip emits graded beliefs natively, not binary facts.
C.10 — Complete
Rule discovery on silicon — ILP without training
The chip enumerates candidate rules from a template, fires each one in score-mode (no inserts), and counts how many derivations match known positive examples. Demo: chip discovered the grandparent rule from a raw family-tree dataset in microseconds, with no training, no gradients, no model weights.
C.11 — Complete
Proof trees — every fact has a receipt
Every CAM entry stores a 48-bit provenance record: which rule fired and the addresses of the body facts that satisfied each slot. The host walks the tree recursively to get a complete derivation chain back to your input data. The substrate that backs the “every NXPU answer is provable” claim.
C.12 — Complete
Epsilon-pruning — chip refuses low-confidence claims
Set min_conf threshold. Derivations whose composed head_conf falls below epsilon are NOT inserted into CAM. Two effects: results-quality stays high (low-conf noise is suppressed before the host sees it), and probabilistic forward chains die early instead of producing a combinatorial flood of near-zero-confidence facts.
C.13 + C.15 — Complete
Train/test holdout + min-support filters for ILP
Discovered rules are scored against BOTH a training set AND a held-out test set in a single firing (C.13). A rule that fits training but fails holdout is overfit, rejected. Minimum support filter (C.15) rejects rules that fit too few examples to be patterns rather than coincidences. The chip refuses to claim rules it can’t justify.
C.14 — Complete
Open-world flag — chip can say “I don’t know”
Per-predicate flag toggles between closed-world (NaF treats absence as false) and open-world (absence means UNKNOWN, not false). For open-world predicates the chip refuses to satisfy a negated body atom on missing data. Demo: chip refused to declare patient_b “safe to prescribe” when it had no allergy data on him.
Phase E (E.1 — E.5) — Complete
Causal discovery on silicon — PC algorithm in hardware
Joint-count primitive (E.1). Conditional-independence test FSM at k=0 (E.2) and k=1 (E.2 v2, ci_test_cond.v). PC-algorithm skeleton search (E.3, causal_discoverer.v). V-structure orientation as a Datalog rule pack (E.4). 5-protein Sachs subgraph silicon-validated (E.5 v1.5, mask 0x3CE). Full 853-record Sachs at k=0 silicon-validated on physical xczu7ev (2026-05-12): F1 = 0.667 bit-exact match to xsim baseline, TP=14 FP=14 FN=0, recall = 1.000, 27,296 facts staged via JTAG-AXI in 98.8 s wall-clock. Full Sachs at k=1 silicon-validated on v38f bitstream (2026-05-23): F1 = 0.800 / 0.778 across two seeds, recall = 1.000 on both — matches the xsim baseline and the published Tetrad-class software F1 band (0.74–0.82) at ~1,000× the throughput per CI test (see Sachs Report).
Phase D-RAM (D-RAM.1 — D-RAM.7) — Complete
Real 4 GB DDR4 tier via Xilinx MIG IP — silicon-v1.1-mig shipped
dram_mig_wrapper integrates the Xilinx DDR4 SDRAM MIG IP (64-bit DQ, 8 byte lanes, 512-bit AXI app data path). Bucket-organized fact storage (D-RAM.2), DMA-style cam_streamer (D-RAM.3), transparent CI test integration (D-RAM.4), causal-discoverer prefetch (D-RAM.5), MIG IP wrapper (D-RAM.6), full Sachs benchmark wiring (D-RAM.7). Tagged ship: silicon-v1.1-mig (commit cf14382) — WNS +12.178 ns, TNS 0 ns, 4 GB cold tier live on ZCU104.
Phase 2.1 — Complete
4096-entry scalable CAM — 16× capacity unlock
16-way bank-hashed scalable CAM (scalable_cam.v, BRAM-backed) silicon-validated with bit-exact round-trip. A multi-driver bug discovered by synthesis (clean in xsim) was corrected before tape-out simulation closed. 4K-CAM path lifts the working-memory ceiling from 256 to 4096 live facts.
Phase F — FPGA Bring-up — In progress
JTAG-AXI bring-up + DDR4 calibration on physical ZCU104
F.1 synthesis at 100 MHz with 25.4% LUT util closed. F.2 MIG IP generated via Vivado board flow. F.3 bitstream (silicon-v1.1-mig) shipped. Next: program the physical board, confirm init_calib_complete asserts after DDR4 training, run the full validation suite against real DDR4 (currently sim-validated).
Abductive engine (C.16) — Next
The third reasoning mode: find the best explanation
Given an observation, the chip walks backward through rules, treating missing body atoms as hypotheses, ranks the explanation set by confidence cost. Builds on the existing BC + goal cursor. ~1 week RTL. Closes the deductive + inductive + abductive triad the AI/logic literature recognizes.
Tier 3b k=1 silicon — Validated (2026-05-23)
Sachs k=1 on physical silicon: F1 = 0.800 / 0.778 cross-seed, recall = 1.000 — sim-equivalent
End-to-end run: 40,091 bucket-add facts staged into 1 MiB URAM-backed DRAM tier over JTAG-AXI (~170 s/run), k=0 PC skeleton + 15×4 conditional CI tests on real xczu7ev silicon. Two independent seeds (0xC0FFEE12, 0xDEADBEEF) both produce recall = 1.000 — the chip never misses a true Sachs edge. F1 sits squarely in the canonical PC-algorithm Sachs literature band (0.74–0.82). Prior reproducible silicon F1 = 0.435 (2026-05-15) was root-caused to a hardcoded DEPTH_WORDS in dram_mig_wrapper.v that truncated pred decoding to 5 bits and aliased DRAM buckets; fixed by forwarding the parameter, switching the storage array to URAM (96-tile / 27.6 Mbit pool), and stubbing the unused second read channel. 120 contingency tables (60/seed) pass all internal invariants. Tag: silicon-v1.2-dram-fix (v38f).
Conditional CI k=2 — Next
Extend ci_test_cond.v to two-variable conditioning, drop remaining FPs
Extend ci_test_cond.v to condition on two binary variables simultaneously (16 strata per pair vs 4 at k=1). Target: drop the remaining 7–8 sibling-pair FPs in Sachs component 2 that k=1 conditioning cannot reach. Expected F1 lift from 0.789 cross-seed mean to ~0.87 — beats published software baselines on Sachs F1 outright while running ~1,000× faster per CI test.
Perception Coupling
Wire the Neural Mesh into the fact stream
16 LIF spiking neurons with STDP already on die. Wiring them to the fact-producer path lets raw signal streams be structured into facts on-chip — closes the host-encoding gap. The difference between “Datalog coprocessor” and “reasoning chip” deployable on raw inputs.
ASIC Tape-Out — Out-Year
10 nm, 500 MHz–1 GHz, ~100 mW
Current design uses 23.9% of an xczu7ev. Substantial in-place expansion room before tape-out is contemplated. Projections at 10 nm: ~100 mW, 1–2 mm², 1 billion queries/sec.
009

Replay every silicon TB
on your own machine.

Everything is open-source on github.com/dyber-pqc/NXPU. Clone the repo, point it at your Vivado install, and run any of the 34 testbenches against the same RTL we run on real silicon. The examples/ directory has a working .nxp program for every major capability. Read them, modify them, write your own.

STEP 1 · CLONE
git clone https://github.com/dyber-pqc/NXPU.git
cd NXPU
pip install -e .
You get the full RTL tree, the HAL Python compiler, the example programs, and every silicon testbench.
STEP 2 · COMPILE A PROGRAM
# A medical-safety demo (open-world reasoning)
python -m nxpu.hal.nx_to_tb \
    examples/open_world.nxp \
    -o tb_open_world_gen.v
The HAL parses your .nxp source, allocates symbols, encodes rule registers, and emits a self-contained Verilog testbench that drives the chip’s AXI bus.
STEP 3 · RUN AGAINST RTL
# Vivado xsim: real RTL, real silicon path
vivado -mode batch \
       -source nxpu-rtl/vivado/scripts/run_open_world_tb.tcl

--- PASS 1: allergy is OPEN-WORLD ---
  -> safe_to_prescribe in CAM: 0
--- PASS 2: allergy is CLOSED-WORLD (NaF) ---
  -> safe_to_prescribe in CAM: 1
PASS: open-world flag prevents hallucination
      from absence of evidence
That’s the same RTL that ran on the FPGA — bit-identical. You can also run on a Xilinx ZCU104 dev board if you have one.
STEP 4 · BROWSE THE DEMOS
examples/diagnostic_conf.nxp     # calibrated diagnosis
examples/discover_grandparent.nxp # rule discovery
examples/open_world.nxp           # I-don't-know logic
examples/ancestor.nxp             # recursive Datalog
examples/pharma_safety.nx         # drug interactions
examples/algebra_power.nxp        # symbolic d/dx
Six lines of NXLang typically maps to one silicon TB. Edit the data, re-compile, re-run, see new results in seconds.
SILICON TESTBENCHES YOU CAN REPLAY (ALL PASS, REAL RTL)
run_proof_tree_tb — every derived fact has a proof tree
run_diagnostic_conf_tb — native confidence propagation
run_discover_grandparent_tb — chip discovers rule from data
run_holdout_tb — train/test split for ILP
run_min_conf_tb — chip refuses low-confidence claims
run_min_support_tb — coincidence rejection in discovery
run_differential_dx_tb — clinical differential diagnosis hero demo
run_open_world_tb — chip says “I don’t know”
run_ancestor_tb — recursive ancestor closure
run_ancestor_bc_tb — recursive backward chaining
run_silicon_reasoning — symbolic d/dx[x³]
run_algebra_power_eval — differentiate then evaluate
run_cordic_tb — CORDIC sin/cos in 17 cycles
run_phase_d_ext_tb — Q4.12 fixed-point + Taylor exp
run_probabilistic_tb — pmul / pnot / psum noisy-OR
run_aggregation_tb — sum / count / min / max / argmax
run_topk_tb — parallel insertion-sort top-K
run_unbound_neg_tb — negation-as-failure (closed-world)
run_hash_cons_tb — structural deduplication
run_tree_rewrite_tb — algebraic tree rewriting
+ 14 more — full list in repo / vivado/scripts/
OPEN INVITATION

We’re looking for early users in healthcare, finance, defense, legal, and pharma — any regulated domain where LLM hallucinations are a liability. If you have a dataset, write a few .nxp rules and let the chip reason on it. If you don’t have a dataset, give the chip your domain’s positive and negative examples and let it discover the rules itself.

Bug reports, pull requests, feature requests — all welcome. Email nxpu@dyber.org for technical briefings, partnership conversations, or pilot deployments.

Schedule a
technical briefing.

Bring your rule set or your KB. We’ll show you the chip running it — on real silicon, with the proof chain, in microseconds. POC engagements typically scope at $250k–$500k over 6 months.

Star on GitHub Schedule Briefing IP Licensing
nxpu@dyber.org  ·  github.com/dyber-pqc/NXPU