A Different Architecture for a Different Problem
Transformer inference is memory-bound, not compute-bound. The bottleneck isn't how many operations you can perform—it's how fast you can feed data to those operations. Asimov is designed from first principles around this reality.
Key Specifications
Memory-First Design
Most AI accelerators maximize theoretical compute (FLOPs), then add memory as an afterthought. Asimov inverts this: we designed around memory bandwidth and capacity first, with compute balanced to match. The result is over 90% realized memory bandwidth on real Transformer workloads—versus under 30% for GPUs running the same models.
LPDDR Over HBM
We chose commodity LPDDR5x over High Bandwidth Memory. HBM delivers impressive peak bandwidth on paper, but at extreme cost, power, and supply chain risk. Our architecture achieves comparable realized bandwidth from LPDDR—while delivering 6x the memory capacity per chip at dramatically lower system cost.
Dual-Hemisphere Architecture
Two identical hemispheres can operate independently on separate workloads or collaboratively on larger problems. Each hemisphere has its own memory subsystem, enabling efficient scaling from single-chip deployments to multi-chip Titan systems without architectural compromises.
TransWarp Engine
At the heart of Asimov is a 512×128 systolic array running at 2 GHz, with weight memory co-located at each processing element. This architecture minimizes data movement—the dominant source of power consumption and latency in AI inference. The array reconfigures dynamically: 512×128 for FFN Matrix-Matrix Multiplication (GEMM), 128×512 for memory bound Matrix-Vector Multiplication (GEMV) attention computation.
Streaming Vector Acceleration
Dedicated hardware performs softmax, RMSNorm, RoPE, SwiGLU, and other activation functions at line rate—no kernel launches, no memory round-trips. Your model runs as a continuous pipeline where vectors flow through matrix ops, normalization, and nonlinearities without ever stalling for the CPU. New activation functions can be supported without silicon changes.
On-Chip General Purpose CPUs
Multiple on-chip ARMv9 64-bit general purpose processor cores handle workload orchestration and provide a programmable escape hatch for frontier model operations that don't fit standard patterns. Run custom logic when you need it, but keep the common path in dedicated hardware for deterministic latency and maximum throughput.
Host Interface
PCIe Gen 6 with CXL support delivers 128 GB/s per hemisphere. The host handles tokenization and sampling; Asimov owns everything in between. Asimov's independent operating capabiltiies may make you think the host processor would be bored, but a fast host interface enables massive distributed KV cache management, multi model loading, and more without slowing down.
Scale-Out Interconnect
16 Tbps of direct chip-to-chip bandwidth with no switches or NICs required. Point-to-point links scale to 16,384 chips in ring, torus, mesh, and more topologies for tensor parallelism, pipeline parallelism, and MoE expert parallelism.
