Paper Reviews

Stratum: System-Hardware Co-Design with Tiered Monolithic 3D-Stackable DRAM for Efficient MoE Serving

Abstract

As Large Language Models (LLMs) continue to evolve, Mixture of Experts (MoE) architecture has emerged as a prevailing design for achieving state-of-the-art performance across a wide range of tasks. MoE models use sparse gating to activate only a handful ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose Stratum, a system-hardware co-design for Mixture-of-Experts (MoE) model inference. The system is predicated on a future memory technology, Monolithic 3D-Stackable DRAM (Mono3D DRAM), integrated with a Near-Memory Processing (NMP) logic die via hybrid bonding. The core contribution is a set of co-optimizations across the stack. At the hardware level, the authors propose an "in-memory tiering" mechanism to exploit simulated latency variations across the vertical layers of the Mono3D DRAM. At the system level, they introduce a topic-aware scheduler that uses a lightweight classifier to predict query topics, mapping frequently used "hot" experts to the faster memory tiers. The authors claim significant improvements in throughput (up to 8.29×) and energy efficiency (up to 7.66×) over conventional GPU-HBM baselines.

Strengths

Comprehensive Scope: The paper attempts a full cross-stack analysis, from device-level simulation of Mono3D DRAM (Sec 6.1.1) to system-level serving policies (Sec 5). This holistic approach is commendable, as optimizations at one level often have unexamined consequences at others.
Clear Problem Formulation: The work correctly identifies that MoE models, despite their computational sparsity, are fundamentally bottlenecked by the memory capacity and bandwidth required to store and access massive expert parameters. The motivation for exploring beyond-HBM memory architectures is well-founded.
Detailed NMP Microarchitecture: The proposed NMP architecture in Figure 7 (page 6) is described with a reasonable level of detail, including PE, PU, and chip-level organization. This provides a concrete basis for the performance and area modeling, moving beyond high-level conceptual claims.

Weaknesses

The paper’s ambitious claims rest on a chain of optimistic assumptions and insufficiently validated components. The entire structure is fragile, and a weakness in any single link calls the overall conclusion into question.

Foundational Reliance on Simulated, Forward-Looking Technology: The entire premise of the paper hinges on the specific performance characteristics of 1024-layer Mono3D DRAM, which does not exist commercially. The crucial latency variation between tiers (a 1.6x difference from fastest to slowest, Sec 6.2.1, pg 11), which is the sole motivation for the tiering mechanism, is derived from Coventor and NeuroSim simulations (Table 1, Figure 14). This is not measured data. If the real-world manufacturing process yields a device with less latency variation, or if thermal crosstalk between layers negates these differences under load, the primary benefit of Stratum's data placement strategy is severely diminished or eliminated. The paper presents these simulated results as fact without a sensitivity analysis.
The Fragility of the Topic-Aware Scheduling: The system's performance is critically dependent on a "lightweight topic classifier" (Sec 5.1). The authors report 85.0% accuracy on the Chatbot Arena dataset (pg 9). This implies a 15% misclassification rate. A single misclassification would presumably place a "hot" expert on a slow tier, resulting in worst-case memory latency for that token's computation. The evaluation does not quantify the performance degradation under misclassification. This is a critical omission. A 15% chance of hitting a major performance penalty is unacceptable in a production serving system. The system's performance in the face of this realistic failure mode is not analyzed.
Unsubstantiated and Potentially Misleading Performance Claims: The headline claim of "up to 8.29×" improvement is a classic red flag. As seen in Figure 16 (pg 13), this peak number occurs for the smallest model (OLMOE) at a specific sequence length. The gains for the larger and more complex Llama-4-Scout model are a much more modest ~4.5×. More importantly, the fairness of the GPU baseline comparison is questionable. The paper states the baseline is vLLM on H100 GPUs, but provides no details on the configuration. Was the baseline fully optimized? Was it truly memory-bound, or was it bottlenecked elsewhere? Without a detailed roofline analysis or performance counter data from the GPU, it is impossible to verify that the baseline is not a strawman. The system is designed to excel at memory-bound tasks; the authors must first rigorously prove that the baseline workloads are indeed memory-bound.
Practical Constraints are Glossed Over:
- Expert Swapping Cost: Table 4 (pg 12) claims a sub-1% time overhead for expert swapping. This analysis appears to assume ideal conditions. The cost of moving gigabytes of parameter data between DRAM tiers, even with the proposed row-swap buffer, is non-trivial. The paper does not analyze the impact of memory bank conflicts or contention on the ring network during these swaps, especially under heavy load.
- Thermal and Power Assumptions: The thermal analysis (Sec 6.2.2, pg 11) concludes a 45W power budget for the logic die is feasible with "high-end liquid cooling solutions." This is a best-case scenario. The paper does not model the performance impact of thermal throttling if this ideal cooling is not achieved. The power breakdown in Figure 15 (pg 11) is based on synthesis and simulation, which can often underestimate real-world dynamic power consumption.
- Generality: The entire optimization relies on requests having clear, classifiable topics with predictable expert affinities. The system's performance on workloads without this property (e.g., general conversation, creative writing, multi-topic queries) is unaddressed. This severely limits the claimed applicability of the approach.

Questions to Address In Rebuttal

Please provide a sensitivity analysis showing how Stratum's throughput advantage changes as the simulated latency variation across Mono3D DRAM tiers is reduced. For example, what is the performance gain if the fastest-to-slowest tier latency ratio is only 1.2x, rather than the assumed 1.6x?
The topic classifier has a non-zero error rate. Please provide an ablation study that quantifies the impact of topic misclassification (e.g., at 5%, 15%, and 25% error rates) on the overall system throughput and latency distribution.
Please provide evidence that the GPU baselines were not configured as a strawman. Specifically, provide profiler data (e.g., from NSight) for the H100 running vLLM to demonstrate that the workload is fundamentally memory-bandwidth-bound and that the GPU's compute resources are not being underutilized for other reasons.
The expert swapping cost analysis in Table 4 seems to assume an idle system. How does this overhead change when swapping occurs concurrently with active inference requests that are contending for the same memory banks and on-chip network resources?
How does the system perform on a mixed workload of queries where a significant fraction (e.g., 50%) has no strong topic affinity and thus activates experts in a pseudo-random or uniform pattern? This would test the system's performance when its primary optimization heuristic fails.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents Stratum, a comprehensive system-hardware co-design for accelerating the serving of Mixture-of-Experts (MoE) Large Language Models. The work's central and most compelling idea is the synergistic mapping between an emerging hardware technology and an emergent property of large AI models. Specifically, it leverages Monolithic 3D-Stackable DRAM (Mono3D DRAM), a technology characterized by non-uniform access latencies across its vertically stacked layers. The authors astutely observe that this physical heterogeneity can be exploited by mapping frequently accessed "hot" experts—identified via a topic-based prediction system—to the faster, upper memory tiers, while relegating "cold" experts to the slower, deeper tiers.

The proposed system integrates this tiered Mono3D DRAM with a near-memory processor (NMP) on a logic die, connected via high-density hybrid bonding. The co-design extends across the stack, encompassing a hardware architecture for the NMP (Section 3.2, page 5), operator mapping strategies for MoE and attention computations (Section 4, page 6), and a system-level scheduler that classifies incoming queries by topic to inform expert placement (Section 5, page 8). The cross-layer evaluation demonstrates significant improvements in throughput and energy efficiency over conventional GPU-HBM baselines.

Strengths

Novel and Elegant Co-Design Synergy: The core contribution is not merely the application of a new memory technology, but the profound insight that connects the physical properties of that technology to the behavioral properties of the target application. The paper brilliantly turns a potential hardware drawback—the variable access latency of deep Mono3D DRAM stacks (visualized in Figure 2, page 3)—into a key architectural feature. This is elegantly matched with the observation of topic-specific expert affinity in MoE models (profiled in Figure 4, page 4), creating a powerful, cross-stack optimization principle. This represents a mature form of co-design that goes beyond simple acceleration.
Forward-Looking and Relevant Problem Domain: The paper tackles two critical, forward-looking problems simultaneously: (1) the memory wall in serving extremely large models like MoEs, and (2) the architectural implications of next-generation 3D memory integration. By moving beyond the well-trodden ground of HBM-based PIM/NMP, the authors provide a valuable architectural blueprint for a technology (Mono3D DRAM) that is a strong candidate for future high-performance systems. This positions the work not as an incremental improvement, but as a pioneering exploration of a future design space.
Comprehensive, Multi-Layered Approach: The strength of the work lies in its completeness. The authors have considered the problem from the device level (DRAM timing parameters in Table 1, page 10), through circuit and architecture (NMP design in Figure 7, page 6), and up to the system software level (topic-aware scheduling in Figure 6, page 5). This end-to-end perspective lends significant credibility to the claimed performance benefits, as it accounts for constraints and overheads at each layer of the system.
Contextualization within the Field: The paper does a good job of situating itself relative to prior work in PIM/NMP for transformers (e.g., AttAcc, Duplex) and highlighting its key differentiators, primarily the shift to Mono3D DRAM and the exploitation of its unique properties (Section 7, page 13). It builds upon the established trend of moving compute closer to memory while introducing a novel axis of optimization (latency tiering).

Weaknesses

Contingency on an Emerging Technology: The work's greatest strength is also its primary weakness. The entire premise and the impressive results are predicated on the maturation and adoption of Monolithic 3D DRAM as described. While this is a hallmark of forward-looking architectural research, the practical impact is contingent on manufacturing trends and the resolution of potential yield and thermal challenges associated with such dense 3D integration.
Sensitivity to Model Behavior: The co-design is exquisitely tuned to the phenomenon of topic-expert specialization. This raises questions about its robustness. If future MoE training methodologies were to change—for instance, by explicitly encouraging more uniform expert usage to improve generalization—the core premise of Stratum's data placement strategy would be undermined. The system's performance is tightly coupled to a specific, albeit currently observed, emergent behavior of MoE models.
Potential Overheads in Dynamic Scenarios: The paper demonstrates that the overhead of swapping experts between tiers is negligible for a given batch transition (Table 4, page 12). However, in a real-world serving scenario with a highly diverse and rapidly changing mix of query topics, the frequency of these swaps could increase. There is a potential risk of "memory thrashing" at the tier level if the topic distribution of incoming requests is chaotic, which could degrade performance in ways not fully captured by the current evaluation.

Questions to Address In Rebuttal

Robustness to Model Drift: The core optimization relies on strong topic-expert affinity. How does the performance advantage of Stratum degrade as this affinity weakens? For example, what happens if the hot/cold expert distinction becomes less pronounced, with usage probabilities being more evenly distributed?
Impact of Prediction Inaccuracy: The system's effectiveness is front-loaded by a lightweight topic classifier. The evaluation in Section 5.1 (page 9) shows high accuracy, but not 100%. What is the performance penalty of a misclassification? For instance, if a "math" query is misclassified as "legal," the system would presumably preload the wrong experts into the fast tiers, leading to slower execution. Can the authors quantify this impact?
Generalizability of the Architecture: The Stratum NMP and tiered memory system is highly optimized for the sparse, dynamic nature of MoE models. Does this specialized architecture offer significant benefits for other classes of models? For example, could the tiered memory system be repurposed to accelerate traditional dense transformers by placing attention KV caches in faster tiers, or is its utility fundamentally tied to the expert-based structure of MoEs?
Scaling and Physical Constraints: The paper assumes up to 1024 vertically stacked layers. As the stack depth increases, the latency disparity between the top and bottom tiers also grows (as shown in Figure 14, page 11). Is there a point of diminishing returns where the slowest tier becomes so latent that it's impractical, or where thermal density becomes an insurmountable challenge for the NMP logic die?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper proposes Stratum, a system-hardware co-design for serving Mixture-of-Experts (MoE) Large Language Models. The system is built upon an emerging memory technology, Monolithic 3D-Stackable DRAM (Mono3D DRAM), integrated with a Near-Memory Processor (NMP) on a logic die via hybrid bonding.

The primary novel claim, as I deconstruct it, is twofold: 1. At the hardware level: The architectural exploitation of an inherent physical property of Mono3D DRAM—namely, the layer-dependent access latency variation (Figure 2, page 3)—to create fine-grained, physical, intra-chip memory tiers. 2. At the system level: A co-design that maps a known software behavior of MoE models—topic-based expert affinity ("hot" vs. "cold" experts)—directly onto these novel physical memory tiers to optimize data access.

The work claims to be the first to propose such a co-design leveraging this specific memory technology for MoE serving.

Strengths

The core strength of this paper lies in its identification and architectural exploitation of a device-level characteristic.

Novel Architectural Insight: The central idea of turning a physical-layer non-ideality (latency variation across vertically stacked wordlines) into a system-level feature (memory tiering) is genuinely novel and insightful. Instead of designing for the worst-case latency as is conventional, the authors embrace the heterogeneity. This demonstrates a deep, cross-layer thinking that is rare and commendable. This is detailed in Section 2.1 (page 3) and visualized in Figure 14 (page 11).
Synergistic Co-Design: The novelty is further strengthened by the tight coupling between the hardware insight and the application domain. The concept of expert affinity in MoE models is known (e.g., [87], [33]), but prior work has not had a hardware substrate that so elegantly maps to this logical concept. The mapping of hot/cold experts to fast/slow physical tiers (Section 5.2, page 9) is a powerful and novel synthesis of existing ideas from different domains.
Significant Delta from HBM-based PIM/NMP: The paper correctly differentiates itself from prior art in NMP for LLMs (e.g., Duplex [89], AttAcc [67]), which are based on HBM. The architectural shift to Mono3D DRAM with its dense hybrid bonding fundamentally changes the design constraints (higher internal bandwidth, no TSV bottleneck), justifying a new NMP design. The novelty here is in the adaptation to and exploitation of this new memory paradigm.

Weaknesses

My critique is centered on the precise boundaries of the novelty and the potential transience of the underlying physical premise.

Constituent Ideas Are Not Novel: While the synthesis is novel, the paper could be more precise about the prior art for its constituent components. The observation of expert affinity is not new ([87]), memory tiering as a concept is decades old, and near-memory processing for transformers is an established research direction. The paper's novelty rests entirely on the combination and the physical mapping. The authors should frame their contribution more sharply as a novel architectural mapping rather than implying the invention of these base concepts.
NMP Architecture is Evolutionary, Not Revolutionary: The proposed Stratum NMP architecture (Figure 7, page 6) is a well-engineered solution but does not appear to introduce fundamentally new processing concepts. It combines tensor cores, a ring interconnect, and SIMD special function units, which are all well-understood building blocks in accelerator design. The "delta" compared to prior NMP designs like Duplex [89] seems to be primarily in the interconnect topology (ring vs. global buffer/crossbar) and the direct integration with Mono3D banks. This is a significant engineering adaptation, but its conceptual novelty as a processor architecture is limited.
Contingency on a Technological "Flaw": The entire premise of in-memory tiering hinges on the significant latency variation across Mono3D DRAM layers. This variation stems from the staircase structure for wordline contacts (Figure 2, page 3). It is conceivable that device and circuit designers will view this as a defect to be engineered away in future generations of the technology, striving for uniform access times. If they succeed, the core hardware motivation for this work vanishes. The novelty is thus tied to a potentially transient property of an emerging technology.

Questions to Address In Rebuttal

The authors should use the rebuttal to clarify the following points regarding the novelty and significance of their contribution.

The concept of topic-based expert affinity and classifying experts as "hot" or "cold" has been explored for optimizing MoE serving on existing hardware (e.g., [87]). Please confirm that your primary novel contribution is not the identification of this affinity, but rather the creation of a new hardware architecture (tiered Mono3D) that provides a physical substrate for this logical classification, and the co-design that maps between them.
The NMP architecture in Figure 7 integrates tensor cores, a ring network, and SIMD function units. Beyond the adaptation to leverage the high internal bandwidth of Mono3D DRAM, what are the fundamentally new architectural concepts in the processing unit or interconnect design itself when compared to the principles used in prior NMP systems like Duplex [89]?
The proposed tiering mechanism is predicated on the access latency heterogeneity in Mono3D DRAM. How fundamental is this property? Is it not a target for elimination by device-level engineers in future iterations of the technology? How would the value proposition of Stratum change if next-generation Mono3D DRAM achieved, for instance, less than a 20% latency variation between the fastest and slowest layers?
The system introduces significant complexity, including a topic classifier, a dynamic SLO-aware scheduler, and a memory mapper for expert swapping. Could a simpler baseline, such as caching the weights of the most globally popular experts (independent of topic) in a large SRAM on the logic die, achieve a substantial fraction of the performance benefit without the overhead of dynamic topic classification and physical data migration between tiers? A comparison against such a baseline would help quantify the benefit of the novel tiering mechanism itself.

Kelle: Co-design KV Caching and eDRAM for Efficient LLM Serving in Edge Computing

Abstract

Running Large Language Models (LLMs) on edge devices is crucial for reducing latency, improving real-time processing, and enhancing privacy. By performing inference directly on the device, data does not need to be sent to the cloud, ensuring faster ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose "Kelle," a hardware-software co-design for LLM inference on edge devices that uses eDRAM as the primary storage for the KV cache. The core contributions are twofold: 1) an algorithm named AERP that combines attention-based token eviction with a recomputation policy to manage the KV cache size, and 2) a two-dimensional adaptive refresh policy (2DRP) for the eDRAM that modulates refresh rates based on token importance and bit significance, thereby intentionally introducing errors to save power. These are implemented in a custom accelerator. The authors claim significant speedup (3.9×) and energy savings (4.5×) over an SRAM-based baseline. While the problem is relevant, the work rests on a foundation of questionable experimental design and insufficiently substantiated claims, particularly concerning the main performance comparison.

Strengths

Problem Formulation: The paper correctly identifies the KV cache as a primary memory bottleneck for LLM inference on resource-constrained edge devices, and the exploration of eDRAM as an alternative to SRAM is a logical direction.
Ablation Study Structure: The authors have made an effort to isolate the performance contributions of their various techniques. The comparisons between Original+SRAM, Original+eDRAM, AEP+SRAM, and AERP+SRAM (Section 8.1.2, Figure 13) provide some insight into the relative benefits of each component of their design, assuming the baselines are fair.
Comprehensive Algorithm Design: The proposed solution is multifaceted, addressing the eDRAM challenge from multiple angles: cache content management (AERP), refresh power (2DRP), and data lifetime (Kelle Scheduler).

Weaknesses

My primary role is to ensure the rigor of published work. This paper, in its current form, exhibits several critical weaknesses that undermine the validity of its conclusions.

Fundamentally Confounded Main Comparison: The central claim of 3.9× speedup and 4.5× energy savings is derived from a comparison between Kelle+eDRAM and an Original+SRAM baseline. As stated in Section 8.1.1, the SRAM baseline is area-matched, resulting in a smaller 24×24 systolic array compared to Kelle's 32×32 array. This is an unacceptable confounder. The reported gains are not solely from the memory subsystem innovation but are significantly influenced by the fact that the Kelle system has nearly 80% more compute resources (32²/24² ≈ 1.78). This renders the headline performance numbers highly misleading. A valid comparison would require matching the compute capabilities.
Insufficient Justification for Heuristics: The proposed AERP algorithm relies on a seemingly arbitrary threshold. In Section 4.1.2, the decision to recompute KV vectors for a token if it is retained in "at least 50% of the heads" lacks any theoretical or empirical justification. The paper presents no sensitivity analysis for this crucial parameter. How does performance change at 40% or 60%? Without this, the chosen value appears to be a "magic number" that may have been tuned for the specific experiments shown.
Superficial Evaluation of Error Injection: The 2DRP policy is predicated on the idea that LLMs are tolerant to a certain level of data corruption in the KV cache. The authors' evaluation of this tolerance is dangerously shallow. They rely primarily on perplexity (PPL) (Figure 8), which is a coarse, statistical measure of fluency. The claim in Section 7.1 that an average retention failure rate of 2e-3 has a negligible impact is not sufficiently proven. While Table 5 shows "comparable" scores on TruthfulQA, a minor drop in accuracy on a multiple-choice benchmark does not adequately characterize the risk of catastrophic failures, such as factual hallucination or safety violations, which could be triggered by specific bit-flips in critical tokens. A system that knowingly corrupts data requires a far more rigorous and targeted evaluation of its failure modes.
"Straw Man" Baseline Scheduler: The baseline computation pattern shown in Figure 12a appears intentionally inefficient, with long, serialized data dependencies that inflate the data lifetime in eDRAM. Any reasonably optimized system would attempt to co-schedule dependent operations to improve data locality and reduce lifetime. The gains attributed to the "Kelle Scheduler" may be significantly overstated by comparing against this naive baseline.
Understated System Complexity: The paper proposes a complex memory subsystem (Section 5.1, Figure 10) with data split into four groups (MSB/LSB for HST/LST tokens) across 32 banks, managed by custom eviction and refresh controllers. While the area of the systolic evictor is mentioned (Section 8.1.4), the overhead of the intricate control logic, potential timing challenges, and addressing complexity for this fine-grained management is not adequately discussed or quantified.

Questions to Address In Rebuttal

The authors must address the following points directly to salvage the credibility of this work.

Provide a new end-to-end performance comparison (speedup and energy) against an SRAM-based baseline that is compute-matched (i.e., also uses a 32×32 systolic array). This is the only way to isolate the true contribution of the Kelle memory system. The area and power of this new, larger SRAM baseline must be reported.
Present a sensitivity analysis for the 50% popularity threshold in the recomputation policy (Section 4.1.2). How do system performance, accuracy, and storage savings vary as this threshold is changed? Justify your final choice of 50%.
The evaluation of the 2DRP's error injection is insufficient. Please provide a more rigorous analysis of its impact on model factuality and safety. This should go beyond PPL and standard zero-shot tasks and utilize benchmarks specifically designed to detect factual inconsistency (e.g., FactScore) or adversarial safety risks.
Justify the selection of the "Baseline" scheduler in Figure 12. Is this representative of typical, optimized LLM inference schedulers, or is it a worst-case scenario designed to amplify the benefits of the Kelle scheduler? Please compare against a more aggressive, locality-aware baseline scheduler.
In Section 8.4.1, you claim Kelle can support long contexts (up to 60K tokens) by offloading to DRAM, stating that the paging process is simplified. However, prefetching this volume of KV data from DRAM for every token generation step would incur substantial latency and energy overhead. Please provide a quantitative analysis of this "paging" overhead and demonstrate that it does not negate the on-chip benefits for such long sequences.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper, "Kelle," presents a hardware-software co-design for efficient Large Language Model (LLM) inference on edge devices, targeting the significant bottleneck of the Key-Value (KV) cache. The core contribution is the principled replacement of traditional on-chip SRAM with embedded DRAM (eDRAM) as the primary storage for the KV cache. Recognizing that eDRAM's high density comes at the cost of power-intensive periodic refreshes, the authors propose a suite of tightly-coupled optimizations to mitigate this overhead. These include: 1) an Attention-based Eviction and Recomputation Policy (AERP) to manage the limited cache size and reduce the lifetime of stored data, and 2) a Two-Dimensional Adaptive Refresh Policy (2DRP) that cleverly exploits the inherent error tolerance of LLMs by reducing refresh rates for less significant bits and less important tokens. These policies are instantiated in a complete accelerator design featuring a custom memory subsystem and a novel "systolic evictor" to implement the AERP with minimal latency. The work demonstrates significant improvements in speed and energy efficiency, positioning eDRAM as a viable and powerful component for future edge AI systems.

Strengths

Excellent Problem-Solution Fit and Timeliness: The paper identifies one of the most pressing problems in deploying modern AI: the memory capacity wall for LLM inference. The proposal to use eDRAM is a fantastic synthesis of an existing, mature technology with a new and critical problem domain. While eDRAM has been explored for CPU caches and CNN accelerators, its application to the LLM KV cache is both novel and exceptionally well-motivated. The authors correctly identify that the KV cache is fundamentally a capacity and bandwidth problem, which aligns perfectly with eDRAM's primary strengths over SRAM.
Insightful and Elegant Co-design Principle: The true strength of this work lies not just in using eDRAM, but in the deep, synergistic co-design. The central insight, empirically validated in Figure 8 (page 5), is that LLMs exhibit graceful degradation to certain types of data corruption. The 2DRP policy (Section 4.2, page 5) is a direct and clever exploitation of this property, mapping a characteristic of the software model (variable importance of data) directly onto a physical knob in the hardware (refresh rate). This moves beyond a simple component swap into a truly holistic system design, which is commendable.
Comprehensive System-Level Approach: The authors present a complete system, from algorithm to architecture. They do not stop at the conceptual level but detail the hardware necessary to make their vision a reality, including the Kelle accelerator architecture (Figure 9, page 6), the memory subsystem layout (Figure 10, page 7), and the novel systolic evictor (Section 5.3, page 7). This end-to-end thinking significantly increases the credibility and impact of the work, demonstrating a clear path from idea to implementation.
Thorough and Illuminating Evaluation: The experimental evaluation is extensive, covering multiple models, datasets, and a strong set of ablation studies (Section 8.3, page 11). The comparison against four distinct baselines in Section 8.1.1 (page 10) is particularly effective, as it allows the reader to disentangle the benefits derived from simply using eDRAM versus those from the AERP and 2DRP algorithms. This methodical breakdown provides clear evidence for the efficacy of each proposed contribution.

Weaknesses

Limited Contextualization Against Other Memory Technologies: The paper does an excellent job of framing the SRAM vs. eDRAM trade-off. However, the broader field of computer architecture is actively exploring a rich landscape of emerging memory technologies (e.g., MRAM, RRAM, FeFETs) for on-chip memory. These technologies could potentially offer the density benefits of eDRAM without the refresh overhead, albeit with their own challenges (e.g., write latency, endurance). A discussion on where Kelle's principles fit within this wider context would strengthen the paper's long-term relevance. For instance, could the AERP policy be beneficial for NVMs to manage write endurance? This is a missed opportunity to position the work more broadly.
Understated Implementation Complexity: The proposed co-design, particularly the 2DRP policy, introduces non-trivial control logic. The memory controller must now track importance scores and manage multiple, fine-grained refresh domains dynamically. While the paper quantifies the overhead of the systolic evictor (Section 8.1.4, page 11), a more detailed analysis of the control overhead for the memory system itself would be beneficial. The elegance of the idea may hide a significant complexity cost that should be acknowledged and justified more explicitly.
The Fundamental Assumption of Error Tolerance: The 2DRP policy hinges on the assumption that LLMs are resilient to bit-flips in certain data. While the authors show this holds for their 16-bit evaluation, this property may become less robust as the field pushes towards aggressive, low-bit quantization (e.g., 4-bit, 3-bit, or even binary formats). In such schemes, every single bit carries substantially more information, and a single bit-flip could be catastrophic. The paper would be strengthened by a discussion of how its core principles might adapt or break under these more aggressive quantization scenarios.

Questions to Address In Rebuttal

Could the authors elaborate on how their co-design principles, particularly AERP and the concept of mapping data importance to hardware properties, might compare to or be adapted for other emerging on-chip memory technologies like MRAM or RRAM, which offer density without refresh but introduce challenges like write endurance and latency?
The 2DRP policy requires a sophisticated eDRAM controller capable of managing dynamic, fine-grained refresh domains. Could the authors provide a more detailed estimate of the area and power overhead of this control logic relative to a standard eDRAM controller? Is there a risk that this control complexity negates some of the energy savings from reduced refresh operations?
The viability of 2DRP rests on the error tolerance of the LLM. How do the authors expect this approach to perform with models that are aggressively quantized to 4-bit or lower representations? Does the reduced information redundancy in low-bit formats present a fundamental challenge to a strategy that relies on tolerating data corruption?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper proposes "Kelle," a hardware-software co-design for efficient LLM inference on edge devices. The central idea is to use embedded DRAM (eDRAM) as the primary on-chip storage for the KV cache to leverage its high density and low leakage power compared to SRAM. To mitigate the primary drawback of eDRAM—the high cost of periodic refresh operations—the authors introduce a suite of techniques: (1) an Attention-based Eviction and Recomputation Policy (AERP) to manage the cache size and data lifetime, (2) a Two-Dimensional Adaptive Refresh Policy (2DRP) to reduce refresh frequency by exploiting data criticality, and (3) a custom hardware accelerator featuring a novel "Systolic Evictor" to implement these policies efficiently. The authors claim this co-design results in significant speedup and energy savings.

My review will assess the novelty of these constituent ideas by situating them within the landscape of prior art.

Strengths

The primary strength of this work lies in the synthesis of multiple techniques into a cohesive system targeted at a very specific and timely problem. While many of the individual concepts have precedent in other domains, their application and integration for managing an LLM KV cache in eDRAM is a novel endeavor.

The most genuinely novel contribution at the microarchitectural level appears to be the Systolic Evictor (Section 5.3, page 7). The concept of a computational unit that operates in a systolic, on-the-fly manner, tightly coupled with the main systolic array (RSA), to identify the eviction candidate without stalling the pipeline is a clever and, to my knowledge, new design. It solves a specific performance problem created by the proposed eviction policy in an elegant way.

Weaknesses

My main concern is that the paper presents several core algorithmic and policy-level ideas as fundamentally new, when they are in fact adaptations or direct parallels of well-established concepts from prior work. The "delta," or the degree of novelty, is smaller than implied.

Use of eDRAM for On-Chip Acceleration: The foundational premise of using eDRAM as a dense, low-power alternative to SRAM for on-chip neural network accelerators is not new. This path has been well-trodden, particularly in the domain of CNNs. For instance, DaDianNao [15] explored eDRAM for machine learning supercomputers, and RANA [76] specifically proposed refresh-optimized eDRAM for efficient neural acceleration. The novelty here is the application to LLMs, not the core architectural choice.
Attention-based Eviction Policy: The AERP eviction policy, which prunes tokens based on their summed attention scores (Equation 3, page 4), is conceptually almost identical to the "heavy-hitter" identification method proposed in H2O [98]. H2O also identifies important tokens by their high accumulated attention scores. The paper needs to clearly articulate the fundamental algorithmic innovation that differentiates its eviction policy from H2O, beyond the novelty of its hardware implementation (the Systolic Evictor). As it stands, the policy itself appears derivative.
Two-Dimensional Adaptive Refresh Policy (2DRP): The 2DRP is presented as a key innovation, but it is a synthesis of two known principles.
- Importance-Aware Refresh: The idea of reducing refresh rates for less critical data is not new. RANA [76] did precisely this by linking refresh frequency to the impact of bit errors on CNN accuracy. Kelle's first dimension—refreshing High Score Tokens (HST) more frequently than Low Score Tokens (LST)—is a direct analogue of this principle applied to the LLM domain.
- Bit-level Differential Refresh: The second dimension—refreshing Most Significant Bits (MSBs) more frequently than Least Significant Bits (LSBs)—is a classic technique in approximate and fault-tolerant memory design. The principle that errors in LSBs are less impactful than errors in MSBs is fundamental.
- The novelty of 2DRP is therefore in the combination of these two axes for the specific use case of an LLM KV cache. It is a good piece of engineering, but not a fundamentally new concept in memory management.
KV Vector Recomputation: The trade-off of computation for memory is a cornerstone of computer science. While applying it to the KV cache is logical, the policy of recomputing based on token "popularity" (Section 4.1.2, page 5) is a heuristic. The novelty is limited to the specific formulation of this heuristic.

In summary, the paper constructs a powerful system, but it does so primarily by adapting and combining existing ideas. The work would be stronger if it more accurately positioned its contributions as a novel synthesis and application of these principles, rather than implying they are new from first principles.

Questions to Address In Rebuttal

Please explicitly detail the fundamental algorithmic difference between your attention-based eviction policy in AERP and the heavy-hitter identification method in H2O [98]. Why should your policy be considered a novel contribution distinct from this prior work?
Could the authors reframe the contribution of 2DRP? Given prior art like RANA [76] on importance-aware refresh for NNs and established work on bit-level differential reliability, please clarify if the novelty lies in the discovery of these principles or in their specific synthesis and application to the LLM KV cache.
The recomputation policy relies on a heuristic threshold where input vectors (x_n) are stored if the token is popular in "> 50% of the heads". How was this threshold determined? Please provide an analysis of the system's sensitivity to this specific value, as the robustness of a heuristic is key to evaluating its contribution.

LongSight: Compute-Enabled Memory to Accelerate Large-Context LLMs via Sparse Attention

Abstract

Large input context windows in transformer-based LLMs help minimize hallucinations and improve output accuracy and personalization. However, as the context window grows, the attention phase increasingly dominates execution time. Key–Value (KV) caching ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

This paper presents LongSight, an algorithm-hardware co-design that repurposes a CXL-based dense retrieval accelerator, DReX, for sparse attention in large-context LLMs. The core mechanism is a hybrid attention model: a dense sliding window of recent tokens is handled by the host GPU, while attention over the long-tail historical context is offloaded to DReX. This offload leverages a multi-stage filtering process, initiated by a sign-based filter (SCF) accelerated in-memory, to retrieve a top-k set of relevant Key-Value pairs. The authors claim this system can support context lengths up to one million tokens on a single GPU and achieves significant throughput improvements over dense baselines.

However, the work rests on a series of optimistic assumptions regarding the generalization of its sparse approximation, the practicality of its complex hyperparameter tuning, and the performance of an emulated hardware interface. The evaluation relies heavily on projections for its most impressive claims, leaving the robustness and real-world viability of the system in question.

Strengths

Pragmatic System Partitioning: The hybrid dense-sparse approach, which keeps recent tokens in GPU HBM for dense attention while offloading the historical context, is a logical and practical design choice. It correctly identifies that recent tokens are often most important, providing an accuracy backstop for the sparse approximation.
Detailed Hardware-Aware Data Layout: The paper provides a well-considered mapping of the logical data hierarchy (Key Blocks, Context Slices, User Partitions) onto the physical hardware of DReX (Section 7.3.3, page 9). This demonstrates a clear understanding of the need to co-design data structures with the memory system to maximize parallelism and bandwidth.
Repurposing Existing Hardware: The central idea of extending a dense retrieval accelerator for sparse attention is resourceful. It proposes a path to broader utility for specialized hardware, which is a commendable system-level goal.

Weaknesses

Core Claims are Based on Extrapolation, Not Measurement: The headline claim of supporting and accelerating 1M token contexts is not empirically demonstrated. As noted in the caption of Figure 7 (page 12), performance for context lengths above 128K is "projected based on performance at 128K context." A projection is not a result. For a systems paper making bold performance claims, the lack of measured data at the most challenging scales undermines the entire contribution. The scaling of system bottlenecks (e.g., CXL overhead, data structure management) may not be linear as assumed.
The Algorithmic Foundation is Potentially Fragile: The entire performance gain hinges on the effectiveness of the ITQ-enhanced SCF. The methodology for this is questionable:
- The ITQ rotation matrix is trained on a mere 1K-token sequence (Section 5.4, page 6) but is expected to generalize and remain effective across a 1,000,000-token context. This is a heroic assumption. There is no evidence that the statistical properties of Key/Query vectors remain stable enough for this to hold.
- The evaluation relies solely on perplexity. While a useful intrinsic metric, it is not a substitute for performance on downstream, long-context reasoning tasks. A small hit in perplexity can sometimes translate to a catastrophic failure in tasks requiring retrieval of specific, non-obvious facts from deep within the context.
Impractical Hyperparameter Sensitivity: The authors themselves concede in Section 9.3 (page 13) that "optimal parameters (i.e., window size, k, and SCF thresholds) are heavily context-dependent and impact end-to-end performance." This is a critical flaw for a general-purpose accelerator. It implies that for any new model, dataset, or even target context length, a user must engage in a complex, multi-dimensional tuning sweep to achieve the reported performance. This severely limits the system's practical usability and suggests the proposed method is more of a brittle proof-of-concept than a robust solution.
Insufficient and Misleading Baseline Comparisons: The primary baselines are 1-GPU and 2-GPU dense attention. It is a foregone conclusion that a sparse method will outperform a dense one at extreme context lengths. The more scientifically rigorous comparisons would be against state-of-the-art software-based sparse attention techniques (e.g., highly optimized block-sparse kernels) running on the same GPU hardware. The paper discusses such methods in the background (Section 3.1) but fails to compete against them in the evaluation (Section 9). The comparison against a simple sliding window attention (Figure 10, page 13) is a weak benchmark.
Hardware Performance is Based on Emulation: The evaluation "emulate[s] the CXL interface using a dual-socket Intel Xeon...platform" (Section 8.2, page 11). CXL is a complex interconnect, and its real-world performance involves subtle effects of protocol overhead, contention, and NUMA effects that are difficult to capture with such an emulation. A paper proposing a CXL-based hardware solution must be held to a higher standard of fidelity in its performance model. Relying on an emulation for the critical data path injects significant uncertainty into the final performance numbers.

Questions to Address In Rebuttal

Please provide a rigorous justification for projecting performance from 128K to 1M tokens. What evidence supports the assumption that no new system bottlenecks emerge at these larger scales? Can you provide data from a scaled-down model or hardware configuration that validates this linear scaling assumption?
The ITQ matrix is trained on a 1K token sequence. Please provide an ablation study showing how the model's accuracy (in perplexity and, ideally, a downstream task like needle-in-a-haystack) degrades as the context length increases from 1K to 128K. How can the authors be confident this approach does not fail catastrophically at 1M tokens?
Given that the system requires extensive, context-dependent hyperparameter tuning, what is the proposed methodology for a practitioner to deploy LongSight on a novel LLM or for a new application? Please quantify the tuning overhead.
Why were state-of-the-art software-only sparse attention methods, which require no specialized hardware, omitted as primary performance baselines in Figure 7? A fair comparison would show where LongSight provides a benefit over what is achievable on the commodity GPU alone.
How does your CXL emulation model account for protocol overhead and contention from multiple concurrent requests targeting the DReX device? Please quantify the sensitivity of your results to a 2x or 5x increase in CXL tail latency, which can be common in real-world systems under load.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces LongSight, an algorithm-hardware co-design framework to accelerate large language model (LLM) inference for extremely long contexts. The central and most compelling idea is the repurposing of a compute-enabled CXL memory expander called DReX, originally designed for dense retrieval, to serve as an active offloading device for the LLM's Key-Value (KV) cache. LongSight implements a hybrid attention mechanism: a conventional dense attention is performed on a sliding window of recent tokens stored in the GPU's local HBM, while sparse attention for the vast historical context is handled by DReX. The GPU offloads query vectors to DReX, which leverages its in- and near-memory processing capabilities to efficiently perform a top-k similarity search, returning only the most relevant Keys and Values. This approach dramatically reduces the memory and compute burden on the GPU, enabling a single-GPU system to efficiently handle context lengths of up to one million tokens, a scale that is currently only feasible with large multi-GPU setups.

Strengths

The primary strength of this work lies in its elegant synthesis of ideas from several distinct but converging fields of research.

Novel Connection of Problem Domains: The authors make a brilliant conceptual leap by identifying the architectural similarity between top-k dense retrieval (the problem DReX was built for) and the core operation of a common form of sparse attention (finding the keys with the highest dot-product similarity to a query). By repurposing a specialized accelerator, they provide a powerful, hardware-grounded solution to the sparse attention problem, rather than just proposing another software-based heuristic. This connection is the paper's core intellectual contribution and is genuinely insightful.
Addressing a Critical System Bottleneck: The paper tackles one of the most significant challenges in modern AI: the memory and computational explosion associated with large context windows. Their claim of enabling 1M token contexts on a single GPU (as shown in Figure 7, Page 12) is not merely an incremental improvement; it represents an order-of-magnitude increase in accessibility for this class of models. This has the potential to democratize research and application development for use cases that are currently out of reach for most, such as processing entire books, legal archives, or extensive codebases in a single inference pass.
A Compelling Use Case for Modern Hardware Trends: LongSight serves as a powerful "killer app" for emerging architectural trends. It demonstrates the tangible benefits of:
- Compute Express Link (CXL): The low-latency, coherent memory sharing provided by CXL is essential for the fine-grained interaction between the GPU and DReX.
- Processing-in-Memory (PIM): The use of PIM for the initial sign-bit filtering stage (Section 7.1, Page 8) is a perfect application of the technology—a simple, highly parallelizable operation performed directly where the data resides, avoiding massive data movement.
- Disaggregated/Tiered Memory: The paper provides a concrete vision for a smarter memory hierarchy, where the CXL-attached tier is not just for passive capacity expansion but is an active computational partner to the main processor.
Holistic Algorithm-Hardware Co-Design: The solution is thoughtfully designed across the entire stack. The algorithm (hybrid dense-sparse with ITQ-enhancement) is tailored to the hardware's capabilities (sign-based PIM filtering, near-memory dot-products). The data layout within DReX is carefully orchestrated to maximize parallelism (Figure 6, Page 10). This comprehensive, system-level thinking is commendable.

Weaknesses

While the core idea is powerful, its current presentation is tightly coupled to a specific research artifact, which raises questions about its generalizability.

Dependence on the DReX Architecture: The work is presented as an extension of DReX [34]. While this makes for a concrete and well-evaluated system, it leaves the reader wondering about the fundamental principles. The paper would be significantly strengthened by abstracting away from DReX and defining the core requirements for a "LongSight-capable" memory device. What are the necessary PIM capabilities, the required CXL bandwidth, and the near-memory compute primitives that make this approach viable? Without this, the work risks being seen as a bespoke solution rather than a generalizable architectural paradigm.
Complexity of Hyperparameter Management: The authors acknowledge that the system's performance and accuracy depend on a set of interdependent hyperparameters: the dense window size (W), the number of sparse tokens (k), and the per-head SCF thresholds. The pareto frontier in Figure 10 (Page 13) shows that tuning is critical for optimal performance. This introduces a layer of complexity for practitioners. A more robust system would ideally feature a method for automatically and dynamically adapting these parameters.
Limited Comparison to Alternative Sparsity Patterns: The paper focuses exclusively on top-k similarity as the criterion for sparsity. This is a reasonable and popular choice, but the field has explored other patterns (e.g., block-sparse, strided, global tokens as in Longformer [2]). A discussion of whether the DReX hardware could be programmed or adapted to accelerate other sparsity patterns would help contextualize the flexibility and limitations of their proposed hardware.

Questions to Address In Rebuttal

The work is tightly coupled with the DReX architecture [34]. Can the authors elaborate on the core architectural requirements for a compute-enabled memory device to effectively implement the LongSight approach? For instance, what are the minimum PIM capabilities (e.g., is sign-bit XOR sufficient?) and CXL bandwidth needed? This would help readers understand how the concept could transcend this specific hardware implementation.
The performance of LongSight appears sensitive to several hyperparameters (window size, k, thresholds), which may depend on the context length and task (Section 9.3, Page 13). Could the authors discuss the potential for automating this tuning process? Is there a risk that the overhead of finding optimal parameters could negate the performance benefits in some practical, dynamic deployment scenarios?
The paper elegantly repurposes a dense retrieval accelerator for sparse attention. This concept of offloading key primitives to specialized CXL memory seems very powerful. Are there other computationally expensive primitives in modern transformer models (e.g., Mixture-of-Experts routing, speculative decoding verification) that could be similarly offloaded and accelerated by a DReX-like device? A brief discussion on the broader applicability of this "accelerated memory" paradigm would significantly enhance the paper's impact.

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents LongSight, an algorithm-hardware co-design for accelerating large-context LLM inference. The core idea is to repurpose DReX [34], a compute-enabled CXL memory expander originally designed for dense retrieval (e.g., for RAG), to accelerate the sparse attention component of LLM inference. The authors propose a hybrid attention algorithm where a GPU handles a dense sliding window of recent tokens, while the vast history of the KV cache is offloaded to DReX. Within DReX, attention is treated as a top-k vector similarity search, leveraging DReX's in-memory sign-bit filtering and near-memory dot-product accelerators to efficiently find the most relevant Key vectors. The claimed novelty lies in this specific repurposing and the co-design of the algorithm, system integration, and hardware scheduling to enable dynamic, low-overhead sparse attention at a massive scale.

Strengths

The primary strength of this work is its novel synthesis of existing components to solve a well-defined and challenging problem. While individual elements are not entirely new, their combination is.

Novel Application of a Specialized Accelerator: The central novel claim—repurposing a dense retrieval accelerator (DReX) for dynamic LLM attention—is a compelling one. Prior art in hardware-accelerated attention (e.g., NeuPIMs [9], AttAcc [29]) has largely focused on accelerating dense attention computations. LongSight's approach of treating attention as a hardware-accelerated retrieval problem is a significant conceptual departure.
Addressing the Dynamic Update Challenge: The most significant point of novelty compared to other retrieval-based attention methods (e.g., Squeezed Attention [12], which uses standard ANNS) is how it handles the dynamic nature of the KV cache. Conventional ANNS methods require expensive index rebuilding upon data insertion, making them unsuitable for the per-token updates in autoregressive generation. LongSight’s use of DReX's index-free, sign-concordance filtering (SCF) mechanism provides a novel and practical solution to this critical limitation. This is the key "delta" that makes the contribution significant.
Novel System-Level Co-design: The contributions detailed in Section 7, particularly the extensions to the DReX CXL Controller (DCC) and the hierarchical data layout scheme (Figure 6), represent tangible, novel system design. This is not a simple "plug-and-play" use of DReX; it requires new hardware logic and a sophisticated software mapping to handle the granularity of multi-head, multi-layer, and multi-user attention requests.

Weaknesses

The work's novelty is somewhat constrained by its heavy reliance on a foundation of pre-existing concepts and a specific, previously proposed architecture.

Incremental Algorithmic Novelty: The hybrid dense-sparse attention algorithm itself is conceptually similar to established patterns. The idea of combining a dense sliding window for recency with a retrieval mechanism for long-term memory is present in various forms in prior work (e.g., Longformer [2], StreamingLLM [41]). The novelty is therefore not in the algorithm's high-level structure, but in its specific hardware-aware implementation.
Reliance on Closely-Related Prior Art: The proposed system is fundamentally an application built on top of DReX [34], a system proposed in a recent paper by many of the same authors. While repurposing an architecture is a valid contribution, it frames the work more as an extension or a new use case for DReX rather than a fundamentally new hardware architecture. The paper is transparent about this, but it bounds the scope of the novelty.
Synthesis vs. Fundamental Breakthrough: The work is a masterful piece of engineering and synthesis, combining ideas from sparse attention, vector databases, and near-data processing. However, it does not introduce a new, fundamental primitive. Its contribution is the clever and effective integration of existing primitives (CXL, PIM for filtering, near-memory acceleration) into a new system configuration.

Questions to Address In Rebuttal

The core idea of treating attention as a top-k retrieval problem is gaining traction, with Squeezed Attention [12] being a notable contemporary work. Your paper states that prior work supports a "fixed long context" (Section 4, page 4). Can you elaborate further on why standard ANNS methods are insufficient for the dynamic KV cache and how DReX's architectural features (specifically, the index-free filtering) are uniquely essential to overcoming this limitation? A more direct comparison would strengthen the novelty claim.
Your co-design is deeply tied to the specific architecture of DReX [34], including its two-stage filtering/scoring pipeline and PFU design. To what extent is the LongSight framework generalizable? Could the core principles be applied to other near-data or processing-in-memory architectures that may not feature the exact sign-concordance filtering mechanism? Or is the success of LongSight entirely contingent on the unique properties of DReX?
The use of Iterative Quantization (ITQ) [7] is presented as a key enabler for filter efficiency (Section 5.4, page 6). While effective, ITQ is a known technique for improving quantization performance. Is there any novelty in how ITQ is applied or integrated within the LongSight framework, or is its application here standard practice?

ComPASS: A Compatible PIM Protocol Architecture and Scheduling Solution for Processor-PIM Collaboration

Abstract

With growing demands from memory-bound applications, Processing-In-Memory (PIM) architectures have emerged as a promising way to reduce data movement. However, existing PIM designs face challenges in compatibility and efficiency due to limited command/...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose ComPASS, a system aimed at improving the integration of Processing-In-Memory (PIM) devices into general-purpose systems. The solution consists of three main components: 1) a new memory command, PIM-ACT, intended to provide a compatible interface for initiating PIM operations across different architectures; 2) a PIM request generator within the memory controller to offload command generation from the host CPU; and 3) two scheduling policies, Static (ST-BLC) and Adaptive (AT-BLC) Throughput Balancers, to manage concurrent PIM and conventional memory requests. The evaluation, conducted via simulation, claims that the proposed system achieves high PIM performance (up to 10.75× GEMV speedup over non-PIM) while successfully co-existing with memory-intensive CPU workloads.

Strengths

The paper addresses a significant and timely problem in computer architecture: the practical integration of PIM hardware. The challenges of protocol incompatibility and resource scheduling are indeed major barriers to adoption.
The proposed solution is comprehensive, considering the protocol layer (PIM-ACT), hardware support (request generator), and system-level scheduling (ST-BLC/AT-BLC).
The evaluation of the scheduling policies, particularly the demonstration in Figure 8 that AT-BLC can meet PIM Quality-of-Service (QoS) targets where other policies fail, presents a compelling case for the adaptive approach.

Weaknesses

My analysis reveals several significant weaknesses that undermine the paper's core claims of compatibility, practicality, and rigor.

The Claim of a "Compatible" Protocol is Overstated and Misleading. The central premise of a unified PIM-ACT command is fundamentally weakened by the necessity of the "Architecture-Aware Optimization" (AAO) mechanism described in Section 4.5 (Page 5). The paper claims PIM-ACT allows "different PIM devices to communicate with the host using the same PIM-ACT interface, ensuring compatibility." However, AAO requires the memory controller to load device-specific timing parameters, bank activation granularities, and command semantics from the SPD. This means the controller is not interacting with a unified interface; it is interacting with a configurable one that requires explicit, a priori knowledge of the specific PIM device it is controlling. This is a crucial distinction. The complexity is not removed; it is merely shifted to SPD tables and the MC's interpretation logic. This approach seems far from the "drop-in" compatibility implied.
The Foundation of the Adaptive Scheduler (AT-BLC) Relies on an Unjustified Assumption. The paper's strongest result, the AT-BLC scheduler, is critically dependent on a "target completion time (T)" which "must be initialized before the PIM operation begins" (Section 5.4, Page 8). The paper provides no details on how this target T is determined. Is it provided by a compiler? A runtime profiler? An oracle? The feasibility and robustness of AT-BLC are entirely contingent on the accuracy of this prediction. The authors fail to analyze the sensitivity of their mechanism to mispredictions in T. A scheduler that only works with perfect future knowledge is of limited practical value. This omission represents a major logical flaw in the evaluation of the paper's primary scheduling contribution.
Key Overheads are Ignored or Dismissed Without Evidence.
- Host CPU Offload: The paper claims the PIM request generator alleviates the burden on host processor cores (Section 4.2, Page 4), but this crucial benefit is never quantified. There is no measurement of CPU cycles saved or utilization reduced. Without this data, the contribution of the request generator is unsubstantiated.
- Memory Management: The authors propose using HugeTLB for contiguous memory allocation (Section 4.6, Page 6). They acknowledge that memory compaction may be required if huge pages are fragmented but dismiss the overhead by claiming it is "amortized" for LLM inference. This is an unsupported assertion. For systems under high memory pressure or with different workload characteristics, compaction can induce significant, non-trivial latency spikes. The lack of any quantitative analysis on this front is a serious methodological weakness.
Performance Claims Lack Critical Nuance. The PIM-only evaluation in Section 7.1 (Page 9, Figure 7) shows that for GDDR6-AiM, PIM-ACT results in a performance regression for GEMV, even with AAO enabled (0.59% slower). For LPDDR-AiM, the regression is even larger (2.84% slower). To conclude from this data that the protocol "maintains performance comparable to... device-specific protocols" is an oversimplification. Furthermore, the system performance gain attributed to AAO is shown to be only 1.2-1.5% (Section 7.3, Page 12, Figure 10b). This marginal improvement calls into question whether the added complexity of the AAO mechanism is justified.

Questions to Address In Rebuttal

The authors must address the following points directly:

On Compatibility: Please reconcile the claim of a "compatible" and "unified" protocol with the explicit requirement for the memory controller to parse and implement device-specific behaviors via the AAO mechanism. Is "configurable" not a more accurate description? If so, how does this meaningfully reduce integration complexity compared to existing device-specific controller modifications?
On the AT-BLC Scheduler: The functionality of AT-BLC hinges entirely on the pre-supplied target time T.
- a) What specific mechanism is proposed to calculate T in a real system?
- b) Provide a sensitivity analysis showing how AT-BLC's performance (both PIM QoS and CPU throughput) degrades when T is mispredicted by ±10%, ±25%, and ±50%.
On Overheads:
- a) Provide quantitative data demonstrating the reduction in host CPU utilization as a direct result of the PIM request generator. Compare a system with the generator to one without, where the host CPU issues all micro-requests.
- b) What is the measured performance impact (e.g., tail latency for CPU requests, delay in PIM task initiation) of a memory compaction event triggered to create a huge page for a PIM workload?
On Performance Regressions: Please justify the claim that PIM-ACT offers "comparable" performance to native protocols when your own data (Figure 7) shows clear performance regressions for GEMV on AiM-based architectures. At what threshold does a performance loss become significant enough to disqualify a "compatible" claim?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

The paper presents ComPASS, a comprehensive solution aimed at solving the critical system-level integration challenges of Processing-In-Memory (PIM). The core contribution is a two-pronged approach that addresses both the hardware interface and the performance management aspects of PIM in general-purpose systems. The first prong is a compatible PIM protocol (PIM-ACT) and a memory controller extension (PIM request generator) designed to create a unified, flexible hardware interface that can support diverse PIM architectures (like HBM-PIM and GDDR6-AiM) while adhering to existing DRAM standards. The second prong is a set of scheduling policies, culminating in an adaptive throughput balancer (AT-BLC), to intelligently manage memory bus contention between PIM operations and conventional host CPU memory requests, ensuring that PIM meets its performance targets (QoS) without starving CPU applications.

Essentially, this work is not about inventing a new PIM device but rather about designing the standardized "plumbing" and "traffic control" necessary to make the burgeoning ecosystem of PIM devices a practical and integrated part of modern computer systems.

Strengths

Addresses a Critical and Timely Problem: The field has demonstrated the potential of PIM with several real-world devices. The primary bottleneck to widespread adoption is no longer the feasibility of in-memory computation itself, but the challenge of system integration. This paper correctly identifies the core obstacles—protocol incompatibility and resource contention—and proposes a holistic solution. It shifts the conversation from "Can we build PIM?" to "How do we use PIM effectively in a heterogeneous system?" This is exactly the direction the field needs to move in.
Pragmatic and Elegant Protocol Design: The PIM-ACT command is a clever and practical solution to the limited command space problem. By leveraging existing RFU commands or unused bits in commands like NOP (Section 4.1, Figure 3, page 4), it avoids the need for a radical departure from JEDEC standards. The inclusion of an optype field is the key to its power, creating a flexible abstraction layer. This allows device manufacturers to innovate on their specific PIM architectures while communicating with the host through a single, standardized command. The concept of Architecture-Aware Optimization (AAO) (Section 4.5, page 5), where device-specific timing or bank-grouping information is loaded from SPD, is an excellent mechanism for supporting this diversity without sacrificing compatibility.
Holistic System-Level Perspective: The authors recognize that a protocol alone is insufficient. The tight coupling of the PIM-ACT protocol with the PIM request generator in the memory controller and the AT-BLC scheduler demonstrates a strong system-level understanding. The request generator offloads the host CPU from the tedious task of issuing micro-operations, connecting it to a long line of work on "macro instructions" (e.g., TRiM [48], AESPA [27]). The AT-BLC scheduler directly addresses the inevitable performance interference, transforming PIM from a disruptive accelerator into a well-behaved citizen in the memory subsystem.
Strong Connection to Real-World Architectures: The work is well-grounded by demonstrating how ComPASS can be applied to existing commercial PIM architectures like Samsung's HBM-PIM and SK Hynix's GDDR6-AiM (Section 4.7, Table 1, page 7). This case study is crucial as it proves the proposal is not merely a theoretical exercise but a viable path toward unifying disparate industry efforts under a common framework.

Weaknesses

Hardware Complexity and Overhead are Underexplored: The paper proposes adding a non-trivial "PIM request generator" and additional scheduling logic to the memory controller. While conceptually sound, the potential cost in terms of die area, power consumption, and design complexity is not quantified. For a solution targeting practical adoption, understanding this overhead is crucial. A simple analysis of the storage requirements (request/data buffers) and decoding logic would strengthen the proposal significantly.
The Software-Hardware Interface is Abstract: The AT-BLC scheduler relies on the host providing the target completion time (T) and total number of micro-requests (W) for a PIM operation. The paper briefly mentions this is handled by the OS and PIM libraries (Section 4.4, page 5), but this interface is a critical and complex part of the system. How are these values accurately estimated by the runtime? What is the mechanism for passing them to the memory controller? What happens when these predictions are inaccurate? The robustness of the adaptive scheduler depends heavily on the quality of these inputs, and this link feels underdeveloped.
Limited Applicability of the Scheduling Model: The evaluation focuses on large, monolithic GEMV operations typical of LLM inference. In this context, a pre-calculated W and T is plausible. However, the future of PIM may include more diverse workloads, such as graph analytics or database queries, which feature more irregular, data-dependent memory access patterns and potentially many smaller, concurrent PIM tasks. It is unclear how well the AT-BLC model would adapt to scenarios where PIM execution is less predictable and cannot be easily characterized by a single W and T pair.

Questions to Address In Rebuttal

Could the authors provide an estimate, even if high-level, of the hardware overhead (e.g., buffer sizes in KB, gate count estimate for logic) introduced by the PIM request generator and the additional scheduling queues in the memory controller?
Could you please elaborate on the software stack's responsibility for the AT-BLC? Specifically, how does the runtime or driver determine the W and T parameters for a given PIM task, and what is the proposed hardware mechanism for communicating this information to the memory controller?
The proposed adaptive scheduler (AT-BLC) appears well-suited for predictable, throughput-oriented PIM workloads like GEMV. Could you discuss how the ComPASS framework, and particularly the AT-BLC, might be extended or adapted to support more dynamic and latency-sensitive PIM workloads with less predictable execution times?
The Architecture-Aware Optimization (AAO) concept is very compelling as a way to future-proof the protocol. Do you envision this information being communicated solely through a static mechanism like SPD, or could there be a more dynamic, runtime registration process for new PIM device capabilities?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors propose ComPASS, a solution aimed at improving the compatibility and efficiency of integrating Processing-In-Memory (PIM) devices into general-purpose systems. The work identifies two primary challenges: the lack of a standardized PIM protocol and the difficulty of scheduling PIM and conventional memory requests concurrently. The proposed solution consists of three core components:

PIM-ACT: A new, unified memory command that leverages unused command space (RFU or NOP bits) in existing DRAM standards to trigger multi-bank PIM operations. An "optype" field within the command allows it to be adapted to different PIM architectures.
PIM Request Generator: A hardware unit within the memory controller that offloads the host CPU by receiving high-level "macro requests" and decomposing them into a sequence of low-level "micro requests" for PIM execution.
Static and Adaptive Schedulers (ST-BLC and AT-BLC): Scheduling policies that manage the interleaving of PIM and non-PIM requests. While the static balancer uses a fixed threshold, the adaptive balancer dynamically adjusts scheduling priorities based on whether the PIM workload is meeting its Quality-of-Service (QoS) targets.

The central claim is that this combination of a compatible protocol and a QoS-aware scheduler provides a novel and effective solution for processor-PIM collaboration.

Strengths

The primary strength of this work lies in the synthesis of its components to address a practical and timely problem. While the individual concepts have precedents, their integration into a cohesive system architecture designed for compatibility is commendable.

The most novel element is the Adaptive Throughput Balancer (AT-BLC). The concept of a feedback loop where PIM execution progress (Wcur/Tcur vs. W/T') directly influences the memory scheduler's behavior (N) appears to be a new contribution in the context of PIM scheduling. This elevates the work beyond a simple static priority scheme.

The architectural decision to place the PIM request generator inside the memory controller (Section 4.2, Page 4), rather than within the PIM device itself, is a well-reasoned choice. It correctly identifies the limitation of prior work (e.g., TRiM [48], Darwin [32]) where an external generator obscures bank state from the MC, thereby preventing efficient request interleaving. This specific architectural delta is a key enabler for the proposed scheduling policies.

Weaknesses

While the paper presents a well-engineered system, its core ideas, when deconstructed, are largely incremental advancements or applications of known principles from other domains. My primary concern is the degree of fundamental novelty.

The PIM-ACT Command: The idea of using reserved/unused command bits to introduce new functionality is a standard industry practice, not a novel academic concept. The true contribution here is the proposed protocol that uses the command, not the mechanism of the command itself. Furthermore, both HBM-PIM and GDDR6-AiM already have mechanisms to trigger multi-bank operations. ComPASS's proposal is a unifying abstraction layer, which is a valuable engineering contribution but a limited conceptual leap. The authors themselves demonstrate in Table 1 (Page 7) that PIM-ACT primarily serves as a wrapper for existing PIM device functionality.
The PIM Request Generator: The concept of a hardware unit that translates macro-instructions into micro-operations is not new. This is functionally analogous to DMA controllers, command queue (CQ) mechanisms in NVMe, or instruction decoders in other types of accelerators. As noted in the paper's own related work section (Section 8, Page 12), works like TRiM [48] and Darwin [32] have already proposed instruction generators for PIM. The novelty of ComPASS is limited to the placement of this generator within the MC to enable better scheduling, which, while important, is an architectural refinement rather than a new paradigm.
The Static Throughput Balancer (ST-BLC): This is a classic threshold-based or round-robin arbitration policy. The paper acknowledges a recent work [15] that is "conceptually similar to our ST-BLC" (Section 8, Page 12). This suggests that the static scheduling approach is not novel. The paper's main contribution in scheduling is therefore entirely dependent on the adaptive nature of AT-BLC.

In summary, the novelty of this paper is not in the invention of new foundational concepts, but in the clever integration and adaptation of existing ones to the specific problem of PIM-CPU co-execution. The contribution is more of an elegant system design than a fundamental breakthrough.

Questions to Address In Rebuttal

The authors should use the rebuttal to more sharply define the novelty of their contributions against the closest prior art.

Regarding the Static Throughput Balancer (ST-BLC), the paper concedes it is "conceptually similar" to the virtual channel-based scheduler in [15]. Can the authors elaborate on the novel delta between ST-BLC and this prior work? If the novelty is minimal, the paper's contribution should be more narrowly framed around the adaptive policy (AT-BLC).
The novelty of the PIM Request Generator rests on its placement in the MC. While this enables scheduling, prior works like TRiM [48] also used macro instructions. Beyond location, is there any novelty in the macro-request interface itself (Figure 4c, Page 5) or the decoding logic that allows it to be more general-purpose than prior PIM-specific generators?
The PIM-ACT protocol's claim to compatibility is evaluated on two similar GEMV-focused accelerators (HBM-PIM and GDDR6-AiM). How would this protocol generalize to a fundamentally different PIM architecture, such as the general-purpose RISC-based cores in UPMEM-PIM [10]? Would the proposed 6-bit optype and the macro-request format be sufficient to express the diverse instruction set of such a device without becoming a bottleneck or requiring an impractically large optype mapping? Please justify the claimed generality of the protocol beyond the domain of neural network accelerators.

PIM-CCA: An Efficient PIM Architecture with Optimized Integration of Configurable Functional Units

Abstract

Processing- in-Memory (PIM) is a promising architecture for alleviating data movement bottlenecks by performing computations closer to memory. However, PIM workloads often encounter computational bottlenecks within the PIM itself. As these workloads become ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose PIM-CCA, an architecture that integrates a Configurable Compute Accelerator (CCA) into a Processing-in-Memory (PIM) processor, modeled after the UPMEM DPU. The stated goal is to alleviate computational bottlenecks that emerge in PIM systems when memory access latency is reduced. They present a compiler-driven approach to identify and offload "hot code" regions—primarily multiplication-based sequences—to a small, specialized CCA. The evaluation, conducted on a cycle-accurate simulator, claims significant performance gains (up to 1.55x) with what is described as minimal hardware overhead (0.036%). The work also analyzes the relationship between tasklet count and performance in this new architecture.

Strengths

Problem Motivation: The paper correctly identifies a critical issue in PIM systems: the performance bottleneck can shift from memory access to computation once workloads are moved closer to memory (Section 2.3). This observation is valid and provides a solid foundation for the work.
Grounded Simulation: The use of uPIMulator, a simulator based on a commercially available PIM system (UPMEM), provides a realistic baseline for evaluation (Section 2.1). This is preferable to purely abstract models.
Compiler-Architecture Co-design: The authors recognize that a hardware accelerator is ineffective without corresponding compiler support. The inclusion of a compiler flow to detect and replace code sections is a necessary component of the proposed solution (Section 3.3.3).

Weaknesses

My primary concerns with this paper relate to the generalizability of its claims, the transparency of its overhead analysis, and the rigor of its evaluation methodology.

Over-specialization to a Weak Baseline: The entire premise appears to be built upon the specific and significant weakness of the baseline UPMEM DPU: its extremely inefficient multi-cycle integer multiplication implemented via mul_step instructions (Section 3.2, page 6). The impressive 1.55x speedup seems less a testament to the general utility of a CCA and more a result of patching this one specific flaw. The "hot codes" identified are dominated by operations that are slow on the baseline (Figure 4a). The claims of broad applicability to PIM are unsubstantiated, as other PIM architectures (e.g., HBM-PIM) may not share this particular bottleneck.
Misleading Hardware Overhead Metric: The reported 0.036% area overhead is highly suspect (Section 4.6, page 12). This figure is almost certainly calculated against the total area of the entire PIM chip, which is overwhelmingly dominated by the DRAM arrays themselves. This metric obscures the true cost of the modification. The overhead should be reported relative to the area of the DPU's processor logic, which would provide a far more honest assessment of the design's complexity and cost. Without this, the claim of "minimal overhead" is not credible.
Convenient Benchmark Exclusion: The exclusion of key benchmarks like BFS and SpMV is a major red flag (Section 4.1, page 10). The justification given is "limitations in the instruction set supported by our baseline simulator." This is insufficient. These benchmarks are characterized by irregular memory access patterns and different computational kernels than the dense linear algebra workloads that dominate the successful results. Their exclusion raises serious doubts about the robustness of the compiler and the applicability of the CCA beyond simple, regular arithmetic patterns. The work is therefore evaluated on a cherry-picked set of benchmarks that are predisposed to benefit from the proposed accelerator.
Limited "Configurability" and Compiler Fragility: The "Configurable" Compute Accelerator appears to be little more than a set of three hard-wired custom function units for multiplication, accumulation, and max (Table 1, page 10). This is not what is typically understood as a reconfigurable fabric (like a CGRA). Furthermore, the compiler's hot code detection mechanism appears to be a simple pattern-matching scheme based on predefined templates (Figure 10, Algorithm 1). It is unclear how this would scale to more complex code structures or identify opportunities that do not exactly match the pre-canned patterns. The robustness of this compiler is unproven.
Contradictory Statements: In Section 4.1, the paper states that the Needle-Wushiman (NW) benchmark is unavailable in the simulator environment. However, the max operation, explicitly identified for NW, is included as a core CCA function (CCA code 0x2 in Table 1) and is designed into the hardware (Figure 7). Why design and include hardware for a benchmark you cannot run or evaluate? This internal inconsistency undermines confidence in the methodology.

Questions to Address In Rebuttal

The authors must address the following points to make this work acceptable:

Area Overhead: Please clarify the 0.036% area overhead claim. What is the total area used as the denominator for this calculation? Please provide the overhead as a percentage of the DPU's non-memory logic area to provide a more transparent comparison.
Baseline Dependency: The baseline UPMEM architecture's multi-cycle integer multiplication is a critical performance bottleneck. Could the reported speedups be primarily attributed to fixing this specific weakness, rather than a generalizable benefit of the CCA approach? How would PIM-CCA perform against a baseline PIM processor with a more reasonable, pipelined single-cycle integer multiplier?
Benchmark Exclusion: Provide a detailed technical justification for the exclusion of BFS and SpMV. Do these workloads contain computational patterns that the PIM-CCA compiler cannot identify or that the CCA hardware cannot accelerate? The absence of these key benchmarks casts doubt on the general applicability of the proposed solution.
Compiler Limitations: The compiler appears to rely on pattern matching for specific arithmetic sequences (Figure 10). What is the coverage of this approach? How does it handle hot code regions with complex control flow or patterns not pre-defined in the "logic palette"?
On the "Configurable" Claim: The CCA is configured for only three operation types. This seems more like a set of co-processors than a truly "configurable" accelerator. Please comment on the design's extensibility and the process for adding new, more complex functional units beyond the ones presented.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper, "PIM-CCA: An Efficient PIM Architecture with Optimized Integration of Configurable Functional Units," addresses an important, second-order problem emerging from the success of Processing-in-Memory (PIM) architectures: the creation of new computational bottlenecks within the memory device itself. The authors astutely observe that by alleviating the data movement bottleneck, many memory-intensive workloads become compute-bound on PIM's inherently resource-constrained processors (DPUs).

The core contribution is a holistic, co-designed solution that integrates a lightweight Configurable Compute Accelerator (CCA) into the PIM processor's pipeline. This is not merely a hardware proposal; it is a full system concept supported by a PIM-aware compiler that identifies and offloads hot computational subgraphs, and an analysis of the interplay between the accelerator and PIM's native task-based parallelism. The authors demonstrate through simulation that their PIM-CCA design can achieve up to a 1.55x performance improvement on compute-intensive kernels with a negligible hardware area overhead (0.036%), making it a practical proposal for real-world PIM systems.

Strengths

This is a well-conceived and timely piece of work that makes several strong contributions to the field.

Excellent Problem Formulation and Motivation: The paper's primary strength lies in its identification and clear articulation of a critical, next-generation problem for PIM. The analysis in Section 2.3 (page 4), particularly Figure 3, which shows the shift in workload characteristics from memory-bound to compute-bound once inside PIM, is insightful and provides a compelling motivation for the entire paper. This work is not solving a contrived problem; it is looking ahead at the natural evolution of PIM architectures and their limitations.
Elegant Synthesis of Architectural Concepts: This work sits at a valuable intersection of three major research areas: Processing-in-Memory, reconfigurable computing, and compiler-architecture co-design. Rather than inventing a new mechanism from scratch, the authors have skillfully applied the mature concept of a compiler-driven Configurable Compute Accelerator (CCA)—originally proposed for embedded systems—to the unique, resource-starved environment of a PIM processor. This cross-pollination of ideas is a powerful research methodology, and the result is an architecture that feels both novel in its application and grounded in established principles.
A Holistic, System-Level Perspective: The authors should be commended for not limiting their scope to just a hardware block. The proposal is a complete system. They have considered:
- The Hardware: A carefully pipelined CCA logic that respects the tight constraints of the baseline DPU (Section 3.3.1, page 7).
- The ISA: A minimal and compatible instruction set extension (cca and cca_move) that integrates cleanly into the existing pipeline (Section 3.3.2, page 7).
- The Compiler: A mechanism for automatically identifying and replacing hot code regions, which is essential for programmability and usability (Section 3.3.3, page 8).
- The Parallelism Model: An insightful analysis of how the CCA changes the optimal number of software tasklets, showing a deep understanding of the system's performance dynamics (Section 3.4, page 9).
Pragmatism and Realism: The proposed CCA is designed with the severe constraints of memory fabrication in mind. The extremely low reported area overhead (Section 4.6, page 12) makes this a highly practical and believable proposal. By focusing on accelerating a few key instruction patterns (like mul_step), the authors have found a "sweet spot" that delivers significant performance gains without requiring a radical or costly redesign of the PIM processor.

Weaknesses

While the core idea is strong, the paper could be improved by broadening its contextual analysis and exploring the boundaries of its proposed solution.

Limited Scope of Evaluated CCA Operations: The evaluation, while thorough for what it covers, is heavily focused on accelerating multiplication via the mul_step sequence found in the UPMEM architecture (Table 1, page 10). This is certainly the most obvious bottleneck and the correct place to start. However, the paper's broader claim is about a configurable accelerator. The current evaluation makes the CCA feel more like a specialized, fixed-function "multiplication co-processor" rather than a truly flexible unit. The work would be much stronger if it demonstrated how the framework could target other, more diverse computational patterns, even if just through a design study.
Insufficient Discussion of Design-Space Alternatives: The paper rightly argues that simply making the main DPU core more complex is infeasible. However, the CCA is not the only possible solution. A key alternative would be to add a small, fixed SIMD/vector datapath to the DPU. This is a very common architectural pattern for boosting compute throughput. A discussion comparing and contrasting the PIM-CCA approach (with its fine-grained, irregular-pattern matching) against a more traditional SIMD approach (with its regular, data-parallel focus) is missing. This would help to better delineate the specific advantages of the CCA paradigm in the PIM context.
Potential Over-Fitting of the Compiler and Logic: The PIM-CCA compiler and the CCA logic palette construction (Section 3.3.4, page 9) appear to be tightly coupled to the hot code patterns identified in the benchmark suite. While this is a valid methodology, it leaves open the question of generality. How would the framework adapt if presented with a new class of workloads with entirely different computational bottlenecks (e.g., bit-level manipulation, cryptography, or complex address calculations)? A more robust discussion of the compiler's ability to discover and map new, unseen patterns would strengthen the paper's claims of flexibility.

Questions to Address In Rebuttal

Beyond the multiplication and accumulation patterns evaluated, what other types of computational hot spots common in PIM workloads did you identify? Could you provide an example of a non-MAC-style code region and briefly explain how your compiler and CCA design methodology could be applied to accelerate it?
Could you elaborate on why a configurable accelerator (CCA) is a more suitable choice for PIM than a more conventional microarchitectural enhancement like a small, 2 or 4-lane SIMD unit? What specific workloads or computational patterns would be well-served by the CCA that a SIMD unit would handle poorly, and vice-versa?
Regarding the compiler framework, what happens when it encounters a hot loop that contains a mix of operations, some of which are mappable to the CCA and some of which are not? Is the compiler capable of partial offloading, or does the entire loop have to remain on the scalar DPU core if a perfect pattern match is not found?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper proposes PIM-CCA, an architecture that integrates a compiler-guided Configurable Compute Accelerator (CCA) into a UPMEM-like Processing-in-Memory (PIM) system. The authors identify that as PIM systems alleviate memory-access bottlenecks, workloads can become compute-bound within the PIM units (DPUs) themselves. The proposed solution is to use a lightweight, reconfigurable CCA to offload these computational "hot code" regions, which are identified by a custom compiler that analyzes the application's dataflow graph. The work also includes an analysis of how this acceleration interacts with the PIM system's software-managed threading (tasklet) model. The primary claims are improved performance (up to 1.55x) with negligible hardware overhead (0.036%).

Strengths

Novel Synthesis of Concepts: The primary novel contribution of this paper is the application and adaptation of the Configurable Compute Accelerator concept, originally proposed by Clark et al. [6, 7], to the unique constraints and execution model of a modern, commercially-inspired PIM architecture. While PIM accelerators and CCAs exist independently, their synthesis to solve the emergent computational bottleneck within PIM is a novel research direction.
PIM-Specific Hardware Adaptation: The authors demonstrate a clear understanding of the target architecture's limitations. The design of the CCA is not a generic application but is tailored to the PIM environment. The insight to target low-level, multi-cycle instruction sequences (e.g., the mul_step operation on the UPMEM architecture, discussed in Section 3.2, page 6) and consolidate them into a single CCA operation is a well-reasoned, PIM-specific optimization that represents a clear delta over prior CCA work.
End-to-End System Proposal: The work presents a complete system view, including a compiler toolflow (Section 3.3.3, page 8) for pattern identification and code generation, a hardware architecture for the CCA itself (Section 3.3.1, page 6), and an analysis of its interaction with the system's concurrency model (Section 3.4, page 9). This holistic approach strengthens the contribution.

Weaknesses

Limited Delta from Foundational Prior Art: The core conceptual machinery is not new. The idea of a transparent, compiler-driven instruction set extension through a reconfigurable datapath is the central thesis of the original CCA papers [6, 7]. Similarly, the use of compiler analysis on dataflow graphs to identify and offload critical subgraphs is a foundational technique in compilation for custom hardware and reconfigurable computing. The paper's novelty rests almost entirely on the target of this existing methodology (PIM), not on a fundamental innovation in the methodology itself.
Questionable "Configurability" in Practice: The central premise of a CCA is its configurability to accelerate a diverse set of computational patterns. However, the evaluation relies on a CCA configured for a very limited set of operations: a 4-step multiply, an accumulation, and a maximum (Table 1, page 10). This configuration appears functionally closer to a collection of dedicated Custom Functional Units (CFUs) for multiply-accumulate and max operations rather than a truly "configurable" accelerator. The work does not sufficiently demonstrate that the overhead of the CCA's reconfigurable fabric is justified over simply implementing a few fixed-function units, which would be a far less novel contribution. The power of the general CCA framework seems underutilized and, consequently, its novelty is undermined.
Incremental Novelty of the Compiler Heuristic: The proposed "logic palette" construction algorithm (Algorithm 2, page 9) appears to be a straightforward greedy heuristic for mapping pattern graphs to hardware resources to maximize sharing. While necessary for their toolflow, the novelty of this specific algorithm, when compared to decades of prior art in High-Level Synthesis (HLS), logic synthesis, and technology mapping for FPGAs/CGRAs, is not clearly established. It solves a necessary problem for the authors but does not appear to be a standalone novel contribution in compiler or synthesis technology.

Questions to Address In Rebuttal

The key distinction of a CCA over a set of fixed-function units is its configurability. Given the evaluation uses only three core patterns (multiply, accumulate, max), please clarify the novelty and benefit of the generalizable CCA framework over simply proposing the addition of three dedicated CFUs to the DPU pipeline. What is the hardware overhead (area, wiring complexity) of the CCA's reconfigurability (e.g., the MUX network in Figure 7) compared to hard-wiring these three specific functions?
The paper positions itself in the PIM domain. However, other works have proposed co-locating more general-purpose reconfigurable logic (e.g., FPGAs) in the logic layer of 3D-stacked memory or on a DIMM. Please position PIM-CCA more clearly against these alternative forms of reconfigurable near-data processing. What is the fundamental novelty of the CCA model in this context compared to a small, near-memory FPGA or CGRA?
To substantiate the claim of a novel and generalizable framework, could the authors provide an example of a more complex or irregular computational pattern from a different application domain (e.g., bioinformatics, cryptography) that their PIM-CCA compiler (Figure 10, page 8) can successfully identify and for which the logic palette framework can generate an efficient CCA configuration? This would more convincingly demonstrate that the contribution is a new framework and not just a bespoke solution for accelerating GEMV-like kernels.

3D-PATH: A Hierarchy LUT Processing-in-memory Accelerator with Thermal-aware Hybrid Bonding Integration

Abstract

LUT- based processing-in-memory (PIM) architectures enable general-purpose in-situ computing by retrieving precomputed results. However, they suffer from limited computing precision, redundancy, and high latency of off-table access. To address these ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present 3D-PATH, a Processing-in-Memory (PIM) accelerator that utilizes 3D hybrid bonding to couple a DRAM die with a logic die. The core idea is a hierarchical Look-Up Table (LUT) system where a large DRAM-LUT stores precomputed results and a smaller, logic-based "fast-LUT" (implemented as a CAM) is used to accelerate searches and handle sparsity. The paper claims three main contributions: 1) A hierarchical, sparse-aware LUT architecture; 2) An efficient, multiplier-free method for floating-point (FP) computation; and 3) A "thermal-aware" hardware design to mitigate the thermal challenges of 3D stacking. While the paper addresses a relevant problem space, the methodology contains significant flaws, the evaluation relies on a weak baseline, and several core technical descriptions are either unclear or seemingly contradictory.

Strengths

Circuit-Level Rigor: The design and analysis of the custom circuit components, specifically the 9T CAM cell for the fast-LUT and the self-throttling sense amplifier, are thorough. The use of post-layout simulations with Monte Carlo analysis (Section 7.2.1, page 10, Fig. 12) demonstrates robustness at the cell level.
Detailed Thermal Modeling: The paper employs a professional tool (ANSYS Fluent) for thermal modeling (Section 6, page 7), providing a detailed analysis of heat dissipation under various cooling scenarios (Section 7.1, page 8). This is a commendable level of detail for an architecture paper.
Problem Identification: The authors correctly identify key challenges in existing LUT-PIM architectures, namely limited precision, storage redundancy, and inefficient off-table access (Section 1, page 1).

Weaknesses

Fundamentally Unclear Floating-Point Implementation: The description of the floating-point multiplication mechanism (Section 4.4.2, page 5) is critically flawed. The authors claim a "lossless transformation" where the multiplication of mantissas (MIN × Mw) is handled by a LUT. However, the paper states, "The LUT contains precomputed products of the input mantissa and the full-precision weight." This is a logical contradiction. The input mantissa (MIN) is a dynamic value determined at runtime; one cannot precompute a LUT for all possible combinations of dynamic inputs and stored weights without an astronomically large table. The paper fails to specify if mantissa slicing or another approximation is used, which would invalidate the "lossless" claim. Without a clear and viable explanation of this core mechanism, all reported FP performance results are unsubstantiated.
Use of an Unjustified "Analytical Baseline": The primary baseline for comparison, "3D-base," is described as an "analytical baseline model" (Section 'Baseline', page 8). Its performance characteristics are derived from "prior 3D design studies," not from a concrete implementation or even a cycle-accurate simulation. This is unacceptable for a rigorous comparison. The reported performance gains over this baseline (e.g., 1.68× on AI workloads, Fig. 16) are rendered meaningless, as the baseline can be arbitrarily defined to inflate the benefits of the proposed work. This invalidates a major portion of the claimed performance improvements.
Overstated "Thermal-Aware" Contribution: The "thermal-aware hardware" (Section 5, page 5) consists primarily of two circuit-level power reduction techniques: a sign-magnitude adder to reduce bit toggling and a self-throttling sense amplifier that power-gates on a mismatch. While these are valid optimizations for power, they do not constitute a "thermal management" system. A true thermal-aware system typically involves temperature sensors and a dynamic policy (e.g., DVFS) to manage heat globally. The authors' design is reactive and localized; framing it as a comprehensive thermal solution is a significant overstatement. The solution is power-saving, not thermal-aware in the conventional sense.
Crucial Architectural Details are Missing: The paper omits several details essential for evaluating the architecture's viability:
- Fast-LUT Miss Policy: The entire performance model hinges on the fast-LUT. The authors provide no information on how a miss in the fast-LUT is handled. What is the performance penalty? Does it require a full scan or access to a different data structure in DRAM? This is a critical oversight.
- Fast-LUT Hit Rate: Despite the fast-LUT being central to the sparse-aware computation, the authors provide no data on its hit rates for the evaluated benchmarks. Without this data, it is impossible to assess its effectiveness.
- LUT Updates: The paper focuses exclusively on LUT reads. How are the DRAM and fast-LUTs populated and updated? DRAM writes are slow and power-intensive, which could be a major system bottleneck not accounted for in the evaluation.
Unfair GPU Baseline Comparison: For sparse workloads like ResNet-50 and LSS, the authors compare their sparse-aware accelerator against a GPU (Fig. 16). It is not stated whether the GPU baseline utilizes sparsity-aware optimizations (e.g., NVIDIA's cuSPARSE library, structured sparsity). If the comparison is against a dense GEMM implementation on the GPU, the reported speedups are misleading and simply reflect the benefits of sparsity itself, not necessarily the superiority of the proposed architecture.

Questions to Address In Rebuttal

Regarding the FP LUT (Section 4.4.2): Please provide a precise explanation of how the LUT for mantissa multiplication works.
- How can it be precomputed if the input mantissa is dynamic?
- If mantissa slicing is used, what is the bit-width of the slice, what is the resulting table size, and what is the impact on precision? Justify the "lossless" claim.
Regarding the "3D-base" Baseline (page 8): Please provide a detailed specification of the analytical model.
- What specific assumptions were made regarding its compute units, on-chip network, memory access latency, and power consumption?
- Justify why a simulated RTL or cycle-accurate model of a conventional 3D accelerator was not used as a more rigorous baseline.
Regarding the Fast-LUT (Section 4.2):
- Describe the full pipeline for a fast-LUT miss and quantify the associated performance penalty in cycles.
- Provide the measured hit rates for the fast-LUT across all evaluated AI benchmarks (ResNet-50, BERT, LSS, etc.).
Regarding Thermal Management (Section 5):
- Justify the use of the term "thermal-aware hardware." Does the design include any form of temperature sensing or global thermal feedback loop? If not, please re-frame the contribution as a power-efficiency optimization.
Regarding GPU Comparisons (Fig. 16):
- Clarify whether the GPU baseline for sparse models was configured to use hardware and software support for sparse matrix computation. If not, the comparison must be re-evaluated against a proper sparse-aware baseline.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents 3D-PATH, a novel processing-in-memory (PIM) architecture that synergistically combines three key technology trends: LUT-based computation, 3D hybrid bonding integration, and thermal-aware circuit co-design. The core contribution is the insight that the ultra-high bandwidth and fine-grained connectivity of hybrid bonding can be directly translated into high-throughput LUT operations. To manage the complexity and redundancy inherent in large LUTs, the authors propose a hierarchical structure: a small, fast, content-addressable LUT on a logic die acts as a filter and index for a large, high-capacity LUT implemented in a stacked DRAM die.

The work further extends this architectural foundation to address practical challenges. It introduces a multiplier-free method for floating-point operations and leverages the architecture's structure to handle sparse data efficiently. Recognizing that 3D stacking exacerbates thermal issues, the paper presents a holistic thermal co-design, including a self-throttling sense amplifier and a low-toggling adder, to mitigate hotspots without compromising performance. The result is a well-integrated, systems-level proposal that connects advances in semiconductor fabrication directly to architectural innovation for PIM.

Strengths

Timely and Visionary Synthesis: The paper's greatest strength is its successful synthesis of disparate but highly relevant research fields. It sits at the intersection of PIM architecture, advanced packaging, and low-power circuit design. Rather than treating hybrid bonding as a mere "faster wire," the authors use its unique properties to enable a new architectural paradigm (hierarchical LUT-PIM). This provides a compelling blueprint for what future heterogeneous systems could look like and is a significant contribution to the community's thinking about post-Moore's Law computing.
Holistic, Full-Stack Approach: The authors are to be commended for their end-to-end system perspective. They do not simply propose an abstract architecture; they ground it in a specific integration technology (hybrid bonding), identify the critical second-order problem that technology creates (thermal density, as discussed in Section 5.1, page 5), and propose concrete circuit-level solutions. This full-stack awareness, from device physics to architecture, is rare and makes the work far more credible and impactful.
Elegant Solutions to Known PIM Weaknesses: The paper directly tackles several well-known limitations of prior PIM and LUT-based accelerators:
- Sparsity: The hierarchical fast-LUT/DRAM-LUT design is a very clever mechanism for handling sparsity. By using the CAM-based fast-LUT to identify non-zero elements, the system avoids redundant lookups in the much larger and slower DRAM-LUT, connecting directly to a major trend in AI/ML workloads.
- Floating-Point Precision: The lack of efficient floating-point support has been a major barrier for PIM adoption in scientific computing and modern AI. The proposed transformation method (Section 4.4.2, page 5), which offloads the computationally intensive mantissa multiplication to the LUT while handling the exponent and sign in simple logic, is an elegant and practical solution.
- The Memory Wall: The core premise attacks the memory wall by leveraging the massive parallelism (4096 parallel banks) afforded by hybrid bonding, turning a potential bandwidth firehose into productive computation.

Weaknesses

While the hardware proposal is compelling, the paper is viewed primarily through an architectural and circuits lens, leaving some crucial system-level questions open.

Programmability and the Software Stack: The paper is largely silent on how a developer or compiler would target the 3D-PATH architecture. Is the hierarchical LUT managed transparently by a hardware controller, or does it require explicit software management? Decomposing functions into LUTs, partitioning them between the fast and slow tiers, and managing updates are non-trivial software problems. Without a clear programmability model, 3D-PATH risks being a powerful but unusable piece of hardware.
Overhead of LUT Generation and Updates: The analysis focuses almost exclusively on the inference or lookup phase. However, a key advantage of LUTs is their reconfigurability. The paper does not quantify the latency or energy costs associated with populating or updating the LUTs in DRAM. For applications where the function changes dynamically (e.g., during training phases of machine learning, or with adaptive algorithms), the cost of writing these large tables could become a significant performance bottleneck, potentially negating the benefits of fast lookups. The discussion in Section 7.3.1 (page 10) mentions the update process but provides no performance data.
Scalability of the Fast-LUT: The fast-LUT is based on Content Addressable Memory (CAM), which is effective but known to be power-hungry and area-inefficient compared to standard SRAM. The paper evaluates a 32Kb configuration. It is unclear how the architecture's efficiency would scale to problems requiring a much larger set of "hot" indices that do not fit in the fast-LUT. This could create a performance cliff where frequent misses in the fast-LUT lead to iterative, high-latency searches in the DRAM die, undermining the core performance claim.

Questions to Address In Rebuttal

Could the authors elaborate on the intended programmability model for 3D-PATH? What is the division of responsibility between the hardware controller and the software/compiler for managing the two-level LUT hierarchy?
The proposed method for FP multiplication is clever. However, accumulation is also a critical part of many workloads (e.g., GEMM). The paper states that the outer-product approach avoids the need for pre-alignment across different banks (Section 4.4.2, page 5). How and where are the final partial products accumulated, and how does the system handle the necessary exponent alignment and normalization during this final reduction step? What is the associated hardware cost and latency?
Could the authors provide an analysis or estimation of the performance and energy overhead for updating a DRAM-LUT bank? How does this "write" or "reconfiguration" cost compare to the "read" or "lookup" cost, and how does it affect the suitability of 3D-PATH for workloads with dynamic or frequently changing functions?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents 3D-PATH, a processing-in-memory (PIM) architecture that leverages 3D hybrid bonding to implement a hierarchical look-up table (LUT) system. The core claims of novelty appear to be four-fold:

The primary architectural concept: a synthesis of 3D hybrid bonding with LUT-based PIM, creating a hierarchical system where a large, parallel DRAM-LUT is assisted by a small, fast SRAM-based LUT (fast-LUT) on the logic die.
The use of this hierarchical LUT structure to perform sparsity-aware computation, where the fast-LUT acts as an index and filter to avoid redundant accesses to the main DRAM-LUT.
A novel method for multiplier-free and, critically, alignment-free floating-point (FP) computation by combining representation transformation with a column-interleaved, outer-product data mapping.
The introduction of specific thermal-aware hardware, namely a self-throttling sense amplifier (SA) and a sign-magnitude dual-adder tree, to mitigate the thermal challenges inherent to the 3D-stacked architecture.

While the overall synthesis of these concepts into a cohesive system demonstrates novelty, several of the underlying techniques are adaptations of established principles. The most significant novel contributions lie in the architectural pattern enabled by 3D integration and the specific method for alignment-free FP computation.

Strengths

The primary strength of this work is the novel architectural synthesis. The central idea to "transform the high bandwidth of 3D integration into high LUT throughput" (Section 1, page 2) is a genuinely new and insightful architectural paradigm for PIM. While LUT-PIM (e.g., pLUTo [17]) and 3D integration for memory have been explored separately, their co-design into a hierarchical system where the logic die's fast-LUT serves as a directory/filter for the DRAM die's data-LUT is a compelling and novel approach. This is not merely an integration but a fundamentally new way to structure a PIM accelerator.

The second major novel contribution is the method for floating-point computation (Section 4.4.2, page 5). Prior PIM works that tackle FP arithmetic often struggle with the overhead of exponent alignment (e.g., FloatAP [71] uses bit-serial shifting). The proposed "alignment-free" method, achieved by mapping computation in an outer-product fashion where each bank handles an independent column, is a clever circumvention of this bottleneck. This represents a significant delta over the prior art in PIM systems and substantially broadens the applicability of LUT-based PIM.

Finally, the authors demonstrate a holistic design approach by identifying the thermal consequences of their architecture and proposing hardware solutions. This end-to-end consideration, from architecture to circuit, is commendable.

Weaknesses

My main concerns relate to the degree of novelty in some of the constituent components, particularly the thermal-aware hardware.

Self-Throttling Sense Amplifier: The concept of detecting a mismatch early in a CAM/TCAM search cycle and gating the discharge path to save power is a well-established technique in the circuit design literature. For example, selective-precharge schemes mentioned by the authors ([50], [60]) aim for a similar outcome. While the specific 7T circuit implementation in Figure 6 may be unique, the fundamental principle of "early gating to terminate current flow during mismatches" (Section 5.3.2, page 7) is not a fundamentally new invention. The novelty is more in its application to a thermal problem rather than a purely power-saving one, which is an incremental step.
Sign-Magnitude Low Toggling Adder: The use of sign-magnitude (SM) representation over two's complement (2C) to reduce switching activity in adders is a known low-power design strategy. The authors themselves cite prior work [3] that quantifies the energy efficiency benefits. The dual-adder tree is a standard approach to managing the complexity of SM addition/subtraction. Therefore, the novelty here is not in the arithmetic circuit itself, but in its integration into this specific architecture for the stated purpose of thermal mitigation.

The complexity of the overall 3D-PATH system is substantial, requiring advanced 3D integration. While the results for sparse workloads are impressive, the benefit of the hierarchical LUT for dense operations is less clear. For a dense matrix, the fast-LUT would presumably have a 100% hit rate, acting primarily as an address translator and introducing latency and power overhead without providing any filtering benefit. The novelty is therefore highly optimized for a specific data pattern (sparsity).

Questions to Address In Rebuttal

Regarding the thermal-aware hardware (Section 5.3, pages 6-7): Please clarify the novelty of the self-throttling SA and the SM dual-adder tree in the context of prior circuit-level art. Beyond applying known low-power techniques to a thermal problem, what is fundamentally new about the circuit topology or operation compared to existing mismatch-gated CAM SAs or low-power SM adders?
The hierarchical LUT is presented as a key innovation for handling sparsity. For a fully dense workload where no computation can be skipped, what is the performance and power overhead of the fast-LUT search step? Does the fast-LUT become a bottleneck or an inefficient power consumer in such a scenario, and how does the architecture's efficiency compare to a non-hierarchical "flat" LUT-PIM design in that specific case?
The alignment-free FP method is compelling. However, the transformation in Equation 2 (page 5) offloads only the mantissa multiplication ||MIN × MW||LUT to the table. The OLU must still handle exponent addition and final normalization. Could you elaborate on the overhead and potential precision-handling complexities (e.g., for subnormal numbers, rounding modes) that must be managed in the OLU post-lookup? How does this complexity trade-off against the benefit of avoiding pre-alignment?

One Flew over the Stack Engine’s Nest: Practical Microarchitectural Attacks on the Stack Engine

Abstract

Security research on modern CPUs has raised numerous concerns in recent years. These security issues stem from classic microarchitectural optimizations designed decades ago, without consideration for security. Stack pointer tracking, also known as the ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present a detailed microarchitectural analysis of the stack engine, a frontend optimization in modern x86 CPUs from Intel and AMD. They reverse engineer its behavior, focusing on the conditions that trigger a synchronization uop, namely unsupported operations and stack depth overflows. Based on these findings, they construct three leakage primitives and demonstrate their use in same-thread and cross-thread covert channels, as well as side-channel attacks against the cJSON and protobuf libraries. Finally, they identify and analyze undocumented MSRs ("chicken bits") on recent AMD CPUs that can disable the stack engine, measuring the performance impact of this mitigation. While the level of detail in the reverse engineering is commendable, the work suffers from questionable measurement reliability, a potentially contrived threat model, and claims that may not be fully substantiated by the provided evidence.

Strengths

Comprehensive Microarchitectural Analysis: The paper provides an impressively detailed investigation of the stack engine's behavior across a wide range of modern Intel and AMD microarchitectures, from Sandy Bridge to Zen 5. The systematic approach to characterizing properties like ARSP depth (Section 5.4) and support for add/sub (Section 5.5) is thorough.
Mitigation Analysis: The discovery and experimental validation of the undocumented MSRs on AMD Zen 4 and Zen 5 to control the stack engine (Section 9.1) is a significant finding. Measuring the real-world performance impact using SPEC CPU2017 provides a valuable data point on the trade-off between performance and security for this specific optimization.

Weaknesses

Unreliable Measurements Undermine Core Observations: The foundation of this paper—the reverse engineering in Section 5—appears to be built on a shaky measurement methodology. The authors frequently acknowledge measurement issues, such as "performance counter inaccuracies" and "high jitter" on Intel CPUs (caption of Figure 4, page 7), and "excessive noise" preventing a conclusive analysis of speculative execution (Section 5.7, page 8). Most concerningly, when key observations contradict their model (e.g., the lack of a sync uop on Golden Cove and Zen 1), they dismiss it as a "performance counter bug" (Section 5.2, page 6) without providing concrete evidence to support this claim. This is a critical weakness. An alternative and equally plausible explanation is that the hardware behaves differently, which would invalidate the generality of their findings and the primitives built upon them. The burden of proof is on the authors to demonstrate that these are measurement errors, not fundamental behavioral differences.
Questionable Novelty and Robustness of Attack Techniques:
- The "new port contention technique" for cross-thread leakage (Section 7.2, page 10) is poorly motivated and described. The authors claim prior methods are insufficient for single-cycle uops but describe their solution as creating a dependency chain to delay execution. This is a standard approach to amplify contention signals, not a novel technique. The lack of a clear, formal description or a rigorous comparison against prior work makes the claim of novelty unsubstantiated.
- The attack primitives themselves show a lack of generality. The Sync+Reload primitive is rendered ineffective on Zen 5, the latest AMD architecture, because sync uops are dispatched unconditionally (Section 5.2). This forces the authors to develop a more complex Prime+Sync+Probe primitive. This suggests a fragmented and architecture-specific set of leakage methods rather than a universal principle.
Contrived Threat Model and Fragile Attacks: The practicality of the demonstrated side-channel attacks is debatable.
- The cJSON attack (Section 8.1) requires an attacker within the same address space to repeatedly invoke a parsing function on the same secret data but at different start offsets. This seems like a highly specific and unlikely scenario in a real-world setting like the FaaS environment described. A strong justification for the realism of this attack vector is missing.
- The protobuf attack (Section 8.2) is admitted to only work when default compiler optimizations (which use vector instructions that reset the stack engine) are not used. It relies on a specific *__get_packed_size function. This makes the attack fragile and opportunistic, rather than a general threat to applications using protobuf. The paper fails to argue for the prevalence of such vulnerable code patterns in real-world software.
Overstated Claims: The abstract claims this work is the "first reverse engineering of the stack engine." This overstates the contribution. The existence and basic function of the stack engine have been known and described in public documents, including Agner Fog's optimization manuals [18], for years. While this paper provides a much deeper security-focused analysis, it is not the first reverse engineering. This should be rephrased for accuracy.

Questions to Address In Rebuttal

Regarding the claim of a "performance counter bug" on Golden Cove and Zen 1 (Section 5.2), what evidence can you provide to prove that a sync uop is indeed dispatched but simply not counted, as opposed to not being dispatched at all? Without this proof, how can you be confident in the universality of your Sync+Reload primitive?
Please formalize your "new dispatch alignment technique" (Section 7.2). How does it fundamentally differ from established techniques that use dependency chains to amplify contention for measurement? Please provide a more rigorous comparison to prior work like that of Aldaya et al. [3].
Could you elaborate on a more concrete and plausible end-to-end attack scenario for the cJSON side channel (Section 8.1)? Specifically, how would an attacker in a realistic in-process sandboxing environment gain the ability to repeatedly trigger the victim's parsing function on the same data with byte-level control over the starting offset?
Given that one of your main primitives (Sync+Reload) does not work on the latest AMD CPUs and your demonstrated attacks rely on non-default compiler flags (protobuf) or a highly specific invocation pattern (cJSON), do you believe your work demonstrates a widespread, practical threat, or rather a narrow, opportunistic one? Please justify your assessment.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a comprehensive, end-to-end investigation of the CPU stack engine—a long-standing but under-examined microarchitectural optimization—as a novel source of information leakage. The authors conduct a meticulous reverse engineering of the stack engine's behavior across a wide range of modern AMD and Intel processors, identifying the precise conditions that trigger synchronization events. Building on this foundational understanding, they construct a set of powerful leakage primitives. These primitives are then leveraged to build practical covert and side channels, culminating in a high-fidelity attack that exfiltrates structural information from a widely-used JSON parsing library, thereby leaking sensitive data. Crucially, the work does not stop at exploitation; the authors discover and verify undocumented "chicken bits" in recent AMD CPUs that can disable the stack engine, and they provide a sober analysis of the ~4% performance cost this mitigation incurs. This is a foundational piece of work that systematically transforms a seemingly benign performance optimization into a fully understood security liability.

Strengths

This paper has several significant strengths that place it firmly in the upper echelon of microarchitectural security research.

Novelty and Significance of the Target: While the community has spent years dissecting caches, branch predictors, and speculative execution, the stack engine has remained largely unexplored from a security perspective. This work is, to my knowledge, the first to subject it to a rigorous security analysis. By doing so, it opens up a new avenue of inquiry into how fundamental frontend optimizations can create security vulnerabilities, broadening the attack surface beyond the more commonly studied CPU components.
Methodological Rigor and Breadth: The authors’ approach is exceptionally thorough. The reverse engineering effort detailed in Section 5 (pages 5-8) is impressive, spanning multiple generations and architectures from both Intel and AMD. This provides a valuable comparative view and demonstrates that the vulnerability is not an isolated design flaw but an emergent property of a widely adopted optimization. The progression from reverse engineering to primitive-building to a full-fledged attack is logical and compelling.
A Complete "Vulnerability Lifecycle" Analysis: This is perhaps the paper’s greatest strength. It does not simply present an attack; it presents the entire story. It begins with curiosity about a microarchitectural feature, moves to deep understanding, demonstrates a practical exploit, and, most importantly, provides a concrete, hardware-verified mitigation. The discovery and functional analysis of the undocumented MSR bits in Section 9.1 (page 12) is a standout contribution, turning the paper from an offensive security work into a balanced and constructive piece of systems research. The performance evaluation of the mitigation provides the final, critical data point needed for CPU architects to make informed trade-off decisions.
Bridging Low-Level Primitives to High-Level Impact: The attack on the cJSON library (Section 8.1, page 11) is an excellent case study. It skillfully connects an esoteric microarchitectural effect (a sync uop being dispatched due to ARSP overflow) to a tangible security outcome (distinguishing between patient records based on the structure of parsed data). This demonstration is crucial for showing that the identified leakage channel is not merely theoretical but poses a genuine risk to real-world software.

Weaknesses

The weaknesses of this paper are minor and relate more to missed opportunities for contextualization than to flaws in the work itself.

Limited Contextualization within the Landscape of Frontend Attacks: The paper correctly identifies itself as a frontend attack in Section 10 (page 13). However, it could do more to position the stack engine channel relative to other known frontend channels (e.g., from the uop cache, Loop Stream Detector, or branch predictors). A brief discussion on the comparative properties—for instance, is the stack engine channel stealthier due to its transient nature? Is it lower or higher bandwidth? Is it more or less noisy than, say, a uop cache-based channel?—would help readers better situate this new vector within the broader taxonomy of microarchitectural threats.
Could Further Generalize the Underlying Principle: The paper correctly notes in Section 5.9 (page 8) that architectures like ARM and RISC-V do not require a stack engine due to their different ISA idioms. This is an important distinction. However, the underlying principle is that a performance optimization designed to handle a common software idiom (in this case, x86's push/pop stack management) creates state that can be leaked. The paper could strengthen its intellectual contribution by briefly speculating on whether analogous "idiom-specific" frontend optimizations in other ISAs might create similar, currently undiscovered, leakage vectors. This would elevate the core idea from an x86-specific finding to a more general principle of microarchitectural security.

Questions to Address In Rebuttal

The discovery of the MSR "chicken bits" is a fantastic contribution. Can the authors elaborate, even briefly, on the methodology used to find them? Was this the result of a brute-force MSR scan, or were there clues in documentation, patents, or kernel patches that pointed them in the right direction? Understanding this process could be valuable for other researchers.
How does the stack engine channel compare to other non-speculative, frontend channels in terms of its key properties? Specifically, considering a sophisticated defender, would this channel be considered more or less difficult to detect than a channel based on, for example, uop cache contention?
The paper makes a compelling case for the vulnerability of the stack engine in x86. Thinking more broadly, do the authors believe that the general principle—that stateful frontend optimizations for common software idioms are a ripe target for side channels—is applicable to other architectures? Could they provide a hypothetical example of what such an optimization and vulnerability might look like on an architecture like ARM or RISC-V?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents a systematic reverse engineering and security analysis of the "stack engine," a microarchitectural optimization in modern x86 CPUs that tracks the stack pointer (RSP) in the frontend to improve instruction-level parallelism. The authors characterize the behavior of this engine across a wide range of recent Intel and AMD microarchitectures, identifying its internal state (ARSP), its tracking depth, and the conditions that trigger a synchronization with the architectural RSP in the backend.

Based on these novel insights, the authors construct three new attack primitives (Direct Underflow, Sync+Reload, Prime+Sync+Probe) that exploit the stack engine's state to leak information. They demonstrate these primitives by building covert channels and a side-channel attack against the cJSON parsing library. Finally, they discover and analyze undocumented MSR "chicken bits" in recent AMD CPUs that can disable the stack engine, providing a mitigation and a way to quantify its performance impact.

The core novelty of this work lies in being the first to deeply investigate, characterize, and weaponize the stack engine as a source for information leakage. While the attack patterns are analogous to prior work in other domains (e.g., caches), the target, the channel, and the specific mechanisms are entirely new.

Strengths

Novelty of the Target Microarchitectural Unit: The primary strength of this paper is its focus on a previously unexamined microarchitectural component for security analysis. While the existence of stack pointer tracking is known in principle (citing Bekerman et al. [8] from 2000), this paper provides the first-ever detailed, empirical reverse engineering of its modern implementations (Section 5, pages 5-8). The characterization of properties like tracking depth (Figure 4, page 7), conditions for synchronization (Section 5.2, page 5), and support for add/sub (Section 5.5, page 7) across multiple generations of AMD and Intel CPUs is a significant and novel contribution to the community's understanding of the x86 frontend.
Novel Primitives Derived from First Principles: The attack primitives are not generic; they are meticulously derived from the specific behaviors uncovered during the reverse engineering phase. For example, Prime+Sync+Probe directly exploits the finite capacity of the internal ARSP register and the observable synchronization event on overflow. This tight coupling between reverse engineering and exploit development is a hallmark of high-quality microarchitectural security research. This represents the first time such stateful tracking in the stack engine has been exploited.
Novel Observation Technique for a Difficult Signal: The authors correctly identify that the signal from the stack engine—a single-cycle, independent ALU operation—is difficult to observe, especially cross-thread, and that prior port contention techniques [3, 23] are insufficient. Their development of a "tuned" port contention method with a dependency to align execution windows (Section 7.2, pages 10-11) is a subtle but important novel contribution in its own right, enabling the observation of this new class of faint signals.
Novel Discovery of Undocumented Mitigations: The discovery and characterization of undocumented "chicken bits" in AMD Zen 4 and Zen 5 CPUs to control the stack engine (Section 9.1, page 12) is a concrete and novel finding. This is not simply an application of existing knowledge but a genuine discovery that provides an immediate mitigation path and allows for a precise performance evaluation of the targeted feature.

Weaknesses

Conceptual Analogy to Existing Attack Patterns: While the target and mechanism are novel, the conceptual framework of the Prime+Sync+Probe primitive is a direct analogue to the classic Prime+Probe cache attack. The pattern is: (1) put the microarchitectural structure into a known state (prime), (2) let the victim execute, (3) check the state to see if the victim's activity changed it (probe). The paper should more explicitly position its contribution not as the invention of a new attack pattern, but as the novel discovery that the stack engine constitutes a previously unknown structure susceptible to this pattern.
The Demonstration is an Application, Not a Core Novelty: The successful attack on the cJSON library (Section 8.1, page 11) is an excellent demonstration of the primitives' effectiveness. However, from a novelty standpoint, this is an application of the core ideas rather than a new idea in itself. The core contribution remains the identification and exploitation of the stack engine, not the specific finding in a downstream software library.

Questions to Address In Rebuttal

The Prime+Sync+Probe primitive is functionally analogous to cache-based Prime+Probe. Can the authors elaborate on the non-trivial aspects of adapting this conceptual pattern to the stack engine's ARSP register? For instance, how does the non-cache-like, single-value nature of the ARSP state fundamentally differ from the set-based state of a cache during an attack?
In Section 5.7 ("Stack engine under speculation"), you confirm that sync operations are observable under transient execution but conclude that an attack "would only slightly expand the capabilities of a cross-thread attacker." This conclusion seems understated. Could a transient execution attack based on the stack engine enable leakage scenarios not possible with existing speculative attack primitives? Please clarify if there is a genuinely novel transient attack vector here that has been downplayed.
Your reverse engineering covers a wide range of x86 CPUs. You briefly state that ARM and RISC-V lack a stack engine due to their ISA design (Section 5.9, page 8). Are you aware of any analogous frontend optimizations in other non-x86 architectures that track register state (not just the stack pointer) in a similar stateful, finite-capacity manner that could be susceptible to the new class of primitives you have developed?

DExiM: Exposing Impedance-Based Data Leakage in Emerging Memories

Abstract

Emerging non-volatile memory (NVM) technologies, such as resistive RAM (ReRAM), ferroelectric RAM (FRAM), and magnetoresistive RAM (MRAM), are gaining traction due to their scalability, energy efficiency, and resilience to traditional charge-based ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper presents an investigation into impedance-based side-channel leakage in emerging non-volatile memories (NVMs), specifically ReRAM, FRAM, and MRAM. The authors utilize a Vector Network Analyzer (VNA) to measure the S11 reflection parameter on the power distribution network (PDN) of commercial memory chips. They claim that variations in the measured impedance correlate with the stored data's Hamming weight (inter-HW) and specific data patterns (intra-HW). Using a feature selection process followed by Principal Component Analysis (PCA) and standard machine learning classifiers, the authors report high classification accuracy, concluding that impedance-based leakage is an "exploitable phenomenon" in these memory technologies.

Strengths

Empirical Breadth: The study evaluates three distinct and commercially relevant NVM technologies (ReRAM, FRAM, MRAM) from multiple vendors. This provides a broader base for its claims than a study focused on a single device.
Use of COTS Devices: The experiments are conducted on commercial off-the-shelf (COTS) components, which lends some external validity to the findings, as opposed to custom-designed test structures or simulations.
Detailed Experimental Setup: The paper provides specific details regarding the VNA configuration, frequency sweep, and data acquisition process (Section 6.3), which is a necessary, albeit basic, requirement for reproducibility.

Weaknesses

My primary concerns with this manuscript center on the validity of its core premise, the methodology used to isolate the phenomenon, and the interpretation of the results.

Conflation of System and Component Impedance: The fundamental flaw in this work is the lack of evidence isolating the impedance contribution of the memory cells themselves from the rest of the system. The measurements are taken on the 3.3V power rail (Section 6.3), which represents the aggregate impedance of the entire memory chip (including I/O buffers, sense amplifiers, row/column decoders, charge pumps) and the test PCB itself (decoupling capacitors, power planes, traces). The authors provide no control experiments—such as measuring the chip in a quiescent state or a baseline measurement of the PCB without the chip—to substantiate that the observed variations originate specifically from the N-bit memory state and not from peripheral circuitry that is trivially data-dependent. The simplified models in Section 4 (Fig. 6, 7) are cell-level, yet the measurements are system-level. This is a critical, unbridged gap.
Oversimplified Theoretical Foundation: The analytical models presented in Section 4 are rudimentary and fail to account for the dominant non-idealities in any real-world circuit. The equations (Eq. 3-8) ignore parasitic capacitance and inductance, process variations, temperature effects, and noise, all of which would heavily influence impedance measurements in the GHz range. To present these idealized equations as the basis for a complex, noisy, real-world phenomenon and then claim experimental validation is a significant logical leap. The work does not demonstrate that the measured effects are governed by these models versus other, more plausible circuit dynamics.
Questionable Signal Processing and Feature Selection Pipeline: The methodology in Section 7.1.1 is not adequately justified.
- The choice of Pearson correlation for feature selection presupposes a linear relationship between impedance at a specific frequency and the stored data value. There is no theoretical or empirical justification provided for this assumption. Non-linear relationships would be missed entirely.
- The two-step approach of selecting features via correlation and then applying PCA is unorthodox. PCA is designed to find the axes of maximal variance in a dataset. By pre-filtering the data, the authors may be biasing the analysis and discarding components that, while having lower individual correlation, might be highly informative in combination. A more rigorous approach would apply PCA to the full spectrum and analyze the resulting components or compare the chosen pipeline against standard alternatives.
Overstatement of Classification Results: While the reported classification accuracies in Table 1 seem high, the presentation obscures critical details.
- The results are presented as ranges (e.g., F1-Score "84.3%-89.5%"). This is imprecise. Does this range represent variation across the 20 chips, across different NVM models within a technology type, or something else? This ambiguity hinders a critical assessment of the results' stability.
- The confusion matrices in Figure 10 reveal non-trivial misclassifications. For MRAM (Fig. 10c), distinguishing HW0 has a False Discovery Rate (FDR) of 31.5%, and classifying HW8 only achieves a 77% True Positive Rate (TPR). This suggests the channel is significantly noisier and less reliable than the summary tables imply. To label a phenomenon with such error rates as definitively "exploitable" requires a more nuanced discussion of the attack context (e.g., error correction requirements) which is absent.

Questions to Address In Rebuttal

The authors must address the following points to substantiate their claims:

Isolation of Effect: How did the authors de-embed the impedance contribution of the NVM cells from the parasitic effects of the package, the PCB, and the chip's peripheral circuitry? Please provide control measurements or a detailed analysis that proves the observed impedance variations are not dominated by these other sources.
Model vs. Reality: Can the authors justify the leap from the simplified, ideal circuit models in Section 4 to the complex, system-level measurements? How do these models account for the frequency-dependent behavior of the PDN, including decoupling capacitor resonances, which are known to create significant impedance variations?
Methodological Justification: What is the justification for assuming a linear data-impedance relationship for the Pearson correlation feature selection? Please provide evidence comparing your two-step feature selection pipeline (correlation + PCA) with a baseline approach (e.g., PCA on the full spectrum) to demonstrate that your method is not biasing the results or discarding useful information.
Interpretation of Results: Please clarify precisely what the performance ranges in Table 1 represent. Furthermore, please address the significant misclassification rates in Figure 10 and discuss how they impact the practical exploitability of this side channel, which you claim is a key finding. An attack with a ~23% error rate on certain data patterns (MRAM HW8) is far from a foregone conclusion.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces and validates a novel side-channel attack vector against emerging non-volatile memories (NVMs) like ReRAM, FRAM, and MRAM. The core contribution is the demonstration that the fundamental data storage mechanism of these memories—their impedance state—can be directly and non-invasively measured through the Power Distribution Network (PDN) to leak the stored data. The authors challenge the implicit assumption that the shift from charge-based to impedance-based storage in NVMs inherently improves security against physical attacks.

Using S-parameter analysis with a Vector Network Analyzer (VNA), the authors show that data-dependent impedance variations are statistically significant across a range of commercial memory chips. Their methodology successfully distinguishes not only between data patterns with different Hamming weights (inter-HW) but also between patterns with the same Hamming weight (intra-HW), enabling powerful template-based attacks. The work serves as a foundational exploration of a new class of hardware vulnerabilities, highlighting that the physical properties of data storage can themselves become a source of leakage.

Strengths

Strong Conceptual Contribution: The primary strength of this paper is its conceptual clarity and novelty. It pivots the focus of side-channel analysis for NVMs from dynamic leakage (e.g., power consumption during write operations) to static leakage rooted in the physical state of the memory cells themselves. This is a crucial insight that opens a new and previously overlooked avenue for security research. By framing impedance not as a passive circuit parameter but as an active source of information leakage, the authors provide a valuable new perspective for the hardware security community.
Excellent Empirical Grounding: The work is not merely theoretical. The authors provide a robust and convincing experimental evaluation across three major NVM technologies (ReRAM, FRAM, MRAM) from multiple vendors (Section 6.1.1, page 7). This breadth significantly strengthens the generality of their findings. The systematic analysis of both inter- and intra-Hamming weight variations (Section 7.1, page 8) demonstrates the high resolution of the attack vector and its potential to defeat simple countermeasures.
Bridging Disparate Fields: The paper successfully connects two traditionally separate domains: the device physics of emerging memories and the RF/microwave measurement techniques used in signal integrity analysis. The repurposing of VNA and S-parameter analysis (Section 2.3, page 4), typically used for characterizing high-frequency circuits, as a tool for security vulnerability discovery is both clever and effective. This interdisciplinary approach is a model for future work in physical side-channel analysis.

Weaknesses

Understated Positioning Against Existing Impedance SCA: While the paper has a good "Related Works" section (Section 10, page 12), it could more forcefully articulate its contribution in the context of prior impedance side-channel work. Research on impedance leakage in FPGAs (e.g., [13, 65]) has already established the general principle. The authors should more explicitly highlight why their work is a significant leap forward—namely, the transition from analyzing dynamic switching in logic elements to extracting static data from high-density memory arrays. This is a non-trivial distinction that elevates the paper's impact, and it deserves more emphasis.
Practicality of the Threat Model: The threat model (Section 5, page 7), which assumes physical access and an identical reference chip for template building, is standard for this type of academic research. However, the work would have a greater impact if it briefly explored the boundaries of this model. For instance, a discussion on the potential for remote sensing (e.g., via backscattering) or the robustness of templates across different manufacturing batches would help contextualize the real-world threat level.
Countermeasures Section is Preliminary: The discussion on countermeasures in Section 8 (page 10) is comprehensive in its breadth but lacks depth. It serves as a good catalog of potential defenses but does not provide much intuition on which methods would be most effective or efficient against this specific static leakage channel. While a full evaluation is beyond the scope of this paper, a more focused discussion or a preliminary simulation of a promising countermeasure would transition the work from purely identifying a problem to pointing more concretely toward solutions.

Questions to Address In Rebuttal

The authors have presented a compelling and well-executed piece of research. I would appreciate their thoughts on the following points to further strengthen the work:

On the Novelty of Application: The related work in [13, 65] has explored impedance leakage in FPGAs. Could the authors further clarify the fundamental differences and challenges in extending this concept from configurable logic to dense, static NVM arrays? What makes this a non-trivial extension that constitutes a significant new contribution?
On the Robustness of Templates: The evaluation relies on template attacks using 20 chips of each type, which is a good practice. How sensitive are these impedance templates to environmental factors like process variations, temperature, and aging? Could a template trained on one batch of chips be successfully used against another, or is per-device (or at least per-batch) profiling necessary for a real-world attack?
On the Future of Defenses: Section 8 provides a good overview of potential countermeasures. Based on your findings, do you have an intuition about which category (e.g., architectural randomization, software-based masking, or device-level modifications) would be the most effective and cost-efficient defense against this specific type of static impedance leakage? For instance, would techniques that randomize memory layout be more effective than those that add noise to the PDN?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present "DExiM," an experimental study demonstrating that emerging Non-Volatile Memory (NVM) technologies—specifically ReRAM, FRAM, and MRAM—leak information about their statically stored data through their impedance characteristics. The core methodology involves using a Vector Network Analyzer (VNA) to measure the S11 reflection parameter of the memory chip's Power Distribution Network (PDN), from which the impedance profile is derived. The authors show that these impedance profiles contain statistically significant variations corresponding to both inter-Hamming weight (different numbers of '1's) and intra-Hamming weight (same number of '1's in different positions) data patterns. They use this leakage to train machine learning classifiers to distinguish the stored data with high accuracy. The central claim is that this represents a new class of hardware vulnerability for these memory types.

Strengths

Novel Application of a Known Technique to a New Domain: The primary strength of this paper is its application of impedance-based side-channel analysis to the domain of static data leakage from dedicated emerging NVM chips. While impedance analysis itself is not new, its use here to non-invasively read data-at-rest from technologies that encode information directly as impedance is a novel and logical extension. The work successfully identifies a vulnerability that is a direct consequence of the fundamental operating principle of these memories.
Systematic and Rigorous Validation: The authors provide a compelling existence proof through a thorough experimental evaluation. The use of three distinct NVM technologies (ReRAM, FRAM, MRAM) from multiple vendors, with tests conducted on 20 distinct chips for each model (as mentioned in Section 6.1.1, page 7), demonstrates that the observed phenomenon is not an artifact of a single device or architecture but a more fundamental property. This systematic approach adds significant weight to the core claim.
Clear Differentiation from Power Analysis: The paper effectively argues why this leakage vector is distinct from and potentially more potent than traditional power analysis for this class of devices. As noted in the introduction (Section 1, page 2), the low-power and limited switching activity of NVMs can make power SCA difficult, whereas impedance is a static property that remains measurable, presenting a fundamentally different attack surface.

Weaknesses

Incremental Advance Over Existing Impedance Leakage Work: The core idea of using impedance as a side channel is not new. The authors themselves cite the most relevant prior art, notably Monfared et al. [65] ("LeakyOhm") and a related preprint by the current authors [13]. "LeakyOhm" conclusively demonstrated data-dependent impedance leakage from FPGA logic elements (LUTs, flip-flops) and used it to mount a successful key-recovery attack on an AES implementation. The "delta" in this work is the change of target from programmable logic (FPGAs) to dedicated memory chips (NVMs). While this is a valid and important distinction, the conceptual framework—measuring data-dependent impedance variations via the PDN—is functionally identical. The paper's novelty rests entirely on this change of target, which makes the contribution more of an extension of an existing attack class to a new component type, rather than the discovery of a fundamentally new attack class.
Lack of a Demonstrated Novel Attack: The analysis stops at data classification. Prior work [65] took the critical next step of leveraging the identified leakage to perform a full cryptographic key extraction. By limiting the scope to classification of raw data patterns, the authors demonstrate a channel but not a novel attack with demonstrated real-world impact. This makes the contribution feel incomplete from a security perspective and lessens the significance of the novel finding. The paper hypothesizes about consequences like key extraction (Section 9, page 11) but does not provide evidence.
Generic and Non-Novel Countermeasures: The discussion of countermeasures in Section 8 (page 10) is a high-level overview of established side-channel defense principles (masking, randomization, ECC-aware design). There is no novel countermeasure proposed that is specifically tailored to the unique physical properties of impedance leakage in NVMs. This section lacks originality and reads like a summary of textbook defenses.

Questions to Address In Rebuttal

The closest prior art [65] demonstrated impedance leakage in FPGAs. Please elaborate on the fundamental physical differences in the source of impedance variations between a configured FPGA logic element (as in [65]) and a static NVM cell (as in your work). A more detailed comparison would help solidify the novelty of your contribution beyond simply a change in the device under test.
Given that prior work on impedance leakage [65] culminated in a full AES key extraction, why did your investigation stop at classifying 8-bit data patterns? Can you provide evidence or a compelling argument that the signal-to-noise ratio and observability of your discovered channel are sufficient for a similar, more complex attack on a real cryptographic primitive stored in one of these NVMs?
The countermeasures proposed in Section 8 are largely generic. Can you propose at least one concrete, novel countermeasure that is specifically designed to mitigate impedance leakage in NVMs at either the circuit or architectural level, which would not be a straightforward application of existing power/EM side-channel defenses?

Sonar: A Hardware Fuzzing Framework to Uncover Contention Side Channels in Processors

Abstract

Contention- based side channels, rooted in resource sharing, have emerged as a significant security threat in modern processors. These side channels allow attackers to leverage timing differences caused by conflicts in execution ports, caches, or ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper presents Sonar, a pre-silicon fuzzing framework that purports to systematically uncover contention-based side channels in processors. The core thesis is that multiplexers (MUXes) are the primary loci of resource contention, and by identifying these MUXes, one can define contention-critical states. The framework then uses these states, particularly the timing interval between requests (reqsIntvl), to guide testcase generation towards triggering contentions. The authors evaluate Sonar on two open-source RISC-V processors (BOOM and NutShell) and claim to have discovered 14 side channels, 11 of which are presented as new.

While the engineering of the framework appears functional for its defined scope, its fundamental premise is an oversimplification of microarchitectural contention. The methodology for guiding the fuzzer is naive, and the claims regarding the novelty and exploitability of the discovered vulnerabilities are not sufficiently substantiated.

Strengths

Automatable Heuristic for Contention Points: The core idea of leveraging MUX structures as a heuristic to identify potential contention points (Section 5.1, page 4) provides a scalable, automated starting point for analysis. The bottom-up tracing method is a logical way to group cascaded 2:1 MUXes into a single n:1 contention point.
Targeted Root-Cause Analysis: The dual-differential comparison method (Section 7, page 7), which correlates instruction commit cycle differences (CCD) with changes in contention-critical states, is a methodologically sound approach for narrowing down the source of a detected timing leak, reducing manual debugging effort.
Concrete Implementation: The framework is implemented and evaluated on two non-trivial out-of-order processor designs, demonstrating that the proposed instrumentation and analysis pipeline is functional.

Weaknesses

Oversimplified Contention Model: The central premise that MUXes are the definitive "hotspots" for all significant resource contention is fundamentally flawed. While MUXes are ubiquitous in signal selection, complex contention scenarios often arise from more intricate, stateful arbiters, queues with occupancy-dependent backpressure, and shared functional units whose contention logic is not fully captured by analyzing MUX input trees. This foundational assumption limits the scope of discoverable vulnerabilities to only those that manifest as simple signal selection conflicts, potentially missing more subtle and complex contention channels.
Naive Fuzzing Guidance Metric: The guidance metric, which exclusively seeks to minimize the request interval (reqsIntvl) (Section 6.2.1, page 6), is myopic. While forcing requests to be simultaneous (interval of zero) is a valid strategy for triggering simple volatile contentions, it fails to account for more complex scenarios. Many powerful side channels require precise, non-zero timing relationships (e.g., one request arriving exactly N cycles after another to exploit pipeline staging, buffer states, or arbiter fairness timers). The framework's singular focus on minimizing this interval is a greedy approach that likely prevents the discovery of such vulnerabilities and can easily get trapped in local optima where simple contentions mask the path to more complex ones.
Overstated Novelty of Findings: The claim of uncovering 11 "previously unknown" side channels (Abstract, page 1) is not supported by a rigorous analysis of the results. A close review of Table 3 (page 10) reveals that many of these "new" channels are simply instances of well-understood contention principles manifesting on the specific microarchitectures of BOOM and NutShell:
- S1-S4 (TileLink Contention): This is a textbook case of bus/interconnect contention. Discovering that a long-running transaction can block a shorter one on a shared bus is not a novel vulnerability class.
- S5 (MSHR Contention): This is a specific manifestation of MSHR pressure, a known side-channel vector. The paper itself notes its relation to Speculative Interference Attacks [11]. The "false sharing path blocking" appears to be a name for a specific trigger condition, not a new fundamental mechanism.
- S11 & S12 (L1 DCache Contention): These are intra-thread variants of classic cache contention attacks (e.g., Prime+Probe, Flush+Reload). The novelty is limited to demonstrating they can be triggered without SMT, which is an interesting but incremental finding, not a new class of vulnerability.
- The paper fails to demonstrate the discovery of a truly novel class of contention vulnerability.
Insufficient Evaluation of Exploitability: The exploitability analysis (Section 7.3, page 7 and Section 8.5, page 12) is superficial. Applying a generic Meltdown-style template demonstrates the existence of a transient execution window but provides no rigorous analysis of the signal-to-noise ratio, the difficulty of timing the attack under realistic system load, or the precise attacker capabilities required. The complete failure to construct a working PoC on NutShell (accuracy <2%) is a major red flag that is inadequately explained away by "earlier exception detection." This failure strongly suggests that the detected timing variations on NutShell are practically unexploitable and should not be classified as significant vulnerabilities without further evidence.
Limited Generalizability: The evaluation is confined to two academic, open-source RISC-V processors. The findings cannot be assumed to generalize to commercial-grade processors from vendors like Intel, AMD, or ARM, which feature far more complex and proprietary microarchitectures, interconnects, and prefetchers. Many of the uncovered issues appear to be artifacts of specific design choices in BOOM and NutShell rather than fundamental, universal principles.

Questions to Address In Rebuttal

Regarding the contention model: Can you provide a compelling argument or evidence that contention mechanisms not directly represented as MUX cascades (e.g., complex token-based arbiters, buffer occupancy management) are adequately covered by your approach? Please provide an example of a known side channel based on such a mechanism and explain how Sonar would detect it.
Regarding the guidance metric: How does the reqsIntvl minimization strategy avoid missing vulnerabilities that require a precise, non-zero timing interval between contending requests? Have you considered alternative guidance metrics that reward specific temporal patterns rather than just simultaneity?
Regarding novelty: Please justify the claim that channels S1-S4 and S11-S12 represent novel vulnerability classes, distinct from previously documented bus/interconnect and cache-based contention attacks, respectively. What is the fundamental new principle being demonstrated beyond its occurrence in a specific RISC-V core?
Regarding exploitability: Beyond citing "earlier exception detection," what specific microarchitectural investigation was performed to explain the <2% accuracy for the PoCs on NutShell? Does this result not suggest that the timing variations Sonar detects can be academic artifacts with no practical security impact? Why should these be considered exploitable vulnerabilities?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents Sonar, a novel pre-silicon fuzzing framework designed to systematically uncover contention-based side channels in processor designs. The authors' core contribution is a microarchitecture-aware approach that departs from traditional random instruction fuzzing. Sonar is built on the key insight that resource contention often manifests at multiplexers (MUXes) within the circuit.

The framework operates in three stages: 1. It first uses a bottom-up tracing methodology starting from MUX outputs to automatically identify "contention points" and their associated states within the RTL design. 2. It then employs a state-guided fuzzing loop, using the timing interval between requests (reqsIntvl) at these contention points as a fitness metric to guide testcase mutation, progressively driving the design towards a state of contention. 3. Finally, it uses a "dual-differential comparison" method—comparing both instruction commit times and the microarchitectural contention states under different secret values—to accurately detect and pinpoint the root cause of the side channels.

Evaluated on the BOOM and NutShell open-source RISC-V cores, Sonar successfully identified 14 contention side channels, 11 of which are previously undocumented.

Strengths

The primary strength of this paper lies in its elegant and effective conceptual bridge between the architectural and microarchitectural domains for security verification.

A Powerful, Foundational Heuristic: The central idea to model resource contention as a problem of MUX arbitration (Section 5.1, page 4) is a significant contribution. It provides a concrete, automatable, and scalable way to find potential contention "hotspots" across a complex processor design without needing manual specifications. This moves the field beyond simply fuzzing instructions and hoping for the best, towards a more targeted and intelligent search. It's a foundational insight that could inform future work in hardware security verification.
Effective State-Guided Fuzzing: The paper successfully translates this microarchitectural insight into a practical feedback mechanism for fuzzing. Using the request interval (reqsIntvl) as a fitness function (Section 6.2.1, page 6) is a clever way to solve the notoriously difficult problem of triggering timing-sensitive events. It creates a gradient that the fuzzer can descend to force simultaneous requests, something that random mutation would struggle to achieve efficiently. This methodology represents a significant maturation of hardware fuzzing techniques for security.
Contextual Significance and Strong Results: This work is situated perfectly at the intersection of hardware verification and computer security. The need for pre-silicon detection of side channels is well-established, as post-silicon fixes are immensely costly. By finding 11 new, potentially exploitable channels in well-regarded open-source designs like BOOM (Table 3, page 10), the authors provide compelling evidence that their framework is not just a theoretical novelty but a practical and necessary tool. It effectively addresses a known gap left by both less-targeted fuzzers (like SpecDoctor) and less-scalable formal methods (like UPEC).
End-to-End Systematization: Sonar is presented not as a single trick, but as a complete, end-to-end framework. The combination of automated hotspot identification, guided triggering, and the dual-differential analysis for precise detection creates a systematic and repeatable process. This level of automation is crucial for integrating security analysis into the standard hardware design lifecycle.

Weaknesses

The weaknesses of the paper are primarily related to the boundaries of its core assumptions and its positioning relative to adjacent methodologies. These are opportunities for clarification and future exploration rather than fatal flaws.

Conceptual Limits of the MUX Model: The MUX-centric view is powerful for arbitrated resources (ports, buses, interconnects), but its applicability to other forms of contention is less clear. For example, the paper reports finding contention on a non-pipelined Multiply-Divide Unit (S13). While this is a valid contention channel, it stems from the unit's internal "busy" state rather than an input MUX selecting between competing requests. The paper would be strengthened by a discussion on the conceptual limits of the MUX model and how it generalizes (or doesn't) to other forms of resource occupancy.
Generalizability of the Implementation: The framework's implementation relies on analyzing FIRRTL, a high-level intermediate representation generated from Chisel (Section 8.2, page 8). While this is pragmatic for academic research using open-source cores, it raises questions about its applicability in industrial settings, which predominantly use Verilog/SystemVerilog and may not provide access to such a high-level IR. A discussion on the challenges and potential pathways for adapting the "bottom-up tracing" to standard Verilog RTL or even gate-level netlists would significantly broaden the perceived impact of this work.
Nuance in Comparison to Formal Methods: The paper positions itself against formal methods primarily on the basis of scalability. This is a fair and important point. However, the discussion could be more nuanced. Formal methods provide proofs of absence (within certain bounds), a powerful guarantee that fuzzing cannot offer. A brief exploration of the fundamental trade-offs would be valuable. Are there classes of contention channels Sonar might miss that a formal tool could theoretically find, and vice versa?

Questions to Address In Rebuttal

Could the authors elaborate on the conceptual boundaries of their MUX-based contention model? For instance, how does it capture contention for non-pipelined functional units (e.g., S9, S13 in Table 3) where the bottleneck isn't necessarily request arbitration at an input MUX, but rather the busy state of the unit itself? Is the framework implicitly monitoring this through other MUXes (e.g., on the writeback path), or does this suggest the need for a hybrid model?
The reliance on FIRRTL is pragmatic for the chosen evaluation targets. Can the authors comment on the feasibility of adapting the "bottom-up tracing" methodology to more standard, lower-level representations like Verilog RTL or even gate-level netlists, which would be crucial for industrial adoption?
Beyond scalability, could you provide more insight into the fundamental trade-offs between Sonar's dynamic approach and formal verification methods like UPEC? Are there specific types of contention channels (perhaps those requiring extremely complex state setup) that one is inherently better suited to find than the other?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents Sonar, a pre-silicon fuzzing framework for detecting contention-based side channels in processor RTL designs. The authors propose a three-part methodology. First, they introduce a technique to systematically identify potential contention points by treating multiplexers (MUXes) as proxies for resource arbitration and using a bottom-up tracing algorithm. Second, they employ a guided fuzzing strategy where the feedback metric is the timing interval between competing requests (reqsIntvl) at these MUX-defined contention points, with the goal of minimizing this interval to trigger simultaneous contention. Third, they use a "dual-differential" analysis method to confirm the side channel, which compares both instruction commit-timing differences and the underlying microarchitectural contention states under different secret values.

The core novelty of this work rests on the combination of these three ideas: the systematic, structural identification of contention points via MUX analysis to guide a fuzzer, the use of a microarchitectural timing interval as a direct feedback loop for mutation, and the automated root-cause analysis that links observed timing effects to specific contention events. While individual components build on existing concepts (guided fuzzing, differential testing), their synthesis into a cohesive framework targeted specifically at contention side channels appears to be a novel contribution to the field of pre-silicon hardware security verification.

Strengths

From the perspective of novelty, the paper has several significant strengths:

S1: A Novel Heuristic for Locating Contention. The central idea of using MUXes as a structural signature for contention points (Section 5.1, page 4) is a genuinely new and clever heuristic for the purpose of fuzzing. Prior fuzzing work (e.g., SpecDoctor, SIGFuzz) has relied on more abstract or effect-based monitoring (e.g., transient state coverage, commit timing). Sonar's approach is the first I am aware of to systematically parse the circuit structure itself (via FIRRTL) to derive targets for contention, which is a significant methodological advancement. The "bottom-up tracing" method to identify the full set of inputs to a cascaded MUX is a concrete, novel algorithm that enables this approach.
S2: A Novel Feedback Mechanism for Triggering Contention. Guided fuzzing is not new, but the choice of feedback is critical and domain-specific. The authors' proposal to use the inter-request timing interval (reqsIntvl) as the fitness function for the fuzzer (Section 6.2.1, page 6) is a novel application tailored to the unique challenge of triggering timing-sensitive microarchitectural events. Driving mutations to explicitly minimize this value is a far more directed strategy than the random or coverage-guided approaches of prior work and represents a new technique in the hardware fuzzer's toolkit.
S3: A Novel Method for Automated Root Cause Analysis. Standard differential testing can reveal the existence of a timing leak, but not necessarily its cause. Sonar’s "dual-differential comparison" method (Section 7, page 7) is a notable contribution. By simultaneously comparing the Commit Cycle Difference (CCD) to isolate the affected instruction and the contention state logs to identify the responsible MUX, the framework moves beyond mere detection to automated attribution. This linking of effect to cause is a qualitative improvement over prior art and represents a novel analysis workflow.

Weaknesses

The primary weaknesses relate to the framing of the novelty and the evaluation of its significance.

W1: Overstated "First Mover" Claim. The abstract claims Sonar is the "first systematic and automated fuzzing framework designed to uncover contention side channels". This claim is too strong and potentially misleading. Frameworks like SpecDoctor [25] and SIGFuzz [26] are also fuzzing frameworks that can, and do, uncover contention side channels (e.g., port contention is a known Spectre-v4 variant). The true novelty of Sonar is not that it finds them, but how it finds them. The claim should be refined to state that it is the first framework to use structural analysis of arbitration logic (MUXes) to systematically guide a fuzzer towards triggering these contentions. The innovation is in the specific guidance mechanism, not in being the first tool in the category.
W2: Insufficient Differentiation from Conceptually Adjacent Ideas. The concept of resource contention is fundamental to computer architecture. While using MUXes for fuzzing is new, the idea that arbiters are contention points is not. The paper would be stronger if it more clearly delineated why its automated MUX-based approach provides a significant advantage over a simpler, more manual approach where a designer might identify major arbiters (e.g., for execution ports, memory ports, bus interfaces) as monitoring targets. The novelty lies in the automation and scale, and this should be emphasized more.
W3: Complexity vs. Benefit Analysis. The proposed methodology introduces significant complexity: a full-design RTL pass for MUX tracing and extensive instrumentation for dynamic reqsIntvl monitoring. The overheads are non-trivial, as shown in Table 2 (page 9). For the novelty to be truly significant, the benefit must clearly outweigh this cost. The comparison with SpecDoctor in Section 8.3.4 (page 9) shows Sonar triggers more contentions, which is a good result. However, this comparison doesn't fully assess the trade-off. Is it possible that a much simpler heuristic (e.g., monitoring a handful of key pipeline stall signals) could achieve 80% of the results with 10% of the complexity? A discussion on this trade-off would help solidify the significance of the proposed novel, but complex, approach.

Questions to Address In Rebuttal

Please refine your primary novelty claim. Given that prior fuzzers like SpecDoctor [25] can detect contention-based channels, can you clarify precisely what makes Sonar the "first" in its class? Is the novelty more accurately described as being the first structurally-guided fuzzer for this purpose?
The core of your guidance is the MUX-based heuristic. How robust is this heuristic? Are there common sources of resource contention in modern processors that are not implemented with straightforward MUX trees and would therefore be missed by Sonar? Conversely, does the automated tracing produce a large number of "contention points" at MUXes that are architecturally uninteresting or do not represent a meaningful shared resource, leading to wasted effort by the fuzzer?
Your reqsIntvl feedback mechanism is innovative but seems computationally intensive. For a complex contention point with N inputs, the fuzzer must track O(N^2) request pairs. How does the framework scale as the number of identified contention points and their input fan-in increases, especially in a large, complex SoC design beyond the evaluated BOOM/NutShell cores?

Symbiotic Task Scheduling and Data Prefetching

Abstract

Task- parallel programming models enable programmers to extract parallelism from irregular applications. Since software-based task-parallel runtimes impose crippling overheads on fine-grain tasks, architects have designed manycores with hardware support ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper presents a symbiotic hardware prefetcher, the Task-Seeded Prefetcher (TSP), and a prefetch-aware task scheduler, the Memory Response Task Scheduler (MRS), designed for fine-grain, task-parallel manycore systems. The central thesis is that task descriptors, available to the scheduler before task execution, can seed a prefetcher to overcome the limitations of conventional prefetchers, which are ineffective for short-lived tasks. MRS then leverages the status of these prefetches to reorder task execution, prioritizing tasks whose data is already in cache. The authors evaluate this system across four scheduler types and two cache sizes, claiming significant speedups.

While the problem is well-defined and the high-level approach is conceptually plausible, the work suffers from a critical methodological flaw in its core premise of "predictability," and the evaluation reveals fundamental negative interactions that the proposed mechanisms exacerbate rather than solve. The paper identifies these serious pathologies but dismisses them as out of scope, undermining the claimed general applicability and robustness of the proposed solution.

Strengths

Problem Motivation: The paper correctly identifies a critical performance bottleneck in hardware-assisted task-parallel systems: the ineffectiveness of conventional memory prefetchers for very short tasks. The data in Figure 1 and Table 1 provides clear evidence of both the performance gap due to memory latency and the short duration of tasks, strongly motivating the need for a new approach.
Core Concept: The fundamental idea of using the task descriptor (function pointer and arguments), which is known well in advance of execution, to initiate prefetching is sound and represents a logical direction for these architectures.
Evaluation Breadth: The authors conduct an extensive evaluation, covering four distinct scheduling policies (combinations of speculative/non-speculative and random/spatial mapping) and two different cache configurations. This breadth attempts to demonstrate the general utility of their mechanisms.

Weaknesses

Circular Definition of Predictability: The paper's central claim rests on the "predictability" of memory accesses. However, this predictability is never independently validated. Table 1 (page 3) presents "Predictable loads (%)" as evidence, but the text on page 3 reveals how this is measured: by "counting the accesses for which TSP is able to issue prefetches." This is a tautology. The evidence for predictability is that the proposed predictor can predict them. This fails to provide any objective, foundational proof that these access patterns are inherently simple. A rigorous analysis would first classify access patterns (e.g., affine on arguments, multi-level pointer chasing) and then demonstrate the mechanism's coverage, rather than defining coverage as the metric itself.
Fundamental Negative Interaction 1: Priority Inversion: The symbiotic relationship between TSP and MRS is shown to be actively harmful for priority-ordered algorithms. The authors explicitly document a 1.3x slowdown for sssp-r on the NONSPECSPAT 1/16-cache system (Section 5.2, page 9). They correctly diagnose the cause: MRS's reordering of tasks based on data availability causes severe "priority inversions," leading to a 2.1x increase in the number of tasks executed. Their response to this fundamental design flaw is to state, "We leave applying MRS to such specialized schedulers as future work." This is unacceptable. A general-purpose scheduler that breaks a canonical, high-performance graph algorithm is not a general-purpose solution. The paper demonstrates a core tension between latency hiding and work efficiency and fails to resolve it.
Fundamental Negative Interaction 2: Increased Speculation Aborts: The evaluation shows that for msf-o, the proposed mechanisms lead to a performance degradation due to increased task aborts (Section 5.2, page 9). Aggressive prefetching, by its nature, brings data into the cache earlier, widening the window for data conflicts in a speculative system. This increases the likelihood of aborts, wasting significant work. Again, the authors diagnose a critical negative interaction that their "symbiotic" system creates but offer no solution within their framework. This suggests the two components are not truly symbiotic but can be mutually destructive under common conditions.
Inefficient Prefetching for Common Control Flow: The paper acknowledges that TSP prefetches aggressively for "moot tasks"—tasks that exit early based on a data-dependent condition (Section 5.3, page 12). This generates useless memory traffic and pollutes the cache. The authors' suggested remedy is to use a different architecture entirely (Hive), which is an admission that their mechanism is inefficient for a prevalent pattern in irregular applications. A robust prefetcher must be more intelligent about control flow.
Ad-Hoc Training Mechanism: The training FSM in Figure 5d relies on a "Randomize Dependence" transition when a learned pattern repeatedly fails. The paper provides no justification for this seemingly ad-hoc heuristic or analysis of its convergence properties. In a system with many potential dependencies (arguments and prior loads), random guessing could be highly inefficient and fail to converge on the correct pattern in a timely manner, if at all.

Questions to Address In Rebuttal

Please provide a mechanism-independent analysis of the memory access patterns in your benchmarks to substantiate the claim of high predictability. For instance, what percentage of dynamic loads are simple affine functions of task arguments, and what percentage are more complex indirect patterns?
The slowdown in sssp-r is attributed to priority inversion caused by MRS. This is not a minor outlier but a failure on a canonical algorithm for which priority scheduling is essential. How would you modify MRS to balance the goals of latency hiding and maintaining priority order to prevent such work-inefficiency pathologies? Simply stating it is "future work" is insufficient.
Your system increases abort rates in speculative execution for benchmarks like msf-o. How can the "symbiotic" relationship between TSP and MRS be modified to mitigate this effect? For example, should MRS consider the likelihood of data conflicts when reordering tasks?
Can you provide a more rigorous justification for the "Randomize Dependence" step in your training FSM? What is the search space for this randomization, and what is the expected convergence time for a task with N arguments and M prior loads? Have you considered more structured search heuristics?

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a symbiotic co-design of a data prefetcher (TSP) and a task scheduler (MRS) for fine-grained, hardware-assisted, task-parallel systems. The core thesis is that in such systems—where conventional prefetching fails due to the short lifetime and unpredictable dispatch order of tasks—the scheduler and prefetcher must work in tandem.

The authors' solution facilitates a bidirectional information flow: 1. Scheduler to Prefetcher: The scheduler "seeds" the Task-Seeded Prefetcher (TSP) with descriptors of tasks waiting in the queue, providing the crucial lookahead needed to initiate timely memory requests. TSP learns the memory access patterns of task functions, including direct, indirect, and striding patterns, based on their arguments. 2. Prefetcher to Scheduler: The Memory Response Task Scheduler (MRS) leverages the status of these prefetches. It adjusts the baseline scheduling policy to prioritize dispatching tasks whose data has already arrived in the cache, thereby improving core utilization and hiding memory latency.

This symbiotic approach is shown to yield significant speedups (gmeans up to 1.4x, peak 3.1x) on systems that are already highly optimized, effectively unlocking the performance potential that was previously gated by memory latency.

Strengths

The primary strength of this work lies in its elegant and insightful framing of a fundamental problem. It sits squarely at the intersection of two major research thrusts in computer architecture: overcoming the memory wall and exploiting fine-grained parallelism.

Identifies and Solves a Critical Bottleneck: The hardware task-parallelism model, exemplified by prior work like Swarm [39], was a major step forward for irregular applications. However, as this paper correctly and clearly articulates in Section 2 (page 2), this model created a new problem: it broke the assumptions that underpin decades of prefetching research. This paper doesn't just identify this gap; it provides a compelling and well-motivated solution. It is an essential, enabling technology for this entire class of architectures.
Novelty of the "Symbiotic" Co-design: While the components build on established concepts (e.g., TSP's learning mechanism is an intelligent adaptation of indirect memory prefetchers like IMP [94]), the true novelty is the tight, closed-loop integration. The idea that the task queue is not just a work container but a rich source of semantic lookahead for the memory system is powerful. Furthermore, using prefetch status to guide scheduling (MRS) moves beyond simple locality-aware scheduling into a new domain of "data-readiness-aware" scheduling. The illustrative example in Figure 4 (page 5) is exceptionally effective at communicating this core contribution.
Excellent Contextualization and Synthesis: The authors demonstrate a panoramic understanding of the field. They astutely draw parallels between their general-purpose approach and the hard-wired data-fetching mechanisms found in specialized accelerators. In essence, TSP and MRS can be seen as a way to achieve the data-fetching efficiency of an accelerator within a flexible, general-purpose, programmable tasking model. This is a significant conceptual contribution.
Strong and Convincing Evaluation: The experimental methodology is thorough, testing the proposed techniques across four distinct scheduler designs and two different cache configurations. The comparison of a 1/16th size cache with TSP/MRS against a full-size cache without a prefetcher is particularly powerful. The results, shown in Figure 6 (page 10), often demonstrate that a smaller, "smarter" cache with this symbiotic system is more effective than a large, "dumb" one, making a strong case for the area efficiency of this approach.

Weaknesses

The weaknesses of the paper are not in its core idea, which is sound, but in the completeness of the symbiotic relationship and its potential interactions with other system dynamics.

First-Order Symbiosis Ignores System-Level Pathologies: The paper is commendably transparent about scenarios where the system degrades performance, such as for sssp-r on the non-speculative spatial scheduler or msf-o on the speculative one (Section 5.2, page 9). The authors attribute this to negative interactions with priority inversion and increased speculation aborts. This suggests that the current symbiosis is "first-order"—it optimizes for the data readiness of individual tasks but is blind to second-order, system-wide effects like speculation pressure or fairness. A deeper integration might involve the prefetcher providing contention information to the scheduler, or the scheduler informing the prefetcher about a task's likelihood of being aborted.
Limited Scope of Predictability: The prefetching model is based on affine functions of task arguments and prior loads (Equation 1, page 4). This is a practical and effective choice for the workloads studied, but it represents a specific point in the design space. More complex, non-affine access patterns (e.g., hash-based or input-dependent control flow) would not be captured. While this is a reasonable limitation, the work would be stronger if it discussed where it fits on the spectrum between simple stride prefetchers and more heavyweight solutions like runahead execution.
Unexplored Behavior on "Friendly" Workloads: The evaluation focuses exclusively on irregular workloads, which is the target domain. However, to understand its place in a truly general-purpose system, it would be valuable to understand how TSP/MRS behaves on workloads where conventional stream/stride prefetchers excel. Is there a risk of negative interference, or does the system gracefully handle these cases as well? This would help contextualize the overall utility and robustness of the design.

Questions to Address In Rebuttal

The performance degradation due to increased aborts and priority inversion is a fascinating result. It suggests that aggressively reordering tasks via MRS can disrupt the delicate balance of speculative and priority-ordered systems. Could the symbiotic relationship be extended? For example, could the scheduler throttle MRS’s reordering when speculative state storage is nearly full, or when a large priority gap emerges between the highest-priority task and the highest-priority ready task?
Regarding the training mechanism for TSP: How quickly does it adapt to phase changes within an application, where a task function's memory access patterns might change? The FSM in Figure 5d (page 6) seems robust, but some discussion on training latency and adaptability would strengthen the paper.
Could you elaborate on the potential of this architecture beyond latency hiding? The prefetcher learns a task’s memory footprint. Could this information be used for other optimizations, such as guiding data placement in a NUCA system or informing fine-grained power management by predicting memory- or compute-intensive phases of a task? This speaks to the broader potential of the symbiotic model.

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents two symbiotic hardware mechanisms, the Task-Seeded Prefetcher (TSP) and the Memory Response Task Scheduler (MRS), designed to address the memory latency bottleneck in manycore systems with hardware-assisted, fine-grain task parallelism. The authors identify a key problem: conventional prefetchers are ineffective for very short tasks (e.g., <200 cycles) because the task completes before the prefetcher can be trained and issue timely requests.

The core idea is to leverage the hardware task scheduler, which has access to task descriptors (function pointers and arguments) before a task is dispatched. TSP uses these descriptors to "seed" a prefetch engine, learning and predicting memory access patterns (argument-based, indirect, and striding) well before the task executes. MRS then augments the scheduler's dispatch logic, using feedback on prefetch completion from TSP to prioritize tasks whose data is already available in the cache, thereby reordering execution to maximize core utilization.

Strengths

The primary strength of this work lies in its novel integration of two distinct hardware components—the task scheduler and the data prefetcher—to solve a well-defined and challenging problem. While the constituent concepts have appeared in prior art, their specific combination and application context are new.

Novel Prefetch Trigger Mechanism: The central novel idea is using the task descriptor from a hardware task queue as the seed for a sophisticated data prefetcher. Conventional hardware prefetchers are triggered by the program counter (PC) and the stream of memory accesses from an executing thread. Prior work on programmer-assisted prefetching for tasks (e.g., Tesseract [5], Minnow [95]) required explicit hints or separate prefetch tasks. TSP proposes an automatic mechanism that learns complex access patterns initiated not by execution, but by the mere act of a task being queued for future execution. This is a genuinely new trigger for this class of prefetcher.
Novel Scheduler-Prefetcher Symbiosis: The feedback loop where the prefetcher's status directly and dynamically influences the scheduler's dispatch decision (MRS) is a novel form of hardware integration for general-purpose task-parallel systems. While the high-level concept of data-driven or locality-aware scheduling exists, MRS implements this as a reactive optimization on top of a conventional priority-ordered scheduler. It doesn't replace the scheduling paradigm but augments it, as clearly illustrated in Figure 4 (page 4). This symbiotic relationship is the paper's key conceptual contribution.
Clear Articulation of the "Delta": The paper does a commendable job of positioning its contributions against prior art. It correctly identifies that its learning mechanism for indirect and affine access patterns is an adaptation of the principles from the Indirect Memory Prefetcher (IMP) [94], as stated in Section 3.2 (page 5). The authors do not claim to have invented this learning logic, but rather to have found a new and effective way to trigger and apply it. This honest and precise positioning strengthens the paper's novelty claims.

Weaknesses

The novelty of this work is in the combination and application, not in the fundamental building blocks themselves. A critical assessment reveals the following:

Constituent Mechanisms are Adaptations of Prior Art: The address prediction logic within TSP—predicting addresses as an affine function of a reference value (address = scale * [ref] + offset) and chaining these predictions for indirect accesses—is functionally identical to the core mechanism of IMP [94] and its successors like DMP [30]. The novelty is confined to the source of the ref value (a task argument from the scheduler vs. a value from a striding access stream).
Conceptual Overlap with Dataflow Principles: The core idea of MRS—delaying task execution until its inputs are available—is the foundational principle of dataflow computing. While the authors' implementation as a reordering heuristic within a priority queue is a novel implementation for this domain, the underlying concept is not new. The paper would be stronger if it more explicitly differentiated its approach from classical dataflow execution models, particularly highlighting how MRS preserves a baseline priority ordering that pure dataflow lacks.

Questions to Address In Rebuttal

Generalizability of the Learning Mechanism: The TSP implementation appears tightly coupled to the affine function prediction model adapted from IMP [94]. Given the significant body of work on more advanced data prefetchers that recognize more complex patterns (e.g., graph traversals, irregular pointer chains), could TSP's front-end (the "task-seeding" trigger) be decoupled from its back-end (the prediction engine)? How much of the proposed architecture is specific to this one learning model versus being a general framework for pre-execution prefetching?
Distinction from Dataflow Models: Please clarify the novelty of the MRS policy compared to task scheduling in dataflow or stream processing architectures, where computation is naturally triggered by data availability. The key distinction appears to be that MRS modifies an existing priority order rather than deriving the schedule solely from data dependencies. Can the authors elaborate on why this distinction is significant and what benefits it provides over a pure data-driven model in the context of their target workloads?
Defense of "First" Claim: The conclusion (Section 7, page 13) claims TSP is the "first prefetcher to target hardware task-parallel systems without requiring programmer assistance." While this appears true within the specific domain of general-purpose manycore architectures, some specialized accelerators (e.g., for graph processing or deep learning) use work descriptors to pre-stage data automatically. Can the authors confirm the novelty of their integrated scheduler-prefetcher design against these more domain-specific architectures which may employ conceptually similar hardware mechanisms?

Software Prefetch Multicast: Sharer-Exposed Prefetching for Bandwidth Efficiency in Manycore Processors

Abstract

As the core counts continue to scale in manycore processors, the increasing bandwidth pressure on the network-on-chip (NoC) and last-level cache (LLC) emerges as a critical performance bottleneck. While shared-data multicasting from the LLC can alleviate ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper proposes Software Prefetch Multicast (SPM), a software-hardware co-design to improve bandwidth efficiency in manycore processors. The core idea is to use a new "sharer-exposed" prefetch instruction, generated by software analysis, to inform the hardware (specifically, the LLC) of which cores form a "sharer group." Within a group, a designated "leader" core issues prefetches, which trigger a multicast of the shared data from the LLC to all "follower" cores. The followers block their own redundant prefetches. To handle thread execution variance, the design incorporates a timeout mechanism for followers and a leader-switching mechanism. The authors claim significant NoC bandwidth savings and speedups over baseline systems.

However, the proposed mechanism introduces substantial complexity and a cascade of corrective measures (timeouts, leader switching) to patch fundamental vulnerabilities in its core leader-follower model. The evaluation, while showing positive results on regular benchmarks, raises serious questions about the general applicability, hidden overheads, and robustness of the system in real-world scenarios.

Strengths

Problem Motivation: The paper correctly identifies the escalating bandwidth pressure on the NoC and LLC in manycore systems as a critical bottleneck, providing a solid motivation for the work.
Leveraging Software Insight: The high-level approach of using static (or compiler) information to guide hardware coherence actions is a valid research direction, potentially avoiding the pitfalls of purely speculative hardware predictors.
Targeted Comparison: The analysis in Section 4.2 (page 8), which evaluates performance when the working set size exceeds the LLC capacity, is a relevant stress test that effectively highlights a key limitation of directory-history-based schemes like Push Multicast.

Weaknesses

Fragility and Compounding Complexity: The core leader-follower model is inherently fragile to thread asynchrony, a common case in parallel applications. The authors acknowledge this but address it with a series of complex, reactive patches. A follower waits (potentially stalling) for a slow leader. If it waits too long, it triggers a Timeout (Section 3.5, page 5), generating a unicast request and response, which partially defeats the purpose of multicast. If timeouts become frequent, a Leader-Followers Switching Mechanism (Section 3.6, page 6) is invoked, adding yet another layer of state management and network traffic. This design appears to replace the challenge of hardware prediction with an equally, if not more, complex challenge of hardware-based runtime synchronization management.
Introduction of New Performance Bottlenecks: The "Waiting Strategy" is a double-edged sword. While it reduces upstream traffic from followers, it can actively degrade performance by forcing a faster core to stall waiting for a multicast initiated by a lagging leader. The paper's own data confirms this failure mode. In Figure 13 (page 8), SPM performs worse than the standard L1Bingo-L2Stride prefetcher on backprop. The authors state this is because "a demand request may need to wait in the follower private cache." This is a critical admission that the mechanism can be actively harmful, yet this trade-off is not sufficiently analyzed.
Underestimation of System Overheads:
- Hardware Cost: The claim of "light" hardware overhead (Section 4, page 7) is questionable. A 64-core system requires 304 Bytes/LLC slice, 280 Bytes/L2 cache, and a 64-entry (464 Bytes) Waiting Table per L2. This totals (64 cores * (280 B + 464 B)) + (64 slices * 304 B) ≈ 67 KB of configuration/state SRAM across the chip, which is not a trivial hardware cost.
- Configuration Traffic: The Configuration Stage (Section 3.4, page 4) requires each sharer core to broadcast a config_req to all LLC slices. For a group of 16 sharers in a 64-slice system, this is 16 * 64 = 1024 configuration messages. While likely small packets, this initial broadcast storm is a non-trivial network overhead that is not quantified. The claim of "0.1 us" overhead is for a single round trip and does not account for this broadcast traffic or scenarios with many sharer groups.
- Context Switch Penalty: The description of context switch handling in Section 3.7 (page 6) is superficial. A thread switch-out triggers a de-allocation message, broadcast to all LLCs. A new thread switch-in, if part of a sharer group, must then re-initiate the entire configuration stage. The performance penalty of this complete teardown and rebuild of sharer state upon a context switch seems prohibitive and is not evaluated.
Evaluation Scope and Parameter Tuning:
- Benchmark Selection: The chosen benchmarks (e.g., cachebw, multilevel, conv3d) are characterized by highly regular, statically analyzable, bulk-synchronous sharing patterns. The proposed approach is tailor-made for these best-case scenarios. Its applicability to workloads with more dynamic, irregular, or pointer-based sharing is entirely unproven.
- Parameter Sensitivity: Performance appears highly sensitive to the Timeout Threshold. As shown in Figure 19 (page 10), selecting a suboptimal value (e.g., 128 cycles instead of 512 for cachebw) can significantly degrade performance. The paper provides no methodology for how this critical parameter would be determined for arbitrary applications in a production environment, rendering the design impractical.
Inconsistent Bandwidth Savings Claims: Figure 17 (page 9) shows that for shared data, SPM results in slightly more ejected flits at the L2 cache than the baseline. The authors explain this is due to timeout-triggered unicasts and multicasts arriving at cores that do not yet need the data. This contradicts the narrative of pure bandwidth efficiency; the system reduces upstream request traffic but can increase downstream data traffic, including useless traffic that pollutes private caches.

Questions to Address In Rebuttal

The SPM design seems to be a cascade of fixes: the Timeout mechanism fixes the slow-leader problem, and the Leader Switching mechanism fixes the chronic slow-leader problem. Can the authors justify this layered complexity, and why is it superior to a simpler mechanism that embraces asynchrony, such as allowing followers to issue their own requests to be coalesced at the LLC?
Please provide a detailed analysis of the performance degradation seen in backprop (Figure 13, page 8). Specifically, quantify the average stall time introduced by the Waiting Strategy in follower caches and explain why this penalty is more severe than the latency savings from multicasting in this particular workload.
Regarding overheads:
- Please justify the claim that ~67KB of distributed state SRAM for a 64-core system is "light."
- Please quantify the network traffic (in flits) and latency of the Configuration Stage for a 16-sharer group in the 64-core system. How does this overhead scale as the number of distinct sharer groups in an application increases?
- What is the estimated performance impact (in cycles) of a single context switch, including the full de-allocation and re-configuration sequence described in Section 3.7 (page 6)?
The Timeout Threshold is a critical performance parameter. How do you propose this value be set in a real system where application behavior is not known a priori? Would this require a complex hardware/software runtime tuning mechanism, adding even more complexity to the design?
Given that SPM can increase the amount of data traffic ejected to private caches (Figure 17, page 9), some of which may be unused, how does the system ensure that the net effect is a reduction in energy consumption, not just a shift in where bandwidth is consumed?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents Software Prefetch Multicast (SPM), a software-hardware co-design aimed at mitigating the on-chip network (NoC) and last-level cache (LLC) bandwidth bottleneck in manycore processors. The core contribution is a mechanism that allows software (i.e., the compiler or programmer) to explicitly communicate the complete set of sharing cores ("sharer groups") to the hardware. This is achieved through new sharer-exposed prefetching instructions.

The hardware uses this information to perform precise, bandwidth-efficient multicasting. To handle the practical challenge of thread-to-thread performance variation, the authors propose a robust leader-follower model. In this model, only one designated "leader" thread issues the prefetch for the group, while "follower" threads block their own redundant prefetches and await the multicast data. Crucially, the system includes a timeout and dynamic leader-switching mechanism to ensure laggard leaders do not stall the entire group, adding a layer of resilience to the design. The evaluation demonstrates significant NoC bandwidth savings (42-50%) and substantial speedups (geomean of 1.28x-1.38x) on 16- and 64-core systems.

Strengths

Elegant and Direct Solution to an Information Problem: The fundamental strength of this work lies in its diagnosis of the core problem: hardware, on its own, has an incomplete picture of application-level data sharing. Speculative approaches like Push Multicast [12] are clever but are ultimately constrained by the limitations of historical predictors (e.g., directory capacity, silent evictions, first-access misses). SPM elegantly sidesteps this by creating a direct channel for the software, which possesses ground-truth knowledge of sharing patterns, to inform the hardware. This shifts the paradigm from speculative prediction to explicit direction, which is a powerful and promising approach.
Pragmatic Handling of System Dynamics: A purely static leader-follower model would be too brittle for real-world execution. The paper's most impressive design feature is its clear-eyed acknowledgement and handling of thread variation (as motivated in Section 2.3, page 3). The timeout mechanism combined with the dynamic leader-switching algorithm (Section 3.6, page 6) provides the necessary resilience to make the co-design practical. This elevates the work from a simple "what if" idea to a well-considered system. The ablation study in Figure 22 (page 11) effectively validates the necessity of these components.
Strong Placement within the Research Landscape: The authors successfully position their work as a synthesis of several long-standing research threads. It leverages the precision of software prefetching, applies the bandwidth-saving principles of multicasting (seen in early work like the NYU Ultracomputer [10]), and embodies the modern trend of software-hardware co-design. It provides a compelling alternative to purely hardware-based coherence prediction schemes [18, 19, 23] by trading hardware complexity and speculation for software hints and ISA support.
Well-Scoped and Insightful Future Work Discussion: The discussion in Section 6 (page 12) is a significant strength, as it demonstrates a broad understanding of the system-level implications. The authors thoughtfully consider the interaction with hardware prefetchers, the path toward compiler automation, and the potential impact of new instructions like AMX. This shows maturity and provides a clear roadmap for how this foundational idea could be expanded and integrated into future systems.

Weaknesses

While the core idea is strong, its applicability and system-level integration could be further explored.

Software Scope and Generality: The current evaluation relies on manual analysis and insertion of SPM instructions into workloads with regular, statically analyzable sharing patterns (e.g., dense linear algebra). The true test of such a co-design is its generality. It remains an open question how effective SPM would be for applications with more dynamic, input-dependent, or irregular sharing (e.g., graph analytics, sparse solvers, or certain pointer-intensive data structures). While compiler support is noted as future work, the fundamental recognizability of sharer groups is a prerequisite.
Configuration and Context-Switch Overhead: The configuration stage (Section 3.4, page 4) involves broadcasting requests to establish sharer groups and leader/follower roles. The paper asserts this overhead is low, but in a workload characterized by many frequent, short parallel regions, this setup/teardown cost could become significant. Similarly, the process for handling context switches (Section 3.7, page 6) involves de-allocation and re-configuration messages, which could add non-trivial overhead in a heavily multi-programmed environment. The impact of these transient states on performance is not fully characterized.
Interaction with the Coherence Protocol: The paper focuses primarily on the data movement and bandwidth aspects of the design. However, the interaction with the underlying cache coherence protocol (e.g., MESI) is not fully detailed. For example, what happens if a follower thread attempts to issue a store (a Request for Ownership) to a cache line that is currently in-flight as a read-only multicast prefetch triggered by the leader? This could introduce complex races or require additional logic in the cache controllers that is not discussed.

Questions to Address In Rebuttal

Beyond the evaluated workloads, could the authors comment on the applicability of SPM to applications with more dynamic or opaque sharing patterns, such as graph processing? Is the mechanism fundamentally limited to statically-determinable sharing, or is there a path to supporting more irregular workloads, perhaps through runtime profiling as hinted at in Section 6?
Could the authors elaborate on the scalability of the configuration stage? While the latency for a single configuration is low, can they provide analysis or data on the potential performance impact in a scenario with thousands of small, distinct parallel regions, where the configuration overhead might become a dominant factor?
Can the authors clarify the interaction with the base MESI coherence protocol? Specifically, how are potential write-after-read races handled if a follower thread issues a store to an address that is currently being prefetched via multicast for the group? Does the blocked prefetch request in the follower's private cache also stall subsequent stores to the same address until the multicast data arrives?
The Timeout Threshold appears to be a sensitive tuning parameter (Figure 19, page 10). Does this parameter require per-application tuning for optimal performance, or have the authors identified heuristics that would allow for a robust, system-wide default value?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper proposes Software Prefetch Multicast (SPM), a software-hardware co-design to improve bandwidth efficiency in manycore processors by optimizing the handling of shared data. The core claim is that by exposing software-level knowledge of data sharers to the hardware, a more precise and timely multicast can be achieved, overcoming the limitations of prior hardware-only prediction or request coalescing schemes.

The authors identify three primary contributions as novel: 1. A new ISA extension in the form of a "sharer-exposed" prefetching instruction that carries a Group ID, and a corresponding software-hardware interface for configuring these sharer groups. 2. A microarchitecture centered on a "leader-follower" model, where a single designated "leader" thread issues the multicast-triggering prefetch on behalf of the entire group. 3. A dynamic leader-switching mechanism, based on timeouts, to adapt to thread execution variation and prevent stalls.

My analysis concludes that while the individual concepts (prefetching, multicasting, leader election) are not new in isolation, their specific synthesis into a coherent software-directed multicast framework represents a novel and meaningful contribution. However, the degree of this novelty is evolutionary, not revolutionary, and comes at the cost of significant mechanism complexity that warrants scrutiny.

Strengths

Novelty of the Core Abstraction: The central novel idea is the "sharer-exposed prefetch" instruction (Section 3.2, page 4). This fundamentally shifts the paradigm from hardware inferring sharers to software declaring them. This is a significant delta from the closest prior art, Push Multicast [12], which relies on historical sharer information stored in the directory. SPM uses a priori knowledge from the program structure, which is inherently more accurate for predictable sharing patterns and does not suffer from limitations like directory capacity or the need for silent eviction protocols. Similarly, it is a clear step beyond generic hints like Intel's MOVRS [14], which indicates data is "read-shared" but crucially does not identify the specific set of sharers. The SPM instruction’s Group ID provides this missing link.
Novel Solution to a Known Problem: The paper correctly identifies thread variation as a fundamental limiter for passive multicast/coalescing techniques (Section 2.3, page 3). The proposed dynamic leader-switching mechanism (Section 3.6, page 6) is a novel microarchitectural solution tailored specifically to their leader-follower model. While dynamic adaptation is a common design pattern, its application here—using follower timeouts to trigger a leader re-assignment at the LLC—is a new and well-motivated part of the overall design. It directly addresses a primary weakness of more static or predictive approaches.
Clear Articulation of the "Delta": The authors have done a commendable job of positioning their work against existing solutions. The introduction and related work sections (Section 1, page 1 and Section 5, page 11) clearly delineate the conceptual differences between SPM and techniques like GPU packet coalescing [22], Stream Floating [29], and especially Push Multicast [12]. This demonstrates a strong awareness of the prior art and helps isolate their specific novel claims.

Weaknesses

Evolutionary, Not Revolutionary, Novelty: The overall concept, while new in its specific implementation, can be viewed as the logical synthesis of several existing ideas. We have software prefetching [6], multicast for shared data [10], and software hints for hardware [14]. SPM combines these into a more powerful, explicit mechanism. The leader/follower model also has conceptual overlaps with helper threading for prefetching [20, 24], where one thread does work (prefetching) on behalf of others. While the SPM leader also performs application work, the functional similarity reduces the perceived novelty of this aspect of the design. The contribution is in the engineering of the combined system, not in a singular, groundbreaking new concept.
Significant Complexity for the Achieved Benefit: The proposed mechanism is substantially complex. It requires ISA extensions, new configuration-stage messages (config_req, config_rsp), and new state-holding structures in both the L2 private cache (Leader/Followers Lookup Map, Follower Waiting Table) and the shared LLC (Share Map) (Figures 9 and 10, page 5). This is in addition to the timeout counters, comparators, and logic for the dynamic leader-switching protocol. The evaluation shows geomean speedups of 1.28x-1.38x. While respectable, it is critical to question whether this level of performance gain justifies the introduction of such a multifaceted and invasive hardware mechanism. The novelty here comes with a high complexity tax.
The Software Oracle Assumption: The entire premise of SPM's novelty and effectiveness rests on the ability of the compiler or programmer to correctly and comprehensively identify sharer groups statically. The paper focuses on applications with regular, explicit sharing patterns. The novelty of the approach is less clear for applications with dynamic, input-dependent, or pointer-chasing sharing patterns where static analysis is intractable. The proposed mechanism provides no novel way to handle these cases, falling back to standard behavior.

Questions to Address In Rebuttal

Complexity Accounting vs. Prior Art: The authors critique Push Multicast [12] for requiring in-network filtering. However, SPM introduces a multi-step configuration protocol, three new tables across L2 and LLC, and a timeout/leader-switching mechanism. Could the authors provide a reasoned, apples-to-apples comparison of the hardware complexity (e.g., estimated storage overhead in KB, logic gate count) of SPM versus the in-network filtering and directory modifications required by Push Multicast?
Justification for Group IDs over Simpler Hints: The core ISA novelty is the Group ID. Could the authors elaborate on why a simpler mechanism would not suffice? For example, an enhanced prefetch_shared instruction broadcast from one core, with other cores using a lightweight hardware mechanism to snoop and suppress their own redundant prefetches for a short time window. Please defend why the explicit configuration and management of numbered Group IDs is essential and justifies its complexity over a less stateful design.
Overhead of Configuration: The configuration stage is presented as a prerequisite for the multicast operation. For programs with many fine-grained parallel regions, this configuration protocol (broadcast config_req from all sharers, LLC determination, multicast config_rsp) must be invoked repeatedly. At what frequency of parallel region invocation would the overhead of this novel configuration stage begin to negate the benefits of the subsequent multicast?
Robustness of the Leader/Follower Model: The paper mentions that context switches require wiping the sharer state and re-configuring (Section 3.7, page 6). This seems to be a significant vulnerability of the proposed stateful model. How does the performance of this novel mechanism degrade in a multi-programmed environment with frequent context switching, compared to a more stateless predictive scheme like Push Multicast?

RICH Prefetcher: Storing Rich Information in Memory to Trade Capacity and Bandwidth for Latency Hiding

Abstract

Memory systems characterized by high bandwidth and/or capacity alongside high access latency are becoming increasingly critical. This trend can be observed both at the device level—for instance, in non‑volatile memory—and at the system level, as seen in ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The paper proposes RICH, a hardware prefetcher designed for memory systems with high capacity and latency. RICH combines spatial prefetching across multiple region granularities (2 KB, 4 KB, 16 KB). To manage the metadata for larger regions, it employs a hierarchical storage mechanism, keeping frequently used patterns on-chip and offloading less frequent ones to main memory. The design uses a multi-offset trigger mechanism to improve accuracy for large regions and a priority-based arbitration scheme to select the best prefetch size. The authors evaluate RICH against several spatial prefetchers, claiming a 3.4% performance improvement over the state-of-the-art Bingo prefetcher in a conventional system, and an 8.3% improvement in a simulated high-latency system.

Strengths

The paper addresses a relevant and forward-looking problem: how to design prefetchers for future memory systems where capacity and bandwidth are plentiful, but latency is a major bottleneck.
The initial motivation for exploring multiple prefetch region sizes is well-founded, as demonstrated by the analysis in Figure 1 (page 3).
The evaluation is extensive, utilizing a wide range of benchmarks (SPEC06, SPEC17, Ligra, Parsec) and comparing against multiple relevant prior works (Bingo, PMP, SMS, SPP-PPF).

Weaknesses

The paper’s central claims rest on a complex design whose trade-offs are not rigorously analyzed or justified. The performance benefits appear modest relative to the introduced complexity and potential hidden costs.

Insufficient Analysis of Off-Chip Overheads: The core premise of RICH is to "store rich information in memory." However, the cost of this strategy is inadequately quantified. The paper fails to report the most critical overhead metric: the amount of main memory bandwidth consumed by its own metadata reads and writes. This traffic must compete with demand accesses and the prefetched data itself. The analysis in Figure 14 (page 11), which shows performance under varying total bandwidth, is insufficient as it doesn't isolate the contribution of this self-inflicted overhead. Without this data, the claim of "strategically consuming" bandwidth is unsubstantiated.
Unconvincing Latency Tolerance Argument for Off-Chip Metadata: The justification for moving 16 KB patterns off-chip hinges on Figure 5 (page 5), which claims that a 50ns additional latency results in less than 15% degradation in "prefetch opportunities." This analysis is flawed for two reasons:
- The y-axis is labeled "Late Prefetches," which is not the same as lost opportunities and is not formally defined. A 15% increase in late prefetches could represent a significant performance impact.
- The central argument of the paper is that RICH excels in high-latency systems (Section 5.3). However, the analysis in Figure 5 is not connected to the high-latency system evaluation in Figure 13. If system memory latency is already high (e.g., baseline + 120ns), the latency to fetch off-chip metadata will also be substantially higher, likely far exceeding the 50ns tested in Figure 5 and invalidating its conclusion.
Arbitrary Design Choices and Unjustified Heuristics: The design of RICH is replete with magic numbers and heuristics presented without sufficient justification.
- Trigger Mechanism: Why were (PC, 5 offsets) for 16 KB and (PC, 3 offsets) for 4 KB chosen (Section 3.1)? Figure 3 (page 4) shows a clear trade-off between accuracy and coverage. The paper provides no evidence that these specific points are optimal or robust across workloads.
- Arbitration Priority: The fixed priority scheme in the Region Arbitration unit (Section 4.2, Step P3) is presented as a given. Why is giving the 2 KB (PC, address) prefetch the highest priority optimal? This could potentially block a more beneficial, albeit slightly later, 16 KB prefetch. No alternative arbitration schemes are discussed or evaluated.
- Prefetch Count Threshold: The threshold of 30 prefetches before offloading a 16 KB pattern to memory (Section 4.1) is stated to be based on "experiments," but no data or sensitivity analysis is provided to support this specific value (beyond the brief analysis in Figure 17).
Questionable Iso-Storage Comparison: The iso-storage comparison in Section 5.6 (page 11) aims to show RICH is more storage-efficient. However, the "Enhanced Bingo" baseline is constructed by increasing the PHT's associativity. This is not necessarily the most effective way to utilize a larger storage budget for Bingo; increasing the number of entries may have been more beneficial. This choice of a potentially weak baseline undermines the claim that RICH is better at converting storage to performance.
Mismatched Complexity and Benefit: In the conventional system, RICH provides a modest 3.4% geomean speedup over Bingo. This marginal gain comes at the cost of a significantly more complex design, involving three parallel lookup paths, multi-offset tracking FIFOs, a complex arbitration unit, and the entire machinery for managing on-chip/off-chip metadata. The engineering cost and potential for critical path elongation are non-trivial and are not justified by such a small improvement.

Questions to Address In Rebuttal

Please provide a quantitative breakdown of main memory bandwidth utilization. Specifically, what percentage of total bandwidth is consumed by RICH's off-chip metadata reads and writes across the evaluated workloads? How does this overhead traffic impact performance, especially in bandwidth-limited scenarios?
The paper argues for RICH's strength in high-latency systems (Figure 13, page 10). Please reconcile this with the off-chip metadata access analysis in Figure 5 (page 5). How does the performance of the off-chip metadata mechanism hold up when the baseline memory latency is already increased by 120ns?
Please provide a rigorous justification for the choice of 5 and 3 offsets for the 16 KB and 4 KB region triggers, respectively. A sensitivity analysis showing why these specific values are optimal across a range of workloads is required.
In the iso-storage comparison (Section 5.6, page 11), please justify why increasing the PHT associativity was chosen as the method to scale Bingo's storage budget, as opposed to other methods like increasing the number of entries.
Table 3 (page 6) shows that for some workloads (e.g., roms-294, roms-1070), the 200-entry on-chip PHT cache for 16 KB patterns has poor coverage (32.8% and 38.7%). What is the performance impact on these specific workloads, which must frequently stall for high-latency off-chip metadata accesses? Do these cases suffer a performance degradation compared to the baseline?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents RICH, a novel hardware prefetcher designed to address the growing challenge of high memory latency in modern and future computer systems. The core contribution is a philosophical shift in prefetcher design: instead of being constrained by scarce on-chip resources, RICH strategically leverages abundant off-chip memory capacity and bandwidth to store "rich" prefetching metadata. This enables more powerful prefetching techniques that would otherwise be infeasible.

Specifically, RICH makes two key contributions. First, it implements a multi-scale spatial prefetching mechanism that operates on 2 KB, 4 KB, and 16 KB regions, using a sophisticated multi-offset trigger and arbitration system to balance timeliness, coverage, and accuracy. Second, it introduces a hierarchical on-chip/off-chip metadata storage architecture. Latency-sensitive training and control data, along with patterns for small regions and a cache for large-region patterns, are kept on-chip. The bulk of the large-region (16 KB) patterns, which are shown to be more tolerant to retrieval latency, are offloaded to main memory. The evaluation demonstrates that RICH outperforms state-of-the-art spatial prefetchers like Bingo and PMP, with its performance advantage becoming more pronounced as memory latency increases, validating the paper's central thesis.

Strengths

Timely and Highly Relevant Thesis: The paper's fundamental premise is exceptionally well-aligned with dominant trends in the memory subsystem landscape. The authors correctly identify that technologies like CXL-based memory pooling, NVM, and even generational shifts in DDR standards are prioritizing capacity and bandwidth at the expense of latency (Section 1 and 2.1, page 1-2). RICH is not just another prefetcher; it is an architectural response to this macro trend. This makes the work significant and forward-looking.
Novel Architectural Pattern for Prefetching: The hierarchical on-chip/off-chip metadata storage is the paper's most compelling contribution. While temporal prefetchers have previously used off-chip memory to store long access streams, applying this concept to the metadata of a spatial prefetcher is a powerful idea. The analysis in Section 3.2 (page 5) that carefully partitions metadata based on latency tolerance, access frequency, and size is insightful and forms the foundation for a practical design. This demonstrates a thoughtful co-design between the algorithm and its physical implementation.
Effective Use of "Rich" Metadata: The paper successfully avoids the trap of simply scaling up existing designs. The multi-region prefetching strategy (Insight 1, page 3) and the use of multi-offset triggers to improve accuracy for large regions (Insight 2, page 4) are clever mechanisms that directly convert the availability of more metadata into tangible performance gains. The region arbitration logic (Section 4.2, page 7) elegantly balances the high-performance potential of large regions with the low misprediction cost of small ones.
Strong Supporting Evaluation: The experimental results, particularly the sensitivity studies, provide strong evidence for the authors' claims. The analysis of performance under increased memory latency (Figure 13, page 10) is crucial, as it directly validates that RICH is well-suited for the future systems it targets. Similarly, the performance breakdown analysis (Figure 15, page 11) effectively demonstrates that each component of the design—from the multi-offset trigger to the region arbitration—contributes meaningfully to the final result.

Weaknesses

Understated Relationship with Temporal Prefetching: While the authors correctly differentiate their work from temporal prefetchers like STMS (Section 2.3, page 3), they could strengthen their positioning by more deeply analyzing the conceptual parallels. The idea of using off-chip memory for prefetcher state is a shared principle. A more detailed discussion could highlight the unique challenges of applying this to spatial metadata (e.g., lack of sequentiality, pattern indexing, latency tolerance of pattern fetches) and thus better underscore the novelty of their architectural solution.
Limited Exploration of System-Level Implications: The proposal to dedicate a 128 KB region of main memory per core (Section 4.4, page 8) for prefetcher metadata introduces a new, architecturally-visible resource. The paper assumes a static allocation by the OS at boot time (Section 4.1, page 7), which is a practical starting point. However, this opens up a host of system-level questions that are not addressed:
- Virtualization: How would a hypervisor manage and partition this off-chip PHT space for multiple guest VMs?
- Security: Could this shared memory structure become a side-channel for inferring memory access patterns between processes or VMs?
- Dynamic Allocation: Could the OS dynamically resize or page this metadata space based on application needs? Acknowledging these issues would provide a more complete picture of how RICH would integrate into a full system stack.
Design and Verification Complexity: The proposed architecture, with its concurrent lookups for three different region sizes, complex arbitration logic, and asynchronous off-chip metadata management, represents a significant increase in design complexity compared to traditional prefetchers. While the performance gains appear to justify this, a brief discussion on the practical challenges of verification and timing closure for such a unit would add a valuable layer of pragmatism to the proposal.

Questions to Address In Rebuttal

Could the authors elaborate on the key architectural and algorithmic differences that make offloading spatial pattern metadata to main memory a distinct and more challenging problem than offloading the access stream histories used by temporal prefetchers?
The paper focuses on prefetching across 4 KB page boundaries by using virtual addresses. A major trend in systems is the use of huge pages (e.g., 2 MB, 1 GB) to reduce TLB pressure. How might the RICH architecture leverage knowledge of huge pages? It seems that confirming an access is within a 2 MB huge page could make the 16 KB region prefetching even more aggressive and accurate, potentially unlocking further performance.
Regarding the off-chip metadata store, have the authors considered the implications for multi-socket systems connected via CXL? If a thread migrates to a core on a different socket, would its RICH metadata need to be migrated as well? Does this present an opportunity for a shared, system-wide metadata store, or would per-core locality remain paramount?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors propose RICH, a hardware prefetcher designed for future memory systems characterized by high capacity and high latency. The central thesis is to trade abundant memory capacity and bandwidth to hide latency by storing a large amount of prefetching metadata. The paper claims novelty through a combination of three core ideas: (1) a multi-scale spatial prefetcher that simultaneously tracks 2 KB, 4 KB, and 16 KB regions; (2) a "multi-offset" trigger mechanism that uses a sequence of memory accesses, rather than a single access, to validate and trigger prefetches for larger regions; and (3) a hierarchical storage system that keeps latency-sensitive and frequently used metadata on-chip while offloading the large, less-frequently-used 16 KB region patterns to main memory. The authors demonstrate that this combination allows RICH to outperform state-of-the-art spatial prefetchers, particularly in high-latency scenarios.

Strengths

The primary strength of this work lies in its synthesis of existing concepts into a genuinely novel architecture for spatial prefetching. While individual components may have conceptual predecessors, their integration here is new and well-motivated.

The Multi-Offset Trigger Mechanism: The most significant novel contribution is the use of multiple access offsets to trigger a spatial pattern prefetch (Section 3, Insight 2, page 4). Conventional spatial prefetchers like SMS [11] or Bingo [12] use a single (PC, offset) or (PC, address) pair. By requiring a sequence of offsets (e.g., 5 for a 16 KB region), RICH creates a more robust trigger that significantly improves accuracy for large-region prefetching. This mechanism is conceptually novel as it borrows a validation principle often seen in temporal/stream prefetching (i.e., confirming a sequence) and applies it to trigger a purely spatial pattern lookup.
Novel Application of Hierarchical Storage: The idea of offloading prefetcher metadata to main memory is not new; it is the cornerstone of temporal prefetchers like STMS [16] and MISB [15], which the authors correctly cite. However, the novelty in RICH is the specific application of this concept to bit-vector-based spatial patterns and the design of a coherent tiered system around it. The on-chip 16 KB PHT acts as a cache for the off-chip main PHT, complete with mechanisms like the Valid Map and a prefetch count threshold to manage the off-chip traffic. This is a novel architectural pattern for spatial prefetchers.
Coherent Multi-Scale Architecture: While the observation that different workloads benefit from different region sizes is not new, the creation of a prefetcher that concurrently tracks, triggers, and arbitrates between multiple fixed region sizes is a novel architectural choice. The arbitration logic, which prioritizes based on accuracy and misprediction cost (Figure 4, page 5) and de-duplicates requests (Section 4.2, Step P3, page 7), represents a clear and novel design for coordinating these different scales.

Weaknesses

The paper's claims of novelty must be carefully contextualized against prior art, and the justification for the substantial increase in design complexity needs to be robust.

Incrementalism of Individual Concepts: While the synthesis is novel, some of the foundational ideas are well-established. The core idea of trading memory capacity for performance is a fundamental principle in computer architecture. Off-chip metadata storage has been explored extensively in the temporal prefetching domain. The paper would be stronger if it more explicitly delineated the novel mechanisms required for their spatial implementation from the known general concept of off-loading.
Significant Increase in Design Complexity: The proposed RICH architecture is substantially more complex than its predecessors. It effectively runs three parallel training and lookup pipelines for the different region sizes, which feed into a non-trivial Region Arbitration unit (Figure 8, page 8). The management of the off-chip PHT adds further control logic. The 3.4% performance gain over Bingo in a conventional system seems marginal given this complexity. While the 8.3% gain in a high-latency system is more compelling, it is crucial to question if a simpler mechanism could have achieved similar results.
Under-explored Implementation Overheads: The paper briefly mentions on page 6 that "virtual addresses are used to ensure pattern continuity" for 16 KB regions that cross 4 KB page boundaries. This is a critical implementation detail with non-trivial consequences. It implies that every prefetch request generated for a 16 KB region might require address translation, potentially putting pressure on the TLB. This overhead is not quantified or discussed in detail, yet it is a direct consequence of a key novel feature (large region support).

Questions to Address In Rebuttal

On the Multi-Offset Trigger: The selection of the number of offsets for the triggers (5 for 16 KB, 3 for 4 KB) appears to be an empirical choice. Could the authors provide more insight into the sensitivity of the accuracy/coverage trade-off to this hyperparameter? Is there a principled reason for these specific values, or were they simply the optimal points found in your design space exploration?
Distinction from Prior Off-Chip Metadata Schemes: Please elaborate on the key mechanistic differences between RICH's off-chip management for spatial patterns and the scheme used by a temporal prefetcher like STMS [16]. Beyond the difference in metadata content (bit-vectors vs. address streams), what novel challenges did you face and solve? For instance, how critical is the prefetch count thresholding mechanism for managing off-chip bandwidth, and is this distinct from prior work?
Quantifying Control Logic Complexity: The paper provides a clear breakdown of storage overheads (Table 4, page 8). However, it does not address the area and power cost of the additional control logic, particularly the three parallel lookup units and the Region Arbitration logic. Can the authors provide an estimate of this overhead to give a more complete picture of the prefetcher's cost?
Impact of Virtual Address Translation: Could you clarify the performance impact of using virtual addresses for 16 KB prefetching? How frequently do these prefetches require new TLB lookups, and what is the potential for this to become a performance bottleneck, especially in multi-core scenarios where TLB pressure is higher?

DECA: A Near-Core LLM Decompression Accelerator Grounded on a 3D Roofline Model

Abstract

To alleviate the memory bandwidth bottleneck in Large Language Model (LLM) inference workloads, weight matrices are stored in memory in quantized and sparsified formats. Hence, before tiles of these matrices can be processed by in-core generalized matrix ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors identify that LLM inference on modern CPUs is bottlenecked by the software-based decompression of quantized and sparsified weight matrices, which precedes their processing by hardware GeMM engines like Intel's TMUL. To address this, they first propose a 3D performance model, "Roof-Surface," to characterize the interaction between memory bandwidth, vector decompression units, and matrix execution units. Based on insights from this model, they propose DECA, a near-core hardware accelerator to offload decompression. Finally, they introduce a new ISA extension, TEPL, to enable efficient, low-latency, out-of-order invocation of DECA. Their simulation results claim up to a 4x speedup on compressed GeMM kernels and a 1.6x-1.9x reduction in next-token latency for large LLMs compared to an optimized software baseline.

Strengths

The paper addresses a relevant and timely problem: the performance bottleneck of weight decompression for LLM inference on CPUs.
The use of a state-of-the-art software baseline (Intel's libxsmm kernels) provides a strong point of comparison for the proposed hardware solution.
The fundamental idea of decomposing the performance problem into three interacting domains (memory, vector, matrix) is a logical approach to bottleneck analysis.

Weaknesses

My primary concerns with this submission revolve around the foundational model's validity, the justification for costly ISA extensions, and the fidelity of the evaluation methodology.

The Roof-Surface Model is an Oversimplification that Ignores First-Order Effects: The central thesis rests on the Roof-Surface model providing "clear insights." However, the model is an idealized representation of pipelined throughputs that systematically ignores crucial, real-world system effects. In Table 2 (page 5), the authors' own data reveals significant discrepancies between the model's predictions (R-S) and measured reality (Real). For instance, for BF16_50%, the model predicts 11.8 TFLOPS while the real performance is 9.2 TFLOPS—a 28% overestimation. The authors dismiss this as "non-plotted factors such as memory latency or cache latency." These are not minor factors; they are fundamental to system performance. A model that cannot account for over 25% of the performance limitation is not a sound foundation upon which to base microarchitectural design decisions. The claims of the model's explanatory power are therefore overstated.
Insufficient Justification for a New ISA Extension: The introduction of the TEPL instruction is a significant architectural modification that sets a very high bar for justification. The authors motivate TEPL by comparing it to a store-based invocation method that requires explicit memory fences (Figure 8, page 7), which is known to be inefficient for fine-grained core-accelerator communication. This comparison appears to be against a strawman baseline. The paper fails to explore or evaluate more sophisticated software-only or architectural mechanisms that could mitigate this latency without requiring bespoke ISA changes, such as optimized polling, doorbell registers with better scheduler integration, or advanced prefetching schemes. The conclusion that TEPL is necessary is therefore premature and insufficiently supported.
Ambiguous Simulation Fidelity: The evaluation is conducted using an "in-house simulator based on Sniper" (Section 7.1, page 9). Sniper is an interval-based simulator, which raises questions about its cycle-level accuracy for modeling the complex out-of-order execution and pipeline interactions that are central to the benefits claimed for TEPL. The paper provides no details on how the simulator models the TEPL Queue, speculative instruction issue, or the squash signaling between the core and DECA. Without rigorous validation or more detailed explanation of the simulation infrastructure, the results concerning TEPL's ability to "hide communication latency" remain unsubstantiated.
Comparison Against Unrealistic Alternatives: In Figure 14 (page 11), the authors compare DECA against "More AVX Units" and "Wider AVX Units." This comparison is not based on a realistic microarchitectural model, but on an abstract thought experiment of "removing the dynamic instructions from 3 out of 4 iterations." This completely sidesteps the immense microarchitectural challenges (e.g., L1 cache port scaling, register file bandwidth, instruction issue width) that make such a design infeasible, a point the authors themselves concede in Section 4. This comparison is therefore misleading, as it pits a detailed simulation of the proposed accelerator against a non-physical, idealized abstraction of an alternative.

Questions to Address In Rebuttal

Regarding the Roof-Surface model: How can the model be considered a reliable guide for microarchitecture design when it fails to account for system effects (latency, caching) that result in a ~20-30% performance deviation, as shown in your own results (Table 2)? Please justify why these first-order effects were excluded from your "3D performance model."
Regarding the TEPL ISA extension: Before proposing a new instruction, what alternative software-only or existing architectural mechanisms for low-latency core-accelerator communication were evaluated and why were they deemed insufficient? Please provide data to show that a fenced, store-based invocation is the strongest possible baseline short of new ISA.
Regarding simulation: Can you provide specific details on how your Sniper-based simulator models the out-of-order execution, speculative issue, and retirement of TEPL instructions? Specifically, how are structural hazards on the DECA loaders and dependencies with the core's ROB modeled to ensure the performance claims are not an artifact of an overly optimistic simulation environment?
Regarding the DSE in Section 8.2: The selection of the "best" {W,L} pair for DECA appears to be optimized to solve the problem as defined by your own idealized Roof-Surface model. Given the model's discussed inaccuracies, how can you be certain that this configuration is truly optimal for a real system where other factors (e.g., latency) are at play?
The abstract claims a next-token generation time reduction of "1.6x-1.9x." However, in Table 6 (page 12), the Llama2-70B result for N=16 and Q8_30% shows a speedup of 116.6 / 75.7 ≈ 1.54x, which is outside the claimed range. Please clarify the precise set of results used to establish this range.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a holistic, model-driven approach to solving the weight decompression bottleneck in Large Language Model (LLM) inference on CPUs. The authors identify that with high-bandwidth memory (HBM) and powerful in-core matrix engines (like Intel's AMX), the software-based decompression of compressed (quantized and sparsified) weights becomes the new performance limiter.

The core contribution is threefold: 1. A Novel Performance Model: The "Roof-Surface," a 3D extension of the classic Roofline model, which elegantly visualizes the performance bounds imposed by the three key interacting resources: memory bandwidth, vector unit throughput, and matrix unit throughput. 2. A Near-Core Accelerator: DECA, a specialized hardware unit designed to offload dequantization and de-sparsification from the CPU's vector units. The design of DECA is explicitly motivated and guided by insights from the Roof-Surface model. 3. An ISA Extension: The Tile External Preprocess & Load (TEPL) instruction, which enables efficient, out-of-order, low-latency communication between the CPU core and the DECA accelerator, effectively hiding communication overheads by overlapping them with computation.

The evaluation, conducted on a simulated 56-core server, demonstrates that this co-designed system can accelerate compressed matrix multiplication kernels by up to 4x over optimized software and reduce end-to-end next-token latency for large LLMs like Llama2-70B by 1.6-1.9x.

Strengths

The primary strength of this paper lies in its principled, top-to-bottom systems thinking. It is a superb example of how insightful performance modeling can directly inform and justify hardware and ISA design.

The Roof-Surface Model: This is the paper's most significant and enduring contribution. The classic 2D Roofline model is a cornerstone of performance analysis, but it falls short when a third resource becomes a first-order performance limiter. The authors' extension to a 3D "surface" (Section 4, page 5) is both intuitive and powerful. It provides a clear, quantitative language to explain why the software-only approach fails on HBM-equipped systems (it becomes "VEC-bound") and precisely what level of acceleration is needed to overcome this bottleneck without over-provisioning hardware. This model has the potential for broad applicability beyond this specific problem.
Holistic Co-Design: The paper does not simply propose an accelerator in a vacuum. Instead, it presents a complete solution. The Roof-Surface model identifies the problem, the DECA accelerator provides the hardware muscle, and the TEPL instruction provides the necessary low-level communication primitive to make the hardware effective. This tight integration of modeling, microarchitecture, and ISA is the hallmark of a high-quality computer architecture paper. The motivation for TEPL, contrasting the inefficient store-based invocation (Figure 8, page 7) with the streamlined out-of-order TEPL approach (Figure 9, page 7), is particularly compelling.
Timeliness and Practical Impact: The work addresses a critical, real-world bottleneck for a major workload. As CPUs continue to integrate HBM and more powerful matrix engines to stay competitive with GPUs for AI inference, the "software glue" problem will only become more acute. This paper provides a clear, well-reasoned blueprint for how CPU architects can solve this problem, potentially making high-performance, low-latency LLM inference more accessible and cost-effective on general-purpose servers. The comparison in Table 7 (page 12), which shows the proposed system closing a significant portion of the performance gap with a contemporary GPU, underscores this practical relevance.
Excellent Contextualization: The authors do a commendable job of situating their work within the broader landscape of in-core and near-core acceleration. The taxonomy presented in Table 8 (page 13) is particularly useful, clearly differentiating DECA from prior work by its unique combination of support for flexible compression, cooperation with matrix units (rather than replacing them), and fine-grained, low-overhead interleaving with the core.

Weaknesses

The weaknesses of the paper are minor and mostly relate to the boundaries of its exploration rather than fundamental flaws in the core idea.

Limited Scope of Evaluated Compression Schemes: The evaluation focuses on established but relatively simple schemes (unstructured sparsity, BF8, MXFP4). The field of model compression is evolving rapidly, with more complex methods like product quantization, non-uniform quantization, and structured sparsity patterns gaining traction. While the LUT-based design of DECA suggests flexibility, it is not immediately clear how it would handle schemes that require more complex arithmetic than a simple lookup and scaling operation.
Deeper Architectural Implications of TEPL: The TEPL instruction is well-motivated, but its introduction has ripple effects on the core's front-end, scheduler, and ROB. While the paper describes the high-level mechanism (TEPL Queue, speculative issue), a more in-depth discussion of the complexity and area/power cost of these changes to a modern out-of-order core would strengthen the proposal.
Unexplored Generality of the Roof-Surface Model: The authors correctly claim in Section 9.2 (page 13) that the Roof-Surface model can be generalized. This is one of the most exciting aspects of the work. However, this claim would be significantly bolstered by briefly applying the model to another, different problem domain to demonstrate its broader utility beyond LLM decompression (e.g., a bioinformatics pipeline, a graphics rendering stage, or a data analytics query).

Questions to Address In Rebuttal

The LUT-based dequantization in DECA is flexible for existing formats. How would the architecture adapt to future, more complex schemes that may not be easily captured by a lookup table, such as those requiring on-the-fly arithmetic to reconstruct weights? Does the DECA pipeline have extensibility points for such cases?
The paper's strongest contribution is arguably the Roof-Surface model. Can the authors provide another concrete example of a workload with a three-way (or more) dependency chain where their generalized modeling approach from Section 9.2 would yield non-obvious insights that a traditional Roofline analysis would miss?
Regarding the TEPL ISA extension: What alternative, less invasive core-accelerator communication mechanisms were considered? For instance, could a system of memory-mapped FIFOs combined with intelligent core-side prefetching and synchronization instructions achieve a similar level of latency hiding without the complexity of a new, blocking, out-of-order instruction? What was the deciding factor that made TEPL the superior choice?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents three distinct but interconnected contributions aimed at accelerating compressed Large Language Model (LLM) inference on CPUs equipped with in-core matrix engines. The core problem—that software-based decompression of quantized/sparsified weights becomes a bottleneck—is well-established. The authors' proposed solution consists of:

A novel performance analysis framework called the Roof-Surface model, a 3D extension of the traditional 2D roofline model. This model aims to correctly identify bottlenecks in a pipelined system involving memory, vector units, and matrix units.
A near-core decompression accelerator named DECA, which offloads de-quantization and de-sparsification from the CPU's vector units to feed the in-core matrix engine.
A new ISA extension, Tile External Preprocess and Load (TEPL), designed for efficient, out-of-order invocation of the DECA accelerator, hiding communication latency.

My analysis concludes that while the constituent hardware components of the accelerator are based on known techniques, the specific combination of the performance model, the cooperative accelerator architecture, and the tight ISA integration represents a significant and novel system-level contribution. The novelty lies not in a single isolated idea, but in the synergistic design where each component is justified by and enhances the others.

Strengths

The primary strength of this paper is the introduction of multiple layers of novelty that build upon each other cohesively.

Novelty of the Performance Model: The Roof-Surface model (Section 4, page 5) is a genuinely novel extension of performance modeling for a specific, important class of problems. While multi-dimensional performance models are not unheard of, the specific formalization of a three-stage, pipelined dependency (Memory -> Vector -> Matrix) with corresponding arithmetic intensities (AIXM and AIXV) is new. It correctly identifies the vector-decompression bottleneck where the traditional 2D roofline model fails (as shown in Figure 3b, page 4). This model is not just a theoretical exercise; the authors effectively use it to motivate the need for an accelerator and to perform a quantitative design space exploration for it (Section 8.2, page 11).
Novelty in Architectural Division of Labor: The concept of a near-core accelerator is not new. However, DECA's novelty lies in its specific role as a cooperative pre-processing unit for an existing, powerful in-core engine (the TMUL). Much prior work on in/near-core accelerators for sparse workloads (e.g., VEGETA [39], RASA [40]) proposes replacing or heavily augmenting the core's compute units to handle compressed formats directly. In contrast, DECA decouples the decompression task from the GeMM task, offloading the former to a specialized unit while leaving the latter to the highly-optimized TMUL. This specific architectural partitioning is a novel approach to the problem.
Novelty of the ISA Integration: The TEPL instruction (Section 5.3, page 7) is a sophisticated and novel solution to the command-and-control problem for tightly-coupled accelerators. The naive approach using memory-mapped stores and loads is correctly identified as inefficient due to serialization and exposed latency. TEPL's design as a single instruction that both triggers the accelerator and receives the result, while being fully integrated into the core's out-of-order engine (using renaming and speculative execution), is a clean and powerful abstraction. This goes significantly beyond the prior art of loosely-coupled accelerators or those requiring no core modifications (e.g., SPADE [19], as noted in Table 8, page 13). The integration is the key innovation here.

Weaknesses

From a novelty perspective, the weaknesses are minor and relate to the granularity of the claims.

Constituent Components are Not Fundamentally New: The microarchitecture of the DECA pipeline itself (Section 6, page 8) is a synthesis of well-understood components. It uses Look-Up Tables (LUTs) for dequantization, POPCNT and Parallel Prefix Sum circuits for bitmask processing, and a crossbar for expansion. None of these blocks are, in isolation, novel inventions. The paper's novelty claim rests on their specific arrangement and dimensioning, which is guided by the Roof-Surface model. This is a system-level novelty, not a circuit-level one.
Limited Exploration of Non-ISA Alternatives: The paper makes a strong case for TEPL by demonstrating the flaws in a simple store-based invocation mechanism. However, the design space of command-and-control without new instructions is broader. For example, a system with dedicated command queues managed by the accelerator, which the core polls, could potentially offer better performance than the baseline store-with-fence approach. While I suspect the authors' TEPL solution is superior, a more thorough dismissal of advanced non-ISA-modifying alternatives would strengthen the novelty claim of requiring an ISA extension.

Questions to Address In Rebuttal

The paper's related work mentions NVIDIA's TMA [52] as a mechanism for supplying data to Tensor Cores and suggests augmenting it with DECA-like features could be an interesting direction (Section 9.1, page 13). Could the authors more precisely articulate the novel "delta" between the proposed TEPL instruction and the existing TMA mechanism? TMA is also, in essence, an accelerator for managing tile movement. How is TEPL's tight integration with the CPU's speculative, out-of-order core fundamentally different from the integration of TMA within an SM?
The generality of the Roof-Surface model is briefly discussed (Section 9.2, page 13) by proposing a generalized equation (3). Beyond this abstract formulation, could the authors provide one concrete example of another existing, real-world architecture or application (outside of CPU LLM decompression) where this 3-domain (or n-domain) pipelined model would provide insights that a traditional 2D roofline model would miss?
Regarding the TEPL design, was an alternative that relied on a dedicated, hardware-managed command FIFO between the core and DECA considered? Such a design might avoid the full complexity of a new, renamed instruction class while still allowing for out-of-order dispatch and avoiding memory fences. What would be the performance and complexity trade-offs of such a design compared to TEPL?

StreamTensor: Make Tensors Stream in Dataflow Accelerators for LLMs

Abstract

Efficient execution of deep learning workloads on dataflow architectures is crucial for overcoming memory bottlenecks and maximizing performance. While streaming intermediate results between computation kernels can significantly improve efficiency, ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper presents StreamTensor, a compiler framework designed to automate the generation of stream-based dataflow accelerators for LLMs from PyTorch models. The core contributions are an "iterative tensor" (itensor) type system for encoding stream layouts, a hierarchical design space exploration methodology, and an LP-based algorithm for FIFO sizing. The authors evaluate their framework on an FPGA, claiming significant latency and energy efficiency improvements over prior FPGA-based works and contemporary GPUs.

However, the work is undermined by questionable experimental comparisons, unsubstantiated claims regarding the systematic nature of its optimizations, and a concerning lack of critical hardware implementation details. While the proposed abstractions are interesting, the evidence provided is insufficient to validate the claimed performance superiority.

Strengths

Formalism for Stream Layouts: The itensor type system (Section 3.1, page 3) provides a structured and verifiable way to represent streamed tensor data. The ability to analytically infer the need for and the minimum size of a layout converter (Algorithm 1, page 9) based on type information is a sound concept.
Analytical FIFO Sizing: The formulation of the FIFO sizing problem as a linear programming task (Section 5.3.4, page 11) is a clear and defensible analytical contribution. Modeling token production/consumption with piecewise linear functions (Figure 8, page 10) provides a formal basis for optimizing inter-kernel delays.
End-to-End Automation: The demonstrated pipeline from a high-level model in PyTorch down to a hardware bitstream (Figure 4, page 4) represents a substantial engineering effort and addresses a key productivity challenge in hardware acceleration.

Weaknesses

Fundamentally Flawed Experimental Comparisons: The central claims of performance superiority are built on unsound comparisons.
- Quantization Mismatch: The authors' implementation uses a W4A8 quantization scheme. However, the primary FPGA baseline, DFX [29], uses FP16 (Table 6, page 12). A comparison across such drastically different numerical precisions is invalid. W4A8 designs are inherently smaller and faster, but this is an advantage of the quantization scheme, not necessarily the compiler framework. The performance gains cannot be cleanly attributed to StreamTensor's contributions.
- Hardware Mismatch: The authors use an AMD U55C FPGA, while both Allo [15] and DFX [29] use the U280. While from the same family, these are different parts with different resource counts and characteristics, further confounding the comparison.
- Misleading GPU Comparison: The results in Table 5 (page 12) show that for Time-To-First-Token (TTFT), the A100 GPU is between 3.97x and 31.99x faster. The authors focus on total latency, but for many LLM applications (e.g., interactive chat), TTFT is the critical metric. Framing the overall result as a win obscures a significant performance deficit in a key area.
Missing Essential Hardware Metrics: For a paper proposing an FPGA accelerator framework, the complete omission of post-place-and-route resource utilization data (LUTs, FFs, BRAM, DSPs) is a critical flaw. Without this data, it is impossible to assess the actual efficiency of the generated designs. The memory reduction shown in Figure 10a (page 13) is only for intermediate results and offers no insight into the total on-chip resource cost, which is essential for judging feasibility and scalability.
Overstated Claims of "Systematic Exploration": The paper repeatedly claims to "systematically explore" the design space (Abstract, Section 1.4). However, the methods described in Section 5.1 (page 8) are a collection of heuristics: "naive tiling," "intensity-aware algorithm" for unrolling, and a "heuristic that moves reduction loops outward" for permutation. These are reasonable heuristics, but they do not constitute a systematic exploration. The term implies a more exhaustive or provably optimal search, which is not what is being performed.
Fragile Assumptions in FIFO Sizing Model: The LP model for FIFO sizing relies on static kernel latencies obtained from profiling (Section 5.3.1, page 10). This assumes a deterministic execution environment. The model's sensitivity to deviations between profiled and actual run-time behavior is not analyzed. Furthermore, the claim that the "memory utilization of stream FIFOs is negligible" (Section 5.3.4, page 11) is a strong assertion that is not backed by data. In a complex graph with hundreds of inter-kernel connections, the aggregate BRAM usage of these FIFOs could become significant.
Weak Handling of Dynamic Control Flow: The framework's approach to dynamicism (Section 5.3.5, page 11) is to either fall back to host execution or rely on "shape hints" to bound dynamic tensors. This is not a solution but rather an avoidance of the core challenge of compiling dynamic workloads to a static dataflow architecture. This severely limits the practical applicability of the framework to models with any data-dependent control flow.

Questions to Address In Rebuttal

Please provide a justification for comparing your W4A8 implementation against an FP16 baseline (DFX). How can the performance gains be attributed to your compiler framework rather than the fundamentally less complex arithmetic of the 4-bit quantization scheme?
Provide detailed post-place-and-route resource utilization reports (LUTs, BRAMs, DSPs, etc.) for each of the evaluated LLM designs on the U55C FPGA. How close are these designs to the resource limits of the target device?
Please reconcile the claim of "systematically exploring" the design space with the described use of separate, non-exhaustive heuristics for tiling, unrolling, and permutation.
What is the performance degradation of your FIFO sizing solution if the kernel latencies measured during profiling differ from their real runtime values by 10%, 20%, or 50% due to runtime variance?
Provide data to support the claim that FIFO memory utilization is "negligible." For the most complex model evaluated (e.g., Llama), what is the total on-chip BRAM consumed by all stream FIFOs combined, and what percentage of the total available BRAM does this represent?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents StreamTensor, an end-to-end compiler framework designed to automate the mapping of PyTorch-based Large Language Models (LLMs) onto stream-based dataflow accelerators, with a specific evaluation on FPGAs. The central challenge addressed is the immense difficulty and error-prone nature of manually designing efficient dataflow hardware, particularly in managing inter-kernel data streaming to overcome memory bottlenecks.

The authors' core contribution is the introduction of a novel iterative tensor (itensor) type system. This abstraction is the linchpin of their entire framework. By explicitly encoding the layout, access pattern, and iteration space of a data stream, itensor elevates the compiler's understanding from a simple memory-mapped tensor to a structured, flowing data entity. This formalization enables a series of powerful, automated optimizations that were previously ad-hoc or intractable: seamless kernel fusion, automatic generation of minimal-cost layout converters, and systematic resource allocation. The paper demonstrates the effectiveness of this approach by achieving significant latency and energy efficiency improvements over state-of-the-art FPGA solutions and competitive GPUs for LLM inference.

Strengths

The true strength of this paper lies in its synthesis of ideas from compiler theory, high-level synthesis, and computer architecture to create a cohesive and powerful automation framework.

The itensor as a Unifying Abstraction: The most significant contribution is the itensor type system (Section 3.1, page 3). For decades, compilers for spatial and dataflow architectures have struggled to bridge the semantic gap between imperative code (or static dataflow graphs) and the physical reality of streaming data. Traditional tensor types represent a block of memory; itensor represents a protocol for accessing that data over time. This is a profound and elegant conceptual leap. It provides the formal underpinning necessary for the compiler to reason about stream compatibility, a problem that has historically plagued HLS tools and required manual intervention. This is what allows StreamTensor to confidently fuse any two kernels, inserting a provably correct and minimal converter if their stream protocols (itensor types) do not match.
Systematic Exploration of the Design Space: The paper wisely decomposes the notoriously complex accelerator design problem into a hierarchy of three distinct but interconnected spaces: Linalg Tiling, Kernel Fusion, and Resource Allocation (Figure 4, page 4). This structured approach transforms what is often an unmanageable, holistic design challenge into a series of more constrained, solvable optimization problems. For instance, the token behavior model and subsequent LP formulation for FIFO sizing (Section 5.3, page 10) is an excellent example of applying formal methods to a specific sub-problem (Pitfall 4) that is often solved with heuristics or over-provisioning. This brings a much-needed rigor to the field.
Bridging the Gap Between AI Frameworks and Dataflow Hardware: This work sits at a critical intersection. On one side, you have the immense productivity of frameworks like PyTorch. On the other, you have the potential performance and efficiency of dataflow architectures like FPGAs, CGRAs (e.g., AMD Versal [24]), and custom DSAs (e.g., SambaNova [43], Groq [1]). The bridge between them has been a rickety, manual, and expert-driven process. StreamTensor represents one of the most serious and complete attempts to build a robust, automated highway. By starting from a high-level model and generating hardware, it lowers the barrier to entry for utilizing these powerful but esoteric architectures.
Excellent Problem Motivation and Positioning: The authors do a superb job in Section 1.3 ("Pitfalls," page 2) of articulating the precise, thorny issues that have limited prior work. They correctly identify inter-kernel correlations, memory management, fusion compatibility, and FIFO sizing as the key hurdles. Their entire framework is then built to systematically knock down each of these barriers. This clear problem-solution mapping makes the paper's contributions easy to understand and appreciate.

Weaknesses

The weaknesses of the paper are primarily related to its current scope and the assumptions it makes, which are understandable for a pioneering work but important to acknowledge.

Hardware Abstraction and Portability: While built on the general MLIR framework, the implementation and evaluation are tightly coupled to a specific HLS flow (Vitis for AMD FPGAs). The true potential of a framework like StreamTensor is its ability to target a class of dataflow architectures. It is not yet clear how the concepts, particularly the resource allocation and cost models (e.g., fusion cost in Algorithm 2, page 9), would translate to a more structured CGRA with a dedicated network-on-chip versus the "soft" logic of an FPGA. This limits the generality of the claims, though the future work section (Section 8, page 14) rightly identifies this as a key next step.
Handling of Dynamicism: The framework's strength lies in analyzing and optimizing statically determined dataflow graphs. The handling of dynamic behavior (data-dependent control flow, dynamic tensor shapes) is pragmatically offloaded to the host CPU (Section 5.3.5, page 11). This is a common and reasonable strategy, but it sidesteps the core challenge that many real-world applications, beyond the autoregressive decoding of LLMs, possess. The framework's performance is predicated on a largely static view of the computation.
Scalability to Multi-Device Systems: The current model for kernel fusion and partitioning is implicitly single-chip. The paper sets the maximum fusion cost Cmax to the on-chip memory of a single FPGA. However, deploying large LLMs requires partitioning across multiple accelerators. While the paper notes this is out of scope, the lack of a clear story for multi-chip partitioning and communication scheduling is a major limitation for practical, large-scale deployment. The itensor concept could potentially be extended to describe inter-chip streams, but this is a non-trivial research problem.

Questions to Address In Rebuttal

The authors are encouraged to use the rebuttal to comment on the broader implications and future trajectory of this work.

Extensibility of itensor: The itensor type system is beautifully suited for the dense, regular streaming patterns found in LLM transformers. How would this abstraction need to evolve to capture more complex or irregular dataflow patterns, such as those in Graph Neural Networks (GNNs) or models with heavy use of sparse tensors? Could it, for example, encode streams of indices and values separately and describe their relationship?
Path to Architectural Portability: Beyond the mention in future work, could you elaborate on the key changes required to retarget StreamTensor from an FPGA HLS backend to a more structured dataflow architecture like a CGRA or a dataflow ASIC? What are the primary architectural parameters that would need to be exposed to the compiler's resource allocation and cost models?
Resilience to Inaccurate Profiling: The LP-based FIFO sizing model (Section 5.3.4) relies on kernel latencies and throughputs obtained from a profiling step. In real systems, these values can fluctuate due to data-dependent execution paths, thermal management, or resource contention. How sensitive is the generated solution to inaccuracies in these profiled values? Is there a path toward a hybrid compile-time/run-time approach where buffer sizes could be adapted based on observed behavior?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents StreamTensor, a compiler framework designed to automate the generation of stream-based dataflow accelerators for LLMs, starting from a PyTorch model. The authors identify several key pitfalls in existing dataflow design paradigms, including inter-kernel correlation, external memory management, and FIFO sizing.

The central claims of novelty appear to be twofold: 1. The introduction of an "iterative tensor" (itensor) type system (Section 3.1, page 3), which explicitly encodes the temporal streaming layout of a tensor through an iteration space and an affine map. This type system is used to enable automated kernel fusion, stream layout converter generation, and type-based verification. 2. A piecewise linear function-based model for token behavior (Section 5.3, page 10) that captures both transient and steady-state dynamics of dataflow kernels. This model is used to formulate FIFO sizing as a linear programming (LP) problem, providing an analytical solution to avoid deadlocks and minimize resource utilization.

The paper builds an end-to-end compilation pipeline around these core ideas, transforming high-level Linalg IR into a dataflow IR and ultimately to hardware.

Strengths

The primary strength of this work lies in the conceptual novelty of its core abstractions, which address long-standing challenges in dataflow compilation with new formalisms.

Novelty of the itensor Type System: The core contribution, the itensor type, is genuinely novel. While prior tensor compilers and scheduling languages (e.g., Halide [46], TVM [16]) provide mechanisms to describe tiled and permuted access patterns procedurally through a schedule, StreamTensor elevates this description to a first-class, declarative type. This is a significant conceptual shift. By encoding the stream's access pattern directly into the type, the compiler can perform static verification of stream compatibility between a producer and consumer (as illustrated in Figure 5, page 4). This is a more powerful and scalable approach than relying on procedural analysis of two separate kernel schedules. The ability to analytically derive the minimal required stream layout converter and its buffer size (Algorithm 1, page 9) directly from the type mismatch is a direct and elegant consequence of this novel abstraction. This goes beyond prior type systems like Graphene's [28], which, as the authors correctly note, focus on static memory layout rather than the dynamic streaming order.
Novelty in FIFO Sizing Formulation: The problem of buffer sizing is not new; it is a foundational topic in dataflow modeling [10, 26, 38]. However, much of the classic work focuses on steady-state analysis in models like Synchronous Dataflow (SDF). The proposed model in Section 5.3 is novel in its explicit and analytical modeling of both the initial transient phase (initial delay, D) and the steady-state phase (pipeline II, L) using piecewise linear functions. This provides a more accurate representation of the behavior of coarse-grained, deeply pipelined hardware kernels than steady-state models alone. Furthermore, framing the global FIFO sizing problem as an LP problem that minimizes inter-kernel delays (a proxy for buffer size) subject to satisfying data dependencies across all graph paths is an elegant and powerful formulation. This analytical approach is a clear advancement over prior automated methods that rely on time-consuming simulation [30].
A Coherent, Hierarchical Abstraction: The framework demonstrates a well-conceived hierarchy, using the itensor type at a high level to orchestrate complex dataflow transformations like kernel fusion, and then lowering this abstraction to a more conventional stream/buffer representation for code generation. This separation of concerns allows the novel type system to be maximally effective at the right level of abstraction.

Weaknesses

While the core ideas are novel, the evaluation of their specific benefits and the discussion relative to the closest conceptual prior art could be strengthened.

Insufficient Differentiation from Scheduling DSLs: The paper contrasts itensor with traditional tensor types but does not sufficiently discuss the delta between its declarative, type-based approach and the procedural approach of powerful scheduling DSLs like Halide or TVM's TensorIR. A Halide schedule also contains all the information needed to determine a stream's layout. The key difference is that in Halide, this is an attribute of the computation (Func), whereas here it is an attribute of the data (itensor). The authors should more explicitly argue why the latter is superior for the task of composing pre-existing dataflow kernels, which is the central challenge in this domain. The novelty is in the representation, but its superiority over alternative representations is not fully established.
Novelty of Design Space Exploration (DSE) Algorithms: In Section 5, the paper describes the exploration of three design spaces. However, the algorithms employed for this exploration (e.g., intensity-driven unrolling, heuristic permutation) are themselves standard practice. The novelty here is that the itensor framework enables a more systematic exploration, but the exploration techniques themselves do not appear to be novel contributions. The paper should be clearer in distinguishing between the novel framework and the standard optimization heuristics applied within it.
Complexity vs. Benefit Justification: The proposed framework, particularly the itensor type and the multi-level IR, introduces considerable compiler complexity. The paper demonstrates strong end-to-end results, but it does not isolate the benefit of its novel contributions. For example, the LP-based FIFO sizing is elegant, but how much better is it in practice (in terms of area and performance) than the simpler "Conservative" strategy mentioned in Section 5.3.3? Without an ablation study directly comparing the outcome of the novel LP algorithm to simpler heuristics, it is difficult to judge if the added complexity provides a marginal or a transformative benefit.

Questions to Address In Rebuttal

The central novelty is the itensor type. Could the authors elaborate on the fundamental advantages of a declarative, type-based representation of stream layout compared to deriving the same information from a procedural schedule, as is done in compilers like Halide or TVM? Specifically, how does the itensor type simplify or enable optimizations that would be difficult or impossible otherwise?
Regarding the novel FIFO sizing model in Section 5.3.4, could the authors provide an ablation study comparing the FIFO sizes and resulting performance/latency from their LP formulation against the simpler "Conservative" strategy and a naive heuristic (e.g., sizing based on peak throughput difference)? This would help quantify the concrete benefit of this novel analytical model.
The itensor type system appears well-suited for dense, affine access patterns. What are its limitations? How would the system be extended to support more dynamic or irregular streaming patterns, such as those arising from sparse tensor operations or data-dependent control flow? Is the proposed formalism extensible to these cases?

Chameleon: Adaptive Caching and Scheduling for Many-Adapter LLM Inference Environments

Abstract

The effectiveness of LLMs has triggered an exponential rise in their deployment, imposing substantial demands on inference clusters. Such clusters often handle numerous concurrent queries for different LLM downstream tasks. To handle multi-task settings ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper presents Chameleon, an LLM serving system for multi-adapter workloads, aiming to improve latency and throughput over existing systems like S-LoRA. The authors identify two primary bottlenecks: the overhead of loading LoRA adapters from host to GPU memory and head-of-line (HoL) blocking caused by workload heterogeneity. To address these, Chameleon introduces two main components: (1) a dynamic, software-managed cache for LoRA adapters in otherwise idle GPU memory, managed by a cost-aware eviction policy, and (2) a non-preemptive, multi-queue scheduler that classifies requests based on a "Weighted Request Size" (WRS) to prioritize smaller requests without starving larger ones. The evaluation claims significant reductions in P99 TTFT latency (80.7%) and a 1.5x throughput improvement over S-LoRA under high load.

Strengths

The paper provides a solid characterization of the problem space in Section 3. The analysis effectively demonstrates that adapter loading contributes significantly to latency, especially with tensor parallelism (Figure 5), and that PCIe bandwidth becomes a bottleneck with many unique adapters (Figure 4). This motivation is clear and well-supported by their initial experiments.
The core concept of caching adapters in GPU memory is a logical and powerful solution to the identified loading overhead. The ablation study (Figure 11, "ChameleonNoSched") clearly shows that this caching mechanism is responsible for the majority of the performance gains, confirming the validity of this approach.
The evaluation is extensive in scope, covering multiple loads, LLM sizes, GPU memory configurations, and a multi-GPU tensor parallelism setup. The scalability analysis in Section 5.5 and 5.6 provides useful data points on the system's behavior under different hardware constraints.

Weaknesses

My primary concerns with this paper relate to the potential for over-tuning on the evaluation workload, the questionable justification for the scheduler's complexity, and a failure to address critical resource trade-offs.

Generalizability of "Adaptive" Policies is Unsubstantiated: The system's core policies rely on magic numbers derived from offline profiling of a single trace, which fundamentally undermines the claim of being adaptive.
- The cost-aware eviction policy (Section 4.2, page 6) uses the formula Score = FxFrequency+R×Recency+S×Size. The coefficients are statically set to F=0.45, R=0.10, S=0.45 based on "offline profiling of industrial traces [41]". This is a classic case of overfitting the policy to the evaluation data. There is no evidence that these weights are optimal, or even effective, for workloads with different characteristics (e.g., the WildChat or LMSYS traces also used in the paper).
- Similarly, the Weighted Request Size (WRS) formula (Section 4.3, page 7) uses coefficients A=0.4 and B=0.6. The authors state these are based on "sensitivity studies and on profiling", which is vague and irreproducible. This critical component, which governs the entire scheduling process, appears to be manually tuned to the specific workload and hardware used in the evaluation.
Scheduler Complexity is Not Justified by Gains: The paper introduces significant complexity with its dynamic, K-Means-based multi-queue scheduler. However, the evidence shows this complexity yields marginal benefits.
- In Section 5.4.5 and Figure 22 (page 12), the authors compare their dynamic queue organization against a simple, static 4-queue setup. The dynamic approach provides only a 10% reduction in P99 TTFT at high load. This minor improvement does not seem to justify the overhead and complexity of periodically running K-Means clustering to reconfigure queues and quotas.
Baseline Comparisons May Be Unfair: The paper's claims of superiority are predicated on comparisons that may not be rigorous.
- The comparison against a Shortest-Job-First (SJF) scheduler (Figure 8 and Figure 15) highlights starvation of long requests. However, any production-grade SJF scheduler for latency-sensitive systems incorporates an aging mechanism to mitigate starvation. The paper cites µServe [46], which uses such a mechanism, but it is unclear if the baseline they implemented is this robust version or a naive, strawman SJF. Without aging, the reported starvation is an expected artifact, not a novel finding.
- The ablation study in Figure 11 shows that the scheduler ("ChameleonNoCache") provides a very small improvement over S-LoRA. The vast majority of the gains come from the cache. This suggests the scheduling contribution is minor.
Critical Resource Trade-Off is Ignored: The fundamental premise is to use "idle GPU memory" for the adapter cache. This memory is in direct contention with the KV cache, which is a primary determinant of system throughput via batch size. The paper fails to analyze this trade-off. It is entirely plausible that allocating this "idle" memory to expand the KV cache in the S-LoRA baseline would allow for larger batch sizes, yielding a throughput improvement that could rival or exceed that of Chameleon. By not evaluating this alternative configuration, the paper presents an incomplete and potentially misleading picture of performance.

Questions to Address In Rebuttal

Regarding the eviction policy coefficients (F, R, S) and WRS weights (A, B): Please provide evidence of their robustness. How do the results change when applying the exact same coefficients (derived from the Azure trace) to the WildChat and LMSYS traces? If they must be re-tuned for each trace, the claim of adaptivity is weak.
The dynamic queue management shows only a 10% benefit over a static configuration (Fig. 22) at high load. Can the authors provide a compelling justification for this added complexity? What is the computational overhead of the periodic K-Means clustering and how does it affect the system, especially during load spikes?
Please clarify the implementation of the SJF baseline used in Section 5.3. Does it include an aging mechanism to prevent starvation, as is standard practice and described in the work [46] you cite? If not, why is this considered a fair comparison?
The adapter cache competes for GPU memory with the KV cache. Please provide an analysis comparing Chameleon's use of idle memory for adapter caching against an alternative scenario: S-LoRA is configured to use the same amount of "idle" memory to expand its total KV cache capacity, which would enable larger effective batch sizes. How does S-LoRA's throughput compare under that configuration?

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents Chameleon, an LLM inference serving system designed for environments with a large number of fine-tuned adapters, a scenario popularized by techniques like Low-Rank Adaptation (LoRA). The authors identify two primary bottlenecks that emerge in these "many-adapter" settings: (1) the I/O overhead of frequently loading adapter weights from host memory to GPU memory, which contends for PCIe bandwidth, and (2) the increased workload heterogeneity introduced by adapters of varying ranks (sizes) and popularity, which exacerbates tail latency issues like head-of-line blocking.

Chameleon's core contribution is to treat the adapters themselves as a first-class resource to be managed. It introduces a synergistic two-part solution: an adaptive adapter cache that utilizes otherwise idle GPU memory to store popular or costly-to-load adapters, and an adapter-aware multi-queue scheduler that classifies requests based on a weighted size (including input, predicted output, and adapter rank) to provide a fast lane for small requests while preventing starvation for large ones. The work is positioned as a significant enhancement over state-of-the-art systems like S-LoRA, which load adapters on-demand. The evaluation demonstrates substantial improvements in P99 latency (80.7% reduction) and throughput (1.5x increase) under high load.

Strengths

Excellent Problem Formulation and Characterization: The paper's primary strength lies in its clear identification and meticulous characterization of an important, emerging problem. Before presenting their solution, the authors dedicate Section 3 to demonstrating why existing systems are insufficient. Through targeted experiments, they convincingly show that adapters are a non-trivial source of performance heterogeneity (Figures 2 and 3), that adapter loading is a significant and scalable bottleneck (Figures 4 and 5), and that the opportunity to solve this (idle GPU memory) exists but is dynamic (Figure 6). This foundational analysis is what makes the proposed solution so compelling.
Elegant Application of Classic Systems Principles: The core ideas of Chameleon—caching and multi-level feedback queue scheduling—are not new to computer science, but their application to this specific domain is novel and elegant. The paper effectively recasts the adapter management problem into a familiar resource management paradigm. The adapter cache is a clever use of a transient resource (idle GPU memory), and its cost-aware eviction policy (Section 4.2) correctly recognizes that not all cache misses are equal, drawing a parallel to classic web and database caching systems. Similarly, the multi-queue scheduler (Section 4.3) is a well-understood technique for balancing responsiveness and fairness, which the authors have adapted skillfully to the unique sources of heterogeneity in LLM inference.
Holistic and Synergistic System Design: The two main components of Chameleon are not independent add-ons; they are designed to work together. The scheduler's awareness of adapter rank informs the cache manager's decisions about which adapters are valuable to keep resident. For example, scheduling a request with a large adapter that is already cached is much cheaper than scheduling one with a small adapter that needs to be loaded. This synergy between scheduling and caching is a hallmark of a well-thought-out system design. The architecture shown in Figure 9 clearly illustrates this interplay.
Significant and Practical Impact: The work addresses a real-world challenge. As organizations increasingly rely on fine-tuning for personalization and task-specialization, efficiently serving thousands of adapters will become a critical cost and performance issue. Chameleon offers a practical, software-only solution that could be integrated into popular serving frameworks (like vLLM, TGI, etc.). The reported performance gains, especially the dramatic reduction in P99 tail latency, are highly significant and would translate directly to improved user experience and lower operational costs in production environments.

Weaknesses

While the paper is strong, there are areas where the broader context and system dynamics could be explored more deeply.

Interaction Dynamics with the KV Cache: The paper's central premise for the adapter cache is the existence of "idle GPU memory." However, in a heavily loaded serving system, the primary consumer of dynamic GPU memory is the KV cache. The paper treats these two as simply competing for space. A deeper analysis of their interaction would be beneficial. For instance, a burst of long requests could cause the KV cache to expand rapidly, forcing the eviction of many adapters from the Chameleon cache. This could, in turn, increase the latency of the next batch of requests, which now must reload those adapters, creating a potential for performance oscillations or thrashing between the two memory consumers. A discussion of this dynamic would strengthen the paper.
Static Nature of the Eviction Policy: The cost-aware eviction policy (Section 4.2) uses a linear combination of frequency, recency, and size, with coefficients (F=0.45, R=0.10, S=0.45) set via offline profiling. While this is a reasonable approach, it raises questions about its robustness across different workloads. A workload with high temporal locality might benefit from a higher weight on recency (R), while a workload with stable popularity might benefit from a higher weight on frequency (F). The paper could be improved by either demonstrating the policy's robustness or discussing a more adaptive mechanism for tuning these weights online.
Limited Discussion on Predictor Accuracy: The scheduler relies on a BERT-based proxy model to predict output length, which is a key component of the Weighted Request Size (WRS). The sensitivity analysis in Section 5.4 (Figure 19) shows the system is reasonably robust, but the discussion could be expanded. How does poor prediction accuracy impact fairness? Could it lead to requests being systematically misclassified into the wrong queue, effectively undermining the scheduler's design?

Questions to Address In Rebuttal

Regarding the interaction between the adapter cache and the KV cache: Could you elaborate on the potential for thrashing between these two memory consumers under volatile loads? Does Chameleon implement any mechanism to coordinate between the KV cache manager and the adapter cache manager to prevent such negative feedback loops?
The eviction policy coefficients are tuned offline. How sensitive are the overall performance gains to these specific values (F=0.45, R=0.10, S=0.45)? Have you explored how these optimal weights might change for workloads with different characteristics (e.g., streaming vs. request-response, different adapter popularity distributions)?
The WRS formula in Section 4.3 weights the normalized output size more heavily (B=0.6) than the input size (A=0.4). Could you provide more intuition for this choice? Is it primarily because the decode phase, which depends on output length, typically dominates total execution time?
Your work is a significant step forward for node-level scheduling in many-adapter environments. How do you see these ideas integrating with cluster-level schedulers? For example, could a cluster scheduler use information about which adapters are cached on which nodes to make more intelligent request routing decisions?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present Chameleon, an LLM inference serving system designed for multi-task environments that leverage LoRA adapters. The paper claims two primary novel contributions: 1) an adaptive cache for LoRA adapters in GPU memory to mitigate loading overheads, and 2) an adapter-aware multi-queue scheduler to reduce head-of-line blocking and improve tail latency. The system is implemented on top of S-LoRA and evaluated against it, demonstrating significant improvements in P99 TTFT latency and throughput.

While the engineering effort is commendable and the performance results are strong, my review finds that the core ideas presented as novel are, in fact, direct applications of long-established principles from computer systems, operating systems, and networking. The novelty is confined to the specific application of these principles to the domain of LoRA adapter serving, not in the invention of new caching or scheduling paradigms.

Strengths

Problem Identification: The paper does an excellent job characterizing (Section 3, pages 3-4) the specific performance bottlenecks in many-adapter serving environments, correctly identifying adapter loading overhead and increased workload heterogeneity as critical issues.
System Integration: The integration of the caching and scheduling components is well-executed. The synergy between the two, where the scheduler's decisions inform the cache manager, demonstrates solid system design.
Domain-Specific Heuristic: The Weighted Request Size (WRS) formula (Section 4.3, page 7) is a logical, domain-specific heuristic. Including adapter rank alongside input and output size is a sensible extension to prior request-sizing metrics.

Weaknesses

My evaluation is focused exclusively on conceptual novelty. In this regard, the paper's claims are significantly overstated.

Adapter Caching Lacks Conceptual Novelty: The proposed "adapter cache" is a standard memory cache. The idea of keeping frequently-used, read-only data in a faster tier of the memory hierarchy (GPU memory) to avoid fetching it from a slower tier (host memory via PCIe) is a foundational concept in computer architecture and systems.
- Prior Art on Policy: The "Cost-Aware Eviction Policy" (Section 4.2, page 6) is a linear combination of frequency, recency, and size/cost (Score = F*Frequency + R*Recency + S*Size). This is a classic formulation for cost-aware caching heuristics. In fact, the paper itself cites GDSF [5] (page 11), a well-known algorithm from 1998 that uses a Frequency * Cost / Size heuristic. The proposed policy is a variation on this theme, not a new invention. The novelty is in the tuning of weights (F, R, S), not the approach itself.
- Prior Art on Mechanism: The dynamic sizing of the cache based on available memory is a standard practice in software-managed caches where memory is shared with an application (in this case, the KV cache). The claim of introducing "the first cache design for LoRA adapters" (page 2) is only true in the most literal sense; it is the first application of a standard cache to this specific data type, which does not constitute a novel contribution to the field of computer systems.
Multi-Queue Scheduling is Well-Established Prior Art: The proposed "adapter-aware multi-queue scheduler" is conceptually identical to decades of work on preventing Head-of-Line (HoL) blocking in schedulers for operating systems, routers, and web servers.
- Prior Art on Structure: The core idea of partitioning tasks by size into different queues to provide an "express lane" for short jobs is not new. The paper itself acknowledges this by citing Size-Interval Task Assignment (SITA) [7, 15] and Q-Zilla [35] in the related work section (page 13). SITA, proposed in 1999, is the direct intellectual ancestor of this scheduling approach.
- Incremental Heuristic: The only new element is the WRS formula, which adds AdapterSize to the sizing metric. Prior work like µServe [46] already uses predicted output length. This is an incremental, though logical, extension of an existing heuristic, not a fundamentally new scheduling algorithm. Applying K-Means to dynamically determine queue boundaries is also an application of a standard clustering algorithm, not a novel scheduling concept.
Complexity vs. Benefit: The paper introduces significant machinery (output length prediction, K-Means clustering, dynamic quota calculation, cost-aware cache management) to implement these known concepts. While the performance gains are substantial compared to a simple FIFO baseline (S-LoRA), it is unclear how much of this gain comes from the well-understood benefits of multi-queue scheduling vs. the specific "adapter-aware" component. The contribution is better framed as a comprehensive engineering effort to apply best practices to a new domain, rather than a source of new foundational ideas.

Questions to Address In Rebuttal

Please clarify the conceptual novelty of the adapter cache beyond it being a standard software-managed, cost-aware cache. What fundamentally distinguishes the eviction policy from the family of algorithms represented by GDSF [5], which also balances frequency, cost, and size?
The paper's scheduler is structurally and functionally analogous to SITA [7, 15]. Given this, can the authors articulate the core novel contribution in their scheduler beyond the domain-specific WRS sizing heuristic? Is the claim one of a new scheduling paradigm, or a new and effective application of an existing one?
To better isolate the novelty, could the authors compare Chameleon's scheduler not just to FIFO and SJF, but to a SITA-like scheduler that uses a simpler sizing heuristic (e.g., only predicted output tokens)? This would help quantify the specific benefit of making the scheduler "adapter-aware," which appears to be the primary delta over extensive prior art.

Coruscant: Co-Designing GPU Kernel and Sparse Tensor Core to Advocate Unstructured Sparsity in Efficient LLM Inference

Abstract

In the era of large language models (LLMs) and long-context generation, model compression techniques such as pruning, quantization, and distillation offer effective ways to reduce memory usage. Among them, pruning is constrained by the difficulty of ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present Coruscant, a co-designed software kernel and hardware architecture extension aimed at accelerating LLM inference by exploiting unstructured sparsity in the 30-70% range. The work is composed of three main contributions: (1) a bitmap-based sparse format designed to improve compression ratios over existing formats like CSR in this moderate sparsity regime; (2) a corresponding GPU SpMM kernel that uses this format to reduce data movement; and (3) a proposed hardware modification to the Tensor Core, the "Coruscant Sparse Tensor Core," which integrates a "Bitmap Decoder" to operate directly on the compressed format. The authors claim significant speedups over cuBLAS and Flash-LLM, and further gains with their proposed hardware.

Strengths

Problem Motivation: The paper correctly identifies a critical gap in the literature. While many works focus on extreme sparsity (>90%) or rigidly structured sparsity (2:4), the moderate, unstructured sparsity regime (30-70%) common in accuracy-preserving LLM pruning is underserved by existing hardware and software. Figure 1 provides a standard but clear motivation for focusing on the memory-bound nature of the decode stage.
Identification of Format Weakness: The analysis in Section 2.1 and Figure 3 effectively demonstrates that conventional index-based sparse formats (CSR, COO) are inefficient and can even lead to memory expansion at the target sparsity levels. This provides a solid rationale for exploring alternative representations like bitmaps.

Weaknesses

My analysis reveals several significant methodological and analytical flaws that call the paper's central claims into question.

Highly Questionable Hardware Simulation Methodology: The evaluation of the proposed Coruscant Sparse Tensor Core (STC) is fundamentally weak. In Section 4, the authors state they "simulate the behavior...by removing bitmap decoding and shared memory access writes...and retaining the standard HMMA instructions." This is not a simulation; it is a simplistic analytical projection. This approach completely ignores potential microarchitectural side effects, such as pipeline stalls caused by the new decoder logic, increased register file pressure, or contention on operand buses. It assumes the proposed "Bitmap Decoder" is a zero-cost, zero-latency oracle. Consequently, all performance results for the Coruscant STC (e.g., in Figures 11b, 15, 19) are speculative at best and likely overly optimistic.
Inadequate Comparison to Hardware-Accelerated Baselines: The comparison against NVIDIA's 2:4 sparsity in Section 5.3.4 and Figure 21 is telling. The authors' software kernel is slower than the 2:4 kernel, and even their proposed hardware (Coruscant STC) only "nearly matches" the 2:4 kernel's latency at 50% sparsity. They attribute the baseline's advantage to higher warp occupancy, but this is not an excuse—it is a critical performance factor. Their proposed dataflow and kernel implementation appear to be less efficient at utilizing the GPU's resources than the industry standard. The paper fails to make a convincing case for unstructured sparsity when it cannot clearly outperform the existing, hardware-accelerated structured sparsity solution at its home turf of 50% sparsity.
Discrepancy Between Theoretical and Actual Compression: There is a clear disconnect between the ideal compression ratios presented in Figure 3 and the actual memory footprint reduction achieved by the kernel, as discussed in Section 5.2.2. The authors admit this is due to padding added "to fully utilize the GPU memory bandwidth." This overhead is non-trivial and weakens the core premise that their format leads to superior memory efficiency in practice. The paper lacks a rigorous quantification of this padding overhead across different matrix sizes and sparsities, making the real-world benefit of their format ambiguous.
Superficial Hardware Overhead Analysis: The area and power analysis in Section 5.3.2 and Table 2 is not credible. The authors report synthesizing the Bitmap Decoder in a 45nm process and then scaling the results to 7nm using generic equations from Stillmaker et al. [48]. This methodology is widely understood to be inaccurate for modern nodes, as it fails to account for the dominance of wire delay, leakage power, and complex physical design rules. A proper hardware analysis would require synthesis in a modern PDK or a detailed, bottom-up analysis of the circuit. The presented numbers appear to be a back-of-the-envelope calculation designed to minimize the perceived cost of their hardware addition.
Weak Link Between Performance Claims and Accuracy Motivation: The paper begins by arguing that unstructured sparsity is superior for maintaining model accuracy (Table 1, Figure 2). However, the core end-to-end evaluation in Section 5.1 and Figure 15 only reports performance metrics (tokens/sec). It is never explicitly shown that the models pruned to 30/50/70% for this performance evaluation actually maintain their accuracy. A fair comparison would require evaluating performance against a dense baseline of equivalent accuracy, which may involve techniques other than pruning. The paper implicitly compares a lower-accuracy pruned model to a full-accuracy dense model, which inflates the perceived efficiency gains.

Questions to Address In Rebuttal

The authors must provide clear and convincing responses to the following points:

On Hardware Methodology: Justify the use of an analytical model for the Coruscant STC performance evaluation. Provide evidence, such as from a detailed microarchitectural simulator (e.g., GPGPU-Sim), that your model accurately captures pipeline dynamics and that the Bitmap Decoder does not introduce new performance bottlenecks. Without this, the STC results are unsubstantiated.
On Baseline Performance: Explain, with detailed profiling data (e.g., from Nsight Compute), the precise reasons for the Coruscant kernel’s lower performance and resource utilization compared to the 2:4 cuSPARSELt kernel at 50% sparsity. Why should the community adopt a more complex unstructured approach if it fails to outperform the simpler structured alternative in a direct comparison?
On Compression Overhead: Provide a new table or figure that directly compares the "ideal" compression ratio of your format (from Figure 3) against the "actual" memory footprint of the kernel (including all padding and metadata overheads) for every data point presented in the evaluation (all sparsities, matrix sizes, and batch sizes).
On Accuracy in Evaluation: For the end-to-end results in Figure 15, what are the measured perplexity scores for the pruned Llama 2 models at 30%, 50%, and 70% sparsity? How do these scores compare to the dense baseline? The claimed speedups are meaningless without the context of the accuracy trade-off.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents Coruscant, a compelling co-design of a GPU kernel and a sparse tensor core architecture aimed at making unstructured sparsity practical for Large Language Model (LLM) inference. The authors identify a critical gap in the current landscape: state-of-the-art pruning techniques naturally produce unstructured sparsity in the 30%-70% range, yet modern hardware, notably NVIDIA's 2:4 semi-structured sparsity, cannot efficiently exploit it, forcing a trade-off between model accuracy and hardware performance.

Coruscant's core contribution is a full-stack solution to this problem. It introduces: 1. A memory-efficient, bitmap-based sparse format designed for this moderate sparsity regime. 2. A software-only GPU kernel that leverages this format on existing commercial GPUs to reduce memory footprint and outperform dense (cuBLAS) and other sparse (Flash-LLM) kernels. 3. A minimal, pragmatic extension to the GPU's Sparse Tensor Core (STC) that directly decodes the bitmap format in hardware, eliminating the software decompression overhead for even greater performance gains.

The work is well-motivated by the memory-bound nature of the LLM decode phase and provides a clear pathway for the machine learning and computer architecture communities to reconcile the benefits of unstructured pruning with the demands of efficient hardware execution.

Strengths

The primary strength of this paper is its insightful positioning and holistic approach to a significant, real-world problem.

Excellent Problem Formulation: The authors have correctly identified a crucial point of friction between two advancing fields. On one hand, LLM pruning research (e.g., SparseGPT, Wanda) is pushing towards more flexible, unstructured methods to preserve accuracy. On the other, the hardware community has converged on rigid, structured solutions like 2:4 sparsity for efficiency. Coruscant builds a much-needed bridge between these two worlds. The analysis in Section 2.1 and Figures 2 & 3 effectively establishes the need for a solution in the 30-70% sparsity range, a zone where prior formats are shown to be inefficient.
Pragmatic Full-Stack Co-Design: This is not just a theoretical hardware proposal. The authors provide an immediately useful software kernel for today's GPUs and a well-defined, minimally invasive hardware modification for tomorrow's. This two-pronged approach dramatically increases the work's potential impact. It allows the community to adopt the ideas in software now while providing a clear, low-overhead blueprint for hardware vendors. The proposed "Bitmap Decoder" is a small, targeted addition, not a complete redesign, which makes it far more plausible for adoption.
Strong Contextualization and Evaluation: The paper does an admirable job of placing itself within the vast literature of sparse computation and LLM optimization. The comparisons are not just against the standard dense baseline (cuBLAS) but also against relevant, cutting-edge sparse kernels like Flash-LLM, SparTA, and Sputnik (Figure 16). Furthermore, the architectural comparison against prior academic STC designs like DSTC and RM-STC (Section 5.3.3) is particularly insightful, correctly arguing that Coruscant's simpler, memory-focused design is better suited for the "skinny matrix" SpMM workloads of LLM inference, as opposed to the compute-bound workloads those prior works targeted.
Connecting Pruning to System-Level Benefits: The work successfully translates the benefits of its approach from kernel-level speedups to tangible system-level advantages. The end-to-end evaluation (Section 5.1, Figure 15) shows not only increased token throughput but also the ability to run larger batch sizes by freeing up VRAM, directly addressing the "Out of Memory" errors that plague LLM serving. This demonstrates a deep understanding of the end-user's problem.

Weaknesses

The weaknesses of the paper are minor and primarily relate to missed opportunities to further broaden its context and impact.

Limited Discussion on Interaction with Other Compression Techniques: The paper is laser-focused on sparsity. However, in practice, sparsity is almost always combined with quantization. There is no discussion of how the Coruscant format would interact with weight quantization schemes (e.g., 8-bit or 4-bit integers). Would the bitmap overhead become more significant relative to the compressed data? Does the STC design need modification to handle quantized data types? Acknowledging this interaction is crucial for a complete system-level solution.
Focus on Weight Sparsity Only: The work is entirely centered on static, unstructured weight sparsity. Another emerging area of interest is activation sparsity, which is dynamic and data-dependent. While this is clearly outside the paper's primary scope, the core idea of a hardware-accelerated bitmap decoder could potentially be relevant there as well. A brief mention of this as a future direction would strengthen the paper's long-term vision.
The "Why Now?" Argument Could Be Sharpened: The motivation is good, but it could be even more powerful. The paper explains what the problem is, but it could spend more time on why solving it is becoming existential for the future of LLMs. As models grow and context windows expand, the memory footprint of weights (even without the KV cache) becomes a fundamental limiter. Framing Coruscant not just as an optimization but as an enabling technology for future, more powerful models on commodity hardware would elevate its perceived importance.

Questions to Address In Rebuttal

Synergy with Quantization: Could the authors comment on the synergy or potential conflicts between their bitmap-based format and popular quantization schemes (e.g., INT8, FP8, or INT4)? How would the memory footprint and hardware design change if Coruscant were to support pruned and quantized models?
The Next Bottleneck: Coruscant convincingly argues for a solution to the weight memory bandwidth bottleneck in sparse LLM inference. With this addressed by the proposed STC, what do the authors foresee as the next major bottleneck? Does this approach simply shift the performance problem to another part of the system, perhaps the instruction frontend or the register file bandwidth, especially given the more complex decoding logic?
Generality Beyond LLMs: While the work is excellently motivated by LLM inference, the proposed format and hardware seem more general. Have the authors considered its applicability to other domains that feature moderate, unstructured sparsity, such as graph neural networks (GNNs) or certain scientific simulations? How would the "skinny matrix" assumption hold up in those domains, and would the design still be effective?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents "Coruscant," a co-designed system comprising a bitmap-based sparse format, a corresponding GPU SpMM kernel, and a minimal extension to the GPU Tensor Core, all aimed at accelerating LLM inference with unstructured weight sparsity. The authors identify a critical gap in existing solutions, which are either inefficient for moderate sparsity levels (30-70%) or enforce restrictive semi-structured patterns like 2:4 sparsity. The core proposal is to use a simple bitmap to represent sparsity within tiles, which improves compression for this specific sparsity range, and then to design a software kernel and a hardware "Bitmap Decoder" to process this format efficiently for the memory-bound, skinny matrix multiplications characteristic of LLM decode steps.

Strengths

Specialization as a Novel Contribution: The primary strength and novel aspect of this work is not the invention of a new primitive, but the highly effective specialization of existing concepts for a specific, important workload. The authors correctly identify that prior sparse tensor core designs (e.g., DSTC, RM-STC) were targeted at general-purpose, compute-bound problems, leading to complex and area-intensive hardware. Coruscant’s novelty lies in its insight that the LLM inference problem (sparse-weight, dense-activation SpMM) allows for a significant simplification:
- It only requires single-operand sparsity, eliminating the need for complex scatter-gather and merge logic for the output matrix that plagues dual-operand sparse designs.
- It targets a memory-bound regime, correctly prioritizing compression efficiency (via larger tile sizes) over maximizing computation skipping. This is a crucial and well-justified trade-off.
Architectural Elegance and Simplicity: The proposed hardware modification, the "Bitmap Decoder" (Figure 14, page 7), is minimalistic. It reuses the existing accumulation and data paths of a dense tensor core, adding only the logic to decode the bitmap and select the appropriate non-zero values from registers. This stands in stark contrast to the significant architectural rework proposed in prior art for unstructured sparsity. The claimed low area overhead (Table 2, page 10) makes this a practical and economically viable proposal, which is a form of novelty in itself.
Kernel-Level Innovation: The co-designed GPU kernel contains a subtle but important novel element: the use of column-wise tiling to mitigate shared memory bank conflicts during the software-based decompression stage (Figure 9, page 6). This is a non-obvious implementation detail that directly addresses a key performance bottleneck for this class of algorithm and demonstrates a deep understanding of the GPU execution model.

Weaknesses

Fundamental Primitives are Not Novel: The primary weakness from a novelty perspective is that the core building blocks are well-established.
- Bitmap-based Sparse Formats: Representing sparse data with bitmaps is a classic technique. It is not a novel concept.
- Hardware Decoders for Bitmaps: The idea of a hardware unit that processes a bitmap to gate or select data is also not fundamentally new.
- Sparse Tensor Cores for Unstructured Sparsity: The most direct prior art, DSTC [55] and RM-STC [21], which the authors thankfully cite and compare against in Section 5.3.3, have already proposed the use of bitmap-based formats for unstructured sparsity in tensor cores.
Therefore, the claim to novelty must rest entirely on the specific architectural choices and co-design for the LLM workload, not on the invention of the underlying mechanisms. The paper's framing could be clearer on this point; it sometimes implies the format itself is the primary innovation, when in fact the true innovation is the simplified architecture it enables for this specific problem.
Limited Scope of Novelty: The contribution is sharply focused on skinny SpMM. While this is the paper's stated goal, it means the proposed architecture is less general than prior work. The novelty is derived from removing generality, which is a valid but narrow path for contribution. The paper would be strengthened by more explicitly defining the boundaries where its approach is superior and where more general (but complex) approaches like DSTC would be preferable.

Questions to Address In Rebuttal

Given that DSTC [55] and RM-STC [21] have already proposed bitmap-based sparse tensor cores for unstructured sparsity, please articulate precisely what the core architectural novelty of the Coruscant STC is, beyond a change in tile size and specialization for single-sided sparsity. The rebuttal should focus on why this specialization constitutes a significant inventive step over these prior works, rather than an incremental engineering optimization.
The column-wise tiling strategy (Figure 9, page 6) to avoid shared memory bank conflicts is a key part of the software kernel's performance. Is this technique novel in the context of GPU-based sparse matrix decompression, or are there precedents for this specific approach in prior literature on SpMM or other sparse kernels?
The paper's core trade-off is prioritizing memory compression over computation skipping, which is well-suited for the target memory-bound workload. Can you characterize the crossover point (e.g., in terms of batch size 'N' or a machine's operational intensity) where the computation-skipping benefits of a more complex architecture like DSTC would begin to outweigh the compression benefits of Coruscant? A clear analysis of this trade-off boundary would better situate the novelty of your contribution.

Accelerating Retrieval Augmented Language Model via PIM and PNM Integration

Abstract

Retrieval- Augmented Language Models (RALMs) integrate a language model with an external database to generate high-quality outputs utilizing up-to-date information. However, both components of a RALM system, the language model and the retriever, suffer ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper proposes MNM, a heterogeneous computing architecture integrating Processing-In-Memory (PIM) on the HBM core die and Processing-Near-Memory (PNM) on the logic die to accelerate Retrieval-Augmented Language Models (RALMs). The authors claim that by offloading memory-bound GEMV operations of the language model to PIM and retrieval-specific tasks (vector search, sorting) to PNM, their architecture can overcome memory bandwidth bottlenecks inherent to conventional GPU-based systems. They supplement this hardware with a novel scheduling strategy, "selective batching and early generation," which purportedly overlaps retrieval and generation to reduce idle cycles. The authors report significant performance speedups (up to 29.2x) and energy savings (up to 71.5%) over an NVIDIA H100 GPU baseline.

Strengths

Problem Characterization: The initial workload analysis in Section 3 (Figure 4, page 5) is thorough and correctly identifies the memory-bound nature of key RALM components. The roofline model analysis effectively motivates the need for a non-conventional architectural solution by demonstrating that both the language model's MHA kernels and the retriever's PQ code scan/top-k selection are far from the compute-bound regime on a state-of-the-art GPU.
Task Partitioning Rationale: The architectural decision to partition tasks—assigning simple, regular GEMV operations to PIM units and more complex, specialized retrieval logic to PNM—is well-reasoned (Section 4.1, page 6). This approach correctly acknowledges the fabrication and area constraints of logic on a DRAM die versus the flexibility of a logic die.

Weaknesses

My primary concerns with this work lie in the realism of the evaluation methodology, the downplaying of critical trade-offs, and the lack of rigor in overhead analysis.

Unjustified Quality-Performance Trade-off: The proposed "Early Generation" scheduling scheme is presented as a key innovation, yet it comes at the cost of model accuracy. Figure 8 (page 9) explicitly shows an increase in perplexity as batch size and nprobe scale. While the authors frame this as a smaller degradation compared to a GPU-based equivalent, it is a degradation nonetheless. The paper fails to adequately justify this trade-off. An architecture that achieves speedup by compromising the correctness of the model's output is fundamentally flawed. The core premise of RALM is to improve generation quality with retrieved data; this scheduler actively works against that goal by using stale data.
Questionable Area and Power Overheads: The overhead analysis in Section 6.6 (page 13) relies on highly suspect assumptions. The area of PIM logic is synthesized for a 14nm process and then scaled by a factor of "10x" to approximate a DRAM process node. This scaling factor is a crude approximation based on a single reference and lacks the necessary justification for a rigorous hardware paper. Consequently, the claim of a "15.0% overhead" on the core die is built on a weak foundation. This figure is not minor; a 15% reduction in memory cell area is a commercially prohibitive cost that the authors dismiss too readily. The power figures in Table 3 (page 10) also appear optimistic, and it is unclear if they account for all sources of leakage and activity under realistic concurrent operation.
Oversimplified Thermal Modeling: The thermal analysis presented in Section 4.1 (page 6) to justify PNM logic placement is superficial. The use of a simple 1D compact layered-conduction model is inadequate for a complex 3D-stacked device like HBM. This model ignores lateral heat spreading, the cumulative effect of hotspots from underlying PIM logic, and the thermal impact of the high-density TSV arrays. Presenting a temperature rise of less than 0.5°C based on such a model is unconvincing and potentially misleading. Real-world thermal throttling could easily negate the claimed performance benefits.
Unquantified System-Level Complexity: The proposed architecture introduces significant complexity that is not accounted for in the evaluation.
- The MNM Controller (Figure 6, page 7), which translates GPU instructions into MNM commands, is a critical component whose design, latency, and area/power overheads are completely ignored.
- The scheme to reorder PQ codeword IDs to solve memory alignment issues requires a host-side mapping table. The memory footprint and lookup latency of this table are never quantified. For a large database, this could be a non-trivial overhead.
Speculative Scalability Claims: The model and database scaling analysis in Section 6.5 (page 13) is based entirely on projection, not simulation or measurement. While such analyses can be illustrative, the claims made here are presented with undue confidence. The argument that MNM suffers from lower communication overhead than a baseline multi-GPU system is unsubstantiated, as the authors fail to quantify the command and data traffic their own scaled-up system would require between the host, GPUs, and multiple MNM-enabled HBM stacks.

Questions to Address In Rebuttal

Provide a rigorous justification for the 10x area scaling factor used for PIM logic. How does this factor account for differences in cell libraries, routing density, and process design rules between a mature logic process and a modern DRAM process?
The "Early Generation" scheduler trades model accuracy (perplexity) for latency. At what point does this degradation become unacceptable for a production-level RALM? Please provide a sensitivity analysis showing how perplexity scales with more extreme retrieval latencies and characterize the point at which the model's output quality is critically compromised.
Elaborate on the cost of the 15.0% PIM area overhead on the HBM core die. Quantify this in terms of lost memory capacity per die and per stack. How does this impact the commercial viability of such a memory product compared to standard HBM?
Address the limitations of the 1D thermal model. Acknowledge the potential impact of 3D thermal effects and explain why the risk of thermal throttling from the proposed PNM logic is negligible, especially when placed adjacent to high-activity PHY and TSV regions.
Provide a detailed architectural design for the "MNM Controller" on the GPU side, including its projected area, power, and the latency it adds to the command path. Furthermore, quantify the memory footprint and access overhead of the host-side mapping table required for PQ codeword ID reordering for the largest dataset used.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents MNM, a heterogeneous computing architecture designed to accelerate Retrieval-Augmented Language Models (RALMs). The core contribution is the insightful co-design of a hardware system and a scheduling policy that addresses the two distinct, memory-bound bottlenecks inherent in RALMs. The authors propose integrating both Processing-In-Memory (PIM) on the HBM core die and Processing-Near-Memory (PNM) on the HBM logic die. The PIM units are leveraged to accelerate the GEMV-heavy attention operations in the language model, while the more flexible PNM logic is dedicated to the lookup- and sort-intensive tasks of the vector search-based retriever. This hardware architecture is complemented by a novel "selective batching and early generation" scheduling strategy that exploits the hardware's capabilities to maximize the overlap between token generation and retrieval, mitigating the idle cycles common in batched RALM inference.

Strengths

Excellent Problem Characterization and Motivation: The authors perform a thorough workload analysis (Section 3, page 5) that correctly identifies the fundamental challenge in RALM acceleration: it is not a monolithic problem. They astutely recognize that both the language model and the retriever are memory-bound, but for different reasons. The language model's attention is limited by bandwidth for GEMV operations, while the IVF-PQ retriever is constrained by irregular memory access patterns (LUTs) and sorting. This clear diagnosis provides a strong foundation for their proposed solution.
Holistic, System-Level Approach: This work is a superb example of system-level co-design. Rather than focusing on a narrow hardware optimization, the authors consider the entire RALM pipeline. They propose a hardware solution (MNM) and a software scheduling policy that are synergistic. The hardware enables more efficient, concurrent operations, and the scheduling policy is explicitly designed to exploit this new capability. This holistic view is a significant strength and reflects a mature understanding of the problem domain.
Elegant and Well-Justified Architectural Mapping: The core architectural idea—mapping the two distinct RALM workloads to different parts of an HBM stack—is elegant and compelling. Placing simple, massively parallel MAC units (PIM) in the core die for the regular GEMV operations is a natural fit. Simultaneously, using the logic die for a more specialized PNM accelerator for retrieval tasks is equally wise, as it allows for more complex logic (e.g., sorters, arbiters) without the process technology constraints of the DRAM die. This "right tool for the right job" approach is the paper's central technical insight.
Strong Contextual Fit within Current Research Trends: This work is exceptionally well-positioned within the broader landscape of computer architecture and AI systems. It directly engages with several critical, contemporary research threads:
- The Memory Wall: It is a direct attack on the memory wall for a critical emerging workload.
- Heterogeneous Computing: It moves beyond the traditional CPU-GPU dichotomy and embraces a more specialized, heterogeneous model integrating PIM and PNM.
- System-Level AI Acceleration: It recognizes that accelerating AI is about more than just speeding up matrix multiplies; it involves optimizing the entire data-to-answer pipeline, including data retrieval. This paper serves as a valuable case study for the future of AI system design.

Weaknesses

While the core idea is strong, the paper could be improved by broadening its contextual discussion to better situate its specific design choices.

Limited Discussion of Alternative Heterogeneous Designs: The proposed PIM/PNM integration within a single HBM stack is a compelling design point. However, the academic landscape includes other heterogeneous proposals. For instance, the recent work on HeterRAG [60] proposes using HBM-PIM for generation (low latency) and DIMM-based PIM for retrieval (high capacity). A discussion contrasting the MNM approach (all-in-HBM, maximizing bandwidth) with an approach like HeterRAG's would strengthen the paper by exploring the fundamental trade-off between retrieval database capacity and access latency. This would help readers understand the specific part of the design space that MNM is optimizing for.
Clarity on the Interdependence of Hardware and Software Contributions: The "Early Generation" scheduling scheme appears highly effective in conjunction with the MNM hardware. However, its performance on conventional systems is less clear. The perplexity analysis in Figure 8 (page 9) is a good start, but the paper could more explicitly articulate whether the scheduling strategy is a general contribution applicable to any RALM system, or if its benefits are primarily unlocked by the dramatically reduced retrieval latency provided by the MNM architecture. Clarifying this relationship would help delineate the impact of the hardware and software components of the work.

Questions to Address In Rebuttal

The choice to integrate both PIM and PNM accelerators within the HBM stack prioritizes bandwidth and low-latency communication between the two components. Could the authors elaborate on the trade-offs of this approach compared to a disaggregated design, such as the one proposed in HeterRAG [60], which might offer much larger retriever database capacity by using DIMM-based memory for retrieval? In what scenarios or scales of RALM deployment would the MNM design be most advantageous?
The proposed "Early Generation" scheduling is shown to be highly effective with MNM. Could you further elaborate on its applicability to conventional GPU-based systems? Given that GPUs would have a much longer retrieval latency, would the window for "early generation" shrink to the point of being ineffective, or does the selective batching component still provide significant benefits on its own?
The current PNM design is tailored for IVF-PQ retrieval. Looking forward, retrieval methods are evolving to include more complex logic, such as multi-hop reasoning or graph traversal. How extensible is the proposed PNM architecture to support these more compute-intensive retrieval tasks? Would it require a more general-purpose programmable core on the logic die, and what would be the implications for area and power?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper proposes MNM, a heterogeneous PIM/PNM architecture integrated within a High Bandwidth Memory (HBM) stack to accelerate Retrieval-Augmented Language Models (RALMs). The core idea is to partition the RALM workload according to its distinct computational characteristics and map each part to a specialized memory-centric compute unit. Specifically, the authors propose using Processing-In-Memory (PIM) on the HBM core die to accelerate the GEMV-heavy attention operations of the language model, while using Processing-Near-Memory (PNM) on the HBM logic die to accelerate the lookup- and sort-intensive tasks of the vector search retriever. This architecture is coupled with a novel scheduling policy, "Early Generation," which builds upon selective batching to overlap retrieval and generation phases, aiming to maximize throughput.

Strengths

The primary strength of this work lies in its elegant synthesis and co-design of existing concepts into a cohesive and well-motivated architecture.

Novelty in Integration: The central novel claim is the tight, heterogeneous integration of both PIM and PNM accelerators within a single HBM stack, with each component tailored to a different part of the RALM workload. While PIM and PNM have been proposed for these tasks separately, their co-location and co-design within one memory device to serve a single application is a distinct architectural proposition.
Workload-Specific Partitioning: The motivation for the heterogeneous design is well-argued. The paper correctly identifies the distinct bottlenecks of RALMs—GEMV-bound attention in the language model and irregular memory access patterns in the retriever (Section 3, page 5). The proposed mapping of these workloads to PIM (for its high internal bandwidth) and PNM (for its more complex logic capability) is logical and architecturally sound.
Hardware-Software Co-Design: The proposed "Early Generation" scheduling scheme appears to be a direct consequence of the underlying hardware's capabilities. The ability to perform concurrent PIM and PNM operations, enabled by features like the dual row buffer (Section 4.1, page 6), allows for a more aggressive overlapping than would be possible in systems with physically separate accelerators. This synergy between the proposed hardware and software is a notable strength.

Weaknesses

My primary concern is that while the integration is novel, the fundamental building blocks and conceptual approaches are largely derivative of prior art. The paper could do a better job of positioning its contribution against these closely related works.

Constituent Ideas are Not New: The core ideas, taken in isolation, are well-established.
- PIM for Attention: Using HBM-PIM to accelerate GEMV operations in transformer attention is not a new idea. Prior works like AttAcc [80] and NeuPIM [24] have already established this approach. The PIM component of MNM appears functionally similar to these proposals.
- PNM for Vector Search: Accelerating vector search using near-memory processing is also a known technique. Works like Chameleon [34] and FAANS [32] have proposed dedicated accelerators (on FPGAs or DIMMs) for IVF-PQ and other approximate nearest neighbor search algorithms. The PNM component of MNM addresses the same problem with a similar approach.
Conceptual Overlap with Prior Heterogeneous Systems: The concept of a heterogeneous PIM architecture for RALMs has been recently proposed. HeterRAG [60] specifically proposes combining HBM-based PIM for generation with DIMM-based PIM for retrieval. MNM's core conceptual contribution—partitioning the RALM workload across different PIM/PNM technologies—is therefore not entirely novel. The delta between MNM and HeterRAG is primarily in the implementation: MNM integrates both units within a single HBM stack, whereas HeterRAG uses separate memory systems. The paper fails to discuss this crucial piece of prior art and articulate the specific benefits of its tightly-coupled approach over HeterRAG's disaggregated one.
Incremental Novelty in Scheduling: The proposed scheduling scheme is an intelligent adaptation of existing techniques. The idea of overlapping retrieval and generation was explored in PipeRAG [35]. The concept of dynamically managing requests to improve utilization is the basis of continuous/selective batching, as seen in systems like Orca [99]. The authors' "Early Generation" scheme is a clever combination of these ideas, but its fundamental novelty is tied exclusively to its co-design with the MNM hardware rather than being a new scheduling paradigm in itself.

Questions to Address In Rebuttal

The authors should address the following points to clarify the novelty and significance of their contribution.

Comparison with HeterRAG [60]: Please provide a detailed, quantitative comparison against HeterRAG. What are the specific architectural and performance advantages of integrating both PIM and PNM within a single HBM stack versus HeterRAG's approach of using separate HBM-PIM and DIMM-PIM systems? Does the tight coupling provide benefits beyond reduced physical distance, such as a more unified programming model or lower-latency coordination?
Clarifying Scheduling Novelty: The "Early Generation" scheduling is enabled by the MNM architecture. Can the authors pinpoint the exact architectural feature(s) that make this scheduling scheme uniquely effective or even possible? For instance, is it the dual row buffer that allows for contention-free concurrent access, which would not be possible with other PIM designs? How would a PipeRAG-style scheduling perform on the MNM architecture, and vice-versa?
Generality of the PNM Accelerator: The PNM design is highly optimized for IVF-PQ-based retrieval. How adaptable is this design to other, more modern vector search algorithms like HNSW? A truly novel contribution would ideally show a path toward broader applicability. Is the PNM component a fixed-function unit for IVF-PQ, or is it a more general-purpose programmable engine for near-memory retrieval acceleration?

HEAT: NPU-NDP HEterogeneous Architecture for Transformer-Empowered Graph Neural Networks

Abstract

Transformer- empowered Graph Neural Networks (TF-GNNs) are gaining significant attention in AI research because they leverage the front-end Transformer’s ability to process textual data while also harnessing the back-end GNN’s capacity to analyze graph ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The authors propose HEAT, a heterogeneous NPU-NDP architecture designed to accelerate inference for Transformer-empowered Graph Neural Networks (TF-GNNs). The architecture assigns the compute-intensive Transformer frontend to a Neural Processing Unit (NPU) and the memory-intensive Graph Neural Network (GNN) backend to a DIMM-based Near-Data Processing (NDP) unit. The paper introduces three primary optimizations: (1) a topology-aware encoding scheme that quantizes vertex features based on their degree; (2) a two-phase subgraph scheduling strategy to reduce DRAM bank conflicts and redundant memory accesses; and (3) a decoupled dataflow to enable parallel execution of the NPU and NDP. The authors claim significant speedup and energy efficiency gains over a high-performance GPU and state-of-the-art specialized accelerators.

Strengths

Problem Formulation: The paper correctly identifies a relevant and challenging problem. The performance bottleneck created by the sequential, two-phase nature of TF-GNNs—combining a compute-bound Transformer with a memory-bound GNN—is well-established. The roofline analysis in Figure 3(b) (page 4) provides clear motivation for a heterogeneous approach.
Architectural Concept: The high-level decision to map the Transformer to a compute-centric NPU and the GNN to a memory-centric NDP is logical and well-aligned with the computational characteristics of each model component.

Weaknesses

The paper's claims rest on a foundation of overly simplistic assumptions, questionable experimental design, and insufficiently justified heuristics. The reported performance gains appear to be a product of these weaknesses rather than a demonstrated architectural superiority.

Under-justified Topology-Aware Encoding: The central premise of the topology-aware encoding (Section 4, page 5) hinges on the simplistic assumption that vertex degree is a sufficient proxy for importance. The authors provide only a coarse masking experiment (Figure 4, page 4) as justification. This neglects more nuanced centrality metrics (e.g., PageRank, betweenness centrality) that often provide a more accurate measure of a node's influence in a graph. The choice of degree feels arbitrary and is not rigorously defended against alternatives, making the entire optimization seem superficial. Furthermore, the selection of the hyperparameter α (Section 6.4.2, page 11) is presented as a simple trade-off, but no sensitivity analysis is provided to show how this choice interacts with different graph structures or model architectures.
Questionable Optimality of Scheduling Heuristics: The proposed subgraph scheduling strategy (Section 5.3, page 7) is based on a series of greedy heuristics. While the intra-SG scheduling algorithm aims to balance bank accesses, its optimality is not discussed. The paper fails to quantify how close the resulting schedule is to an optimal or even a lower-bound solution for bank conflict minimization. Similarly, the inter-SG scheduling relies on a heuristic that "subgraphs sampled from neighboring starting points tend to have more overlapped vertices." The generalizability of this observation across diverse and complex graph structures is asserted, not proven. The entire scheduling component lacks theoretical grounding.
Unfair and Potentially Misleading Baselines: The primary weakness lies in the evaluation methodology (Section 6.1, page 9).
- The "FACT+MEGA" baseline appears to be a strawman. The authors have constructed a baseline by serially chaining two independent, state-of-the-art accelerators. This naive composition is inherently limited by the sequential data dependency that HEAT is specifically designed to break. It does not represent a genuinely co-designed heterogeneous competitor and is therefore an unfair point of comparison. HEAT's outperformance of this baseline is expected and fails to demonstrate superiority over a competently designed alternative.
- The performance comparison against the A100 GPU lacks critical details. It is unclear what level of software optimization (e.g., CUDA graph fusion, optimized libraries) was applied to the GPU baseline. A poorly optimized GPU baseline can artificially inflate the speedup of a custom accelerator.
Insufficient Modeling of System-Level Overheads: The evaluation relies on simulation (ONNXim, DRAMsim3), but the paper provides insufficient detail on the modeling of the interconnect between the host, NPU, and NDP. The "NPU Memory Bus" in Figure 7 (page 6) is a black box. The practical overheads of communication, synchronization, and data coherency in a real system with such a decoupled dataflow are non-trivial and could significantly erode the claimed performance gains. The paper seems to assume an idealized communication fabric.
Neglect of Dynamic Graph Scenarios: The paper dismisses the cost of its offline scheduling algorithms by comparing it to the one-time cost of model training (Table 5, page 9). This is a critical omission for inference workloads. In many real-world applications (e.g., social networks, recommendation systems), graphs are dynamic and evolve over time. The proposed offline approach would require frequent and costly re-computation, rendering it impractical. The paper completely fails to address the applicability of its methods to dynamic graphs.

Questions to Address In Rebuttal

Please provide a more rigorous justification for using vertex degree as the sole metric for importance in your topology-aware encoding. Have you evaluated other centrality metrics, and if so, how do they compare in the accuracy-performance trade-off?
Can the authors provide an analysis of the optimality of the greedy subgraph scheduling algorithm? For instance, how does it compare to a theoretical upper bound or a more complex scheduling method like simulated annealing for a small, representative graph?
Please defend the fairness of the FACT+MEGA baseline. Why is a simple sequential combination of two distinct accelerators a valid point of comparison for your co-designed architecture, rather than a hypothetical but more fairly integrated heterogeneous baseline?
Provide more details on the communication model assumed between the NPU and NDP. What bandwidth and latency are assumed for the "NPU Memory Bus," and how do these assumptions impact the overall performance, especially in the decoupled dataflow? Please conduct a sensitivity analysis on these parameters.
How would the proposed offline scheduling strategies adapt to dynamic graphs where the topology changes over time? What is the overhead of re-computing the schedule, and at what rate of graph change does this overhead nullify the benefits of the scheduling?

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

The authors present HEAT, a heterogeneous hardware architecture designed to accelerate Transformer-empowered Graph Neural Networks (TF-GNNs). The paper identifies a key challenge in this emerging model class: the front-end Transformer is compute-bound, while the back-end GNN is memory-bound, and existing accelerators typically focus on only one of these domains. HEAT addresses this by co-designing a Neural Processing Unit (NPU) for the Transformer and a Near-Data Processing (NDP) unit for the GNN.

The central contribution is not merely the hardware itself, but a set of co-design principles that exploit the coupling between the Transformer front-end and the GNN back-end. These include: (1) a topology-aware encoding scheme that uses vertex importance (degree) from the graph to apply variable precision quantization in the Transformer, reducing its computational load; (2) a subgraph scheduling strategy for the GNN that bundles and reorders subgraphs to mitigate DRAM bank conflicts and improve data locality; and (3) a decoupled dataflow that breaks the strict sequential dependency between the two stages, enabling parallel execution on the NPU and NDP. The paper demonstrates significant speedup and energy efficiency gains over a high-performance GPU and state-of-the-art specialized accelerators.

Strengths

Novel Cross-Domain Co-Optimization: The key intellectual contribution of this paper is the idea of using information from one domain (graph topology) to optimize computation in another (Transformer encoding). The topology-aware encoding scheme (Section 4, page 5) is an elegant insight. It recognizes that not all nodes in a graph are equally important and that this importance, relevant to the GNN, can be back-propagated to guide the precision of the feature extraction in the Transformer. This is a powerful co-design principle that moves beyond optimizing individual components in isolation.
A Holistic, System-Level Perspective: The authors have successfully avoided the pitfall of local optimization. Instead of just bolting a Transformer accelerator onto a GNN accelerator, they have re-architected the entire execution flow. The decoupled dataflow (Section 5.4, page 8), enabled by their heterogeneous hardware split, is a prime example of this. It directly addresses the pipeline bubble that would otherwise form due to the inherent dependency, showing a deep understanding of the full system's performance limiters.
Addresses a Timely and Forward-Looking Problem: This work is situated at the confluence of several important trends: the rise of large language models (Transformers), their integration into structured data analysis (GNNs), and the persistent need for specialized hardware. TF-GNNs represent a class of complex, multi-stage AI models that are becoming more prevalent. This paper provides not just a point solution, but a potential architectural template for accelerating future composite AI systems.
Strong Supporting Optimizations: The subgraph scheduling strategy (Section 5.3, page 7) is a well-reasoned and effective technique for tackling the notorious memory access challenges of GNNs. The distinction between intra-SG scheduling (for bank parallelism) and inter-SG scheduling (for temporal locality via a Hot Buffer) is particularly clever and demonstrates a nuanced approach to memory system optimization.

Weaknesses

My critiques are less about fundamental flaws and more about the boundaries and future implications of the proposed ideas.

Limited Exploration of "Importance" Heuristics: The entire topology-aware encoding scheme hinges on using vertex degree as a proxy for importance. While the authors validate this heuristic empirically (Figure 4, page 4), vertex degree is a relatively simple centrality measure. In more complex graphs, other metrics like PageRank, betweenness centrality, or learned attention scores might be better indicators of importance. The work would be strengthened by a discussion on the sensitivity of the architecture to this choice and its generalizability beyond degree-based importance.
Static Nature of the Optimizations: The proposed optimizations, particularly the topology-aware encoding and subgraph scheduling, appear to be performed offline based on a static graph structure. This is suitable for many inference scenarios, but the paper does not address how HEAT would perform in dynamic settings, such as social networks with evolving connections or real-time recommendation systems where graph structures change frequently. This static assumption limits the applicability to a subset of real-world problems.
Potential Programmability and Abstraction Challenges: The co-designed nature of HEAT, while powerful, implies a tight coupling between the algorithm and the hardware. It is unclear how a developer would program such a system for a new or slightly different TF-GNN model. The paper would benefit from a brief discussion on the software stack or compiler support required to map new models onto this heterogeneous architecture and manage its unique dataflow.

Questions to Address In Rebuttal

Could the authors comment on the sensitivity of their topology-aware encoding scheme to the choice of vertex importance metric? Have they considered or run preliminary experiments with other metrics (e.g., PageRank) and how might that impact the trade-off between computation reduction and accuracy?
The decoupled dataflow is a key enabler of parallelism. Could the authors provide more insight into the overhead and management of the "Encode Table" (Section 5.4, page 8)? Specifically, how does the system handle synchronization between the NPU and NDP to ensure vertex features are ready when needed without introducing significant control overhead?
How would the proposed subgraph scheduling strategy adapt to a streaming or dynamic graph scenario where the graph topology changes over time? Would the intra-SG and inter-SG schedules need to be recomputed frequently, and what would be the associated overhead?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper proposes HEAT, a heterogeneous NPU-NDP architecture designed to accelerate Transformer-empowered Graph Neural Networks (TF-GNNs). The authors identify the performance challenges arising from the compute-bound Transformer front-end and the memory-bound GNN back-end. HEAT's contributions are presented as a three-pronged optimization strategy: (1) A topology-aware encoding scheme that uses GNN vertex importance (proxied by node degree) to apply variable-precision quantization to the front-end Transformer. (2) A two-phase subgraph scheduling strategy for the GNN back-end that bundles subgraphs to mitigate DRAM bank conflicts and reorders execution to improve data locality for "hot" vertices. (3) A decoupled dataflow that enables pipelined execution by having the Transformer on the NPU generate only the vertex features currently required by the GNN on the NDP.

My analysis concludes that the work contains one core, genuinely novel idea—the co-design linking graph topology to Transformer quantization. The other two contributions, while well-engineered and effective, are applications of well-established principles from the systems and architecture communities to the specific TF-GNN workload. The novelty in those areas lies in the specific heuristics and implementation, rather than in the concepts themselves.

Strengths

Novel Cross-Domain Optimization: The most significant and novel contribution is the topology-aware encoding algorithm (Section 4, page 5). The idea of using a structural property from the back-end graph computation (node degree) to guide the precision of the front-end text-encoding Transformer is a genuinely new co-design principle. Prior work on Transformer quantization is typically data-driven (e.g., ANT), while GNN quantization focuses on features within the GNN domain (e.g., MEGA). Linking the two is a clever insight specific to the TF-GNN model class and represents a clear conceptual advance.
Holistic System Co-design: The authors present a complete system that addresses bottlenecks across the entire TF-GNN pipeline. Rather than focusing on just the Transformer or just the GNN in isolation, the architecture and its accompanying algorithms consider the interplay between the two, which is commendable.
Well-Defined Problem and Solution: The paper does an excellent job of motivating the problem with clear profiling data (Figure 3, page 4) and systematically proposing solutions for each identified challenge. The connection between an observation (e.g., irregular memory access) and the proposed solution (e.g., subgraph scheduling) is logical and well-articulated.

Weaknesses

My critique focuses on the degree of conceptual novelty of the latter two contributions, which appear to be expert applications of existing ideas rather than the creation of new ones.

Incremental Novelty in Subgraph Scheduling: The subgraph scheduling strategy (Section 5.3, page 7), while effective, is conceptually an extension of known techniques. The problem of optimizing GNN memory access by reordering computations to improve locality and mitigate bank conflicts is well-trodden ground in GNN accelerator literature (e.g., I-GCN [12], GraNDe [59]). The authors' contribution is a specific two-phase greedy heuristic: bundling subgraphs based on complementary Bank_MAX/Bank_MIN access patterns is a new algorithm, but the underlying principle of co-scheduling tasks to balance resource utilization is a classic systems problem. Similarly, reordering based on neighboring starting points to exploit "hot vertices" is a direct application of the principle of temporal locality. The novelty here is algorithmic and heuristic, not conceptual.
Decoupled Dataflow as Applied Pipelining: The proposed decoupled dataflow (Section 5.4, page 8) is a direct application of producer-consumer pipelining to the TF-GNN workload. The "naive dataflow" (Figure 11b) is a classic bulk-synchronous process. The "decoupled dataflow" (Figure 11c) introduces fine-grained pipelining by enabling the consumer (NDP) to start as soon as the producer (NPU) provides the first piece of necessary data. This is a fundamental optimization in computer systems. The use of an "Encode Table" to avoid re-computation is a form of memoization, another standard technique. While this is excellent system engineering and crucial for performance on their heterogeneous architecture, it does not represent a fundamentally new concept in dataflow or execution models.
Complexity vs. Benefit Justification: The system introduces significant offline and online complexity (topology analysis, two-phase scheduling, inter-module control for the decoupled dataflow). The performance gains are substantial. However, much of this gain comes from applying known systems principles. The primary novel idea (topology-aware quantization) provides a 1.5x speedup in the ablation study (Figure 17, page 11, base to v1). The subsequent, more conventional optimizations (scheduling and pipelining) provide the remaining boost to the final performance. This suggests that while the system is effective, its performance is built upon a single core novel idea augmented by strong, but standard, engineering.

Questions to Address In Rebuttal

On Subgraph Scheduling: Could the authors please clarify the conceptual novelty of the intra-SG scheduling algorithm beyond being a new greedy heuristic for the well-known problem of DRAM bank-conflict reduction in irregular applications? What is the core insight that distinguishes it from prior work on data rearrangement or task scheduling for GNNs or other graph algorithms?
On the Generality of the Core Novelty: The paper's primary novel claim hinges on using node degree as a proxy for vertex importance to guide Transformer quantization. This is compelling for the social network and citation network graphs evaluated. How does this principle generalize to graph types where high-degree nodes are not necessarily the most informative (e.g., hubs in protein-protein interaction networks, or in knowledge graphs where relation types are more important than connectivity)? Is there a risk of this optimization being a domain-specific heuristic rather than a general principle for TF-GNNs?
On System-Level Conflicts: The decoupled dataflow necessitates fine-grained, on-demand feature generation from the Transformer on the NPU. This contrasts with conventional high-performance Transformer inference, which relies heavily on batching large numbers of inputs to maximize PE utilization. Does this on-demand execution model introduce significant throughput degradation or efficiency loss on the NPU, and if so, how does HEAT mitigate this? Is there a trade-off between latency reduction from pipelining and the NPU's overall throughput?

RayN: Ray Tracing Acceleration with Near-memory Computing

Abstract

A desire for greater realism and increasing transistor density has led the GPU industry to include specialized hardware for accelerating ray tracing in graphics processing units (GPUs). Ray tracing generates realistic images, but even with specialized ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose RayN, a near-memory computing (NMC) architecture to accelerate ray tracing, specifically Bounding Volume Hierarchy (BVH) traversal, by placing specialized RT units on the logic layer of 3D-stacked DRAM. The paper identifies memory latency from pointer-chasing as the primary bottleneck and argues that NMC is a suitable solution. It explores three memory controller designs (JEDEC-compatible, Unified, Hybrid), a BVH partitioning heuristic to mitigate load imbalance, and evaluates the performance, energy, and area implications.

While the problem is well-motivated, the proposed solutions rest on several questionable assumptions. The most performant architectural proposals require non-standard memory interfaces with no clear path to adoption. The core technical contribution of load balancing is based on a weak heuristic, which is demonstrated to be ineffective by the authors' own results. Finally, the energy and area claims are undermined by a flawed evaluation methodology that omits key components from the analysis.

Strengths

Problem Identification: The paper correctly identifies and quantifies that BVH traversal is highly sensitive to memory latency (Figure 1, page 1). The limit study (Figure 3, page 3) provides a reasonable, albeit optimistic, upper bound on the potential for improvement.
Architectural Exploration: The consideration of three different memory controller architectures (Section 3.1, pages 4-5) demonstrates a thoughtful approach to the design space, even if the conclusions drawn from this exploration are problematic.
Simulation Infrastructure: The use of a detailed, cycle-level simulator (Vulkan-Sim) integrated with a memory simulator (Ramulator) is appropriate for this type of architectural study.

Weaknesses

Impractical Architectural Assumptions: The highest-performing "Unified" and "Hybrid" architectures (Section 3.1.2, 3.1.3, page 5) are explicitly not compliant with the HBM JEDEC standard. They require fundamental changes to memory protocols, pins, and host-side controllers. Such proposals are largely academic exercises without a convincing argument for how the entire GPU and memory manufacturing ecosystem would adopt these changes. The "JEDEC-compatible" design is a poor alternative, requiring full BVH duplication and disabling memory channel interleaving, making it a performance strawman.
Ineffective Load Balancing Heuristic: The paper claims that a key challenge is partitioning the BVH to mitigate load imbalance. However, the proposed solution is demonstrably weak.
- The load estimation metric proposed in Equation 1 (page 8) is Volume(root node) * depth(partition). The authors' own analysis in Figure 10 (page 8) shows a very weak correlation between this metric and the actual intersection count, with an average correlation of approximately 0.6 and much lower for several scenes. A heuristic with such low predictive power is fundamentally flawed.
- Figure 14 (page 11, bottom) provides direct evidence of this failure. It shows the ratio of maximum to minimum intersection tests across memory modules. A perfectly balanced system would have a ratio of 1. The authors report an average ratio of 3.14 for their BLAS Breaking method. This indicates a severe load imbalance remains, directly contradicting the paper's claims of effective partitioning.
Unsubstantiated Energy Savings: The claim of a 70% average energy reduction (Figure 17, page 11) is based on a critical methodological omission. As stated in Section 5 (page 9), the "Power usage of near-memory RT units is not measured." The analysis only accounts for the reduction in data movement energy, while completely ignoring the power consumption of the newly added compute units. Placing active logic on the HBM die is a significant thermal challenge, and to ignore its power contribution makes the entire energy analysis unreliable and misleading.
Questionable Area Overhead Analysis: The area estimation for the near-memory RT unit (Section 6.3, page 11) is derived by scaling a 65nm mobile GPU architecture (SGRT [51]) to a 12nm process. Technology scaling laws are notoriously unreliable across such a vast gap in process nodes, design goals (mobile vs. high-performance logic), and transistor types. The resulting claim of a trivial 0.78% area overhead is likely a significant underestimate and lacks sufficient rigor.
Ambiguous Performance Claims: The abstract claims a "3.0x speedup on average," but Figure 11 (page 9) shows this is the result of their best-performing, non-standard architecture (H+BB) compared to the baseline. The speedup over the most relevant prior work, Treelets [21], is closer to 2.0x. This represents a form of "resultsmanship" that inflates the contribution.

Questions to Address In Rebuttal

Regarding the Unified and Hybrid architectures: Beyond stating that they are non-standard, what is the realistic path to industry adoption for a proposal that requires coordinated changes from GPU vendors, memory manufacturers, and standards bodies like JEDEC?
Given the poor correlation shown in Figure 10 and the severe measured imbalance in Figure 14, how can the authors justify that their load estimation heuristic (Equation 1) is a valid or meaningful contribution? Why were more sophisticated, possibly runtime-informed, balancing strategies not considered?
Please provide a detailed justification for omitting the power consumption of the near-memory RT units. Can the authors provide a sensitivity study or a first-order estimation of this power and re-evaluate their energy savings claims, accounting for the strict thermal design power (TDP) constraints of an HBM logic die?
Can the authors provide a more robust justification for their area scaling methodology? For instance, by comparing their scaled estimates to known block sizes from more modern, publicly available die shots of high-performance logic.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents RayN, a near-memory computing architecture designed to accelerate ray tracing by tackling the latency bottleneck of Bounding Volume Hierarchy (BVH) traversal. The core contribution is a holistic system co-design that places specialized, but simple, Ray Tracing (RT) units within the logic layer of 3D stacked DRAM (e.g., HBM). This approach is supported by two key pillars: 1) a thoughtful exploration of memory controller architectures (JEDEC-compatible, Unified, and Hybrid) to manage concurrent access between the host GPU and the near-memory units, and 2) a novel software partitioning algorithm, "BLAS Breaking," that subdivides the scene geometry's acceleration structure to improve load balancing across memory modules. The authors demonstrate through simulation that RayN can achieve an average speedup of 3.0x over a conventional GPU baseline and a 2.2x speedup over a state-of-the-art prefetching technique, all while reducing energy consumption by 70% and incurring a negligible area overhead of ~0.78%.

Strengths

This is a well-executed and compelling piece of work that sits at the intersection of computer graphics, architecture, and the growing field of near-memory computing. Its primary strengths are:

A Holistic and Pragmatic System View: The authors' most significant contribution is not just proposing "PIM for ray tracing," but thoughtfully considering the entire system stack. The exploration of three different memory controller designs in Section 3.1 (page 5) is a standout feature. It demonstrates a deep awareness of the practical challenges of deploying near-memory solutions, weighing the trade-offs between standard compliance (JEDEC), raw performance (Unified), and a practical compromise (Hybrid). This elevates the work from a purely academic exercise to a plausible architectural proposal.
Connecting Two Converging Trends: The paper effectively synthesizes two major trends in high-performance computing: the specialization of hardware for graphics (dedicated RT cores) and the push towards processing-in-memory/near-memory computing (PIM/NMC) to overcome the memory wall. By identifying BVH traversal as a memory-latency-bound, pointer-chasing problem (as strongly motivated by Figure 1), the authors find a "killer app" for NMC in the graphics domain, which has historically driven memory system innovation (e.g., GDDR).
Novel and Well-Motivated Algorithmic Co-design: The "BLAS Breaking" partitioning scheme described in Section 4.1 (page 7) is an elegant solution. Instead of reinventing the wheel, it cleverly adapts the existing TLAS/BLAS dichotomy found in modern graphics APIs like Vulkan and DirectX. By breaking down large BLASes into smaller, more numerous partitions, the system can achieve better load balancing across the distributed near-memory RT units, as shown in Figure 9. This synergy between the hardware proposal and the software algorithm is a key strength.
Strong and Clearly Attributed Results: The reported 3.0x speedup is substantial. Crucially, the authors provide excellent analysis to explain why this speedup is achieved. Figure 12 clearly shows the latency reduction for near-memory accesses, while Figure 13 demonstrates a massive (78%) reduction in memory transactions issued by the GPU's main RT units. This detailed breakdown makes the performance claims highly credible and provides valuable insight into the system's behavior. The minimal area overhead and significant energy savings further solidify the proposal's value.

Weaknesses

The paper is strong, but its potential impact could be further clarified by addressing a few points where the current analysis feels incomplete.

Oversimplified Load Balancing Heuristic: The core of the memory placement strategy relies on the load estimation heuristic in Equation 1 (page 8). As the authors honestly show in Figure 10, the correlation between this estimate and the actual work is modest. While the sensitivity study suggests the system is robust even in a worst-case imbalance, this heuristic remains the Achilles' heel of the software design. The performance gains could be highly dependent on camera paths and scene layouts that happen to align well with this simple volume-times-depth metric. The work would be stronger if it explored or at least discussed more sophisticated alternatives.
Limited Scope of Dynamic Workloads: The evaluation is based on rendering multiple frames of a scene by changing the camera position, which simulates some level of dynamism. However, real-time gaming and interactive applications feature far more complex dynamics, including object deformation, destruction, and streaming assets. These scenarios would necessitate frequent refitting or rebuilding of BLASes. The paper does not analyze the overhead of re-partitioning and re-distributing these dynamic BLASes across memory modules, which could potentially erode the performance gains in highly fluid scenes.
Positioning Relative to Cache-Centric Solutions: The paper positions itself against a strong prefetching baseline (treelets). However, it could better contextualize its contribution by briefly discussing how RayN compares, conceptually, to a brute-force approach of simply scaling up on-GPU caches (L2/L3). While NMC is likely the superior path for this kind of irregular access pattern, explicitly stating why—for instance, the inefficiency of caching vast, sparsely accessed BVH trees versus moving the computation—would strengthen the argument that RayN represents a fundamentally better architectural direction, not just an alternative one.

Questions to Address In Rebuttal

Regarding the load estimation heuristic (Equation 1, page 8): Have the authors considered alternative or complementary metrics? For example, in a real-time context, could profiling data from the N-1 frame be used to dynamically adjust the load estimates for the Nth frame, creating a more adaptive and accurate placement strategy over time?
The paper's design partitions the BVH tree geographically. Could it also be partitioned based on ray type? For instance, in many scenes, a small number of complex objects are responsible for most reflection/refraction effects. Could BLASes for these "hero" objects be handled differently (e.g., duplicated or prioritized) to optimize the traversal of secondary rays, which often exhibit less coherence than primary rays?
How does the proposed system handle the interaction between BVH traversal and programmable shaders (e.g., intersection shaders or any-hit shaders)? The paper mentions support for them (Section 4.3, page 8), but running complex, arbitrary shader code on the simple near-memory RT units seems challenging. Could you elaborate on the limitations of the programmable logic in the near-memory units and the mechanism for handing off to the main GPU shader cores if a complex shader is encountered?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper proposes RayN, a near-memory computing architecture designed to accelerate ray tracing. The core idea is to place dedicated ray tracing hardware units (RT units), functionally similar to those on a modern GPU die, directly into the logic layer of 3D-stacked memory modules (e.g., HBM). The stated goal is to mitigate the high memory latency inherent in the pointer-chasing nature of Bounding Volume Hierarchy (BVH) tree traversal.

To support this architectural proposal, the authors introduce three distinct memory controller configurations (JEDEC-compatible, Unified, Hybrid) to manage concurrent memory access from the host GPU and the near-memory RT units. Furthermore, they propose a novel BVH partitioning algorithm, "BLAS Breaking," which leverages the existing TLAS/BLAS API structure to divide the BVH tree across multiple memory modules. This partitioning is guided by a new load-balancing heuristic. The authors claim their Hybrid configuration with BLAS Breaking achieves an average speedup of 3.0x over a baseline GPU.

Strengths

The primary strength of this paper lies in the novel synthesis of two existing, but separate, technology trends: specialized hardware for ray tracing and near-memory computing.

Novelty of Architectural Synthesis: While near-memory computing (NMC) is a well-explored field, and on-die RT units are now industry standard, the proposal to place specialized RT accelerator logic directly on the memory logic die appears to be a genuinely novel concept. Prior work on NMC for GPUs has typically focused on offloading kernels to general-purpose cores (e.g., Hsieh et al., "TOM," ASPLOS 2016, ref [37]) or accelerating different domains like graph processing (e.g., Ahn et al., ISCA 2015, ref [7]). This paper correctly identifies a key workload (BVH traversal) that is an excellent candidate for specialized NMC and proposes a specific, bespoke hardware solution.
Novelty in Problem-Specific Algorithms: The architectural idea is supported by a novel partitioning algorithm tailored specifically for BVH trees. The "BLAS Breaking" method (Section 4.1, page 7) is a clever adaptation that leverages domain knowledge of how graphics APIs already structure scenes. The load estimation heuristic presented in Equation 1 (page 8), Volume(root node) × depth(partition), is a simple but new contribution for predicting ray tracing workload distribution without runtime camera information. This demonstrates a complete system view, moving beyond just the hardware placement.

Weaknesses

While the central synthesis is novel, several of the supporting components are derivative of prior art, and the evaluation does not sufficiently defend the necessity of the proposed specialized hardware against alternative NMC approaches.

Incremental Novelty of Memory Controller Designs: The three controller architectures presented in Section 3.1 (pages 4-5) are not fundamentally new paradigms for managing shared memory access in an NMC system. The "JEDEC-compatible" design, which partitions memory channels, is a standard technique to avoid structural hazards in heterogeneous systems. The "Unified" controller is conceptually similar to architectures where the main memory controller is moved off-host. The "Hybrid" model is a pragmatic engineering compromise between the two. The problem of enabling concurrent access for near-data accelerators and a host is well-known, with prior work like Cho et al. (ISCA 2020, ref [20]) exploring similar challenges. The novelty here is in the application and trade-off analysis, not in the underlying controller concepts.
Limited Novelty and Efficacy of the Partitioning Heuristic: The proposed load-balancing heuristic (Equation 1, page 8) is acknowledged to have a "small" correlation with the actual measured load (Figure 10, page 8). The results confirm this, showing a remaining load imbalance with a max/min ratio of 3.14 on average (Figure 14, page 11). While the heuristic itself is new, it is a very simple model. The field of workload prediction and cost modeling for tree traversal is mature, particularly in the database domain (e.g., query plan optimization). The paper fails to justify why a more sophisticated model was not explored, which weakens the contribution of this novel, but seemingly ineffective, algorithm.
Failure to Contrast with Software on General-Purpose NMC: The paper's core claim rests on the need for specialized near-memory RT units. However, it fails to provide a comparison against a functionally similar system that uses general-purpose near-memory cores, such as those proposed in TOM (ref [37]) or implemented in commercial products like UPMEM's PIM system (ref [4]). The authors state that prior NMC graph architectures "lack the specialized accelerator hardware required" (Section 1, page 1), but this is an assertion, not a quantified conclusion. Without data showing that a software-based BVH traversal on a general-purpose near-memory core is insufficient, the central novel claim—that specialized hardware is the right solution—remains unsubstantiated. The delta between this work and a software-based NMC approach is unclear.

Questions to Address In Rebuttal

Regarding the memory controller designs (Section 3.1), please clarify the precise delta between your proposed solutions and the state-of-the-art in arbitrating between a host processor and a near-memory accelerator. Beyond the application to ray tracing, what is the fundamental novel contribution in the controller logic or protocol itself?
The proposed load-balancing heuristic (Equation 1) shows weak correlation to the actual load. Can the authors justify the decision not to explore more advanced predictive models? Have you considered prior art in cost estimation for tree-based data structures from other domains (e.g., databases) or even simple machine learning models trained offline on representative camera paths?
The central thesis is that specialized near-memory hardware is required. To justify this novel hardware proposal, please provide a quantitative comparison or a well-reasoned estimate of the performance of a software-based BVH traversal algorithm running on an array of general-purpose, low-power cores in the memory logic die (a configuration proposed by prior work such as ref [37] or ref [71]). Without this, it is difficult to assess whether the complexity of adding new, specialized RT units is justified over a more flexible software-based approach.

Pimba: A Processing-in-Memory Acceleration for Post-Transformer Large Language Model Serving

Abstract

Transformers are the driving force behind today’s Large Language Models (LLMs), serving as the foundation for their performance and versatility. Yet, their compute and memory costs grow with sequence length, posing scalability challenges for long-context ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper proposes PIMBA, a Processing-in-Memory (PIM) accelerator designed to serve both transformer and "post-transformer" Large Language Models (LLMs), such as those based on State Space Models (SSMs). The authors' central thesis is that a common "state update" operation in post-transformer models is memory-bandwidth-bound, similar to the attention mechanism in transformers. PIMBA's architecture is based on two primary ideas: (1) a State-update Processing Unit (SPU) shared between two DRAM banks using an interleaved access pattern to improve utilization, and (2) the use of MX8 quantized arithmetic to achieve a better accuracy-area tradeoff compared to other low-precision formats. The authors claim significant throughput improvements over GPU and GPU+PIM baselines with minimal accuracy loss.

Strengths

The paper correctly identifies a relevant and timely research direction: accelerating the emerging class of post-transformer LLMs.
The workload characterization in Section 3, which identifies the state update operation as a potential bottleneck under batched inference, provides a reasonable starting point for the investigation.
The analysis considers the critical trade-off between quantization-induced accuracy degradation and hardware area overhead (Section 4.2), which is a necessary component of any proposal involving low-precision arithmetic.

Weaknesses

Unsupported Foundational Motivation: The paper's motivation hinges on the claimed superiority of post-transformer models. Figure 1(a) presents a comparison where Mamba-2 achieves 4.5% higher accuracy than a baseline transformer. This result is cited from an external source [15] without providing the necessary context to validate its fairness (e.g., equivalent training compute, data, and model tuning). Without this validation, the entire premise that the community should invest in specialized hardware for these models is built on a potentially specious claim.
Oversimplification of the "State Update" Primitive: The paper generalizes the core operations of diverse architectures (SSMs, linear attention, RNNs) into a single "state update" primitive, formalized in Equation 2 (Page 4). This abstraction is a significant simplification. It glosses over key differences, such as the scalar decay in Mamba-2 versus the vector-based gating in GLA. The paper provides no evidence or sensitivity analysis to demonstrate that this single, generalized hardware implementation can efficiently serve these varied computational patterns without significant performance penalties or architectural compromises for one model type versus another.
Insufficient Justification for the Choice of MX8: The argument for using MX8 over int8 rests on the claim that int8 incurs "substantial area overhead" due to the need for dequantization and requantization for addition operations (Section 4.2, Page 6). This claim is asserted but not rigorously substantiated. The paper fails to provide a quantitative, apples-to-apples area comparison between the components of their proposed MX Adder (Figure 9b), which includes shifters and comparison logic, and a standard int8 datapath for the same function. Without this direct comparison, the claim that MX8 is Pareto-optimal remains an unproven assertion.
Weak and Potentially Biased Baselines: The experimental comparison relies on baselines that appear to be deliberately weakened. The GPU+PIM baseline is described as a "time-multiplexed design" that explicitly lacks the "access interleaving technique of PIMBA" (Section 6.1, Page 10). This constructs a strawman; a fair comparison would be against a more aggressively pipelined PIM baseline, allowing the reader to properly assess the incremental benefit of PIMBA's specific design choices. By comparing against a seemingly self-crippled baseline, the reported speedups of up to 2.1x are likely inflated.
Questionable Architectural Novelty: The core architectural proposal of sharing a processing unit between two banks using "access interleaving" (Section 5.2, Page 7) is presented as a key innovation. However, both resource sharing to amortize area cost and interleaving to mitigate memory bank contention are foundational techniques in computer architecture and parallel systems. The paper fails to adequately position this contribution with respect to prior art, thereby overstating its novelty.

Questions to Address In Rebuttal

Regarding the core motivation in Figure 1: Can you provide concrete evidence that the 2.7B parameter transformer and Mamba-2 models are fairly compared in terms of training data, total training FLOPs, and hyperparameter tuning? If not, how does this uncertainty affect the premise of your work?
Regarding the generalized state update primitive (Equation 2): Please provide a quantitative analysis of the performance and area efficiency of your proposed SPU when executing the specific, non-generalized operations of RetNet, GLA, and HGRN2. How much efficiency is lost by forcing these distinct operations onto your unified hardware?
Regarding the choice of MX8 over int8: Please provide a detailed, post-synthesis area breakdown of the MX Adder and MX Multiplier versus an equivalent int8 datapath that includes the necessary dequantization/requantization logic. This data is required to substantiate the claim of Pareto optimality made in Figure 6.
Regarding the GPU+PIM baseline: Please justify the decision to use a non-interleaved, "time-multiplexed" design as your primary PIM baseline. Why was a more competitive, pipelined PIM design not used for comparison? How much of your reported speedup is attributable to simply having a better pipeline structure versus the baseline you designed?
Regarding the accuracy results in Table 2: Was any form of quantization-aware training or post-quantization fine-tuning used to achieve the reported near-zero accuracy degradation? If only post-training quantization was used, the results are unusually strong and require further explanation and validation.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents PIMBA, a Processing-in-Memory (PIM) accelerator designed for serving both existing Transformer-based Large Language Models (LLMs) and the emerging class of "post-transformer" models (e.g., State Space Models like Mamba-2, linear attention).

The authors' core contribution is founded on a crucial unifying insight: despite their algorithmic differences, the performance of both model classes during batched inference is fundamentally bottlenecked by memory bandwidth. For Transformers, this is the well-known attention mechanism; for post-transformers, it is a newly identified bottleneck in the "state update" operation.

Based on this, the authors propose a unified PIM architecture that can efficiently execute both types of operations. The design is guided by two key principles derived from their analysis: (1) The varied primitives in state update operations make per-bank PIM logic area-inefficient, motivating a shared "State-update Processing Unit" (SPU) that interleaves access between two banks. (2) Post-transformer models are sensitive to quantization, and the authors identify Microsoft's MX format (specifically MX8) as a Pareto-optimal choice for balancing accuracy and hardware area in a PIM context. The resulting system, PIMBA, demonstrates significant throughput improvements over GPU and GPU+PIM baselines while maintaining model accuracy and adhering to practical area constraints.

Strengths

Timeliness and Important Vision: This work is exceptionally timely. As the research community actively seeks alternatives to the quadratically scaling Transformer, the question of how to build hardware that supports this transition is critical. The paper's vision of a unified serving system that bridges the gap between today's Transformers and tomorrow's post-transformers is both ambitious and highly valuable. It provides a practical roadmap for evolving our hardware infrastructure.
Excellent Unifying Abstraction: The paper's primary strength lies in its abstraction of the core performance problem. By identifying the "state update" as the conceptual parallel to "attention" and demonstrating through roofline analysis (Figure 1b, page 2) and latency breakdowns (Figure 3, page 4) that both are memory-bound, the authors distill a complex landscape into a single, addressable hardware challenge. This insight forms a powerful foundation for their entire work.
Principled, Data-Driven Hardware Design: The design of PIMBA is not arbitrary; it follows directly from well-articulated principles and thorough analysis.
- The decision to share an SPU between two banks is a clever solution to the area-throughput tradeoff identified in their analysis of pipelined vs. time-multiplexed PIM designs (Section 4.1, page 6).
- The selection of the MX8 format is convincingly justified through a detailed accuracy-area tradeoff analysis (Figure 6, page 6), which clearly shows its superiority over other formats like int8 (too much area) and low-precision floating point (poor accuracy) for this specific workload. This is an excellent piece of co-design.
Comprehensive Scope and Evaluation: The authors evaluate their proposal against a broad set of modern architectures—four distinct post-transformer models, a hybrid model (Zamba2), and a traditional Transformer (OPT). The evaluation across multiple scales (up to 70B parameters) and on a wide range of metrics (throughput, latency, energy, accuracy, and area) lends significant credibility to their claims. The minimal accuracy degradation shown in Table 2 (page 11) is particularly compelling evidence for the viability of their chosen quantization strategy.

Weaknesses

My concerns are less about flaws in the existing work and more about the boundaries of its claims and its integration into the broader systems landscape.

Robustness of the "State Update" Generalization: The authors unify several post-transformer operations into a single generalized state update form (Equation 2, page 4). This is elegant and effective for the models studied. However, the post-transformer field is nascent and evolving rapidly. It is conceivable that a future dominant architecture might introduce primitives (e.g., complex non-linearities, different data dependencies) that do not map cleanly to this structure. This could limit the future-proofing of the PIMBA design.
System-Level Integration Challenges: The paper effectively addresses the accelerator microarchitecture but is lighter on its integration into a full-fledged, dynamic serving system. Modern LLM serving schedulers (like those in vLLM or Orca) use sophisticated techniques like continuous batching and preemption to maximize GPU utilization. The paper acknowledges that PIMBA operates in a "blocked manner" with the GPU (Section 8, page 13), which inherently leads to utilization gaps. While the authors suggest leveraging techniques from NeuPIMs [28], a more detailed discussion of how PIMBA's deterministic, command-based execution model would coexist with the highly dynamic and asynchronous nature of a production-level scheduler would strengthen the paper's system-level contribution.
The Software Hurdle: As with all novel hardware, the software and programmability aspect is a major barrier to adoption. The paper mentions extending CUDA and defining custom DRAM commands (Section 5.1, page 7), which is a non-trivial engineering effort. A deeper contextualization of the required software changes would provide a more complete picture of the path to real-world deployment.

Questions to Address In Rebuttal

Regarding the generalized state update operation (Equation 2): Could the authors comment on the potential limitations of this formulation? Have they considered any emerging post-transformer algorithms that might challenge this abstraction, and how might PIMBA be adapted to accommodate them?
Could the authors elaborate on the system-level scheduling vision? How would a system using PIMBA-enabled memory efficiently manage the pipeline bubbles created during the hand-offs between GPU and PIM execution, especially in a dynamic, multi-user environment managed by a sophisticated scheduler?
The Pareto-optimality of the MX8 format was compellingly demonstrated for the Mamba-2 model (Figure 6). Was this detailed tradeoff analysis also performed for the other SU-LLMs (e.g., RetNet, GLA)? Confirming that MX8 remains the optimal choice across this diverse set of models would further strengthen this key design decision.

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents PIMBA, a Processing-in-Memory (PIM) accelerator designed to serve both transformer and, more critically, "post-transformer" Large Language Models (LLMs) like State Space Models (SSMs) and their variants. The authors first identify a common, memory-bound "state update" operation that unifies various post-transformer architectures. They then propose a novel PIM architecture to accelerate this operation alongside standard attention. The core architectural claims of novelty are (1) a State-update Processing Unit (SPU) that is shared between two memory banks, using an "access interleaving" technique to mitigate read/write hazards and maximize throughput within a constrained area budget, and (2) a State-update Processing Engine (SPE) within the SPU that uses custom microarchitecture for element-wise multiplication and addition on the MX low-precision data format, extending its use beyond its original dot-product design.

Strengths

The primary strength of this paper lies in its well-defined and well-defended novel contributions at multiple levels of abstraction.

Novel Workload Abstraction: The identification and generalization of the "state update" operation (Equation 2, Section 3.1, page 4) across a diverse set of emerging post-transformer models (Mamba-2, GLA, RetNet, HGRN2) is a significant conceptual contribution. While prior work has analyzed individual models, this paper is the first I have seen to propose a unified hardware target based on this common algorithmic pattern. This insight alone provides a strong foundation for the work.
Novel Architectural Technique: The "access interleaving" mechanism where a single SPU is shared between two banks (Section 5.2, page 7) is an elegant and novel solution to a practical PIM design problem. The authors correctly identify the area-throughput tradeoff between naive time-multiplexed and fully-pipelined per-bank designs (Section 4.1, page 6). Their proposed solution, which pipelines operations by alternating reads and writes between two banks into a single SPU, effectively achieves the throughput of a per-bank design with roughly half the area overhead. This is a clever application of pipeline hazard mitigation in a new domain.
Novel Microarchitectural Design: The paper's novelty extends to the microarchitecture of the SPE (Section 5.3, page 8). While the MX format itself is not new [16], its application has been predominantly for GEMM/dot-product operations. The authors’ contribution is the design of bespoke MX Multiplier and MX Adder units for element-wise operations, which are critical for the "state update" primitive. This extension of the MX format's utility is a non-trivial and novel piece of engineering that is directly justified by their compelling area-accuracy tradeoff analysis (Figure 6, page 6).
Novel Empirical Insights: The quantization analysis (Section 3.2, page 5) provides a genuinely new finding. The observation that post-transformer state updates are highly susceptible to the "swamping effect" with standard low-precision floating-point formats (e.g., e4m3, e5m2) is a crucial insight that distinguishes this workload from transformer KV cache quantization. The subsequent identification of MX8 with stochastic rounding as a Pareto-optimal solution in the context of PIM area constraints is a strong, data-backed novel claim.

Weaknesses

My criticisms are primarily focused on contextualizing the novelty and identifying elements that are more evolutionary than revolutionary.

Incremental System Integration: The overall system design, particularly the software stack and host interface, heavily leverages prior art. The authors explicitly state their system architecture is "similar to the existing PIM-based LLM serving systems [28, 54, 55, 67]" and that the software stack is "based on a prior work, HBM-PIM [40]" (Section 5.1, page 6). The proposed custom DRAM commands (ACT4, REG_WRITE, etc., in Section 5.5, page 8) are logical extensions of command-based PIM interfaces seen in prior works and do not represent a fundamental shift in the PIM execution model. The novelty is clearly in the PIM logic itself, not its system-level integration.
Application of a Known Principle: The concept of interleaving accesses between memory banks to improve functional unit utilization is a classic computer architecture principle. While its application to solve the state update read/write hazard in PIM is novel and well-executed, the underlying idea of hiding latency by finding parallelism between independent resources is not fundamentally new. The paper would be strengthened by acknowledging this principle and more clearly articulating how its constraints and implementation differ in the PIM context.

Questions to Address In Rebuttal

On the Novelty of Access Interleaving: The proposed SPU shared between two banks is a cornerstone of your architectural contribution. Can you clarify if any prior work in the broader near-data processing or PIM literature has employed a similar shared-unit, interleaved-bank access scheme to resolve read-after-write or write-after-read hazards, even if for a different application (e.g., database operations, graph processing)?
On the Generality of the "State Update" Primitive: Your design is predicated on the "state update" abstraction (Equation 2). This holds for the current crop of post-transformer models. However, this field is evolving rapidly. How robust is your architecture to potential future post-transformer models that may deviate from this specific pattern of decay ⨀ state + outer(k, u)? Is the SPE's functionality general enough, or is PIMBA's novelty tightly coupled to this specific formulation?
On the Microarchitectural Delta of the MX SPE: The paper proposes custom MX multipliers and adders (Figure 9, page 8). Beyond the logic for handling the shared exponents and microexponents, how much of the core datapath (i.e., the mantissa computation) differs from standard integer multiplication/addition? Please quantify the "delta" in hardware complexity between your proposed units and a hypothetical design that dequantizes MX to a shared internal format (e.g., FP16), performs standard FP operations, and then re-quantizes. This would help solidify the claimed area benefits.

GateBleed: Exploiting On-Core Accelerator Power Gating for High Performance and Stealthy Attacks on AI

Abstract

As power consumption from AI training and inference continues to increase, AI accelerators are being integrated directly into the CPU. Intel’s Advanced Matrix Extensions (AMX) is one such example, debuting in the 4th Generation Intel Xeon Scalable CPU, ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present "GateBleed," a purported timing side channel in Intel's Advanced Matrix Extensions (AMX) on 4th Gen Xeon CPUs. The central claim is that an undocumented, staged power gating mechanism creates distinct, reuse-distance-dependent latencies for AMX instructions. The paper attempts to build upon this primitive to demonstrate three classes of attacks: AI privacy leaks (membership inference, MoE expert routing), a generic microarchitectural magnifier, and a high-bandwidth remote covert channel for Spectre-style attacks. The authors claim their attack is both high-performance and stealthy, evading state-of-the-art hardware attack detectors.

While the initial characterization of the timing anomaly is notable, the paper's primary claims regarding the attack's real-world viability, novelty of impact, and stealthiness are insufficiently supported by the provided evidence. The work suffers from a combination of threat model contrivances, overstated conclusions, and a lack of rigorous, controlled comparisons against established baselines.

Strengths

Systematic Characterization of a Timing Anomaly: The authors have performed a detailed characterization of AMX instruction latency as a function of idle duration (Figure 1, page 2). The investigation in Section 4.3 to rule out confounding factors like DVFS, C-states, and value dependency is methodical and represents a solid piece of reverse engineering.
Demonstration of a Primitive: The paper successfully identifies a timing differential and demonstrates that it can be triggered. The core phenomenon—that an AMX instruction can take ~50 cycles or ~20,000 cycles depending on recent activity—is clearly established on the tested hardware.
Breadth of Exploration: The authors apply their primitive to multiple domains (AI privacy, remote channels), showing ambition in exploring the potential impact of the vulnerability.

Weaknesses

Insufficient Proof of Root Cause: The central pillar of this paper is that the timing variance is caused by "undocumented power gating." The primary evidence for this is the correlation between latency states and package-level power consumption shown in Figure 5 (page 8). While suggestive, this is not definitive proof. Package-level power is a coarse metric, and the observed drop could be correlated with, but not directly caused by, the same mechanism responsible for the latency. Without more direct evidence (e.g., from on-chip sensors, thermal analysis, or official documentation), attributing this exclusively to power gating is an unsubstantiated leap. The mechanism could be a more complex, undocumented power/clock management state machine.
Contrived Threat Models for AI Attacks: The demonstrated AI attacks rely on scenarios that lack clear justification for their real-world prevalence.
- The Mixture-of-Experts (MoE) attack (Section 6.2) achieves its claimed 100% accuracy only with a "layer gap ≥ 8" between experts. This is an extreme case of architectural heterogeneity. The authors provide no evidence that such architecturally imbalanced MoE models are common or practical in production. The attack's effectiveness degrades significantly as the experts become more similar (Table 4, page 11), which is the more likely scenario.
- The Membership Inference Attack (MIA) (Section 6.3) claims its 81% accuracy "rivals or exceeds prior attacks." This claim is made without a crucial baseline. The authors should have implemented a standard, confidence-score-based MIA on the exact same early-exit Transformer model to provide a direct comparison. Without this control, it is impossible to assess whether this complex timing-based approach offers any practical advantage over well-established, and potentially simpler, methods.
Overstated Claims of Performance and Stealth:
- Remote Channel Performance: The claimed 70,000x leakage rate improvement over NetSpectre (Section 6.4) is sensationalized. The term "production network" is used without any quantitative characterization of its properties (i.e., baseline latency, jitter, packet loss). Network noise is the single most important variable for remote timing attacks. Without these statistics, the comparison is meaningless and the results are not reproducible or generalizable. The presented violin plots in Figure 9 (page 12) show clear separation, but this could easily be a result of a low-noise, low-contention network path.
- Stealthiness: The claim of evading SOTA detectors (Section 6.7) is not rigorously proven. The authors state that including the EXE.AMX_BUSY performance counter "did not improve the models' performances." This is a weak dismissal. A thorough analysis would require detailing how this feature was incorporated into the detectors' models and presenting the resulting detection accuracy. It is plausible that a detector specifically designed to look for sparse, high-latency AMX executions could be effective. The current analysis is insufficient to support the strong claim of being "effectively invisible" (page 13).
Limited Scope of Hardware Evaluation: The experiments were conducted on a single CPU model (Intel Xeon Gold 5420+). The paper makes broad claims about "Intel AMX," but provides no evidence that this specific timing behavior is present across the entire Sapphire Rapids family, let alone subsequent generations like Emerald Rapids. The findings may be specific to a particular stepping or microcode version of a single product.

Questions to Address In Rebuttal

Root Cause: Beyond the correlation with package power, what further evidence can you provide to definitively attribute the observed latency stages to a power gating mechanism, as opposed to another undocumented microarchitectural state change (e.g., clock gating, firmware-managed idle state)?
MoE Attack Realism: Please provide justification or citations from deployed systems showing that MoE models with significant architectural heterogeneity (e.g., an 8-layer difference between experts) are a realistic threat model, rather than a constructed best-case scenario for your attack.
MIA Baseline: Please provide results from a direct, controlled comparison of your timing-based MIA against a standard confidence-score-based MIA performed on the identical early-exit Transformer model and dataset. This is essential to substantiate the claim that your attack "rivals or exceeds" prior work.
Remote Channel Characterization: Please provide quantitative metrics for the "production network" used in Section 6.4, including mean latency, standard deviation (jitter), and packet loss rates over the experiment's duration. How does GateBleed's performance degrade as jitter increases?
Stealth Analysis: Please provide a more detailed analysis of the attempt to retrain SOTA detectors. Specifically, describe the feature engineering for the EXE.AMX_BUSY counter and present the full confusion matrix for the retrained detectors. Why, precisely, does this feature fail to enable detection?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces GATEBLEED, a novel and potent timing side channel rooted in the aggressive power gating mechanism of Intel's on-core AI accelerator, Advanced Matrix Extensions (AMX). The core contribution is the discovery that the time required to "wake up" the AMX unit from various power-gated states creates a massive and easily measurable timing discrepancy (up to 20,000 cycles).

The authors masterfully demonstrate that this is not merely an interesting microarchitectural quirk but a versatile and powerful attack primitive with two significant applications: 1. A new vector for AI privacy attacks: They show how GATEBLEED can be used to perform high-accuracy membership inference and infer routing decisions in Mixture-of-Experts (MoE) models. Crucially, these attacks operate purely on timing, without needing access to model outputs like logits or confidence scores, thereby bypassing a whole class of traditional defenses. 2. A generic, high-performance, and stealthy attack magnifier: They repurpose GATEBLEED as a transmission channel for a remote Spectre attack that is robust to network noise, and as a magnifier that can amplify subtle microarchitectural events to bypass timer coarsening defenses.

The work provides a thorough characterization of the vulnerability, identifies exploitable gadgets in major ML libraries, demonstrates end-to-end attacks, and evaluates potential mitigations.

Strengths

The true strength of this paper lies in its synthesis of several research domains and its demonstration of a fundamental vulnerability with far-reaching implications.

Bridging Hardware Architecture and AI Security: The most significant contribution is the direct line it draws from a low-level hardware power optimization to high-level AI privacy risks. While prior works have attacked AI models via side channels (e.g., Cache Telepathy [144]), GATEBLEED exposes a new, more fundamental leakage source. By linking the conditional execution inherent in modern models (like MoEs and early-exit networks) to the conditional activation of a hardware accelerator, the paper reveals that architectural choices in ML now have direct, exploitable hardware security consequences. This is a timely and important connection.
Exceptional Signal-to-Noise Ratio and Practicality: The vulnerability's defining characteristic is the sheer magnitude of the timing delta. A 20,000-cycle gap is orders of magnitude larger than those seen in previous power/frequency-based attacks like Hertzbleed [135] (~200 cycles). This is not an incremental improvement; it is a categorical shift. This massive signal makes the attack resilient to real-world noise (as demonstrated in the remote Spectre attack in Section 6.4, page 11) and resistant to standard defenses like timer coarsening, lending the attacks a degree of practicality rarely seen in academic side-channel research.
Stealth and Evasion of Existing Defenses: The paper convincingly argues that GATEBLEED is exceptionally stealthy. Because the attack can be triggered by a single instruction following a period of natural idleness, it leaves minimal footprint and evades state-of-the-art microarchitectural attack detectors (as shown in Table 5, page 13). This is a critical finding, as it suggests that current defense paradigms, which often look for anomalous patterns like high cache miss rates or TLB flushing, are blind to this class of vulnerability.
Broad Applicability as a Generic Primitive: Beyond the novel AI attacks, the positioning of GATEBLEED as a generic magnifier and covert channel primitive (Section 5.2, page 8 and Section 6.4, page 11) significantly broadens the paper's impact. It provides the security community with a powerful new building block that could enable or enhance a wide range of microarchitectural attacks, especially in constrained or noisy environments where they were previously infeasible.

Weaknesses

The paper's core ideas are strong, and the execution is thorough. The weaknesses are less about flaws and more about opportunities to further explore the context and boundaries of the findings.

Limited Discussion on Threat Model Practicality in Cloud Environments: The most powerful local attacks rely on the "AMX Usage" threat model, where an attacker process is co-located on the same physical core as the victim. While possible, modern hypervisor and OS schedulers in multi-tenant cloud environments are increasingly sophisticated and may actively work to isolate workloads, potentially making same-core co-residency a rare or difficult-to-achieve condition. A deeper discussion of the real-world probability and techniques for achieving this co-residency would strengthen the paper's claims about practical risk.
The Scope of Generality: The paper demonstrates GATEBLEED's capability as a magnifier by amplifying a cache hit/miss timing difference. This is a clear and effective proof of concept. However, to truly cement its status as a generic magnifier, it would be beneficial to discuss or demonstrate its application to other, more subtle microarchitectural events, such as contention on execution ports or scheduler queues.
A Singular Focus on AMX: The work is an excellent deep dive into Intel's AMX. However, AMX is just one example of a broader trend toward on-core, specialized accelerators (e.g., NPUs in consumer chips, Google's TPUs). The paper would be even more impactful if it briefly contextualized its findings within this trend, speculating on whether similar design principles (i.e., aggressive power gating for high-power, intermittently-used units) might lead to analogous vulnerabilities in other vendors' hardware.

Questions to Address In Rebuttal

Regarding the threat model for the co-resident attacks: Could the authors comment on the prevalence of same-core scheduling in major public cloud environments (e.g., AWS, GCP, Azure)? Are there known techniques an attacker could use to increase the likelihood of being scheduled on the same core as a target victim process?
The paper compellingly demonstrates GATEBLEED as a magnifier for a cache timing delta. Could the authors elaborate on its potential to amplify other, more subtle microarchitectural events? Is the 20,000-cycle "cliff" sensitive enough to be tipped by phenomena smaller than an L3 cache miss, such as contention for a specific execution port?
Given the industry-wide trend towards integrating specialized accelerators directly onto the CPU die, do the authors believe that power gating mechanisms in other on-core accelerators (e.g., NPUs, integrated GPUs) represent a likely new frontier for GATEBLEED-style vulnerabilities? Are there fundamental architectural reasons why this vulnerability might be unique to AMX, or is it likely a more generalizable problem?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present GateBleed, a timing side channel vulnerability found in the power gating mechanism of Intel's Advanced Matrix Extensions (AMX) on-core AI accelerator. The core of the work is the discovery and characterization of an undocumented, multi-stage power management feature that introduces reuse-distance-dependent latency variations of up to 20,000 cycles.

The authors leverage this new primitive in three primary ways: 1. As a novel vector for AI privacy attacks (Membership Inference and Mixture-of-Experts routing leakage) that relies solely on timing and not on model outputs like confidence scores. 2. As a generic, single-instruction "magnifier" to amplify subtle microarchitectural state differences, making them observable even with coarse timers. 3. As a high-bandwidth covert channel for remote attacks like Spectre, which is shown to be resilient to network noise where prior art fails.

The paper claims novelty in being the first to exploit accelerator power optimizations for AI privacy attacks and in creating a uniquely stealthy and powerful microarchitectural magnifier.

Strengths

The primary strength of this paper is the discovery of a genuinely new and potent microarchitectural primitive. My analysis of prior art confirms the following points of novelty:

A New Primitive, Not Just a New Target: The core discovery—a multi-stage, stepped power gating mechanism local to the AMX unit (as detailed in Section 4, pages 6-8 and Figure 1, page 2)—is fundamentally different from previously known power-related side channels.
- It is distinct from DVFS-based channels like Hertzbleed [135], as the authors demonstrate the effect persists at fixed frequencies (Section 4.3, page 7).
- It is distinct from whole-core sleep state channels like IdleLeak [105], as it is independent of core C-states.
- Crucially, it is distinct from the closest related work, Thor [29], which also targets AMX. Thor describes an operand-dependent timing variation in a single low-power state. GateBleed describes a reuse-distance-dependent timing variation across five distinct power-gating stages. The root cause is different, and the latency magnitude of GateBleed appears to be two orders of magnitude larger, making it a far more powerful primitive.
Novel Advancement in Magnifier Design: The formulation of GateBleed as a single-instruction magnifier (Section 5.2, page 8) is a significant conceptual advance over prior art. Existing magnifiers like Hacky Racers [143] require complex, carefully crafted instruction sequences to exploit Instruction-Level Parallelism, leaving a large and detectable footprint. Microscope [117] requires privileged OS-level access. The GateBleed magnifier is unprivileged, requires only a single instruction following a passive wait, and its mechanism (hardware power-up latency) is entirely new in this context. This represents a qualitative leap in simplicity and stealth.
New Vector for AI Privacy Attacks: While hardware side-channel attacks on AI models are known (e.g., Cache Telepathy [144]), they typically target static model artifacts (recovering weights or architecture). GateBleed introduces a novel attack surface: leaking private, input-dependent dynamic decisions (e.g., early-exit paths, expert routing) by observing the side effects of accelerator power management. This is a new conceptual link between hardware power optimization and AI privacy that has not been explored before.

Weaknesses

My critique is not that the work lacks novelty, but that its novelty could be framed with greater precision and its implications explored more broadly.

Positioning: A Potent Instance or a New Class? The paper convincingly presents a new vulnerability. However, it is an instance of a broader, known principle: power state transitions incur latency penalties. The exceptional novelty here stems from the undocumented, multi-stage, and high-latency nature specific to AMX. The authors should more clearly position their contribution: is this a one-off implementation flaw in AMX, or is it the first documented example of a new class of vulnerabilities we should expect in future tightly-integrated, aggressively power-managed accelerators?
Limited Generalizability of the Primitive: The discovered primitive is, by definition, specific to a particular microarchitecture (Intel AMX on Sapphire Rapids and newer). While the applications are more general, the root cause is narrow. The paper would be strengthened by a discussion on the architectural trends that led to this design choice and whether similar vulnerabilities are likely to exist in other on-core accelerators (e.g., NPUs in consumer SoCs, GPU Tensor Cores) that also employ aggressive power gating. Without this, the novelty risks being perceived as narrow, even if it is deep.

Questions to Address In Rebuttal

The authors cite Thor [29] as a related work that also finds a timing channel in AMX. Could the authors dedicate a paragraph to explicitly contrasting the root cause of GateBleed (reuse-distance-dependent staging) with that of Thor (operand-dependent leakage)? This would further solidify the novelty of the discovered primitive.
In Section 6.7 (page 12), the authors claim GateBleed evades state-of-the-art detectors. Is this evasion simply because current detectors are not instrumented to monitor AMX-specific performance counters, or is there a more fundamental reason why this primitive is inherently difficult to detect? For example, does a single AMX instruction after a long, passive wait period fall below the anomaly detection threshold of any conceivable event-based detector?
The core mechanism is tied to AMX. Could the authors speculate on the likelihood of finding similar multi-stage, high-latency power gating channels in other classes of on-core accelerators? What architectural pressures (e.g., thermal density, power budget sharing) would incentivize designers to create such leaky optimizations? This would help frame the work's conceptual contribution beyond a single-vendor implementation.

Athena: Accelerating Quantized Convolutional Neural Networks under Fully Homomorphic Encryption

Abstract

Deep learning under FHE is difficult due to two aspects: (1) formidable amount of ciphertext computations like convolutions, so frequent bootstrapping is inevitable which in turn exacerbates the problem; (2) lack of the support to various non-linear ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The paper proposes "Athena," a framework and co-designed hardware accelerator for executing quantized convolutional neural networks (CNNs) under fully homomorphic encryption (FHE). The core methodological deviation from prior work is the rejection of the conventional CKKS scheme for approximate arithmetic in favor of an integer-based FHE scheme (akin to BFV). Non-linear operations (i.e., activation functions) are handled via a "Functional Bootstrapping" (FBS) mechanism, which is effectively a polynomial evaluation of a Look-Up Table (LUT). The authors claim this approach allows for much smaller cryptographic parameters, resulting in significant speedups (1.5x-2.3x) and EDP improvements (3.8x-9.9x) over state-of-the-art CKKS-based FHE accelerators, with negligible accuracy degradation relative to a quantized plaintext baseline.

Strengths

Problem Simplification: The fundamental premise of leveraging model quantization to simplify the underlying cryptographic problem is sound. Moving from the complexities of approximate arithmetic in CKKS to integer arithmetic is a valid research direction that could plausibly reduce overhead.
Hardware-Software Co-design: The authors have clearly considered the interplay between their proposed framework and the hardware needed to execute it. The design of specialized units like the FRU to accelerate the specific bottlenecks of their framework (namely FBS) demonstrates a coherent design philosophy.

Weaknesses

My primary concerns with this submission revolve around the rigor of the analysis, the fairness of the experimental comparisons, and the overstatement of the framework's generality and novelty.

Insufficient Noise Analysis: The noise analysis presented is superficial and unconvincing. In Section 3.2.2 (p. 5), the authors introduce a rounding noise ems during modulus switching. They claim it has "minimal impact" because it contaminates LSBs. Figure 4 (p. 7) shows that this "minimal" impact results in data error ratios of up to 11% in some layers of ResNet-56. A claim that this level of error has no significant cumulative effect on a deep network is extraordinary and requires extraordinary proof, which is not provided. The analysis lacks any formal treatment of how this error propagates and accumulates across dozens of layers. Relying on final accuracy metrics for just a few models is insufficient to validate the robustness of this noise-handling approach.
Misleading Performance Comparisons: The performance evaluation in Section 5.2.2 (p. 10) constitutes a severe apples-to-oranges comparison. The baseline accelerators (CraterLake, ARK, SHARP) are designed to handle the significantly more complex and general-purpose CKKS scheme, which includes computationally intensive operations for bootstrapping and complex number arithmetic. Athena, by design, targets a much simpler cryptographic problem (integer-only arithmetic). It is therefore unsurprising that a specialized accelerator for a simpler problem outperforms a general-purpose accelerator for a harder one. Figure 8 (p. 11), which shows that CKKS accelerators are ill-suited for the Athena workload, does not support the authors' claims of superiority; it merely states the obvious. A fair comparison would involve implementing the Athena framework on a configurable platform against a CKKS implementation on the same platform, or comparing against a hypothetical CKKS accelerator that is also specialized for quantized workloads.
Questionable Novelty and Scalability of "Functional Bootstrapping" (FBS): The FBS mechanism, described in Section 3.2.3 (p. 6), is presented as a key innovation. However, it is fundamentally a known technique: evaluating a LUT via polynomial interpolation. The complexity analysis in Table 3 (p. 7) states the complexity of FBS is O(t), where t is the plaintext modulus. For t = 65537, this is computationally massive. While the BSGS optimization (Algorithm 2) reduces ciphertext multiplications to O(√t), the number of scalar multiplications and additions remains a direct function of t. The paper's impressive performance results seem to be achieved not by an algorithmically superior method, but by applying brute-force hardware (a 16-block FRU array) to this high-complexity problem. This raises serious questions about scalability.
Limited Generality and Fragile Parameterization: The entire framework hinges on the plaintext modulus t being large enough to contain all intermediate inner product results. The authors select t = 65537, which Figure 4 shows is just sufficient for the tested benchmarks. The paper provides no methodology for selecting t for an arbitrary new model, nor does it analyze the sensitivity of the framework to this choice. If a deeper or wider network architecture requires a larger dynamic range, t would need to increase, causing the complexity of FBS to explode and likely rendering the approach impractical. The claim that Athena can "support any type of activation functions" is thus unsubstantiated, as any function requiring a larger LUT or higher precision would break the current parameterization.

Questions to Address In Rebuttal

The authors must address the following points directly to make a case for this paper's acceptance:

Regarding Noise Propagation: Please provide a formal analysis of the propagation and accumulation of the ems error introduced during modulus switching. Demonstrate, either through formal proof or extensive empirical evaluation on significantly deeper networks (e.g., >100 layers), that this error remains bounded and does not catastrophically degrade accuracy.
Regarding Baselines: Please justify the fairness of comparing your specialized, integer-only accelerator against general-purpose CKKS accelerators. To make a convincing case, provide a comparison against a more appropriate baseline, for instance, an implementation of your framework on a general-purpose FHE accelerator or a theoretical analysis against a CKKS flow optimized for quantized inference.
Regarding FBS Scalability: Given the O(t) complexity of the underlying FBS operation, how does the system's performance and hardware cost scale as t increases? What is the practical upper bound on t before the FBS latency and area make the accelerator infeasible?
Regarding Generality: How would a user of the Athena framework determine the required value for the plaintext modulus t for a new, arbitrary CNN model? What happens if this required t is larger than the 17-bit value used in this work?
Regarding Complex Functions: The paper briefly mentions a three-step process for Softmax. This appears to involve multiple FBS evaluations and a costly ciphertext-ciphertext multiplication. Please provide a detailed breakdown of the latency, noise growth, and complexity for this operation, as it is a critical component in many classification networks.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces "Athena," a novel co-designed framework and hardware accelerator for performing Convolutional Neural Network (CNN) inference under Fully Homomorphic Encryption (FHE). The core contribution is a paradigm shift away from the prevailing approach of using approximate arithmetic schemes like CKKS on floating-point models. Instead, Athena leverages the mature field of Quantized CNNs (QCNNs), whose integer-based arithmetic is a natural fit for integer-based FHE schemes like BFV.

This fundamental pivot allows for significantly smaller cryptographic parameters, which in turn leads to dramatically reduced ciphertext sizes and on-chip memory requirements. The framework handles the critical non-linear activation functions not with inaccurate polynomial approximations, but with an exact "functional bootstrapping" (FBS) mechanism. This mechanism, reminiscent of TFHE's programmable bootstrapping, performs a lookup-table operation that can implement any arbitrary activation function and simultaneously handles the re-quantization (remapping) step. The authors present a full-stack solution, from the five-step cryptographic loop (Section 3.1, page 4) to a specialized accelerator architecture designed to handle the framework's unique computational bottlenecks, particularly the FBS step. The result is a system that claims near-plaintext accuracy with significant performance and efficiency gains over state-of-the-art CKKS-based accelerators.

Strengths

Elegant Core Idea and Problem Reframing: The primary strength of this work lies in its insightful connection between two distinct domains: Quantized Neural Networks and Integer-based FHE. Instead of trying to force the square peg of floating-point neural networks into the round hole of approximate FHE (CKKS), the authors correctly identify that the integer-only nature of QCNNs is a perfect match for schemes like BFV. This reframing of the problem is the paper's most significant intellectual contribution and the source of all subsequent benefits.
Practicality and Feasibility: The consequences of this pivot are profound. As shown in Table 1 (page 3) and Table 8 (page 12), the required ciphertext and key sizes are drastically reduced, leading to a >4x reduction in on-chip scratchpad memory compared to leading FHE accelerators like CraterLake and ARK. This is not an incremental improvement; it is a step-change in hardware feasibility and potential cost, making private ML inference a much more tangible reality.
Generalized and Accurate Non-Linearity Handling: The use of functional bootstrapping (FBS) is a powerful choice that solves a major pain point in FHE-based ML. The reliance on Taylor/Chebyshev polynomial approximations in CKKS-based systems is a notorious source of error and requires expert tuning (as shown in Figure 1, page 3). Athena's FBS approach is general—it can implement ReLU, Sigmoid, and even complex pooling operations with perfect precision within the quantized domain. Merging the activation function and the remapping step into a single LUT operation is a particularly clever co-design choice.
Strong Full-Stack Co-Design: The paper presents a convincing end-to-end solution. The software framework's five-step loop is directly reflected in the hardware design. The identification of FBS as the new performance bottleneck (as opposed to NTT in traditional designs) and the subsequent design of the versatile FRU and the pipelined two-region dataflow (Section 4.3, page 9) demonstrates a deep understanding of the entire stack. The results in Figure 8 (page 11), where the Athena framework is simulated on prior hardware, effectively argue for the necessity of this specialized accelerator.

Weaknesses

While the work is strong, its focus is sharp, leaving some broader contextual questions open. These are not so much flaws as they are opportunities for strengthening the paper.

Limited Discussion on the Quantization Process: The paper assumes the availability of a pre-trained QCNN. However, the process of creating a high-accuracy QCNN, often through Quantization-Aware Training (QAT), is non-trivial. More importantly, the choices made in the FHE framework (e.g., the plaintext modulus t in Section 3.3, page 7) are deeply intertwined with the quantization strategy (e.g., the bit-width and range of intermediate values). A brief discussion on this interplay would strengthen the paper by showing how the two domains must be co-designed, not just cascaded.
Architectural Generality: The work is presented and evaluated exclusively on CNNs. While this is a critical workload, the field of deep learning is rapidly expanding, with Transformers becoming dominant in many areas. Transformers also rely heavily on quantization for efficient inference. How would the Athena framework and accelerator adapt to the different computational patterns of a Transformer (e.g., large matrix multiplications and the Softmax function)? Addressing this would elevate the contribution from a "solution for CNNs" to a more general "framework for quantized models."

Questions to Address In Rebuttal

Could you elaborate on the interplay between the quantization strategy and the selection of FHE parameters? Specifically, how does the choice of the plaintext modulus t (65537 in your work) constrain or inform the quantization process (e.g., the 7-bit w7a7 scheme)? Is there a systematic way to co-optimize these parameters?
The paper's evaluation focuses entirely on CNNs. Could the authors comment on the applicability of the Athena framework to other quantized architectures like Transformers? The Softmax operation, in particular, seems like a perfect candidate for the FBS mechanism. Would the accelerator's dataflow, designed for convolutions, be efficient for the large matrix multiplications in a Transformer's attention and feed-forward layers?
In the performance comparison in Table 6 (page 10), the paper compares against baselines running CKKS-based models. Could you provide more detail on how the "computational complexity of other benchmarks" was normalized to that of ResNet-20 for a fair comparison, particularly for ResNet-56 which was not reported by all baselines?
Step 4 of the Athena framework ("Packing") involves a "homomorphic decryption" of LWE ciphertexts. This is a powerful but potentially costly primitive. In the execution time breakdown (Figure 9, page 11), this operation does not appear to be explicitly separated. Could you clarify where its cost is accounted for and its relative significance to the overall latency?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper introduces "Athena," a framework and co-designed hardware accelerator for performing inference on quantized convolutional neural networks (QCNNs) under fully homomorphic encryption (FHE). The central claim of novelty lies not in the creation of a new cryptographic primitive, but in the formulation of a new end-to-end framework that departs from the dominant CKKS-based paradigm for FHE-based machine learning.

The core idea is to build a processing loop around integer-based FHE (specifically, BFV-like operations) tailored for QCNNs. This loop systematically manages computation and noise through five steps: (1) a linear operation using coefficient encoding, (2) modulus switching to reduce noise, (3) ciphertext conversion from RLWE to LWE, (4) homomorphic decryption and packing back into RLWE, and (5) a "Functional Bootstrapping" (FBS) step. The most significant aspect of this proposed framework is that the FBS step unifies three traditionally separate operations: the noise-clearing bootstrap, the non-linear activation function evaluation, and the integer re-quantization/remapping. The authors claim this synthesis allows for smaller cryptographic parameters and higher accuracy compared to prior CKKS-based approaches, and they present a hardware architecture specifically designed to accelerate the bottlenecks of this new workflow.

Strengths

A Novel Conceptual Framework: The primary strength and most significant novel contribution is the deliberate departure from the CKKS-based paradigm for deep learning inference. While prior work has focused on approximating real-number arithmetic with CKKS, Athena embraces the discrete nature of quantized networks and builds a native integer-based FHE pipeline. This represents a distinct and valuable alternative direction in the design space.
Unification of Operations in Functional Bootstrapping: The application of functional bootstrapping to simultaneously perform bootstrapping, non-linear activation, and remapping (as described in Section 3.2.3, Page 6) is a highly novel synthesis of existing ideas. Prior works have used bootstrapping to manage noise and separate techniques (e.g., polynomial approximation) for activations. Merging these, along with the crucial remapping step required for quantization, into a single LUT-based operation is a clever and powerful simplification of the inference flow. This is the paper's most compelling technical insight.
A Coherent, Self-Contained Workflow: The proposed five-step loop (Figure 2, Page 4) is a well-defined and repeatable process for executing QCNN layers under FHE. It presents a structured methodology for managing noise and evaluating functions that is distinct from the ad-hoc parameter tuning often required in leveled CKKS schemes. This structured workflow itself can be considered a novel contribution to the field.
Novelty in Hardware Co-Design: The accelerator architecture is not merely a collection of standard FHE components. The design of the versatile "FBS and RNS Base changing unit" (FRU) and the two-region dataflow (Section 4.3, Page 9) is a direct and novel consequence of the proposed software framework. The design correctly identifies FBS as the new system bottleneck (a shift away from NTT in many prior works) and dedicates significant, specialized resources to it.

Weaknesses

Novelty is in Synthesis, Not Primitives: The paper's novelty is almost entirely based on the clever combination and application of pre-existing techniques. The authors appropriately cite prior art for the core FBS primitive ([29]), coefficient encoding for convolutions ([16, 21]), and RLWE-to-LWE conversions ([12]). While the synthesis is novel, the paper could be more explicit in delineating what is being adopted versus what is being created. The contribution is the recipe, not the ingredients.
Incremental Novelty in Encoding: The paper contrasts its coefficient encoding scheme with Cheetah [16] (Table 2, Page 5). The claimed improvement appears to be a different batching strategy (prioritizing output channels) to improve data locality for subsequent steps, rather than a fundamentally new encoding method. The novelty here is an optimization tailored to their specific framework, which, while valuable, is an incremental advancement over prior art.
Complexity vs. Benefit Justification: The proposed five-step loop introduces significant complexity, involving two forms of ciphertext (RLWE, LWE) and multiple conversions between them (Steps 3, 4, and the final S2C step). While the results are impressive, the justification for this specific complex pathway over a simpler, purely BFV-based or TFHE-based approach could be stronger. The benefit is clear, but it comes at the cost of a non-trivial pipeline that requires specialized hardware (like the SE unit) to remain efficient.

Questions to Address In Rebuttal

The core of your non-linear evaluation rests on functional bootstrapping, which essentially performs a LUT lookup. Prior works such as PEGASUS [30] and TOTA [42] have also proposed using TFHE-style bootstrapping for LUT-based evaluation of non-linear functions in FHE. Please clearly articulate the delta between Athena's approach and these prior works. Is the primary novelty the integration of the remapping step into the LUT, the specific five-step pipeline in which it is embedded, or another factor?
Regarding the coefficient encoding for linear layers (Section 3.2.1, Page 5), please clarify the precise novel contribution over the methods used in Cheetah [16] and NeuJeans [21]. Is the novelty in the encoding itself, or is it strictly in the batching strategy that arranges data to benefit the subsequent sample extraction step?
The proposed framework unifies activation and remapping within the FBS step. How does the framework handle layers that do not have a non-linear activation (e.g., a convolution followed directly by a pooling or batch normalization layer)? Does this require an "identity-with-remapping" LUT, and if so, what is the performance and complexity overhead compared to a more direct remapping operation? This would help clarify the generality of the proposed novel framework.

ccAI: A Compatible and Confidential System for AI Computing

Abstract

Confidential xPU computing has emerged as a prominent technique for effectively securing users’ AI computing workloads on heterogeneous systems equipped with xPUs. Although the industry adopts this technology in cutting-edge hardware (e.g. NVIDIA H100 GPU)...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The authors present ccAI, a hardware-software co-design intended to provide confidential computing for AI workloads on heterogeneous systems with legacy xPUs. The system's architecture is anchored on a hardware module, the PCIe Security Controller (PCIe-SC), which intercepts and secures PCIe traffic between a Trusted VM (TVM) and an xPU. This is complemented by a software component, the "Adaptor," which operates within the TVM to manage security operations without modifying the user application or the xPU's native driver stack. The core proposition is that by operating at the PCIe packet level, ccAI can offer a compatible, transparent, and secure solution for a wide range of xPUs that lack native confidential computing features.

Strengths

Problem Motivation: The paper correctly identifies a critical and practical gap in the current ecosystem: the vast majority of deployed xPUs lack the confidential computing capabilities of cutting-edge hardware like the NVIDIA H100, yet they process sensitive AI workloads. Addressing this is a worthwhile endeavor.
Architectural Approach: The central idea of leveraging the PCIe interconnect as a universal enforcement point is logical. Since PCIe is the de facto standard, this approach has the potential to be more broadly applicable than solutions tied to specific xPU architectures or TEE designs.
Prototyping Effort: The authors have clearly invested significant engineering effort in building a functional prototype. Implementing the PCIe-SC on an FPGA and integrating it with five distinct, real-world xPUs from multiple vendors (NVIDIA, Tenstorrent, Enflame) is a non-trivial accomplishment and lends a degree of credibility to the feasibility of the design.

Weaknesses

My primary concerns with this submission relate to the strength of its claims regarding compatibility, security rigor, and the representativeness of the performance evaluation.

Overstated Compatibility and Transparency: The claim of "no xPU SW changes" (Table 2, page 10) is a significant overstatement. The paper details the introduction of a new kernel module, ccAI_adaptor, within the TVM (Section 7.1, page 8). This module is not part of the original xPU software stack and creates a new, non-trivial dependency. It interacts with the PCIe-SC, allocates memory, and handles encryption. Any update to the native xPU driver that alters its memory access patterns, DMA semantics, or MMIO interactions could break the assumptions made by the Adaptor, requiring a corresponding update. This is not seamless transparency; it is shifting the modification burden from the driver itself to an adjacent, tightly-coupled kernel module. The compatibility is therefore fragile.
Insufficient Scrutiny of the TCB and Security Guarantees: The security of the entire system hinges on the correctness of the PCIe-SC and its configuration.
- Hardware Complexity: The PCIe-SC implementation consumes 218.6K ALUTs and 195.7K logic registers (Table 3, page 11). This is a substantial and complex piece of hardware, effectively a sophisticated NIC/firewall on the PCIe bus. A hardware design of this complexity is a major attack surface in itself. The paper provides no evidence of formal verification or rigorous testing to ensure the hardware is free of critical bugs that could be exploited to bypass security policies.
- Packet Filter Brittleness: The security enforcement relies on a set of L1/L2 table rules (Figure 5, page 6). How are these rules generated? The paper glosses over this critical process. For any new xPU or even a new driver version, this rule set must be perfectly defined to distinguish between benign and malicious traffic. An incorrect or incomplete rule set is tantamount to an open door. This suggests a high operational burden and a high risk of misconfiguration, which undermines the practical security guarantees.
- Unaddressed Threat Vectors: The paper dismisses side-channel attacks as orthogonal (Section 2.2, page 4). However, the PCIe-SC itself introduces new potential side channels. For instance, packet processing time within the PCIe-SC could vary depending on the security action taken (e.g., pass-through vs. decryption). An attacker monitoring PCIe bus traffic timing could potentially infer information about the operations being performed. This is not an orthogonal issue; it is a new vulnerability created by the proposed architecture.
Unconvincing Performance Evaluation: The reported low overhead figures (0.05% - 5.67%) appear to be the result of testing under favorable, compute-bound conditions, which do not sufficiently stress the ccAI architecture.
- Non-Representative Workloads: The majority of the LLM evaluations, particularly in Figure 9 (page 11), are performed with a batch size of 1. While useful for latency measurements, this configuration is heavily compute-bound and minimizes the relative impact of I/O overhead. In real-world cloud inference scenarios, throughput is maximized by using larger batch sizes, which dramatically increases the volume of data traversing the PCIe bus. The evaluation fails to demonstrate how ccAI performs under such I/O-intensive, high-throughput conditions where the per-packet overhead of the PCIe-SC would be most pronounced.
- Insufficient Stress Testing: The stress test in Section 8.6 (page 12) artificially limits PCIe bandwidth but does not present a workload that saturates the link. The crucial test is one where the application is bottlenecked by PCIe bandwidth in the baseline configuration. Only then can the true overhead of ccAI's cryptographic and filtering operations be accurately measured. The current test does not provide this insight.

Questions to Address In Rebuttal

Please reconcile the claim of "no software changes" with the necessity of the ccAI_adaptor kernel module. How do the authors guarantee that the Adaptor will remain compatible across future, potentially significant updates to the proprietary xPU drivers it must coexist with?
Given the substantial complexity of the PCIe-SC hardware (Table 3), what specific steps were taken to verify its functional correctness and security properties? Was any form of formal methods or exhaustive simulation employed to prove it is not a source of vulnerabilities itself?
The paper's performance claims hinge on benchmarks that are not I/O-bound (e.g., batch size 1). Please provide performance evaluation data for a high-throughput scenario where the workload is designed to saturate the PCIe bus (e.g., large-batch inference on a model with significant weight loading, or a data-intensive training workload).
Please elaborate on the generation and management process for the Packet Filter rules (Figure 5). Is this a manual, per-device, per-driver process? If so, how does this scale and how do you prevent human error from introducing critical security flaws? What is the performance penalty for rule lookup as the rule set grows in complexity?

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents ccAI, a novel system designed to provide confidential computing for a wide range of AI accelerators (xPUs). The authors identify a critical gap in the current landscape: while cutting-edge hardware like the NVIDIA H100 offers built-in confidentiality, the vast majority of deployed accelerators lack such features, and existing academic solutions often suffer from poor compatibility or require significant application/driver modifications.

The core contribution of ccAI is its architectural choice to enforce security at the PCIe interconnect level. This is achieved through a hardware-software co-design comprising two key components: 1) a PCIe Security Controller (PCIe-SC), a hardware module that sits between the host and the xPU to intercept, filter, and process all PCIe packets, and 2) a software "Adaptor" running within a Trusted VM (TVM) on the host, which transparently manages the secure workflow without altering the user application or the native xPU drivers. By treating the PCIe packet stream as a universal interface, ccAI aims to deliver a compatible, transparent, and secure solution for heterogeneous AI computing environments.

Strengths

A Pragmatic and Powerful Architectural Choice: The single most important idea in this paper is the decision to place the security boundary on the PCIe bus. This is a brilliant move that reframes the problem. Instead of trying to secure the infinitely complex internals of various xPUs or modify proprietary driver stacks, the authors treat the accelerator as a black box and secure its sole communication channel with the host. This abstraction is the key to achieving the paper's primary goal of compatibility. It elegantly sidesteps the vendor-specific details that have plagued many previous approaches.
Excellent Contextualization and Problem Framing: The authors demonstrate a clear and panoramic understanding of the confidential computing landscape. Figure 1 (Section 3, page 3) is particularly effective, providing a concise taxonomy of existing approaches (TEE-based, HW-based, TDISP, etc.) and clearly positioning ccAI as a distinct and complementary solution. The paper correctly identifies that while standards like TDISP are the long-term future, there is a pressing, immediate need for a solution that can secure the massive installed base of legacy hardware. ccAI is presented not just as a research project, but as a practical answer to a real-world market and operational gap.
Strong System-Oriented Design: The work goes beyond a simple conceptual model. The authors have considered the full lifecycle of a secure workload, including a secure boot process for the PCIe-SC, remote attestation protocols (Section 6, page 7), and key management. The design of the Packet Filter and Packet Handlers (Section 4, page 5) shows a thoughtful approach to balancing security and performance by categorizing packet types and applying tailored protection policies. This level of detail suggests a mature and well-considered system design.
Comprehensive and Convincing Evaluation: The experimental validation is a significant strength. By implementing a prototype and testing it across five distinct xPUs from three different vendors (NVIDIA GPUs, a Tenstorrent NPU, and an Enflame GPU, as detailed in Section 7, page 8), the authors provide strong evidence for their central claim of compatibility. The low performance overheads reported across a wide range of Large Language Models (LLMs) further bolster the argument for the system's practicality.

Weaknesses

While the core idea is strong, the paper could be improved by addressing the practical implications of its proposed hardware.

The Practicality of the PCIe-SC: The entire system hinges on the existence and deployment of the PCIe-SC hardware module. While an FPGA prototype is excellent for academic validation, the path to real-world deployment is a significant hurdle. Is this envisioned as a standalone PCIe card that an xPU plugs into? An integrated component on future motherboards? An offering from a third-party hardware security company? The paper lacks a discussion of the form factor, cost, and supply chain implications, which are crucial for assessing its potential for widespread adoption in cloud data centers.
Scalability and Performance Ceiling: The evaluation demonstrates low overhead on current hardware. However, the bandwidth requirements of next-generation accelerators are growing exponentially. The prototype is based on an Intel Agilex 7 FPGA. A critical question is whether this architecture can scale to saturate the PCIe Gen5/Gen6 links of future flagship GPUs without becoming the primary performance bottleneck. While an ASIC implementation would be faster, some analysis of the architectural throughput limits would strengthen the paper's claims of future viability.
The Trusted Computing Base (TCB) of the Intermediary: By introducing the PCIe-SC, the authors have created a new, critical piece of hardware that becomes part of the TCB. The security analysis in Section 8.2 (page 10) is good, but it could further explore the threat model against the PCIe-SC itself. What is the mechanism for securely updating its firmware? How is its physical integrity maintained beyond the proposed chassis sealing? While the TCB is small compared to a full driver stack, its criticality warrants a more in-depth discussion.

Questions to Address In Rebuttal

Could the authors elaborate on the envisioned deployment model for the PCIe-SC in a typical cloud environment? What would be the most likely path to market for such a device, and what are the primary barriers to its adoption by cloud providers?
The paper compares ccAI's performance overhead to the NVIDIA H100's confidential mode. Could you discuss the architectural trade-offs? While ccAI is more compatible, does the H100's tight integration of security features provide fundamental performance or security advantages that an external PCIe device can never fully match?
How do the authors see ccAI co-existing with the emerging TDISP standard in the long term? Is ccAI primarily a bridging solution for the next 5-10 years until TDISP is ubiquitous, or does the packet-level inspection and policy enforcement of the PCIe-SC offer complementary security guarantees that would remain valuable even in a TDISP-enabled ecosystem?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper proposes ccAI, a system for retrofitting confidential computing capabilities onto legacy, general-purpose xPUs (GPUs, NPUs, etc.) within a heterogeneous cloud environment. The authors' central claim of novelty rests on their specific architectural approach, which aims to provide strong security guarantees with high compatibility and user transparency, a combination they argue is lacking in prior art.

The system is composed of two primary components: (1) a hardware module, the PCIe Security Controller (PCIe-SC), which is physically interposed between the host's PCIe bus and the target xPU, and (2) a software component, the Adaptor, which resides in a Trusted VM (TVM) and coordinates with the PCIe-SC. The core mechanism involves the PCIe-SC intercepting all PCIe traffic to and from the xPU at the packet level. It uses a set of filtering rules and handlers to enforce security policies, such as encrypting/decrypting data payloads for DMA and validating MMIO operations, while remaining transparent to the unmodified xPU driver and user application.

The core novelty is not in the individual ideas of hardware-based security or using TEEs, but in the specific architectural synthesis: using an external, packet-level PCIe security module to create a confidentiality boundary for unmodified, legacy hardware and software stacks. This positions ccAI as a solution for the vast ecosystem of existing accelerators that lack the built-in features of NVIDIA's H100 or do not comply with emerging standards like TDISP.

Strengths

Novel Architectural Niche: The primary strength of this work is its novel architectural arrangement. While prior art has explored confidential accelerators, they typically fall into three categories that ccAI cleverly sidesteps:
- Device-Integrated Hardware (e.g., NVIDIA H100 [50]): These solutions are vendor-specific, proprietary, and only available on the latest, most expensive hardware. ccAI's approach of externalizing the security module makes it, in principle, vendor-agnostic and applicable to legacy devices.
- TEE-based Software Modifications (e.g., Cronus [40], CAGE [85]): These systems often require significant and complex modifications to the xPU driver stack to partition it and run critical parts within a TEE. ccAI's proposed hardware/software co-design aims for a much smaller software footprint (the "Adaptor") and claims to leave the complex driver stack untouched, a significant delta in terms of transparency and compatibility.
- Forthcoming Standards (e.g., TDISP): These require compliance from the CPU, platform, and the xPU device itself. The novelty of ccAI is that it provides a concrete solution for the present-day reality where such compliant hardware is not widely deployed.
Packet-Level Abstraction: Anchoring the security mechanism at the level of the PCIe packet (as detailed in Section 4, page 5) is a powerful and novel choice for this specific problem domain. PCIe is the lingua franca of host-device communication. By operating at this level, ccAI establishes a uniform enforcement point that is conceptually independent of the specific xPU architecture (e.g., GPU vs. NPU), thus providing a more generalizable solution than prior works that are often tailored to a specific accelerator's command submission workflow.

Weaknesses

The "In-line Security Appliance" Precedent: While the application to confidential xPU computing is novel, the fundamental concept of an in-line hardware appliance on a communication bus (be it PCIe, Ethernet, etc.) that filters, modifies, and secures traffic is not entirely new. The paper would be strengthened by more clearly positioning its novelty against this broader class of "bump-in-the-wire" security devices. The novelty is in the details of the packet handlers and the co-design with the TVM-side Adaptor, not the general idea of intercepting traffic.
Generalization Claim vs. Inherent Specificity: The paper's core claim is broad compatibility. However, the true novelty of a general solution is tested by its ability to handle diversity without becoming a collection of special cases. The "Packet Filter" (Section 4.1, page 6) relies on rules based on address spaces and requester IDs. Different xPUs and their drivers use MMIO and DMA regions in vastly different, sometimes idiosyncratic, ways. The paper does not provide sufficient evidence that a simple, generalizable rule set can be defined for truly disparate devices (e.g., an NVIDIA GPU vs. a Tenstorrent NPU) without requiring extensive, device-specific reverse engineering and tuning. The delta between ccAI and a device-specific solution may be smaller in practice than claimed.
Handling of Proprietary Sidebands and Packets: The architectural novelty assumes all critical communication happens over standard, well-documented PCIe packets. However, many complex devices use vendor-defined message types or sideband communication channels that may not be visible or interpretable as standard DMA/MMIO. The brief mention of "Customized packets" in the Discussion (Section 9, page 12) acknowledges this but doesn't fully address the threat to the novelty of a universal solution. If the PCIe-SC cannot parse or secure these proprietary flows, the security guarantees are incomplete, reducing the significance of the advancement.

Questions to Address In Rebuttal

The novelty of the "Adaptor" component hinges on it being a lightweight, minimally intrusive module. For a completely new xPU not evaluated in the paper (e.g., an AMD GPU or a Google TPU), what would be the precise engineering effort required to develop the necessary Adaptor hooks and Packet Filter rules? Please provide a concrete, step-by-step process. This will help clarify whether the solution is genuinely general or just a framework for creating bespoke drivers.
The packet filtering mechanism (Figure 5, page 6) is key to the design. Can the authors provide a concrete example of a non-trivial security policy that differentiates between two distinct operations on the same GPU? For example, how would the L1/L2 table rules distinguish between a legitimate kernel launch command write and a malicious MMIO write attempting to access a configuration register that could compromise isolation?
The proposed PCIe-SC is a novel hardware component. Its viability depends on its ability to scale with technology. The prototype is evaluated on a PCIe 4.0 system. What are the architectural bottlenecks in the PCIe-SC design (e.g., table lookup logic, cryptographic engine throughput) that would need to be overcome to support the line rates and lower latencies of future PCIe 6.0 or 7.0 interfaces? Is the proposed architecture fundamentally scalable?

Ironman: Accelerating Oblivious Transfer Extension for Privacy-Preserving AI with Near-Memory Processing

Abstract

With the wide application of machine learning (ML), privacy concerns arise with user data as they may contain sensitive information. Privacy-preserving ML (PPML) based on cryptographic primitives has emerged as a promising solution in which an ML model is ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose "Ironman," a Near-Memory Processing (NMP) architecture to accelerate Oblivious Transfer Extension (OTE), a critical component in many Privacy-Preserving Machine Learning (PPML) frameworks. The proposal involves a hardware/software co-design approach: for the compute-bound Single-Point Correlated OT (SPCOT) sub-protocol, they introduce an m-ary GGM tree expansion using a ChaCha-based PRG instead of the standard AES. For the memory-bound Learning Parity with Noise (LPN) sub-protocol, they propose an NMP architecture with a memory-side cache and a pre-sorting algorithm to improve data locality. While the paper identifies the correct bottlenecks, its central claims are predicated on a number of strong, and in some cases, unrealistic, hardware assumptions, and the reported performance gains appear to be misleadingly framed.

Strengths

The fundamental diagnosis of the performance bottlenecks in PCG-style OTE is sound. The identification of SPCOT as compute-bound and LPN as memory-bandwidth-bound (Section 1, page 2, Figure 1(c)) correctly motivates the need for distinct optimization strategies.
The algorithmic optimization for SPCOT, specifically the combination of m-ary tree expansion with a ChaCha8-based PRG, is a logical approach to reducing the total number of PRG calls. The ablation study presented in Figure 13(a) (Section 6.2, page 11) provides clear evidence that this combined approach is superior to applying either optimization in isolation.
The proposed index sorting algorithm for the LPN phase (Section 5.3, page 9) is an intelligent technique to mitigate the performance degradation from irregular memory accesses. The use of offline, compile-time sorting for a fixed matrix is a valid optimization strategy in principle.

Weaknesses

Fundamentally Unrealistic Hardware Model: The entire proposal hinges on the ability to place a custom, high-throughput "ChaCha8 Core" and other specialized logic (e.g., the Unified Unit) on the buffer chip of a DRAM DIMM (Section 5.1, page 7, Figure 9). The authors themselves concede in Section 5.1.3 (page 8) that deploying this on existing commercial NMP hardware like UPMEM or HBM-PIM would present "certain challenges" and would ultimately "require replacing the existing hardware with our custom ASIC." This admission relegates the work to a purely theoretical exercise. It is not an acceleration on near-memory processing as it exists, but on a hypothetical, full-custom memory system that is not commercially viable or available.
Misleading Presentation of Performance Gains: The abstract and headline results prominently feature a "39.2-237.4× improvement in OT throughput." However, the end-to-end application speedup for actual PPML frameworks is a far more modest 2.1-3.4× (Table 5, Section 6.5, page 11). While solving one bottleneck to reveal another (communication) is a common outcome, framing the work around a component speedup that is orders of magnitude larger than the real-world application benefit is misleading to the reader. The true impact is much smaller than suggested.
Insufficiently Rigorous Baseline Comparisons: The paper compares against a "full-thread CPU implementation" and a GPU implementation. The GPU baseline, an NVIDIA A6000, is a powerful accelerator. However, its implementation is described in a single sentence (Section 6.1, page 10), lacking any detail on the level of optimization. Without evidence of a highly optimized CUDA implementation that effectively leverages GPU architecture for this specific task, the 40.31x speedup claimed over the GPU baseline is suspect and likely inflated by comparing against a strawman implementation.
Unsupported Claims of Negligible Overhead: The authors claim in Section 5.1.3 (page 8) that the cost of offloading data to the host CPU "becomes negligible" because generation can be overlapped with transmission. This is an oversimplification that ignores the complexities of host-device synchronization, instruction dispatch overhead, and potential bus contention. This critical system-level cost is dismissed without any quantitative analysis or simulation data to support the claim.
Limited Applicability of LPN Optimization: The index sorting algorithm, which is key to the LPN speedup, requires the index matrix A to be fixed and known at compile time (Section 5.3, page 9). This is a strong assumption that may not hold for all PPML scenarios or future cryptographic protocols. The paper fails to discuss the implications or performance impact if this matrix were dynamic or generated on-the-fly.

Questions to Address In Rebuttal

Regarding the hardware model: Can you justify the practicality of your proposed architecture? Please provide a clear pathway for implementing this design on any existing or near-future NMP platform without requiring a full-custom ASIC on the DIMM buffer chip. If no such pathway exists, the contribution must be re-framed as purely theoretical.
Regarding the performance claims: Please reconcile the 237.4x component speedup highlighted in the abstract with the ~3x end-to-end application speedup from Table 5. Why is the most prominent number in the paper not representative of the actual system-level impact?
Regarding the GPU baseline: Please provide concrete details of your GPU implementation. Specifically, what CUDA kernels were designed, how was work distributed across thread blocks, and what memory optimizations (e.g., shared memory usage, coalesced access patterns) were employed? This is necessary to validate that your comparison is fair.
Regarding the offloading cost: Please provide a quantitative breakdown (e.g., from your ZSim simulation) of the overhead associated with the host processor dispatching NMP instructions and receiving the final COT correlations. How does this overhead scale with the number of correlations, and under what specific conditions is it truly "negligible"?
Regarding the LPN sorting algorithm: Please clarify the assumptions under which the index matrix A can be considered static and sorted offline. What is the performance impact on the LPN stage if this assumption is violated and the matrix must be processed without prior sorting?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents "Ironman," a specialized hardware accelerator designed to address a critical performance bottleneck in Privacy-Preserving Machine Learning (PPML): the Oblivious Transfer Extension (OTE) protocol. The authors correctly identify that as other parts of PPML frameworks are optimized, the cryptographic machinery for handling non-linear functions—which heavily relies on OTE—becomes the dominant cost.

The core contribution is a holistic hardware/software co-design. The authors cleverly partition the OTE protocol into its two main components: the computation-bound SPCOT and the memory-bound LPN. For SPCOT, they propose replacing the standard binary AES-based tree expansion with a more hardware-friendly 4-ary tree using a custom ChaCha8 core, significantly reducing the number of primitive operations. For LPN, they leverage a Near-Memory Processing (NMP) architecture with memory-side caches and a novel offline index sorting algorithm to mitigate the performance degradation from LPN's irregular memory access patterns. The design is thoughtfully unified to support the switching of sender/receiver roles inherent in many MPC protocols. The simulation-based results demonstrate a substantial 39x-237x speedup for the OTE protocol itself and a compelling 2.1x-3.4x end-to-end latency reduction for complex CNN and Transformer models within modern PPML frameworks.

Strengths

Excellent Problem Identification and Contextualization: The paper is situated at the crucial intersection of computer architecture, applied cryptography, and secure AI. The authors have correctly diagnosed that OTE is the next major frontier for optimization in making large-scale PPML practical. By profiling real frameworks (Figure 1, page 2), they provide strong motivation that this is not a theoretical problem, but a real-world engineering challenge for the deployment of private AI.
Sophisticated HW/SW Co-Design: This is the paper's most significant strength. Rather than naively accelerating the canonical OTE algorithm, the authors re-evaluate the algorithmic components from first principles with hardware in mind. The decision to move from a standard binary GGM tree with AES to a 4-ary tree with a custom ChaCha8 primitive (Section 4, page 6) is an outstanding example of co-design. It showcases a deep understanding of both the cryptographic requirements and the hardware implementation trade-offs, leading to a 6x performance improvement for the SPCOT component (Figure 13, page 11).
Sound and Modern Architectural Approach: The use of Near-Memory Processing (NMP) for the memory-bound LPN component is a well-justified and modern architectural choice. The design insight that LPN's random access patterns can be regularized via offline sorting (Section 5.3, page 9) and serviced efficiently by distributed memory-side caches and rank-level parallelism is very compelling. This correctly maps the problem's characteristics (low compute intensity, high memory bandwidth demand) to an appropriate architectural solution.
Enabling Practical Private AI for Complex Models: By achieving significant end-to-end speedups, this work pushes the boundary of what is considered feasible for PPML. The evaluation on large Transformer models (ViT, BERT, GPT-2) is particularly important, as these models are often considered prohibitively expensive to run in a secure context. This work provides a credible pathway to deploying such state-of-the-art models with strong privacy guarantees.

Weaknesses

While the core ideas are strong, the paper could be improved by addressing the following points, which are more about completeness than fundamental flaws:

Positioning of Novelty: The paper's novelty lies in the synthesis and application of existing ideas to a new, important domain. NMP, custom cryptographic cores, and index sorting for sparse operations are established techniques. The paper would be stronger if it more explicitly framed its contribution not as the invention of these components, but as the first successful synthesis of them into a coherent accelerator for the PCG-style OTE bottleneck.
Practicality of NMP Integration: The paper relies on a simulated NMP environment. While this is standard for architectural research, a brief discussion on the practical path to deployment would be valuable. For instance, what modifications would be needed in the OS/memory controller to manage the NMP units? How would the Ironman software stack integrate with existing PPML frameworks like PyTorch or TensorFlow, which are often the front-ends for these secure back-ends?
Overhead of Offline Pre-processing: The index sorting algorithm for LPN is performed offline. The paper states this cost is amortized because the LPN matrix is fixed. While true for many inference scenarios, it would be useful to quantify this one-time cost. Furthermore, a discussion on scenarios where the matrix might not be fixed (e.g., certain types of secure training or dynamic systems) would add nuance and scope the applicability of this specific optimization.

Questions to Address In Rebuttal

I am broadly supportive of this work and believe it makes a valuable contribution. I encourage the authors to use the rebuttal to address the following to strengthen the final paper:

Regarding the LPN optimization (Section 5.3, page 9): Could you quantify the one-time computational cost of the offline column and row sorting algorithm? How does this cost scale with the size of the LPN matrix, and at what point might it become a consideration in the overall workflow?
Regarding the architectural scalability: Your evaluation shows strong scaling up to 16 ranks. As you scale to a larger number of DIMMs/ranks, do you foresee the final XOR reduction in the DIMM-NMP module (Figure 9b, page 8) becoming a new bottleneck, and if so, how might your design address this?
Regarding the algorithmic co-design (Section 4.1, page 6): You make a compelling case for using a 4-ary tree with ChaCha8. Could you provide a little more architectural insight into why ChaCha8 is so well-suited for a pipelined hardware implementation compared to, for example, a pipelined AES implementation? Is it primarily the longer output, or are there also advantages in the simplicity of its core operations (Add-Rotate-XOR) that lead to a more area- or power-efficient pipeline?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents "Ironman," a hardware accelerator architecture for PCG-style Oblivious Transfer Extension (OTE), targeting the performance bottleneck in modern Privacy-Preserving Machine Learning (PPML) frameworks. The authors identify two key components of OTE: the computation-bound Single-Point Correlated OT (SPCOT) and the memory-bandwidth-bound Learning Parity with Noise (LPN). To address these, they propose a two-pronged solution: 1) a novel "hardware-aware" m-ary GGM tree expansion algorithm for SPCOT that uses a ChaCha-based PRG instead of AES, and 2) a Near-Memory Processing (NMP) architecture with an index sorting algorithm to improve data locality for LPN.

My primary concern is the degree of conceptual novelty. While the paper claims to be the first customized accelerator for PCG-style OTE, which appears to be true, the underlying techniques employed are largely adaptations of well-known principles from other domains. The primary contribution is therefore the synthesis and application of these existing ideas to a new problem, rather than the invention of fundamentally new architectural or algorithmic concepts.

Strengths

First-Mover on a Relevant Problem: To the best of my knowledge, this is the first work to propose a dedicated, end-to-end hardware architecture for accelerating PCG-style OTE. Previous work on OT acceleration (e.g., POTA [94]) has focused on other variants like IKNP-style OT, which have different performance characteristics. Identifying and targeting this specific, computationally intensive protocol is a timely contribution.
Hardware/Algorithm Co-Design for SPCOT: The most novel element of this work is the proposed co-design for SPCOT in Section 4 (Page 6). The idea of replacing the standard 2-ary AES-based GGM tree with an m-ary ChaCha-based tree is a clever, hardware-centric optimization. While m-ary trees have been explored for V-OLE [88], the explicit analysis and motivation for coupling this with a specific, hardware-friendly PRG like ChaCha8 to reduce operator count and improve area/power efficiency (Table 2, Page 5) constitutes a legitimate, albeit specialized, engineering novelty.

Weaknesses

Limited Conceptual Novelty in LPN Acceleration: The proposed solution for the memory-bound LPN component lacks fundamental novelty. The core ideas are:
- Applying NMP: Using Near-Memory Processing to accelerate a memory-bandwidth-bound problem with low computational intensity is the foundational premise of NMP research. The application here is a straightforward, albeit effective, use case. It does not introduce a new NMP paradigm.
- Index Sorting: The "Index Sorting Algorithm for Memory-side Cache" described in Section 5.3 (Page 9) is conceptually identical to decades of work on improving data locality for Sparse Matrix-Vector Multiplication (SpMV). The authors themselves formulate LPN as an SpMV problem. Techniques such as column and row reordering to cluster non-zero elements and improve cache utilization are standard practice in the high-performance computing (HPC) and compiler communities. The application to LPN is new, but the technique itself is not.
"Unified Architecture" is Standard Design Practice: The claim of novelty for a unified architecture supporting both sender and receiver roles (Section 5.2, Page 8) is overstated. Designing a datapath with shared resources (like an XOR tree) that can be reconfigured through control logic to perform slightly different but related functions is a standard, fundamental principle of efficient hardware design to minimize area. While a necessary feature for a practical implementation, it does not constitute a novel research contribution.
The "Delta" is Primarily Application-Specific: The paper's main strength—being the first accelerator for PCG-style OTE—is also the source of its weakness from a novelty perspective. The contributions are highly specific to this protocol. The core takeaways for a general hardware architect are "use NMP for memory-bound problems" and "reorder memory accesses to improve locality," both of which are already known. The paper does an excellent job of system integration and engineering, but the set of new, generalizable concepts is small.

Questions to Address In Rebuttal

The authors should focus their rebuttal on clarifying the conceptual advances beyond the direct application to PCG-style OTE.

On LPN Acceleration: The authors frame the LPN operation as an SpMV problem (Section 5.3, Page 9). Please explicitly differentiate your proposed "Column Swapping + Row Looking-ahead" sorting algorithm from existing graph partitioning or matrix reordering algorithms used to optimize SpMV performance in the HPC domain. What is fundamentally new about your sorting approach that is uniquely tailored to the structure of LPN and not just an application of a known technique?
On Conceptual Contribution: Beyond being the first architecture for this specific task, what is the single most important conceptual and generalizable contribution of this work? Is there a new principle of cryptographic acceleration or NMP design that a future architect, working on a different protocol, could learn and apply from this paper?
On the "Hardware-Aware" m-ary Tree: The co-design of the m-ary tree with ChaCha is presented as a key contribution. Could the authors elaborate on whether this principle is more profound than a simple substitution of a more parallelizable PRG? For instance, does the structure of the ChaCha output influence the optimal choice of m in a way that would not apply to other long-output PRGs? This would help solidify the "co-design" aspect as more than just a component swap.

Dissecting and Modeling the Architecture of Modern GPU Cores

Abstract

GPUs are the most popular platform for accelerating HPC workloads, such as artificial intelligence and science simulations. However, most microarchitectural research in academia relies on simulators that model GPU core architectures based on designs that ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The authors attempt to reverse engineer the core microarchitecture of modern NVIDIA GPUs (Turing, Ampere, and Blackwell) through a series of microbenchmarks. Based on their inferences, they propose a new core model for the Accel-sim framework. The paper claims this new model significantly improves simulation accuracy, reducing the Mean Absolute Percentage Error (MAPE) from over 34% to approximately 13.5% for an Ampere GPU when compared to the baseline Accel-sim. The primary contributions center on elucidating the function of compiler-managed control bits for dependency tracking, a new issue scheduler policy, and refined models for the register file and memory pipeline.

Strengths

The ambition of the work is commendable. Tackling the opaque nature of modern commercial GPU architectures is a difficult and labor-intensive task.
The paper provides a wealth of specific, quantitative data in its tables (e.g., Table 2, Memory instruction latencies) which, if accurate, could serve as a useful reference.
The authors have clearly invested significant effort in creating and running numerous microbenchmarks to generate the timing data that forms the basis of their hypotheses.

Weaknesses

My primary concerns with this manuscript are the opacity of the methodology, the logical leaps made from limited evidence, and the potential for an unfair baseline comparison, which may inflate the significance of the reported results.

Methodological Rigor and Reproducibility: The reverse engineering methodology described in Section 3 is presented anecdotally. The authors provide two illustrative examples (Listing 1, scheduler policy) but fail to describe the systematic process and full scope of their investigation. How many microbenchmark variants were run to derive each conclusion? What was the process for ruling out alternative hypotheses? Without a rigorous account of the methodology, the findings appear to be a collection of observations from hand-picked scenarios rather than the result of a comprehensive, scientific dissection. The work is therefore difficult to validate or trust.
Overstatement of Inferred "Discoveries": The paper presents its hypotheses as definitive facts. For instance, the proposed "Compiler Guided Greedy Then Youngest (CGGTY)" issue policy (Section 5.1.2) is a strong claim based on insufficient evidence. Figure 4 shows only three specific scenarios with homogeneous, independent instructions. This is hardly a comprehensive evaluation required to definitively characterize a complex scheduler. How does the policy behave under heavy memory contention, with long-latency instructions, or around synchronization primitives? The evidence supports CGGTY as a plausible hypothesis for a limited case, not as a confirmed mechanism. This pattern of over-claiming permeates the paper (e.g., register file organization, front-end policy).
Questionable Baseline for Comparison: The validation in Section 7 hinges on a comparison against "the Accel-sim simulator." It is well-known that the public Accel-sim model is largely based on the much older Tesla/Fermi architecture. Comparing a new model tailored for Ampere against a decade-old architectural model and then claiming a >20% reduction in MAPE is not a fair or insightful comparison. It proves that Ampere is different from Fermi, which is already known. The authors have not demonstrated that their model is superior to a state-of-the-art academic model properly configured for a more recent architecture. This appears to be a "strawman" comparison.
Unsupported Claims Regarding the Blackwell Architecture: The claim to have modeled the very recent Blackwell architecture with high accuracy (17.41% MAPE, Table 4) is not adequately substantiated. In Section 6, the authors mention only supporting new SASS instructions and extending the L2 hashing function. These are minor adjustments and it is highly improbable that they capture the full microarchitectural evolution from Ampere to Blackwell. This claim feels premature and potentially misleading.
Contamination of Validation Results: A critical detail is buried at the end of Section 6: for some kernels, where SASS code was unavailable, a "hybrid mode" using "traditional scoreboards" was employed. This is a major confounding variable. The paper’s central thesis is about the accuracy of a new model based on compiler-set control bits. However, the final accuracy numbers are an amalgam of this new model and an entirely different, traditional dependency model. The authors do not state for which of the 128 benchmarks this hybrid mode was used, making it impossible to assess the true accuracy of their primary contribution.

Questions to Address In Rebuttal

The authors must provide clear and concise answers to the following questions to alleviate the concerns raised.

Regarding your methodology (Section 3), can you provide a quantitative summary of the reverse engineering effort? For example, for the issue scheduler policy alone, how many distinct microbenchmark scenarios (beyond the three in Figure 4) were constructed and tested to validate the CGGTY policy against other plausible alternatives (e.g., LRR, GTO)?
Regarding the baseline (Section 7.2), please specify the exact version and configuration of Accel-sim used as the baseline. Justify why this configuration, which primarily models an older architecture, is considered a fair and relevant baseline for comparison against modern Ampere/Turing hardware.
Regarding the "hybrid mode" for dependency tracking mentioned in Section 6, please provide a list of the benchmarks from your validation suite (Table 3) that required this mode. Furthermore, present a separate MAPE analysis for the subset of benchmarks that ran purely on your proposed control-bit model versus the subset that used the hybrid scoreboard model.
Regarding your Blackwell model, please provide a comprehensive list of all microarchitectural changes implemented beyond the Ampere model. How do you justify that this limited set of changes is sufficient to claim an accurate model of the Blackwell core architecture, rather than simply an "Ampere-plus" model that coincidentally performs well on your chosen benchmarks?

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a comprehensive reverse-engineering study of modern NVIDIA GPU core microarchitectures, from Turing to the latest Blackwell generation. The authors' primary contribution is bridging the significant and growing gap between the aging architectural models used in academic simulators (like Accel-sim) and the reality of contemporary commercial hardware. Through meticulous micro-benchmarking using hand-written SASS code, the work uncovers crucial, previously undocumented design details. These include the compiler-driven mechanism for managing data dependencies (replacing traditional hardware scoreboards), a novel "Compiler Guided Greedy Then Youngest" (CGGTY) issue policy, and detailed models of the multi-banked register file, its associated cache, and the memory pipeline. The authors integrate these findings into a new, publicly available simulator model, demonstrating a dramatic improvement in accuracy—reducing the Mean Absolute Percentage Error (MAPE) from over 34% to 13.45% on an Ampere GPU compared to the baseline Accel-sim. This work serves as both a significant contribution to the scientific understanding of modern GPU design and a vital update to the community's research infrastructure.

Strengths

High-Impact Contribution to Research Infrastructure: The most significant strength of this work is its direct and potent address of the "relevance crisis" in academic GPU simulation. For over a decade, much of the community's research has been predicated on simulators modeling architectures from the Tesla era (circa 2006). This paper performs the herculean task of updating our collective understanding and providing a tangible, validated tool that will elevate the quality and relevance of future research in areas like scheduling, memory systems, and compiler optimizations for GPUs. The validation across three major architectural generations (Turing, Ampere, Blackwell) underscores its immediate and likely future utility.
Unveiling a Fundamental Architectural Paradigm Shift: The paper's detailed exposition of the software-hardware co-design for dependency management (Section 4, page 3 and Section 7.5, page 13) is a profound insight. The move away from complex, power-hungry hardware scoreboards towards compiler-managed Stall counters and Dependence counters represents a major philosophical shift in GPU design. This finding contextualizes modern GPUs within a broader architectural trend towards compiler-led complexity management, reminiscent of VLIW principles. This discovery alone is a major contribution to computer architecture literature.
Methodological Depth and Rigor: The authors' methodology of using carefully crafted, low-level SASS microbenchmarks to probe the hardware's behavior is commendable. This is a non-trivial undertaking that requires deep expertise. The detailed examples provided (e.g., Listing 1 for register file conflicts, the analysis in Figure 4 for the scheduler policy) lend significant credibility to their inferred models. This empirical, bottom-up approach is exactly what is needed to demystify these otherwise black-box systems.
Holistic and Coherent Model: Unlike prior works that often focused on reverse-engineering a single component (e.g., a cache, a specific unit), this paper presents a coherent model of the entire core pipeline. It successfully connects the dots between the front-end fetch/decode, the issue logic, the register file, and the memory subsystems, showing how they interact. The discovery of the CGGTY issue policy and its interaction with the compiler-set Yield bit is a perfect example of this holistic view.

Weaknesses

As a contextual analyst, I view these less as flaws and more as inherent limitations or opportunities for deeper discussion.

Inherent Ambiguity of Reverse Engineering: The work constructs a highly plausible and well-validated model, but it is ultimately an inferred model. The authors are commendably transparent about this (e.g., acknowledging in Section 5.1, page 6, that they "could not find a model that perfectly fits all the experiments"). The paper could benefit from a brief, consolidated discussion on the limitations of this approach and the confidence bounds on their conclusions. While the MAPE reduction is impressive, it's important to frame the resulting model as a powerful and accurate approximation, not necessarily ground truth.
Limited Exploration of the "Why": The paper excels at detailing the "what" (the mechanisms) and the "how" (their operation). However, it offers little speculation on the "why"—the architectural trade-offs that likely motivated these design choices. For example, why did NVIDIA pivot so heavily to compiler-managed dependencies? Was the primary driver area savings, power reduction, clock speed improvements, or simpler hardware verification? Adding a short discussion section to hypothesize on these design rationales would elevate the paper from a descriptive masterpiece to a more complete architectural analysis.
NVIDIA-Centric Focus: The paper's deep dive is exclusively on NVIDIA architectures. This is a practical and understandable choice given NVIDIA's market position and the sheer scope of the work. However, it implicitly positions the NVIDIA way as the modern GPU way. While a brief comparison to AMD's waitcnt is included (Section 5, page 5), the work would be even more valuable to the broader community if it could contextualize its findings by more explicitly contrasting them with the known design philosophies of other major GPU vendors.

Questions to Address In Rebuttal

The core of your work rests on a complex, inferred model of the GPU pipeline. Beyond the aggregate MAPE scores, were there any specific micro-architectural behaviors or corner-case instruction sequences that your final model still struggled to accurately predict? Discussing these outliers could provide valuable clues for future reverse-engineering efforts.
Could you elaborate on the likely architectural motivations behind NVIDIA's shift to the compiler-managed dependency system? What are the primary trade-offs (e.g., in terms of area, power, performance, and compiler complexity) compared to the traditional hardware scoreboard approach that your own results in Section 7.5 (page 13) so clearly favor?
This work represents a monumental effort to re-synchronize academic tools with a fast-moving industry target. Looking forward, how sustainable is this manual, intensive reverse-engineering process? Does your experience suggest any potential avenues for semi-automating the discovery of such microarchitectural properties for future GPU generations, to ensure the community's research tools do not fall so far behind again?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present a work of microarchitectural reverse engineering and modeling for modern NVIDIA GPU cores (Turing, Ampere, and Blackwell). The central claim is that existing academic simulators are based on antiquated architectural assumptions (dating back to the Tesla architecture) and are therefore inaccurate. This paper seeks to remedy this by discovering and modeling key features of modern designs. The core novel claims center on the detailed mechanics of a compiler-hardware co-design for managing data dependencies, a specific issue scheduler policy ("Compiler Guided Greedy Then Youngest"), the internal structure of the register file (RF) and its cache, and the absence of an operand collector stage. These findings are integrated into the Accel-sim framework, and the resulting model is shown to be significantly more accurate than the baseline.

Strengths

The primary strength of this paper is the novelty of its empirical findings. While many papers propose new architectural ideas, this work provides a rare and valuable service by reverse-engineering and documenting a complex, proprietary, state-of-the-art commercial architecture. The novelty is not in inventing new mechanisms, but in revealing existing ones for the first time in the public domain.

Novel Semantics of Dependence Management: The most significant novel contribution is the detailed elucidation of the software-based dependence management system (Section 4, pages 3-4). The existence of compiler-inserted "hints" is not new; this was observed in the Kepler architecture and noted by Jia et al. [46, 47] for Volta/Turing. However, prior work has not provided a functional, mechanistic model. This paper's detailed breakdown of the Stall counter for fixed-latency hazards and the system of six Dependence counters (SBx registers) with producer-increment/consumer-decrement semantics for variable-latency hazards is a genuinely new contribution to public knowledge. The explanation of the DEPBAR.LE instruction and the Yield bit provides a complete, plausible model that has not been described elsewhere.
Refutation of the Operand Collector Assumption: The paper makes a strong claim that modern NVIDIA GPUs do not use an operand collector unit (Section 5.3, page 8). This is a direct and novel refutation of a core assumption in the widely-used GPGPU-Sim and Accel-sim models. The authors' reasoning—that variable latency from an operand collector would break the compiler's ability to calculate static stall cycles—is logical and compelling. Disproving a long-held assumption in a dominant model is a significant form of novelty.
Specific Characterization of the Issue Scheduler: While greedy scheduling policies are well-known, the specific "Compiler Guided Greedy Then Youngest" (CGGTY) policy (Section 5.1.2, page 6) is a novel finding. This moves beyond the canonical "Greedy Then Oldest" (GTO) policy and demonstrates a tight coupling between the scheduler's fallback mechanism (picking the youngest warp) and the compiler's explicit instructions to yield the pipeline (Yield bit). The detailed experimental timeline in Figure 4 provides strong evidence for this specific, previously undocumented policy.
Timeliness of Blackwell Modeling: The claim of being the first work to provide an accurate, validated model for the NVIDIA Blackwell architecture (Section 1, page 1 and Section 7.2, page 12) is a strong point of novelty. Given the recent release of this architecture, this contribution is at the cutting edge of academic GPU modeling.

Weaknesses

The weaknesses of the paper, from a novelty perspective, lie in areas where the contributions are more confirmatory or incremental rather than fundamentally new concepts.

Incremental Novelty of the RF Cache: The concept of a compiler-managed register file cache for GPUs is not a new idea. The paper's model is explicitly and correctly identified as being "similar to the work of Gebhart et al. [34]" (Section 5.3.1, page 8). While the reverse-engineered parameters—such as one entry per bank and specific software management via a "reuse bit"—are new empirical details for NVIDIA's implementation, the core architectural concept has been established in prior art for over a decade. The delta here is one of implementation-specifics, not of fundamental mechanism.
Assumed Prefetcher Design: The front-end model relies on a simple stream buffer for instruction prefetching (Section 5.2, page 7). This is a classic mechanism, and the authors state that they "suspect it is a simple scheme" and "assume" its size is 8 entries. While this assumption is validated by the model's overall accuracy, the contribution lacks the rigor of the reverse engineering applied elsewhere. As a contribution, proposing a standard, well-known prefetcher is not novel.
Known Trade-off Analysis: The analysis in Section 7.5 (page 13) comparing the discovered software-based dependence management to a traditional hardware scoreboard is an evaluation of a known design trade-off. The conclusion that a software-managed approach has lower area overhead is expected. The value is in quantifying this trade-off with realistic parameters, but this does not represent a new conceptual insight into computer architecture.

Questions to Address In Rebuttal

On the Fetch Policy's Novelty: The front-end fetch policy is assumed to mirror the issue scheduler's greedy logic (Section 5.2, page 7). This is presented as a plausible assumption rather than a direct finding. What experiments were performed to rule out other well-known fetch policies (e.g., round-robin, ICOUNT)? How much of the model's accuracy hinges on this specific assumption, which appears less rigorously proven than other claims in the paper?
Clarifying the Delta vs. Gebhart et al.: The paper acknowledges the similarity of its RF cache model to that of Gebhart et al. [34]. Beyond the parametric differences and the lack of a two-level scheduler, could the authors more sharply define the novel architectural principle, if any, that their findings reveal? Is NVIDIA's implementation merely a modern instantiation of the 2011 concept, or is there a more fundamental conceptual difference that this review has missed?
On the Generality and Longevity of Findings: The detailed semantics of the control bits and dependence counters are a key contribution. These were reverse-engineered across three recent architectures. Based on this, do the authors have evidence to suggest that this specific mechanism is a stable, long-term feature of NVIDIA's design philosophy, or is it an artifact of a particular architectural era? How much risk is there that the next generation invalidates this detailed model, thereby limiting the durable novelty of these specific findings?

Interleaved Bitstream Execution for Multi-Pattern Regex Matching on GPUs

Abstract

Pattern matching is a key operation in unstructured data analytics, commonly supported by regular expression (regex) engines. Bit-parallel regex engines compile regexes into bitstream programs, which expose fine-grained parallelism and are well-suited for ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper introduces BITGEN, a code generator that compiles multi-pattern regular expressions into bitstream programs for execution on GPUs. The central thesis is that a "sequential block-wise" execution of these programs is inefficient due to poor data reuse and high memory traffic. The authors propose an "interleaved execution" model, fusing all bitstream instructions into a single GPU kernel loop. To address the primary technical challenge of cross-block data dependencies introduced by shift operations, they present three techniques: 1) Dependency-Aware Thread-Data Mapping (DTM) which resolves dependencies via selective recomputation, 2) Shift Rebalancing (SR) which restructures the dataflow graph to reduce synchronization, and 3) Zero Block Skipping (ZBS) which adds conditional checks to avoid computation on sparse, zero-filled blocks. The authors claim their system achieves a geometric mean speedup of 19.5x over ngAP, a state-of-the-art GPU regex engine.

Strengths

Correct Problem Identification: The paper correctly identifies a critical performance bottleneck in the naive GPU implementation of bitstream-based regex matching—namely, the memory traffic and lack of data reuse inherent in executing instructions in separate, sequential loops (as described in Section 3.2, page 4).
Sound Core Concept: The proposed interleaved execution model is a logical approach to improving data locality. Fusing operations into a single loop to keep intermediate results in registers is a well-established optimization principle, and its application here is well-motivated.
Comprehensive Technical Approach: The authors demonstrate a thorough understanding of the challenges arising from their proposed execution model. The three primary contributions (DTM, SR, ZBS) form a cohesive set of solutions that systematically address the problems of cross-block dependencies, synchronization overhead, and redundant computation.

Weaknesses

My primary concerns with this work relate to the robustness of the core methodology, the fairness of the experimental comparison, and the practical scalability of the proposed system.

Unresolved Failure Mode in Core Technique: The Dependency-Aware Thread-Data Mapping (DTM) relies on recomputing data from an overlapping window of the previous block. The paper itself acknowledges a critical limitation: the required overlap distance can exceed the size of a single block, making dependencies unresolvable with the current method (Section 8.2, page 12, "Discussion: Limits of Overlap Distance in DTM"). The authors propose a "fallback mechanism" but state, "We leave the fallback mechanism as future work." This is a significant flaw. The system, as presented, is not robust. The evaluation may have succeeded only because the chosen benchmarks (Table 1, page 10) do not contain patterns (e.g., /.*/ on long lines, or large bounded repetitions) that trigger this known failure condition. The claims of the paper rest on a technique that is demonstrably incomplete.
Questionable Comparison to ngAP: The headline claim of a 19.5x speedup over ngAP is staggering and requires extraordinary evidence. However, the comparison appears to be between two fundamentally different execution paradigms. BITGEN uses a bit-parallel model derived from Parabix, while ngAP uses an automata-based model. The performance of these models is highly dependent on the specific characteristics of the regex set and input data. The paper offers a brief explanation that ngAP suffers from irregular memory access and limited worklist size (Section 8.1, page 10), but fails to provide a rigorous analysis of why the chosen benchmarks are so particularly unsuited to an automata-based approach and so perfectly suited to a bitstream approach. Without this analysis, the 19.5x speedup seems less like a fundamental improvement and more like an artifact of benchmark selection that favors the authors' paradigm.
Overstated Novelty of Zero Block Skipping: The paper claims that interleaved execution "enables" Zero Block Skipping (ZBS), implying that it is not possible in a sequential model (Section 6, page 9). This is an overstatement. A sequential block-wise implementation could readily check if an intermediate bitstream block is all zeros after one kernel finishes and, if so, skip the launch of subsequent kernels that operate on it. While the authors' fused approach may implement this check more efficiently (avoiding kernel launch overhead), it does not uniquely enable it. The claim should be tempered to reflect a benefit of efficiency rather than capability.
Inherent Scalability Limitation: The execution model assigns a regex group to a single Cooperative Thread Array (CTA) to "simplify synchronization" (Section 9, page 13). The authors admit this comes "at the cost of per-regex scalability." This is not a minor trade-off; it is a primary architectural bottleneck. For any workload dominated by one or a few highly complex regexes, the system has no mechanism to parallelize the work beyond a single SM. The impressive throughput numbers reported may only be achievable on workloads with thousands of simple regexes that can be evenly distributed. This severely limits the applicability of the system in more challenging, real-world scenarios.

Questions to Address In Rebuttal

Regarding the unhandled failure mode of DTM: Please provide experimental results for regex patterns known to generate large overlap requirements (e.g., (.*){N} or [a-z]{16384,} on matching input). How does the system perform, and at what point does it fail? Justifying the system's correctness requires addressing this worst-case behavior, not just the average cases presented.
Regarding the 19.5x speedup over ngAP: Can you provide a more detailed, quantitative analysis of the benchmark regexes to demonstrate that the performance gap is not merely an artifact of comparing two disparate execution models on a dataset that strongly favors one? For instance, what is the ratio of literal-heavy patterns versus patterns with complex control flow (alternation, Kleene stars), and how do the two systems' performances correlate with these features?
Regarding the single-CTA execution model: Please quantify the performance limitation of this design. For example, what is the throughput when matching only the single most complex regex from the Snort or ClamAV benchmarks across the entire GPU, and how does this compare to the throughput of Hyperscan on a single CPU core? This would provide a more honest assessment of the system's scalability limitations.
Regarding the recomputation overhead of DTM: Table 5 (page 12) shows the average-case overhead is low. What is the maximum observed recomputation overhead for any single iteration in your experiments, and under what conditions does it occur? The average can hide pathologically expensive iterations that may impact tail latency.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents BITGEN, a code generator that fundamentally rethinks how bit-parallel regular expression programs are executed on GPUs. The authors identify a critical performance bottleneck in the straightforward approach: executing bitstream instructions sequentially across the entire input (instruction-wise) leads to poor data reuse and massive memory traffic for intermediate results.

The core contribution is a shift to an interleaved (block-wise) execution model, where a block of data is processed through all bitstream instructions before moving to the next block. This model is enabled by a clever technique called Dependency-Aware Thread-Data Mapping (DTM), which resolves the cross-block data dependencies inherent in SHIFT operations not by complex state-forwarding mechanisms, but by selectively recomputing a small number of overlapping bits. This core idea is further enhanced by two well-motivated compiler optimizations: Shift Rebalancing to shorten critical dependency paths and Zero Block Skipping to exploit data sparsity.

The result is a system that achieves a remarkable 19.5x geometric mean speedup over a state-of-the-art GPU automata-based engine (ngAP) and significantly outperforms highly optimized multi-threaded CPU engines like Hyperscan on many workloads.

Strengths

A Powerful and Elegant Core Idea: The central concept of switching from an instruction-wise to a block-wise execution model is a paradigm shift for this problem on GPUs. The decision to resolve dependencies via recomputation is particularly insightful. It trades a small amount of redundant computation—a resource GPUs have in abundance—for a massive reduction in memory traffic and synchronization overhead, which are often the true bottlenecks. This is a beautiful example of architecture-aware algorithm design. The clarity of this idea is well-illustrated in the contrast between Figure 1(a) and Figure 1(b) on page 1.
Exceptional Performance and Empirical Validation: The performance gains reported are not merely incremental; they represent an order-of-magnitude improvement over prior GPU art. The thorough evaluation against established CPU and GPU baselines (Section 8, pages 10-13) convincingly demonstrates the effectiveness of the proposed approach. The performance breakdown analysis (Figure 12, page 11) is a particular strength, as it clearly isolates the contribution of each proposed technique, confirming that the entire system is well-designed, not just a single trick. The analysis of DRAM traffic reduction (Table 4, page 12) directly validates the paper's initial hypothesis about data reuse.
Excellent Synthesis of Existing Concepts into a Novel Framework: This work does not exist in a vacuum, and its strength lies in how it connects disparate fields. It builds upon the bitstream program representation from the CPU-focused world of Parabix [24, 50]. It then applies classic compiler optimization principles, such as dependency graph restructuring (Shift Rebalancing) and value-based optimization (Zero Block Skipping), to this representation. Finally, it tailors the execution model specifically for the GPU architecture (data-parallelism, memory hierarchy, CTA-local synchronization). The result is a system that is greater than the sum of its parts and successfully ports a powerful but previously CPU-bound paradigm to the GPU with resounding success.
Opens a New Avenue for GPU Stream Processing: While the paper is framed around regex matching, the core pattern of "interleaved execution with selective recomputation" has broader implications. It provides a template for executing any data-parallel stream processing algorithm with local, sliding-window dependencies (e.g., 1D stencils, certain sequence alignment tasks) on a GPU without suffering from the "halo exchange" problem at the global memory level. The paper effectively introduces a new, powerful pattern for GPU computing.

Weaknesses

Limited Discussion of Scalability Beyond a Single CTA: The current model simplifies synchronization by assigning a group of regexes to a single Cooperative Thread Array (CTA), as mentioned in Sections 3.1 and 9. While this is a pragmatic choice for the multi-pattern problem, it sidesteps the challenge of scaling a single, extremely complex regex pattern across multiple CTAs. Such a scenario would require inter-CTA synchronization to handle cross-block dependencies, a notoriously difficult problem on GPUs. A more thorough discussion of the architectural implications and potential solutions for this would strengthen the paper's positioning.
Understated Generality of the Core Technique: The authors frame their work tightly around regex matching. However, as noted in the strengths, the underlying technique of resolving sliding-window dependencies via recomputation is potentially much more general. The paper could have a greater impact if it dedicated a short section to discussing how this execution model could be abstracted and applied to other stream-processing domains that have similar dependency patterns. This would elevate the contribution from a "fast regex engine" to a "novel parallel execution strategy for stream computations."
Potential for Pathological Cases in Dynamic Overlap: The paper handles dynamic overlap distances arising from while loops (Figure 7, page 7), which is a non-trivial accomplishment. However, the analysis of its limits feels brief (Discussion in Section 8.2, page 12). Regular expressions can contain nested Kleene stars (e.g., /(a*b*)*/), which could potentially lead to data-dependent overlap requirements that grow very rapidly. A deeper discussion on the theoretical bounds of this overlap and the performance cliff that might occur if it exceeds the practical limit (e.g., the size of a block) would provide a more complete picture of the technique's robustness.

Questions to Address In Rebuttal

Regarding the single-CTA execution model: What are the primary challenges you foresee in extending the Dependency-Aware Thread-Data Mapping to a multi-CTA model for a single, large bitstream program? Would it require global synchronization, or could a more asynchronous, pipelined model between CTAs be devised?
The core idea of trading recomputation for memory locality is powerful. Do you view this as a general-purpose parallel programming pattern for GPUs? Could the BITGEN compiler framework be extended to support other stream-processing DSLs that exhibit similar local dependency structures?
Could you elaborate on the worst-case behavior of the dynamic overlap distance calculation? Are there specific classes of regular expressions or input data patterns that would cause the required overlap Δ to become prohibitively large, and if so, how does your system currently handle such a scenario? You mention a "fallback mechanism" as future work; could you briefly sketch what that might entail?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present BITGEN, a code generator for compiling multi-pattern regular expressions into efficient GPU kernels. The central claim of novelty lies in a shift from the standard sequential, instruction-by-instruction execution of bitstream programs to an "interleaved" execution model. In this model, all bitstream instructions are fused into a single block-wise loop. To make this non-trivial transformation correct and efficient, the authors introduce three key techniques: 1) Dependency-Aware Thread-Data Mapping (DATDM), which resolves cross-block data dependencies (primarily from SHIFT operations) via selective recomputation; 2) Shift Rebalancing, a compiler pass that restructures the dataflow graph to shorten dependency chains; and 3) Zero Block Skipping, which exploits bitstream sparsity within the interleaved model to avoid redundant work. The authors claim this new execution model and its enabling techniques yield significant performance improvements over state-of-the-art CPU and GPU regex engines.

Strengths

My evaluation is based exclusively on the novelty of the proposed ideas relative to prior art.

Novel Execution Model for Bitstream Programs: The core concept of interleaving the execution of a complete bitstream program on a per-data-block basis is a genuine departure from the established model described in works like Parabix [24, 50], which processes one instruction at a time across all data. While loop fusion is a classic compiler optimization, its application here is non-trivial due to the unique cross-block, data-dependent control flow inherent to bitstream programs. The novelty is not "loop fusion" itself, but the creation of a viable framework to apply it to this specific, challenging domain on GPUs.
Principled Resolution of Cross-Block Dependencies: The primary obstacle to the interleaved model is resolving dependencies from SHIFT operations across block boundaries. The authors' proposed solution, Dependency-Aware Thread-Data Mapping (DATDM), is novel. Instead of attempting complex state-forwarding mechanisms, they opt for selective recomputation. The method for systematically analyzing the dataflow graph to calculate the required overlap distance (Δ), including handling dynamic distances from while loops (Section 4.2, page 6), appears to be a new and well-defined technique. It provides a principled way to manage the recompute-vs-communicate trade-off for this problem.
Novel Compiler Optimization (Shift Rebalancing): The concept of operand rewriting (e.g., (A >> n) & B -> (A & (B << n)) >> n) is based on fundamental algebraic properties. However, the systematic application of this identity in a dedicated compiler pass to rebalance a bitstream program's dataflow graph for improved GPU instruction-level parallelism (Section 5, page 7) is a novel contribution in compilation engineering. It addresses a specific performance bottleneck (long dependency chains) created by the bitstream abstraction.

Weaknesses

My concerns relate to the boundaries and fundamental nature of the claimed novelty.

Limited Discussion of the Novelty's Generality: The core enabling technique, DATDM, relies on recomputing a finite number of bits from an adjacent block. As the authors themselves briefly acknowledge in "Discussion: Limits of Overlap Distance in DTM" (Section 8.2, page 12), regex constructs like /.*/ on long lines can create overlap requirements that exceed the block size, breaking the model. The authors dismiss this by stating their benchmarks do not trigger this and suggesting a "fallback mechanism" as future work. For a core contribution, this is a significant limitation. The novelty is presented as a general model, but its applicability appears bounded in a way that is not fully explored.
Incremental Nature of Secondary Optimizations: While Shift Rebalancing and Zero Block Skipping are novel in their specific application, the underlying concepts are familiar. Operand rewriting for balancing expression trees is a known compiler technique, and exploiting sparsity is fundamental in many high-performance domains. The novelty here lies in the synergy with the interleaved model. However, if a different method were to solve the cross-block dependency problem, the contribution of these secondary optimizations would be diminished. Their novelty is contingent on the novelty of the DATDM approach.
Overlapping Concepts with Prior Art in Speculative Execution: The paper differentiates itself from Qiu et al. [59] by stating they handle data dependencies via recomputation, whereas Qiu et al. handle control dependencies via speculation on CPUs (Section 9, page 13). This distinction is valid but narrow. Both works address the challenge of breaking sequential dependencies in a parallelized stream-processing context. The choice of recomputation over speculation for GPUs is a key engineering decision, but the fundamental problem being tackled—enabling non-sequential processing of a dependent stream—shares significant conceptual overlap. The paper could do more to argue why recomputation is fundamentally a more novel or superior approach in the GPU context, beyond empirical results.

Questions to Address In Rebuttal

Regarding the limitations of DATDM: Can the authors provide a more concrete analysis of which specific, real-world regex constructs (beyond /.*/) would break the proposed model by requiring an unbounded or excessively large overlap distance? How would the proposed "fallback to sequential execution" interact with the rest of the fused kernel, and what would its performance implications be?
The paper distinguishes its dependency resolution from Qiu et al. [59]. Could the authors elaborate on whether speculation and recomputation are mutually exclusive solutions for this problem on GPUs? Could a hybrid approach exist? Please provide a more fundamental argument as to why recomputation is the superior and more novel paradigm for resolving data dependencies from SHIFT operations in a massively parallel, high-latency GPU environment compared to a speculative model.
The Shift Rebalancing optimization is presented as an iterative, greedy algorithm. Is there any theoretical basis to suggest this approach nears an optimal DFG balance for this problem, or is it purely a heuristic? The novelty of the engineering would be strengthened if its theoretical properties were better understood.

SoftWalker: Supporting Software Page Table Walk for Irregular GPU Applications

Abstract

Address translation has become a significant and growing performance bottleneck in modern GPUs, especially for emerging irregular applications with high TLB miss rates. The limited concurrency of hardware Page Table Walkers (PTWs), due to their small and ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present SoftWalker, a framework that offloads GPU page table walks from dedicated hardware PTWs to software threads, termed "Page Walk Warps" (PW Warps), running on SMs. The central thesis is that for irregular applications, the primary bottleneck in address translation is not the latency of a single page walk, but the severe queueing delay caused by contention for a limited number of hardware PTWs. By leveraging "idle" GPU cycles to execute page walks in software, the authors claim to achieve massive parallelism, thereby eliminating queueing delay. The proposal is supported by two main architectural additions: dedicated PW Warps to isolate translation from computation, and "In-TLB MSHRs" to expand the capacity for tracking outstanding misses by repurposing L2 TLB entries. The authors claim an average speedup of 2.24x (3.94x for irregular workloads).

While the diagnosis of the problem (PTW contention) is sound, the proposed solution appears to be a costly and complex hardware/software co-design masquerading as a simple software framework. The evaluation relies on a favorable baseline and assumptions that seem to minimize the very real overheads introduced by the software-based approach, and the security implications are not sufficiently addressed.

Strengths

Problem Characterization: The paper provides a compelling analysis motivating the work. The identification in Section 3.2 (Page 5) that queueing delay constitutes up to 95% of the total page walk latency for irregular workloads is a strong and clear insight.
Empirical Motivation: The microbenchmark results from a real NVIDIA A2000 GPU (Figure 4, Page 3), which demonstrate a 4x increase in memory access latency with 256 concurrent walks, provide a solid, real-world grounding for the contention problem.
Clear Presentation of Core Concept: The conceptual overview in Figure 1 and the latency comparison in Figure 9 (Page 6) effectively communicate the paper's core trade-off: accepting a modest increase in per-walk latency to drastically reduce total latency by eliminating queueing.

Weaknesses

Mischaracterization as a "Software" Solution: The proposal is repeatedly framed as shifting work "from fixed-function hardware to software execution." This is misleading. SoftWalker requires significant and non-trivial hardware modifications.
- Dedicated PW Warp Contexts (Section 4.2, Page 7): The paper states that SoftWalker provisions "dedicated architectural slots for the PW Warp, including an instruction buffer entry, scoreboard entries, and SIMT stack entries." This is not a "software" change; it is a modification to the core SM pipeline and resource allocation hardware. The claim of "minimal hardware overhead" is therefore unsubstantiated. The overhead analysis in Section 5.2 (Page 9) only quantifies storage (1470 bits), ignoring the cost and complexity of the required control logic modifications in the warp scheduler and pipeline.
- Required ISA Extensions (Section 4.3, Page 7): The proposal mandates four new instructions (LDPT, FL2T, FPWC, FFB). Adding and decoding new instructions, particularly privileged ones like LDPT which bypasses the TLB, represents a fundamental change to the processor's ISA and hardware, not a simple software routine.
Unconvincing Security Model: The security implications of the new ISA are not rigorously handled.
- The LDPT instruction, which loads a page table entry using a physical address and bypasses the TLB, is extremely powerful. The authors' security argument in Section 5.1 (Page 9) relies entirely on the premise that only the isolated PW Warp can execute it. However, the paper fails to describe the hardware mechanism that enforces this restriction. What prevents a malicious user-space kernel from discovering the opcode for LDPT and using it to read arbitrary physical memory? A robust security model would require hardware-level privilege checks, which are not discussed.
Flawed Performance Trade-offs and Evaluation: The evaluation appears to be constructed to amplify the benefits of SoftWalker while obscuring its costs.
- The "In-TLB MSHR" is TLB Pollution: The mechanism described in Section 4.5 (Page 8) repurposes valid L2 TLB entries to store miss metadata. For any application that is not pathologically irregular—i.e., any application with some degree of spatial or temporal locality—this will lead to the eviction of useful translations, increasing the overall TLB miss rate. The authors tacitly admit this weakness by proposing a "Hybrid Approach" (Section 5.4) for regular applications, which confirms the pure software walker is detrimental in those cases. The evaluation lacks experiments on mixed regular/irregular workloads that would expose this critical flaw.
- Unrealistic Communication Latency Modeling: The performance of SoftWalker is highly sensitive to the communication latency between the L2 TLB and the SM executing the page walk. In Section 6.1 (Page 11), the authors state this is modeled as "equal to the L2 TLB access latency." This seems optimistic. The process involves the Request Distributor, the SoftWalker Controller, and instruction fetch/execution on the SM pipeline, which likely incurs additional cycles beyond a simple L2 access. The sensitivity study in Figure 22 (Page 13) shows performance degrading with higher latency, underscoring the criticality of this assumption, which itself lacks rigorous validation.
- Weak Baseline: The baseline architecture uses 32 hardware PTWs. While this is a plausible configuration for some GPUs, high-end designs may feature more. By comparing against a relatively constrained baseline, the severity of the queueing problem is maximized, making the gains from SoftWalker appear larger than they might be against a more robust hardware alternative.

Questions to Address In Rebuttal

Please provide a detailed hardware cost analysis (in terms of area and design complexity) for the "dedicated architectural slots" required for a PW Warp in each SM. How does this compare to the cost of simply adding more conventional hardware PTWs (e.g., doubling them to 64)?
What specific hardware mechanism prevents a user-level thread from executing the LDPT instruction? Is there a new privilege level or hardware flag, and if so, what is its cost and how is it managed?
The In-TLB MSHR mechanism evicts potentially useful L2 TLB entries. Please provide an evaluation of a workload that mixes frequent, localized memory accesses (which benefit from the TLB) with sporadic, irregular accesses. I expect this to demonstrate performance degradation due to TLB pollution from In-TLB MSHRs.
Please provide a justification for modeling the L2-to-SM request distribution and software walker initiation latency as being equal to a single L2 TLB access. A more detailed cycle breakdown of this critical path is needed to validate the performance claims.
How does the performance of SoftWalker compare against a baseline with 128 PTWs? Your own area analysis in Figure 15 (Page 10) suggests that 128 PTWs is a comparable design point in terms of area cost. A direct performance comparison is necessary for a fair evaluation.

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces SoftWalker, a novel framework that fundamentally shifts GPU page table walking from fixed-function hardware to a scalable, software-based execution model. The authors identify that for emerging irregular applications (e.g., graph analytics, sparse linear algebra), the primary bottleneck in address translation is not the latency of a single page walk, but the massive queueing delay caused by contention on a small, fixed number of hardware Page Table Walkers (PTWs).

The core contribution is to leverage the GPU's inherent massive thread-level parallelism and abundant stall cycles to perform these walks in software. SoftWalker dynamically dispatches specialized, lightweight "Page Walk Warps" (PW Warps) on the SMs to handle TLB misses concurrently. To support this high degree of parallelism, the paper also introduces "In-TLB MSHRs," a clever mechanism that repurposes underutilized L2 TLB entries to track outstanding misses, thereby overcoming the limited capacity of dedicated hardware MSHRs. The evaluation demonstrates significant performance improvements, particularly for irregular workloads, by effectively eliminating translation queueing delays.

Strengths

Elegant and Foundational Re-imagination of a Core Problem: The central idea of SoftWalker is exceptionally strong. It revisits the classic architectural debate between hardware and software-managed address translation, but brilliantly re-contextualizes it for the unique characteristics of a GPU. The authors correctly observe that while software page walks are too costly in traditional CPUs due to expensive traps and context switches, they are a natural fit for GPUs, where near-zero-cost warp switching is a foundational design principle. This transforms what is typically a liability for irregular workloads—long-latency memory stalls—into an opportunity to perform productive, parallel work. The conceptual diagram in Figure 1 (page 2) powerfully illustrates this conversion of stall cycles into progress.
Excellent Problem Analysis and Motivation: The paper does an outstanding job of motivating the work. The analysis in Section 2.2 and 3.2 is compelling. In particular, Figure 7 (page 5), which breaks down page walk latency, provides the "smoking gun" for the entire paper: queueing delay constitutes up to 95% of the total latency for irregular workloads. This clear, data-driven insight makes the authors' subsequent design choices feel not just logical, but necessary.
Holistic and Practical System Design: The proposal is not a single-trick pony. The authors demonstrate a deep, system-level understanding by identifying and addressing the next bottleneck. After proposing thousands of software walkers, they rightly identify that the limited number of hardware MSHRs would become the new point of contention. Their solution, In-TLB MSHRs (Section 4.5, page 8), is a clever and resource-efficient technique that again leverages an underutilized resource (the L2 TLB itself) to solve the problem. This foresight strengthens the entire proposal.
Awareness of Broader Architectural Context: SoftWalker fits neatly into a broader research theme of using "helper threads" or "assist warps" to accelerate system-level tasks (e.g., CABA [93] for compression). This work provides one of the most compelling use cases for this paradigm by applying it to the fundamental process of address translation. Furthermore, the inclusion of a hybrid approach (Section 5.4, page 10) demonstrates pragmatism. The authors acknowledge that their software-first approach might penalize latency-sensitive regular workloads and propose a practical path to deployment that retains existing hardware, making the idea far more compelling for real-world adoption.

Weaknesses

While the core idea and execution are strong, the paper could be strengthened by a deeper exploration of its broader implications.

Security Implications of Privileged Software Execution: Section 5.1 (page 9) provides a solid initial discussion of resource isolation. However, moving a privileged operation—one that deals directly with physical page table addresses—into a software-like execution flow on the SM represents a significant shift in the GPU's security and trust model. While direct access between warps is prevented, the discussion could benefit from considering more subtle side-channel attacks (e.g., through timing, cache, or memory controller contention) that could arise from co-locating these privileged PW Warps with untrusted user warps on the same SM resources.
Interaction with Future Heterogeneous Memory Systems: The paper positions itself in the context of current GPU architectures. However, the field is rapidly moving towards more complex, heterogeneous systems enabled by interconnects like CXL. In these environments, address translation may become more complex, potentially spanning multiple memory domains with different latency characteristics. The paper would be more forward-looking if it discussed how the SoftWalker model might extend to these multi-hop translation scenarios, where a single page walk could involve traversing structures across a CXL link.
Software Ecosystem Complexity: The proposal relies on a host driver to pre-load the PW Warp code and orchestrate its execution. While architecturally sound, this introduces new complexity into the driver and runtime stack. A brief discussion on the anticipated software engineering effort and the interface between the hardware controller and the driver would add a valuable layer of practical analysis.

Questions to Address In Rebuttal

Regarding security, can the authors elaborate on the threat model they considered? Beyond direct register/shared memory access, what is their assessment of potential timing or contention-based side channels between a privileged PW Warp and co-resident user warps? Does giving the PW Warp highest scheduling priority introduce any predictable timing patterns that could be exploited?
How does the SoftWalker framework adapt to future memory systems? Specifically, in a CXL-enabled system with tiered memory, a page walk might require traversing page tables located in remote, high-latency memory. How would the fixed instruction sequence of a PW Warp handle such variable and potentially very long latencies? Does this scenario weaken the benefits of the software approach?
The hybrid model is a key practical feature. Could the authors provide more detail on the policy within the Request Distributor that chooses between hardware PTWs and software PW Warps? For instance, how does it handle the transition when hardware PTWs become fully saturated? Is there a risk of creating bubbles or inefficiencies in the hand-off process itself?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper, "SoftWalker," proposes a framework for handling GPU page table walks in software rather than with fixed-function hardware Page Table Walkers (PTWs). The core idea is to leverage the GPU's own massive thread-level parallelism by dispatching dedicated, lightweight software threads ("Page Walk Warps" or PW Warps) on Streaming Multiprocessors (SMs) to resolve TLB misses. To support this high degree of parallelism, the authors also introduce "In-TLB MSHRs," a mechanism that repurposes underutilized L2 TLB entries to track outstanding misses when the dedicated MSHR hardware is saturated. The authors claim this software-defined approach fundamentally addresses the scalability limitations of hardware PTWs, especially for irregular applications with high TLB miss rates.

My analysis concludes that the fundamental concept of software-managed address translation is not new. However, the paper's primary novel contribution is the adaptation and architectural integration of this concept into the unique, massively parallel execution model of modern GPUs. The novelty lies not in the "what" (software page walks) but in the "how" and "where" (using the GPU's own compute fabric at scale).

Strengths

The main strength of this work is its core insight: the trade-offs that made software-managed translation unattractive for latency-sensitive CPUs are inverted in the context of a throughput-oriented GPU. The authors correctly identify (Section 3.3, page 5) that the high context-switching overhead of CPU exception handling is a key impediment, whereas GPUs are designed for near-zero-cost switching between thousands of threads. Exploiting this architectural feature to parallelize a fundamental OS-level task is a conceptually novel application.

The supporting architectural mechanisms, while drawing from existing concepts, are integrated in a novel way to solve this specific problem:

PW Warp Isolation: The concept of a specialized, privileged warp is a necessary and non-trivial extension of the "assist warp" pattern seen in prior work (e.g., CABA [93] for compression). Applying this pattern to a privileged and security-sensitive task like page table walking, with the required resource partitioning, represents a significant delta.
In-TLB MSHRs: While the paper honestly cites prior art for "In-Cache MSHRs" ([21] Farkas and Jouppi, 1994), its application to a TLB to solve the secondary bottleneck created by their own high-throughput walker is a clever and contextually novel engineering step. Without this, the primary contribution would be less effective.

Weaknesses

The primary weakness from a novelty perspective is that the paper's foundational ideas are adaptations of well-established concepts from other domains. A critical reader with deep knowledge of prior art will immediately recognize the parallels.

Software-Managed TLBs: The concept of a software handler for a TLB miss is decades old, dating back to architectures like MIPS and Alpha. The paper acknowledges this for CPUs in Section 3.3 (page 5) but could do more to frame its contribution as a novel re-imagining of this old idea for a new architectural paradigm, rather than presenting software page walks as a fundamentally new idea in and of itself.
Use of "Assist Warps": The idea of co-opting GPU execution resources for system-level or helper tasks is not entirely new. Prior work such as CABA [93] for data compression and CUDA-DMA [10] for memory copies established the pattern of warp specialization. SoftWalker's novelty is in the target application (address translation) and the required privilege level, which is a significant distinction, but the underlying mechanism of using a specialized warp is an extension of this existing pattern.
In-TLB MSHRs: As noted, this is a direct adaptation of the "In-Cache MSHR" concept. The novelty is in the application, not the core mechanism. The paper is transparent about this, but it must be recognized that this is an incremental, albeit important, innovation.

The complexity of the proposed ISA extensions (Section 4.3, page 7) and new hardware components (SoftWalker Controller, Request Distributor) is non-trivial. While the performance gains for irregular workloads are substantial (average 3.94x speedup), the justification for this added complexity rests entirely on the importance of this specific workload class. For regular workloads, the system introduces a performance regression, which is mitigated by a hybrid model that essentially retains the old hardware path. This suggests the novel mechanism is a specialized accelerator rather than a universal replacement, somewhat narrowing the scope of the innovation's impact.

Questions to Address In Rebuttal

The concept of software-managed address translation has a long history in CPU architectures. Please elaborate further on the specific architectural features of GPUs (beyond fast context switching) that make this old idea newly viable. Conversely, what undiscovered challenges arise when porting this model from a single-core, deep-pipeline CPU to a massively parallel, wide-pipeline SM?
Please position the "PW Warp" concept more directly against the "assist warp" pattern from prior work like CABA [93]. What are the fundamental architectural differences required to support a privileged, system-critical task like address translation compared to an application-level helper task like compression? Specifically, how is security and isolation enforced at the hardware level beyond what was proposed in previous assist warp schemes?
The novelty of "In-TLB MSHRs" lies in its application to a new structure. Were there any non-obvious technical challenges in adapting the In-Cache MSHR idea to a TLB? For instance, do the tag/data organization, replacement policies, or interactions with page table updates in memory introduce complexities not present in a traditional data cache?

LATPC: Accelerating GPU Address Translation Using Locality-Aware TLB Prefetching and MSHR Compression

Abstract

Modern Graphics Processing Units (GPUs) support virtual memory to ease programmability and concurrency, but still suffer from significant address translation overhead due to frequent Translation Lookaside Buffer (TLB) misses and limited TLB Miss-Status ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The authors propose LATPC, a hardware mechanism to accelerate GPU address translation by exploiting intra-warp VPN regularity. It comprises three main components: a "Regularity Detector" to identify strided VPN access patterns within a warp, Locality-Aware MSHR Compression (LATC) to merge multiple TLB miss requests into a single compressed MSHR entry, and Locality-Aware TLB Prefetching (LATP) to batch page table walks for PTEs residing in the same page table. The authors claim a 1.47x geometric mean speedup over a baseline system, asserting superiority over existing prefetchers and state-of-the-art speculative translation mechanisms like Avatar.

While the proposed mechanism demonstrates significant speedup on paper, the evaluation rests on several questionable assumptions, particularly regarding the implementation of prior work and the practicality of the proposed hardware. The claims of effectiveness are potentially inflated due to an uncharitable comparison against the state-of-the-art.

Strengths

Problem Identification: The paper correctly identifies TLB MSHR contention and page table walker (PTW) occupancy as key bottlenecks in modern GPU address translation, as substantiated by the initial analysis in Figure 1. This motivation is sound.
Core Insight: The observation that VPNs within a warp often exhibit strong regularity with respect to thread index rather than temporal access order (Section 4.1, Figure 7) is a valuable insight. Shifting the frame of reference for pattern detection from temporal to spatial (within the warp) is a logical approach.
Component-wise Analysis: The evaluation effectively isolates the performance contributions of the prefetching (LATP) and compression (LATC) components (Figure 17a), which provides clarity on the source of the claimed performance gains.

Weaknesses

Unconvincing Comparison to Prior Art: The evaluation of prior work, particularly Valkyrie [14], appears to be based on a configuration that does not represent its full potential. The authors' own discussion (Section 7, page 12) reveals that simply adjusting the L2 TLB MSHR configuration improves Valkyrie's GMean speedup from 1.03x to 1.20x. This is a critical admission. It suggests the main evaluation in Figure 17 significantly understates the performance of a key state-of-the-art competitor, thereby inflating the relative gains of LATPC. A fair comparison would use an optimized configuration for all evaluated schemes.
Overly Optimistic Hardware Cost and Complexity: The practicality and cost of the per-SM Regularity Detector are questionable. The working model (Figure 12) processes one unique VPN per cycle. For a warp with high page divergence (e.g., 32 unique pages), this implies a serialization latency of up to 32 cycles before the full access pattern is determined and can be acted upon. The assumption of a single-cycle latency for this entire detection process (Section 5.5, page 8) is not justified for this clear worst-case scenario and seems entirely unrealistic. This un-accounted for latency could erode a significant portion of the claimed performance benefits.
Ambiguous and Contradictory "Accuracy" Metric: The claims regarding prefetch accuracy are confusing and poorly defined. In Section 6.5 (page 11), the authors state that for several workloads, LATPC issues no incorrect prefetches, resulting in "100% accuracy," but then choose to report this as 0% "to avoid overstating the benefits." This is methodologically unsound. If no prefetches are issued, accuracy is an undefined or irrelevant metric. If prefetches are issued and all are correct, the precision is 100%. Reporting 0% is not conservative; it is incorrect and obfuscates the true behavior of the prefetcher on those workloads. The standard metrics of coverage and precision should be used clearly and consistently.
Limited Architectural Scope of Evaluation: The evaluation is based on a simulated configuration modeled after the NVIDIA Turing architecture (RTX 2060-like, Section 6.1, page 9). It is unclear how the findings generalize to more recent architectures like Ampere or Hopper, which feature significantly larger L2 TLBs (e.g., 4,096 entries, as mentioned in Section 6.7) and different memory subsystem characteristics. The sensitivity study in Figure 22b shows that simply increasing the baseline's L2 TLB entry count closes the performance gap. This suggests that the problem LATPC solves may be less severe on newer architectures, potentially making this complex hardware solution less impactful. The paper's conclusions about the general necessity of LATPC are therefore not fully supported.
Unsubstantiated Claim on Regularity in "Irregular" Workloads: The paper claims that even "irregular" workloads exhibit an "almost-strided pattern" within a warp (Section 4.1, page 4). However, the evidence provided in Figure 8 is weak. It shows that there are, on average, 2.49 unique strides. While less than 32, this is far from "almost-strided" and suggests a more complex pattern than a single stride can capture. The Regularity Detector as designed (Figure 12) appears to only capture a single dominant stride at a time, potentially leaving performance on the table for these more complex patterns.

Questions to Address In Rebuttal

Please provide a revised performance comparison (revising Figure 17) where Valkyrie is evaluated using the configuration that yields a 1.20x speedup, as discussed in Section 7. Justify why the original, lower-performing configuration was chosen for the primary evaluation.
Provide a detailed timing analysis of the Regularity Detector for a worst-case scenario (e.g., a warp with 16 or 32 unique VPNs that do not form a simple stride). How is the multi-cycle detection latency modeled in the simulator? Justify the single-cycle latency assumption used for the hardware cost analysis in Section 5.5.
Please clarify the prefetcher evaluation metrics. Re-plot Figure 19 using the standard definitions of prefetch coverage (prefetched misses that become hits / total misses) and prefetch precision (correct prefetches / total prefetches issued). Explain how cases with zero issued prefetches are handled.
Given that newer architectures like Ampere have 4,096 L2 TLB entries, and your own sensitivity study (Figure 22b) shows that LATPC with 2,048 entries only marginally outperforms a baseline with 4,096 entries, please discuss the relevance and expected performance benefit of LATPC on such modern or future GPUs.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces LATPC, a mechanism to accelerate GPU address translation by exploiting the often-regular spatial patterns of memory accesses within a single warp instruction. The authors correctly identify that frequent TLB misses and contention for limited Miss-Status Holding Register (MSHR) entries are primary bottlenecks in modern GPU virtual memory systems.

Instead of adopting traditional CPU-style temporal prefetching (predicting the next access for a thread), LATPC makes the key insight that the collection of accesses within a single warp instruction often forms a predictable, strided pattern. It proposes a holistic, two-part solution to leverage this insight: 1. Locality-Aware TLB Prefetching (LATP): A "Regularity Detector" identifies strided VPN patterns within a warp's unique memory requests. This information is used to batch page table walks, fetching multiple related Page Table Entries (PTEs) in a single, efficient operation that leverages DRAM row buffer locality. 2. Locality-Aware TLB MSHR Compression (LATC): The same regularity information is used to compress what would be multiple individual TLB miss requests into a single, compact MSHR entry, dramatically reducing contention on this critical resource.

The work is evaluated using a cycle-level simulator across 24 workloads, demonstrating a significant 1.47x geometric mean speedup over a baseline system, outperforming several existing prefetching schemes and the state-of-the-art speculative approach, Avatar.

Strengths

The paper's value lies in its elegant re-framing of the GPU address translation problem and its well-designed, synergistic solution.

A Fundamental and GPU-Native Insight: The primary strength is its core conceptual contribution: shifting the focus of TLB prefetching from temporal prediction (what will this thread do next?) to spatial regularity (what are this thread's neighbors doing right now?). This is a profoundly important distinction. While prior work has struggled to apply CPU prefetching concepts to the chaotic temporal access streams of GPUs (as shown well in Figures 4 and 5, page 3), this paper embraces the GPU's fundamental execution primitive—the warp—as the source of predictability. The data presented in Figure 8 (page 4), showing that even "irregular" workloads exhibit few unique VPN strides within a warp, provides strong evidence for this foundational premise.
Holistic and Synergistic Design: LATPC is not a single trick; it is a well-conceived, holistic solution that attacks the problem from two critical angles. LATC directly addresses the resource contention bottleneck (MSHRs), while LATP addresses the latency bottleneck (page table walks). The two components are powered by the same underlying insight and mechanism (the Regularity Detector), making the design elegant. Figure 11 (page 5) provides an excellent timeline visualization of how these two components work together to resolve queueing delays that plague the baseline system.
Strong Grounding in the Current Research Landscape: The authors do an excellent job of situating their work. They not only compare against a gamut of traditional prefetchers but also against Avatar [84], a very recent and philosophically different (speculative) approach. The analysis in Section 6.5 (page 11), which shows that LATPC is not only competitive with Avatar but can be combined with it for even greater gains, is particularly valuable. This demonstrates a mature understanding of the field, positioning LATPC not just as a replacement for other techniques, but as a powerful, orthogonal component in the architect's toolkit.
Connecting Hardware Principles: The work effectively connects two well-understood hardware principles in a novel way. It takes the concept of memory access coalescing, a cornerstone of GPU performance, and applies it to the metadata of memory access—the address translations themselves. Furthermore, by batching page table walks, it recognizes that the page table itself is just another data structure in memory and that accesses to it can benefit from DRAM locality, a concept explored in other contexts but applied very effectively here.

Weaknesses

The weaknesses are less about fatal flaws and more about the boundaries of the proposed idea and areas that could be explored more deeply.

Characterizing the Limits of "Regularity": While the paper demonstrates benefits even for workloads classified as "irregular," the true strength of the "Regularity Detector" seems tied to almost-strided access patterns. The paper would be strengthened by a more explicit discussion of the mechanism's behavior in the face of truly pathological, pointer-chasing workloads where intra-warp VPNs might have no discernible stride or structure. How gracefully does LATPC degrade to the baseline performance in such scenarios?
Interplay with the Upstream Coalescer: The Regularity Detector is situated after the TLB coalescer, which provides it with a set of unique VPNs. The paper's discussion in Section 7 (page 12) on the sorting requirement feels a bit like an addendum but is, in fact, central to the mechanism's real-world efficacy. The ability to detect a consistent stride depends heavily on the order in which the unique VPNs are processed. A deeper analysis of the sensitivity of the Regularity Detector to the output order and behavior of a realistic hardware coalescer would lend more robustness to the claims.
Overhead vs. Simpler Alternatives: The hardware overhead analysis in Section 5.5 (page 8) is thorough and shows the cost is modest. However, the overall design complexity (a new detector unit, modified MSHR tags, and modified PTW logic) is non-trivial. The sensitivity study in Section 6.7 (page 11) convincingly shows that LATPC can outperform a baseline with more resources. Still, a more direct comparison would be valuable: for the same transistor budget as the entire LATPC mechanism, how much would performance improve by simply building a larger, more associative L2 TLB or adding more standard PTWs? This would help contextualize LATPC within the broader space of architectural trade-offs.

Questions to Address In Rebuttal

Could the authors better characterize the performance of LATPC on workloads with highly unstructured, pointer-chasing memory access patterns (e.g., graph analytics on sparse, high-degree graphs), where intra-warp VPNs may have no discernible stride? At what point does the overhead of the Regularity Detector fail to provide a benefit?
The authors briefly mention the impact of sorted vs. unsorted VPNs from the coalescer in Section 7. Could they elaborate on the sensitivity of the Regularity Detector to the output order of the coalescer? How much of the claimed 1.47x speedup depends on the coalescer producing a thread-index-ordered list of unique VPNs, and what is the expected performance if this order is not guaranteed?
From a designer's perspective, is there a crossover point where simply investing the area and power budget of LATPC into more conventional resources (e.g., doubling the L2 TLB entries or adding 50% more PTWs) would yield equivalent or better performance across this set of workloads? The sensitivity study shows LATPC scales well, but a direct iso-area comparison would be insightful.

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper introduces LATPC, a hardware mechanism designed to accelerate GPU address translation. The authors' central claim of novelty rests on a specific insight: while the temporal sequence of TLB accesses in a GPU is often chaotic, there exists significant spatial regularity in the Virtual Page Numbers (VPNs) requested by threads within a single warp instruction. LATPC is a three-part system built around this insight: 1) a Regularity Detector that identifies stride-based patterns among the unique VPNs of a warp; 2) a Locality-Aware TLB MSHR Compression (LATC) scheme that represents multiple strided TLB misses in a single MSHR entry using a <Base VPN, Stride, Valid Mask> format; and 3) a Locality-Aware TLB Prefetching (LATP) mechanism that leverages this information to issue batched page table walks, exploiting DRAM row buffer locality at the final level of the page table.

My review will focus exclusively on the novelty of this core idea and its implementation.

Strengths

Novel Core Insight: The primary strength of this paper is its shift in analytical perspective. The vast majority of prior art in TLB prefetching, particularly work adapted from CPUs, analyzes a temporal stream of misses from a single execution context (e.g., Sequential, Stride, and Distance prefetchers, as correctly identified by the authors in Section 3.1, page 3). The authors convincingly argue and demonstrate (most effectively in Figure 7, page 4) that for GPUs, this temporal view is noisy and lacks predictable patterns. Their proposed alternative—a spatial analysis across the threads of a single warp instruction—is a genuinely novel approach for the problem of TLB prefetching. It recasts the problem from time-series prediction to spatial pattern recognition at the instruction level.
Novel Hardware Mapping of the Insight (LATC): The proposed modification to the L1 TLB MSHR structure is a direct and elegant implementation of the core insight. While miss coalescing is a known technique, it typically applies to multiple requests for the same resource (e.g., the same cache line). LATC, in contrast, coalesces requests for different but predictably related resources (strided VPNs). The use of a <Base VPN, Stride, Valid Mask> representation within a single MSHR entry (Section 5.3, page 7) is a novel mechanism in the context of TLB miss handling. It effectively creates a compressed representation of a "miss stream" that exists spatially within the warp, not temporally.
Cohesive, End-to-End System: The novelty is not confined to a single component. It flows logically from the initial observation (Section 4) to the pattern detection (Regularity Detector, Section 5.2), miss status handling (LATC, Section 5.3), and finally to the page walk optimization (LATP, Section 5.4). This tight integration, where the prefetch generation directly informs a specialized page walk batching strategy to exploit DRAM characteristics, constitutes a complete and novel system design.

Weaknesses

Constituent Concepts are Not Fundamentally New: While the synthesis is novel, the underlying concepts are not. Stride detection is a classical technique. The idea of representing a regular region of memory with a base and bounds/stride has appeared in various forms, such as in stream buffers and region-based prefetching for data caches. The concept of batching requests to improve memory-level parallelism is also well-established. The paper's novelty hinges entirely on the application of these concepts to the specific problem of warp-level, inter-thread TLB misses. The authors should more clearly position their work as a novel application and synthesis of existing principles rather than the invention of fundamentally new ones.
Simplicity of the Regularity Detector: The proposed Regularity Detector (Figure 12, page 6) only identifies a single, constant stride. While the data in Figure 8 (page 4) suggests this is often sufficient, it is a very simple pattern. This mechanism is not novel from a hardware complexity standpoint. The contribution is its purpose, not its implementation. The work could be perceived as less groundbreaking because it does not attempt to identify more complex or multiple interleaved stride patterns, which have been explored in other prefetching domains. The "delta" over a simple stride detector is zero.
Insufficient Differentiation from Conceptual Prior Art: The authors do an excellent job differentiating from the specific CPU/GPU prefetchers they evaluate. However, the conceptual link to stream prefetching is strong. A stream prefetcher identifies an access stream (base address + stride), reads ahead, and stores data in a buffer. LATPC identifies a VPN stream (base VPN + stride), "pre-misses" ahead, and stores the status in a compressed MSHR entry. The paper would be stronger if it acknowledged this conceptual parallel and clearly articulated the key differences in the problem constraints (e.g., handling misses vs. prefetching data, interacting with the page walk mechanism vs. the data cache).

Questions to Address In Rebuttal

On MSHR Compression (LATC): The core of your hardware novelty appears to be the <Base VPN, Stride, Valid Mask> MSHR entry. Can you confirm if this exact representation for compressing multiple, distinct, in-flight misses has been proposed in prior art for any miss-handling structure (e.g., data cache MSHRs, L2 TLB MSHRs, etc.)? Please clarify the precise delta between LATC and prior work on miss coalescing or MSHR compression.
On the Regularity Detector: The decision to detect only a single stride per pattern appears pragmatic. What is the estimated performance impact of this limitation? What percentage of warp memory instructions exhibit more complex, yet still regular, patterns (e.g., multiple interleaved strides) that your current detector cannot capture? This will help quantify the sufficiency of your proposed novel, but simple, detector.
On the Novelty of Batched Page Walks (LATP): The paper cites prior work on enhancing page table walkers (e.g., [60], [97], [98]). Please articulate more sharply the novelty of LATP's batching mechanism compared to these works. Is the novelty simply that the batch is generated by your prefetcher, or is the mechanism for exploiting DRAM row buffer locality during the walk itself fundamentally different from prior proposals for batching page walks?

S-DMA: Sparse Diffusion Models Acceleration via Spatiality-Aware Prediction and Dimension-Adaptive Dataflow

Abstract

Diffusion Models (DMs) have demonstrated remarkable performance in a variety of image generation tasks. However, their complex architectures and intensive computations result in significant overhead and latency, posing challenges for hardware deployment. ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present S-DMA, a software-hardware co-design framework intended to accelerate sparse Diffusion Models (DMs). The core contributions are threefold: 1) A "Spatiality-Aware Similarity" (SpASim) method that reduces the complexity of sparsity prediction from O(N²) to O(N) by assuming local similarity; 2) A "NAND-based Similarity" computation that approximates cosine similarity using bitwise operations on sign or most significant bits (MSBs) to reduce hardware overhead; and 3) A "Dimension-Adaptive Dataflow" designed to unify sparse convolution and GEMM operations for processing on a dedicated PE array. The authors claim significant speedup and energy efficiency improvements over a baseline GPU and state-of-the-art (SOTA) DM accelerators.

Strengths

Problem Identification: The paper correctly identifies a critical bottleneck in prior work: the computational overhead of the sparsity prediction step itself can negate the benefits of sparse computation (Challenge 1 and 2, Figure 2, page 3). Focusing on reducing this overhead is a valid and important research direction.
Comprehensive Co-design: The proposed solution is a full-stack effort, spanning from algorithmic heuristics (SpASim, NAND-similarity) to microarchitectural implementations (SP²U, reduction network). This holistic approach is commendable.
Operator Unification: The dimension-adaptive dataflow (Section 3.3, page 6; Figure 8, page 7) is a technically sound approach to homogenize the dataflow for different sparse operator types (convolution and GEMM), which is a non-trivial challenge in hardware design.

Weaknesses

My primary concerns with this manuscript revolve around the fragility of its core assumptions, the justification for its approximation methods, and the soundness of its experimental comparisons.

Unjustified Locality Heuristic (SpASim): The central premise of SpASim—that semantically similar tokens are spatially proximal—is a strong heuristic that lacks robust validation.
- The evaluation is performed on standard datasets (COCO, GSO) which are dominated by images with clear subjects and backgrounds, naturally favoring such a locality assumption. The work presents no evidence of how SpASim performs on adversarial inputs designed to violate this assumption (e.g., complex textures, abstract patterns, or images with fine-grained, distributed details).
- The "adaptive" selection of the window size K (Algorithm 1, page 6) is performed offline. This is a misnomer; the system is not adaptive at runtime. A single, pre-calibrated K value is used for all inference tasks, which is brittle and may perform poorly on out-of-distribution inputs.
Extreme and Under-analyzed Similarity Approximation: The NAND-based similarity is a radical simplification of cosine similarity.
- For SeS, using only the sign bit discards all magnitude information. For SpS, the paper claims to use MSBs because the distribution is non-negative (Section 3.2, page 6), but provides no analysis on how many bits are used or a sensitivity analysis of quality vs. the number of bits. Is a single MSB truly sufficient to capture similarity in a "long-tail positive distribution"? This seems highly improbable and is not substantiated.
- The hardware savings reported in Figure 7 (page 6) are against an XNOR-based design, not the full MAC and normalization pipeline of cosine similarity. This presents the savings in an overly favorable light.
Unsupported "Zero-Latency" Claims: The paper repeatedly makes claims of "no additional inference latency" or "fully overlapped" operation for its auxiliary hardware components.
- For the SP²U's sorting mechanism (Section 4.2, page 7), the claim that sorting is "fully overlapped" and introduces "no additional inference latency" is unsubstantiated. A formal analysis of potential pipeline hazards or stalls is required. It is difficult to believe there are no conditions under which the main PE array would have to wait for the SP²U.
- Similarly, the Sparsity-Aware Reduction Network (Section 4.4, page 8) claims its accumulation can be "fully overlapped with PE computation, introducing no additional inference delay." Re-accumulating partial results from different PE lines based on dynamic sparsity masks is a complex routing and synchronization problem. Without cycle-level simulation data or a detailed pipeline diagram, this claim is not credible.
Fundamentally Flawed Baseline Comparisons: The experimental evaluation, particularly against SOTA accelerators, is misleading.
- As acknowledged in Section 5.3 (page 10), the chosen baselines (Cambricon-D, Ditto) primarily exploit inter-step temporal sparsity. S-DMA exploits intra-step spatial/semantic sparsity. These are orthogonal, not competing, optimization strategies. Claiming a 7.05x speedup over them is an apples-to-oranges comparison and does not represent a legitimate scientific advance over their techniques. A proper comparison would be against an accelerator designed for intra-step sparsity or a system that integrates both approaches.
- The reported speedups over the NVIDIA A100 GPU (up to 51.11x, Figure 12, page 10) are suspiciously high. This typically indicates that the GPU baseline is not sufficiently optimized. The authors provide no details on the implementation of "GPU+Sparsity." Was this implemented using highly-optimized CUDA kernels and libraries like cuSPARSE, or a naive high-level implementation? Without these details, the GPU baseline appears to be a strawman.

Questions to Address In Rebuttal

The authors must address the following points to establish the validity of their work:

On SpASim's Robustness: Provide evidence that the SpASim method is robust. This should include evaluation on a dataset specifically curated to challenge the spatial locality assumption. Furthermore, justify the use of an offline-tuned, fixed K and discuss the performance degradation when an input image violates the characteristics of the tuning set.
On NAND-based Similarity: Please provide a detailed sensitivity analysis for the NAND-based similarity metric. Specifically for SpS, clarify precisely how many MSBs are used and show how generation quality (e.g., FID, CLIP) degrades as the number of bits is reduced. Justify why this coarse approximation does not lead to catastrophic failures in semantic understanding.
On Architectural Latency Claims: Substantiate the claims of "no additional latency" for the SP²U sorter and the reduction network. Provide pipeline diagrams and/or cycle-level performance data demonstrating the absence of stalls across a range of sparsity patterns, including highly irregular ones.
On Experimental Baselines: Justify the direct comparison of S-DMA to accelerators (Cambricon-D, Ditto) that optimize for a completely different and orthogonal type of sparsity. Acknowledge that these are not competing approaches and re-frame the results accordingly. Provide comprehensive details on the implementation and optimization level of the GPU baselines to prove they are not strawmen.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents S-DMA, a comprehensive software-hardware co-design framework for accelerating sparse diffusion models (DMs). The authors correctly identify that while sparsity offers a promising path to reduce the immense computational cost of DMs, existing methods are critically hampered by two second-order effects: 1) the significant computational overhead of predicting where sparsity exists, and 2) the hardware inefficiency of processing the diverse and irregular sparsity patterns that emerge across different operators, namely convolutions (CONV) and general matrix multiplications (GEMM).

To address this, S-DMA proposes a holistic solution. On the software side, it introduces a "Spatiality-Aware Similarity" (SpASim) method that leverages the inherent local correlations in image data to reduce the complexity of sparsity prediction from O(N²) to O(N). It further proposes a hardware-friendly, NAND-based similarity computation to replace expensive multiply-accumulate operations. On the hardware side, the work designs a dedicated accelerator featuring a novel "Dimension-Adaptive Dataflow." This key architectural contribution unifies the execution of sparse CONV and sparse GEMM operations into a single, efficient GEMM-based pipeline, overcoming a major challenge in accelerating hybrid models. The architecture is supported by a lightweight sparsity prediction unit (SP2U) and a sparsity-aware reduction network. The authors demonstrate significant speedup and energy efficiency gains over both high-end GPUs and other state-of-the-art DM accelerators.

Strengths

The true strength of this paper lies in its holistic and deeply integrated approach to a complex problem. It moves beyond simply applying known sparsity techniques and instead re-evaluates the entire sparse inference pipeline from first principles.

Excellent Problem Formulation: The authors' core insight is that the cost of finding sparsity can negate its benefits. By framing the problem around the three challenges in Figure 2 (page 3)—prediction complexity, prediction overhead, and low PE utilization—they provide a clear and compelling motivation for their work. This demonstrates a mature understanding of the practical barriers to deploying sparse acceleration.
Elegant Algorithmic and Architectural Synergy: The proposed solutions are not independent optimizations but a tightly coupled set of ideas. The SpASim algorithm is motivated by the spatial locality of the target domain (images), and the NAND-based similarity is a direct consequence of designing an algorithm with hardware implementation costs in mind. The "Dimension-Adaptive Dataflow" is the centerpiece of this synergy, providing a hardware substrate that can efficiently execute the sparse workloads created by the software-side prediction. This transformation of sparse convolution into a structured sparse GEMM (Section 3.3, page 6) is a particularly clever contribution that avoids the well-known overheads of traditional im2col-based approaches.
Strong Contextualization within the Field: The paper does an excellent job of placing itself within the broader landscape of AI acceleration. It correctly identifies the limitations of prior work, such as the unsuitability of sign-based similarity from ViT accelerators (like AdapTiV) for the non-negative attention maps in DMs (Section 2.2, page 4). It also distinguishes itself from other DM accelerators (like Cambricon-D and Ditto) by tackling a different and complementary form of sparsity (semantic/spatial vs. temporal/value). This demonstrates a nuanced understanding of the research frontier.
Significant Potential Impact: Diffusion models are a dominant workload in generative AI, and their computational demands are a major bottleneck. S-DMA provides a compelling blueprint for future specialized hardware. By making sparsity practical and efficient, this work could significantly reduce the latency and energy cost of DM inference, enabling their deployment in a wider range of applications, from on-device editing to real-time content generation. The core ideas—especially the unified dataflow—could also prove influential for accelerating other hybrid CNN-Transformer architectures.

Weaknesses

The weaknesses of the paper are primarily related to its scope and the exploration of its boundaries. The core contribution is sound, but its context could be further enriched.

Limited Discussion on Generalizability: The framework is highly optimized for the U-Net-based architectures common in DMs. While this specialization is a strength, the paper would benefit from a discussion on how the core concepts might generalize. For instance, could the dimension-adaptive dataflow be applied to other models that mix attention and convolution, such as vision transformers with convolutional stems or mobile-friendly hybrid networks? A brief exploration of this could broaden the paper's perceived impact.
Sensitivity of the SpASim Method: The performance of the SpASim method relies on the window size K, which is determined offline (Algorithm 1, page 6). The evaluation in Section 5.2 (page 9) shows this works well for the tested benchmarks, but a deeper analysis of its robustness would be welcome. How sensitive is the performance to this hyperparameter? For example, would a model trained on a different data distribution or a highly unusual user prompt require re-profiling to find a new optimal K?
Lack of Comparison with Structured Sparsity: The work focuses on exploiting dynamic, fine-grained sparsity. It would be valuable to briefly contrast this with structured sparsity approaches (e.g., block or vector sparsity). Structured methods are often considered easier to support in hardware and could present a different trade-off between compression rate and hardware complexity. Including this discussion would provide a more complete picture of the design space.

Questions to Address in Rebuttal

The core architectural contribution is the dimension-adaptive dataflow that unifies sparse CONV and GEMM. Could the authors comment on the applicability of this technique to other hybrid CNN-Transformer architectures outside the diffusion model space, such as those used in object detection or semantic segmentation?
The selection of the local window hyperparameter K is performed offline. How robust is a pre-selected K value to variations in input prompts or generation tasks? For instance, does an image edit focusing on a tiny detail versus a large global change affect the optimal K, and if so, how does S-DMA handle such dynamic variation?
The paper's evaluation focuses on its sparsity-centric approach in isolation. How do the authors envision S-DMA synergizing with orthogonal acceleration techniques for DMs, such as quantization, knowledge distillation, or the differential/temporal computing exploited by competitors like Cambricon-D? Are the performance gains expected to be additive?
The NAND-based similarity is a creative and effective hardware-aware approximation. Have the authors considered if this low-cost hardware primitive could be adapted for other similarity-based tasks in machine learning beyond sparsity prediction, such as in retrieval or clustering algorithms?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper introduces S-DMA, a software-hardware co-design framework to accelerate Diffusion Models (DMs) by exploiting semantic and spatial sparsity. The authors propose three core contributions: (1) a "Spatiality-Aware Similarity" (SpASim) algorithm that reduces the complexity of sparsity prediction from O(N²) to O(N) by leveraging local similarity; (2) a "NAND-based Similarity" computation method that replaces expensive multipliers with bitwise logic for both symmetric (SeS) and non-negative (SpS) activation distributions; and (3) a "Dimension-Adaptive Dataflow" and corresponding hardware that unifies sparse convolution and sparse GEMM operations into a dense GEMM format.

While the paper presents a well-integrated and high-performing system, my analysis concludes that the foundational ideas behind each of the core contributions are not new. Instead, they represent effective adaptations and combinations of well-established principles from the fields of efficient Transformers and sparse hardware acceleration, applied specifically to the domain of Diffusion Models. The novelty is therefore in the application and system-level integration, not in the core concepts themselves.

Strengths

System-Level Co-Design: The work is a comprehensive example of software-hardware co-design, connecting algorithmic optimizations directly to bespoke hardware units (SP²U, reduction network).
Problem Formulation: The authors correctly identify and articulate the key challenges (Section 1, page 2, Figure 2) in accelerating sparse DMs: the overhead of sparsity prediction and the difficulty of handling heterogeneous sparse operators.
Holistic Sparsity Support: The framework's ability to handle both semantic sparsity (token merging) and spatial sparsity (image editing) within a unified architecture is a notable engineering achievement.

Weaknesses

The primary weakness of this paper, from the perspective of novelty, is that its core contributions are derivations of prior art.

Spatiality-Aware Similarity (SpASim) is an application of Local Attention: The central idea of SpASim (Section 3.1, page 5) is to reduce a quadratic O(N²) similarity computation to linear O(N) by restricting comparisons to local windows. This concept is the cornerstone of numerous efficient Transformer models developed over the past several years to overcome the exact same bottleneck. Architectures like the Swin Transformer (windowed attention) or methods like Longformer (sliding window attention) are built on this exact principle of exploiting locality to make attention tractable. While the authors apply this to the sparsity prediction step for DMs, the algorithmic principle of trading global comparison for local comparison to achieve linear complexity is not a novel contribution.
NAND-based Similarity is an incremental extension of Bitwise Similarity Proxies: The proposal to use cheap bitwise operations as a proxy for expensive cosine similarity (Section 3.2, page 6) is not new. The authors themselves reference AdapTiV [48], which uses XNOR-based sign similarity for this purpose in Vision Transformers. The authors' claim to novelty rests on adapting this for DMs, where SpS activations are non-negative, by using the Most Significant Bit (MSB) instead of the sign bit. Using MSBs as a low-cost proxy for magnitude is a standard technique in approximate computing. The move from XNOR to NAND gates (Figure 7, page 6) is a minor circuit-level optimization. Therefore, this contribution is a small, albeit clever, delta over existing work, adapting a known technique to a slightly different data distribution.
Dimension-Adaptive Dataflow is a form of Sparse Data Compaction: The proposed dataflow (Section 3.3, page 6, Figure 8) aims to unify sparse convolution and GEMM by transforming sparse convolution into a dense GEMM operation. This is achieved by gathering active tokens/channels and permuting them into a dense block. The concept of converting sparse operations into dense ones by gathering non-zero elements to feed a dense systolic array or PE array is a foundational technique in sparse accelerator design. While this method avoids the memory overhead of the classic im2col transformation for sparse inputs, the "gather-compute-scatter" pattern is not a new architectural paradigm. The novelty lies in the specific permutation strategy for DMs, not in the fundamental approach of data compaction for efficient hardware utilization.

Questions to Address In Rebuttal

Regarding SpASim: The authors must explicitly differentiate their contribution from the large body of existing work on local and windowed attention in the Transformer literature. Beyond applying a known technique to a new problem (sparsity prediction in DMs), what is the fundamental algorithmic novelty?
Regarding NAND-based Similarity: Can the authors argue that the extension from sign-based similarity (as in AdapTiV [48]) to a hybrid sign/MSB-based approach is a non-obvious conceptual leap? Given that using MSBs as magnitude comparators is a common heuristic, the rebuttal should clarify why this specific adaptation constitutes a significant novel contribution.
Regarding the Dimension-Adaptive Dataflow: Please contrast the proposed dimension permutation technique with other structured sparsity or data compaction schemes in the hardware accelerator literature. How does this approach differ fundamentally from prior "gather-compute-scatter" architectures designed to handle sparse activations? The defense of novelty should focus on the architectural concept, not just its specific tuning for DM workloads.

LLM.265: Video Codecs are Secretly Tensor Codecs

Abstract

As the parameter size of large language models (LLMs) continues to expand, the need for a large memory footprint and high communication bandwidth have become significant bottlenecks for the training and inference of LLMs. To mitigate these bottlenecks, ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

This paper proposes repurposing video codecs, specifically components of H.264/H.265, as a general-purpose method for compressing tensors in large language models (LLMs). The authors coin this method "LLM.265" and claim it is effective for weights, KV cache, activations, and gradients, for both inference and training. They leverage existing hardware video encoders/decoders (NVENC/NVDEC) on GPUs for implementation and further propose a custom, optimized "three-in-one" hardware codec design based on their findings. The central thesis is that the statistical properties of tensors are sufficiently analogous to those of natural images to make video compression techniques highly effective.

Strengths

Novel Application of Existing Hardware: The idea to leverage the perpetually idle NVENC/NVDEC hardware for a new, relevant workload (tensor compression) is resourceful and pragmatic from a systems perspective.
Ambitious Scope: The authors attempt to create a unified compression framework that applies to a wide variety of tensor types across different stages of the LLM lifecycle (inference and training). This contrasts with existing methods that are typically specialized for one tensor type (e.g., weights only).
Empirical Breadth: The paper presents experiments across multiple models (LLaMA-2/3, Pythia, T5, ViT), tasks (reasoning, classification, training), and tensor types, demonstrating a commendable effort to validate the proposed method's versatility.

Weaknesses

The paper's claims rest on a foundation of weak analogy, questionable experimental comparisons, and a critical omission of performance metrics that are essential to its core value proposition.

The Core Analogy is Flawed and Overstated: The central claim "Video Codecs are Secretly Tensor Codecs" is a dramatic overstatement. The authors' own analysis in Section 3.1 (Figure 2, p. 4) shows that Inter-Frame Motion Prediction—a critical component responsible for the high efficiency of modern video codecs—is actively harmful, increasing the required bitrate. The method's success thus relies on cherry-picking specific components (Intra-Prediction, DCT, Entropy Coding) from the video codec pipeline while discarding others. The paper should be reframed to argue that intra-frame image compression techniques are effective for tensors, which is a far less sensational but more accurate claim.
Critical Omission of Latency Overheads: The paper's thesis is predicated on improving efficiency by reducing data movement. However, it completely fails to report the latency of the compression (encoding) and decompression (decoding) operations. This is a fatal flaw. For communication-bound scenarios, the time saved by transferring less data can be easily nullified by the computational overhead of the codec itself. Without wall-clock time comparisons for end-to-end steps (e.g., time per training step, total inference latency), the claims of improved performance are unsubstantiated and potentially misleading.
Contradictory and Unconvincing Training Results: In Section 5.1 (p. 9), the authors claim their compressed training method achieves a "final validation perplexity is 36.7, which is lower than that of full-precision training." This is in direct contradiction with their own plot in Figure 9(b), where the uncompressed baseline clearly converges to a much lower perplexity of ~24. A perplexity of 36.7 is a significant degradation in model quality, not an improvement. This erroneous claim severely undermines the credibility of the training-related results.
Unfair Experimental Comparisons: The weight compression experiments in Section 4.1 (Figure 5, p. 6) compare LLM.265 against baselines like GPTQ and AWQ. However, the authors employ a "variable bit-width" search for their method, effectively performing a fine-grained hyperparameter optimization of bit allocation across layers. The baselines are typically evaluated at fixed, uniform bitrates. This is not an apples-to-apples comparison; LLM.265 is given an optimization advantage that the baselines are not. The performance gap may be attributable to this search strategy rather than the inherent superiority of the codec itself.
Insufficient Justification for Efficacy: The explanation for why the method works (Section 3.1, p. 4) relies heavily on qualitative arguments and visual inspection. The claim that the "channel-wise distribution property" creates "edges and planar blocks that are similar to real-world images" is a weak, non-rigorous analogy. A formal statistical analysis comparing the properties of tensor sub-blocks to the assumptions underpinning intra-prediction and DCT would be required to make this a convincing argument.
Speculative and Overreaching Hardware Proposal: The paper makes a significant leap from empirical results on existing hardware to a detailed proposal for a new "three-in-one" codec (Section 7). The evaluation of this proposed hardware is based on synthesis of open-source RTL and an analytical model, not a physical implementation. The performance and energy claims derived from this model (Figure 16, p. 13) are therefore highly speculative and should be presented with much greater caution.

Questions to Address In Rebuttal

Please provide end-to-end wall-clock timing results for your key experiments. Specifically:
- For the inference scenario in Section 4.2, what is the total latency per generated token, including KV cache compression/decompression, when compared to an uncompressed baseline?
- For the training scenarios in Section 5, what is the time-per-step (including codec latency and data transfer) for your method versus the uncompressed and baseline methods?
Please clarify the statement in Section 5.1 (p. 9) that a final perplexity of 36.7 is "lower than that of full-precision training." Your own Figure 9(b) clearly shows the uncompressed baseline achieves a perplexity of ~24. Is this a typo, or a misinterpretation of the results?
Can you justify the fairness of comparing your variable bit-width compression scheme against fixed bit-width quantization baselines in Figure 5? Please provide an ablation study where LLM.265 is constrained to a fixed bitrate across all layers to enable a more direct comparison.
Beyond the visual analogy presented in Section 3.1, can you provide a more rigorous, quantitative analysis demonstrating that the statistical properties of LLM tensors align with the assumptions made by the H.265 intra-prediction modes?
Given that a core component of video codecs (inter-frame prediction) is detrimental to tensor compression, do you agree that the paper’s primary claim should be narrowed from "video codecs" to "intra-frame image compression techniques"?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a novel and intriguing approach to tensor compression for large language models (LLMs), proposing that standard video codecs (like H.264/H.265) can serve as effective, general-purpose "tensor codecs." The core thesis is that the underlying principles of video compression—specifically transform coding (DCT), intra-frame prediction, and entropy coding—are surprisingly well-suited for compressing the various tensors encountered in LLM workloads, including weights, activations, KV caches, and gradients.

The authors introduce a framework, LLM.265, which leverages the existing, and often idle, hardware video encoders and decoders (NVENC/NVDEC) on modern GPUs. This approach is positioned as a "general-purpose" and "versatile" alternative to the current landscape of specialized, data-dependent quantization techniques. The paper provides extensive empirical evidence showing that LLM.265 achieves state-of-the-art compression rates across both inference and distributed training scenarios. Building on this insight, the authors conclude with a compelling architectural proposal for a customized, high-throughput "three-in-one" codec optimized for tensors, videos, and images, demonstrating its potential for significant area and energy savings in future accelerators.

Strengths

The primary strength of this work is its beautiful and non-obvious central idea. The connection between video signal processing and tensor compression is a fantastic example of lateral thinking that bridges two seemingly disparate domains. It shifts the conversation from designing bespoke numerical formats to repurposing a mature, highly optimized technology stack.

A Unifying Framework for Compression: The most significant contribution is the "general-purpose" nature of the proposed solution. The current state of LLM compression is fragmented: one uses techniques like GPTQ/AWQ for weights, different methods for KV cache, and yet another set (e.g., 1-bit Adam) for gradients during training. LLM.265 offers a single, unified mechanism for all of them, as clearly illustrated in Figure 1 (page 1). This simplification of the software and hardware stack is a massive advantage for system design and deployment.
Pragmatic Use of Existing Hardware: The decision to leverage the on-chip NVENC/NVDEC hardware is architecturally astute. These are powerful, fixed-function accelerators that are typically idle during the compute-heavy phases of LLM execution. Tapping into this underutilized resource provides a low-cost, immediately applicable pathway for accelerating data movement without requiring new hardware. This is a classic architectural win.
Versatility and Robustness: The authors convincingly argue that their method is "versatile" because it is data-independent, requiring no calibration or warm-up periods (Section 4, page 6). This is a critical advantage over methods like GPTQ that depend on a calibration set, whose quality can affect final model performance. The ability to achieve fractional bitrates is another subtle but powerful feature that distinguishes it from standard integer quantization.
Excellent Forward-Looking Architectural Analysis: The paper goes beyond a mere software proposal and provides a thoughtful vision for future hardware. The analysis in Section 6 (page 10), particularly the die area comparison in Figure 12, effectively argues for the cost-efficiency of integrating high-throughput tensor codecs. The proposed "Three-in-one codec" in Section 7 (page 11) is a well-reasoned design that balances the needs of multimedia and AI workloads, demonstrating a clear path from the paper's core insight to next-generation silicon. This is precisely the kind of content that elevates a paper at a top-tier architecture conference.

Weaknesses

While the core idea is compelling, the paper could be strengthened by addressing a few key points that are currently underexplored.

The Bottleneck of Existing Hardware: The authors rightly acknowledge that the throughput of current NVENC/DEC units (~1.1 GB/s, as stated in Section 6.1, page 10) is a limitation. However, the paper does not sufficiently contextualize how severe this bottleneck is. Modern intra-node interconnects like NVLink can exceed 900 GB/s. A 1.1 GB/s codec throughput would be a significant bottleneck for communication over NVLink, potentially negating the benefits of compression. The current implementation seems most viable for inter-node communication over slower networks like Ethernet, but this trade-off is not explicitly discussed.
Opacity of the Initial Quantization Step: The method requires converting FP16 tensors to 8-bit integers before they can be processed by the hardware video codec (Section 3.2, page 5). This initial quantization step is itself a form of compression. The paper lacks a clear ablation study that disentangles the effects of this initial FP16-to-INT8 conversion from the subsequent H.265 encoding. It leaves the reader wondering: how much of the final compression-accuracy trade-off is attributable to the simple 8-bit quantization, and how much additional benefit is truly provided by the more complex video codec pipeline?
Lack of Broader Context on General-Purpose Compression: While the work is well-positioned against domain-specific quantization methods, it would benefit from comparison with other hardware-accelerated, general-purpose lossless/lossy compressors (e.g., Zstd, Blosc, etc.). This would help to establish whether the primitives in video codecs are uniquely suited for this task, or if other compression schemes could offer similar benefits. The comparison in Section 7.1 (page 12) is excellent for the custom hardware proposal but is missing from the main evaluation of LLM.265.

Questions to Address In Rebuttal

Could the authors provide an ablation study that isolates the impact of the initial FP16-to-INT8 quantization? For example, what is the accuracy when compressing tensors using only 8-bit quantization versus the full LLM.265 pipeline at a comparable bitrate? This would clarify the unique contribution of the video codec algorithms.
Can you elaborate on the performance implications of the ~1.1 GB/s NVENC throughput? In which specific distributed training/inference scenarios (e.g., inter-node vs. intra-node communication, specific network fabrics) does the communication saving from compression outweigh the latency introduced by the codec itself?
The finding that "Inter-Frame Motion Prediction Does not Work" (Section 3.1, page 5) is fascinating. Could you speculate on the deeper implications of this? Does the lack of inter-layer correlation in LLM weights suggest something fundamental about the learned representations, for instance, that layers learn relatively independent features? This could be a valuable insight for the broader deep learning community.

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents the core claim that standard video codecs (H.264/H.265) can be repurposed as highly effective, general-purpose tensor codecs for Large Language Models (LLMs). The authors demonstrate this through three primary contributions: (1) An empirical investigation showing that the stages of a video codec's intra-frame pipeline (specifically entropy coding, DCT, and intra-prediction) are surprisingly effective at compressing various LLM tensors. (2) A system, "LLM.265," that leverages the existing, and typically idle, hardware video encoders/decoders (NVENC/NVDEC) on commercial GPUs for this task. (3) A proposal for a future "three-in-one" hardware codec that enhances a video codec's architecture for tensor compression while retaining video capabilities. My review will assess the novelty of these claims against prior art.

Strengths

The primary strength of this paper is its significant conceptual novelty.

A Genuinely New Systems-Level Insight: The central idea of repurposing on-chip hardware video codecs for tensor compression is, to my knowledge, entirely new. While tensor compression itself is a crowded field (quantization, pruning, etc.), this work sidesteps the conventional software- or custom-hardware-centric approaches. The insight that a significant, specialized hardware unit on the GPU die is sitting idle during LLM workloads and can be co-opted for a critical bottleneck (data movement) is a powerful and novel systems contribution. This is not merely a new algorithm but a new paradigm for resource utilization in existing hardware.
Novel Explanation of Mechanism: The paper does not simply state "it works" but provides a novel analysis of why it works in Section 3.1. While the use of Discrete Cosine Transform (DCT) for compressing data with spatial locality is not new (it's the basis of JPEG), its application here is justified with a novel lens: mitigating outliers (Figure 3). This is a distinct and more modern justification than the traditional signal-processing argument of energy compaction. Furthermore, the identification that intra-frame prediction successfully captures the channel-wise structure of weight tensors (Figure 4) while inter-frame prediction fails is a new and valuable empirical finding that informs their entire approach.
Novel Architectural Proposal Grounded in Evidence: The proposed "three-in-one" codec in Section 7 is a novel hardware design. Its novelty does not come from inventing a new compression algorithm from scratch, but from the principled modification and augmentation of a known, highly-optimized architecture. By identifying the uselessness of inter-frame prediction for tensors, the authors propose excising it to save area and power, and then re-investing those resources to boost throughput for the shared intra-frame pipeline. This "renovate, don't rebuild" approach is a novel design philosophy in the space of hardware accelerators for compression and stands in contrast to ground-up designs.

Weaknesses

My critiques are not to claim the work is derivative, but to more precisely circumscribe the boundaries of its novelty.

Constituent Components are Not New: The novelty lies in the synthesis, not the ingredients. The paper should be careful not to overstate the novelty of the underlying algorithms. DCT, context-adaptive entropy coding (CABAC), and predictive coding are all decades-old pillars of compression. The paper’s contribution is the discovery that this specific combination, designed for natural images, is unexpectedly effective for the statistical distributions found in LLM tensors. The title's use of "Secretly" is rhetorical; this is an empirical discovery of an emergent property, not the uncovering of a hidden, intentional design.
Novelty of Hardware Proposal vs. Prior Art Could Be Sharpened: In Section 7, the paper proposes a new hardware design. While I assess this design as novel, its positioning relative to other hardware compression accelerators could be more explicit. For example, Atalanta [40] proposes hardware for CABAC-based compression. The key novelty delta here is that this paper's proposal reuses the entire video codec frontend (prediction, transform) and is presented as a multi-purpose unit, whereas Atalanta is a more focused, from-scratch tensor-only block. A more direct discussion of these differing design philosophies would better highlight the specific novelty of their approach.

Questions to Address In Rebuttal

The use of DCT for neural network weight compression is not, in itself, a new idea; prior work has explored compressing weights in the frequency domain. Can the authors precisely articulate what novel benefit is gained by using the entire intra-frame video pipeline (i.e., prediction -> transform -> quantization -> entropy coding) over a simpler, custom pipeline that might only use DCT and entropy coding? Is there a synergistic effect between the stages that is critical to the observed performance?
The "zero-cost" hardware argument for using NVENC/NVDEC is compelling. However, the data path requires converting tensors from FP16/BF16 to 8-bit integers on the CUDA cores before sending them to the hardware unit. At what point (e.g., for smaller tensors or lower communication bandwidths) does the overhead of this data conversion and the API latency negate the benefits of using the "free" hardware? A characterization of this trade-off would strengthen the novelty claim by defining its practical boundaries.
Regarding the proposed three-in-one codec (Section 7), the novelty stems from adapting an existing architecture. This path implies accepting certain legacy design constraints from the original video codec. What are these constraints, and how do they compare to the freedom of a "clean slate" design for a tensor-only codec? In essence, what is the fundamental trade-off between the efficiency gained from reuse and the potential performance lost from not designing a purely optimal tensor compression pipeline from first principles?

HLX: A Unified Pipelined Architecture for Optimized Performance of Hybrid Transformer-Mamba Language Models

Abstract

The rapid increase in demand for long-context language models has revealed fundamental performance limitations in conventional Transformer architectures, particularly their quadratic computational complexity. Hybrid Transformer-Mamba models, which ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper proposes HLX, a unified hardware accelerator designed for Hybrid Transformer-Mamba language models. The authors identify performance bottlenecks in the two primary kernels, FlashAttention-2 (FA-2) and State-Space Duality (SSD), when executed on modern GPUs. To address these, they introduce two novel fine-grained pipelined dataflows: PipeFlash, to hide non-MatMul operational latency in attention, and PipeSSD, a fused and pipelined execution for Mamba-2's core computation to reduce memory traffic and pressure. The paper presents a unified hardware architecture (URSC) to execute these dataflows and evaluates it via a cycle-level simulator against GPU (A100, H100) and TPU baselines. The authors claim significant improvements in compute utilization, kernel-level speedup, end-to-end latency, and area/power efficiency.

Strengths

Well-Defined Problem: The performance analysis in Section 3 is competent. The paper correctly identifies known limitations of GPUs for these workloads: inter-operation dependencies in FA-2/FA-3 and the severe memory-bound nature of SSD. The identification of excessive on-chip memory requirements (642KB for a fused SSD block) as a primary blocker for performant GPU execution is a crucial and valid insight.
Sound Core Concepts: The proposed dataflows, PipeFlash and PipeSSD, are logical responses to the identified problems. Employing fine-grained pipelining to hide latency (PipeFlash) and to manage intermediate data size (PipeSSD) are well-established hardware acceleration principles. The block-level fusion of SSD operations is a direct and appropriate strategy to counter its low arithmetic intensity.
Methodologically Sound Evaluation Framework: The use of a cycle-level simulator and comparison against strong, contemporary baselines (A100, H100) using optimized kernels (FA-2, FA-3, provided SSD) is appropriate. The scaling of the proposed architecture's specifications (HLX30/60) to match the theoretical throughput and memory bandwidth of the baselines provides a reasonable basis for comparison.

Weaknesses

Optimistic Pipeline Balancing Claims: The paper claims its pipeline balancing scheme is robust, maintaining high utilization with less than 2% variation across model configurations (Section 4, page 9). This is highly suspect. The proposed method of adjusting the number of processed rows works cleanly only when model dimensions (block_size, d_head, d_state) are integer multiples of one another and the pipeline depth. Real-world models often use dimensions that would break this harmony (e.g., d_head = 96, not 64 or 128). The paper provides no analysis of pipeline efficiency, stall cycles, or utilization degradation under these more realistic, non-ideal conditions. The claim of "nearly 100% compute utilization" is an idealized best-case scenario presented as a general property.
Understated Overhead of "Unification": The analysis in Table 3 (Section 6, page 12) claims a mere 3-4% area and power overhead for supporting both Transformer and Mamba-2 kernels compared to a specialized, single-purpose design. This figure seems implausible. The Reconfigurable Vector Processing Engine (RVPE) and Update Engine (UpE) must contain significant, distinct datapath and control logic for operations that are not shared (e.g., softmax vs. cumsum/softplus, reciprocal for attention vs. none for state updates). The cost of muxing, expanded microcode, and control flow logic to manage two fundamentally different dataflows is likely far greater than stated. The analysis lacks the necessary detail to substantiate this claim.
Diminishing Returns on Batching: The results in Figure 17 (page 11) reveal a critical weakness that is not sufficiently emphasized: the speedup advantage of HLX over GPUs decreases as the batch size increases. The paper attributes this to GPUs leveraging increased parallelism, but this framing downplays the issue. It indicates that HLX is primarily optimized for low-batch scenarios and its architectural advantage erodes significantly in high-throughput inference settings, which are economically critical. This is a fundamental limitation of the architecture's scalability.
Complete Omission of Decode-Phase Performance: The paper focuses exclusively on the prefill stage of inference. A key motivation for using Mamba-based models is their efficient O(1) state update during the auto-regressive decode phase. The paper claims HLX is "well-suited" for this (Section 4, page 9) but provides zero evidence, simulation data, or analysis. The architectural requirements for efficient single-token decoding (low latency, high occupancy with minimal work) are vastly different from those for parallel prefill. Without this analysis, the evaluation of a "Hybrid Transformer-Mamba" accelerator is critically incomplete.
Narrow Scope of Attention Variants: The evaluation is confined to standard multi-head attention (as implemented in FA-2/FA-3). The brief mention of applicability to GQA/MLA in Section 6 is speculative and insufficient. Variants like Grouped-Query Attention fundamentally alter the K and V tensor shapes relative to Q, which would directly impact the assumptions made in the PipeFlash datapath and the pipeline balance calculations. A claim of general applicability requires empirical validation, not a hand-waving assertion.

Questions to Address In Rebuttal

Provide quantitative data on the pipeline utilization of HLX when running a model with non-ideal dimensions (e.g., d_head = 96, block_size = 192, or a non-power-of-two value). What is the performance degradation compared to the idealized cases presented?
Present a detailed area and power breakdown of the RVPE and UpE components, clearly delineating the logic dedicated solely to attention, solely to Mamba-2, and shared between them. Justify how the total overhead for unification amounts to only 3-4%.
Address the performance trend with increasing batch size. Is there a batch size at which the H100 GPU baseline would match or exceed the performance of HLX for end-to-end inference? If so, what is it?
Provide a thorough analysis of the HLX architecture's performance during the auto-regressive decode phase for a representative sequence length. What is the single-token latency, and how does the architecture's utilization hold up in this serial, memory-latency-bound phase?
Demonstrate the claimed flexibility by providing performance results (speedup and utilization) for HLX running an attention layer with Grouped-Query Attention (GQA). How is the pipeline balancing in Figure 13 affected when the number of K/V heads is a fraction of the Q heads?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces HLX, a unified and pipelined hardware accelerator specifically designed for the emerging class of Hybrid Transformer-Mamba language models. The authors correctly identify a key challenge in this domain: these models exhibit heterogeneous computational patterns, with performance bottlenecks shifting between the attention kernel (FlashAttention-2) and the state-space model kernel (State-Space Duality, SSD) depending on the sequence length.

The core contribution is a holistic, algorithm-hardware co-design solution. The authors propose two novel, fine-grained pipelined dataflows: "PipeFlash" to hide non-MatMul latency in attention, and "PipeSSD" to fuse the disparate operations of the SSD algorithm, thereby increasing data reuse and reducing memory traffic. These dataflows are instantiated on a unified hardware architecture, the "Unified Reconfigurable Streamlined Core" (URSC), which is capable of efficiently executing both computational patterns. The paper presents compelling simulation results demonstrating significant improvements in compute utilization, latency, and power/area efficiency compared to high-end GPUs like the A100 and H100.

Strengths

The true strength of this paper lies in its insightful contextualization and response to a clear and present trend in large language model architecture.

Exceptional Timeliness and Problem Formulation: The research community is rapidly converging on hybrid architectures as a pragmatic solution for long-context modeling, blending the recall of attention with the efficiency of SSMs. This paper is not just chasing a trend; it is one of the first to deeply analyze the systems-level performance implications of this architectural synthesis and propose a dedicated hardware solution. The analysis in Section 1 and Figure 1 perfectly frames the problem of shifting bottlenecks, which is the central motivation for a unified architecture.
Strong Algorithm-Hardware Co-Design Philosophy: The paper's most significant contribution is not merely the hardware, but the co-designed dataflows, PipeFlash and PipeSSD. Instead of accelerating the existing kernels as-is, the authors re-architect the computation to be pipeline-friendly. PipeSSD, in particular (Section 4.1, Figure 10), is an excellent example of this. It takes the five distinct GPU kernels of the baseline SSD and fuses them into a single, streamlined, multi-stage pipeline, fundamentally changing the execution model to favor on-chip data movement and reuse. This demonstrates a deep understanding of where the true inefficiencies lie.
A Unified Architecture for a Hybrid Future: The design of the URSC is a direct and elegant answer to the problem statement. By creating a flexible core with a Dot-Product Engine (DPE), a Reconfigurable Vector Processing Engine (RVPE), and an Update Engine (UpE), the authors provide a substrate that can be configured to map both the PipeFlash and PipeSSD pipelines (as shown beautifully in Figure 12). This moves beyond the typical dichotomy of accelerators for either attention or SSMs (as seen in prior work like SOFA and MARCA, which they correctly position in Section 6) and instead provides a blueprint for accelerating composite models.
Connecting Algorithmic Theory to Hardware Reality: The work successfully bridges the gap between the theoretical properties of models and their practical performance. It correctly identifies why FA-2/FA-3 saturate in utilization (inter-operation dependency) and why SSD is memory-bound on GPUs (high intermediate data volume, low reuse across kernels). The proposed solutions directly target these identified root causes.

Weaknesses

The weaknesses of the paper are primarily related to the scope of its evaluation and its forward-looking positioning, rather than fundamental flaws in the core idea.

Prefill-Centric Evaluation: The entire evaluation focuses on the "prefill" or "encoding" phase, where a long context is processed in parallel. While this is a critical part of long-context inference, the paper completely omits an analysis of the autoregressive "decode" phase, where tokens are generated one at a time. This phase is notoriously memory-bandwidth bound and has very different performance characteristics. A truly "unified" solution for inference must be efficient in both regimes. The authors claim in Section 5.1 (page 9) that HLX is "well-suited" for this, but without data, this remains an unsubstantiated claim.
The Moving Target of GPU Architectures: The comparisons to A100 and H100 are fair contemporary baselines. However, GPU architectures are not static. The very limitations HLX exploits (e.g., rigid SIMT execution for heterogeneous warps, coarse-grained memory movers) are areas of active research and development by GPU vendors. The paper would be strengthened by a discussion of how its architectural advantages would hold up against a future GPU that might incorporate more flexible pipeline support or more powerful asynchronous execution primitives.
Generalizability to Future Model Variants: The authors briefly touch upon the applicability of PipeFlash to variants like GQA and MLA (Section 6, page 13), arguing that the core computation remains the same. While this is likely true, this work opens up the question of how such a pipelined architecture would handle more radically different future models, such as those with highly dynamic data-dependent routing (e.g., Mixture-of-Experts) or fine-grained sparsity. The reconfigurable nature of the RVPE suggests potential, but this is an unexplored frontier.

Questions to Address In Rebuttal

The evaluation is centered on the prefill phase. Could the authors provide some analysis, even if qualitative or theoretical, on how the HLX architecture and its fine-grained pipeline would perform during the memory-bandwidth-bound autoregressive decoding phase? How would pipeline stalls be managed when processing a single token, and what would the expected utilization be?
The paper makes a compelling case against current GPU architectures. However, what are the fundamental architectural advantages of the proposed URSC that could not be reasonably integrated into a next-generation GPU? In other words, is the proposed execution model a specialized one-off, or does it offer general principles that could inform the evolution of commercial accelerators?
Could the authors elaborate on how the proposed PipeSSD dataflow, which is tailored for Mamba-2's SSD, would need to be adapted to handle other structured SSM variants, such as the original Mamba or newer models like Zamba2 (Ref [15, 16]) which may have different internal operations or data dependencies? This would help clarify the robustness of the proposed architectural template.

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present HLX, a unified, pipelined hardware accelerator designed for Hybrid Transformer-Mamba language models. The central thesis is that the heterogeneous computational patterns of the two core kernels—FlashAttention-2 (FA-2) for the Transformer portion and State-Space Duality (SSD) for the Mamba-2 portion—create shifting bottlenecks that limit performance on general-purpose hardware like GPUs.

The paper’s novel claims are encapsulated in two proposed dataflows and one unified architecture: 1. PipeFlash: A fine-grained, row-level pipelined dataflow for attention computations, designed to hide the latency of non-matrix multiplication (non-MatMul) operations by mitigating inter-operation dependencies present in block-level approaches like FA-2. 2. PipeSSD: A novel dataflow for Mamba-2’s SSD kernel that first fuses the distinct computational steps (chunk cumsum, chunk state, etc.) into a single conceptual block-level operation and then applies a fine-grained, dependency-aware pipeline to this fused kernel. 3. A Unified Hardware Architecture (URSC): A specialized hardware core designed explicitly to execute both PipeFlash and PipeSSD efficiently, bypassing the limitations the authors identify in GPU SIMT execution models for this type of heterogeneous pipelining.

Strengths

From a novelty perspective, the paper’s primary strengths are:

A Novel Strategy for the SSD Kernel: The most significant novel contribution is the approach to accelerating the SSD kernel. While operator fusion is a known technique (e.g., FlashAttention), its application to the five distinct and memory-intensive kernels of SSD (as shown in Figure 5, page 4) appears to be new. The subsequent proposal of a fine-grained pipeline (PipeSSD) to manage the complex row-wise and column-wise dependencies within this newly fused kernel is a non-trivial and novel contribution. The analysis in Section 3.2 (page 6) correctly identifies that naively fusing SSD on a GPU fails due to on-chip memory constraints, which provides a strong motivation for their novel hardware/software co-design approach.
Architectural Specialization for Fine-Grained Pipelining: The closest prior art for pipelining attention is FlashAttention-3 [47], which introduces a 2-stage asynchronous pipeline using warp specialization on NVIDIA's Hopper architecture. PipeFlash differentiates itself by proposing a much finer granularity (row-level, as shown in Figure 9, page 6) and a multi-stage pipeline implemented on a specialized, non-SIMT architecture (the URSC). This represents a novel architectural path, departing from the "make the GPU do it" approach and instead arguing for specialized hardware to overcome fundamental GPU limitations (cited in Section 3.3, page 6) for this workload.
The Unified Nature of the Accelerator: The current landscape of accelerators for large language models has bifurcated, with works focusing on Attention (e.g., SOFA [52]) or SSMs (e.g., MARCA [26], VGA [25]) separately. The proposal of a single, unified architecture that natively supports both computational patterns of an emerging and important class of hybrid models is a timely and novel contribution. The overhead analysis in Table 3 (page 12) suggests this unification is achieved with high efficiency, which strengthens the novelty claim.

Weaknesses

My concerns are focused on precisely delineating the boundaries of the novelty against existing concepts:

Conceptual Proximity of PipeFlash to FA-3: The conceptual foundation of PipeFlash—overlapping non-MatMul computation with MatMul computation in attention—is identical to that of FlashAttention-3. FA-3 uses a producer-consumer model with specialized warps. PipeFlash uses a multi-stage pipeline on a specialized datapath. While the implementations are worlds apart (GPU vs. ASIC), the core algorithmic insight is the same. The paper’s novelty here is less about a new pipelining idea and more about a new, specialized hardware implementation of that idea. This distinction should be made clearer.
"Fusion" as an Application of a Known Principle: The claim that "no research has yet fused SSD" (Section 2, page 2) may be accurate for published literature, but the principle of fusing memory-bound operators to improve arithmetic intensity is a cornerstone of performance engineering. The novelty lies not in the idea of fusion itself, but in the specific method for managing the complex dependencies of the SSD algorithm post-fusion. The paper's contribution is the design of the PipeSSD dataflow that makes fusion practical, not the abstract idea of fusion.
Under-explored Comparison to Reconfigurable Architectures: The URSC is described as a "unified reconfigurable streamlined core." The field of reconfigurable computing has a long history. While the paper compares HLX to GPUs and other LLM accelerators, it does not situate its reconfigurable datapath (RVPU in Figure 11c, page 8) within the context of prior work on reconfigurable dataflow architectures. It is unclear if the reconfigurability itself contains novel mechanisms or if it simply uses standard techniques to switch between the dataflows required for PipeFlash and PipeSSD.

Questions to Address In Rebuttal

Please elaborate on the fundamental novelty of the PipeFlash dataflow when compared to the asynchronous pipelining in FlashAttention-3. Is the contribution a new pipelining concept for attention, or is it a more efficient hardware implementation of the known producer-consumer pipelining principle, made possible only by a specialized, non-SIMT architecture?
Regarding the fusion of SSD kernels: Is the novelty in the idea of fusing these kernels, or is it in the specific dependency analysis and pipeline design (PipeSSD) that overcomes the on-chip memory barriers that prevent this "obvious" optimization from working on GPUs? Clarifying this would sharpen the paper's claimed contribution.
Could the authors contrast the architectural novelty of the URSC with prior work in reconfigurable dataflow computing? Specifically, what is the key architectural innovation within the RVPU's "local NoC" and associated units that enables it to efficiently handle the distinct requirements of both the softmax-centric PipeFlash and the cumsum-centric PipeSSD with minimal overhead, beyond simply instantiating the necessary functional units?

ORCHES: Orchestrated Test-Time-Compute-based LLM Reasoning on Collaborative GPU-PIM HEterogeneous System

Abstract

Recent breakthroughs in AI reasoning, enabled by test-time compute (TTC) on compact large language models (LLMs), offer great potential for edge devices to effectively execute complex reasoning tasks. However, the intricate inference pipelines associated ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The authors propose ORCHES, a heterogeneous GPU-PIM system designed to accelerate Test-Time-Compute (TTC) based LLM reasoning on edge devices. The paper identifies three primary challenges in TTC workloads: (1) variable parallelism complicating scheduling, (2) inter-step dependencies hindering pipelining, and (3) memory fragmentation from branch pruning. To address these, ORCHES introduces three corresponding techniques: (1) an adaptive workload assignment strategy, (2) a branch prediction mechanism to enable speculative pipelining, and (3) a memory management scheme to mitigate fragmentation. The system is evaluated via simulation, and the authors claim significant speedups (4.16× for text, 3.10× for vision) over a baseline GPU implementation.

Strengths

Problem Formulation: The paper provides a clear and structured breakdown of the unique computational challenges posed by TTC-based reasoning pipelines (Section 3, page 4). The identification of variable parallelism, branch dependencies, and memory fragmentation as key barriers is logical and well-articulated.
Comprehensive Solution: The proposed ORCHES framework is comprehensive, with each of its three core techniques (T1, T2, T3) directly targeting one of the identified challenges. This demonstrates a thorough approach to system design.
Detailed Mechanisms: The paper details the mechanisms for its proposed techniques, including the analytical models for workload partitioning (Section 4.2, page 6) and the history alignment strategy for the candidate predictor (Section 4.3.1, page 8).

Weaknesses

My primary concerns with this manuscript center on the evaluation methodology, the lack of crucial performance-cost analysis for the proposed techniques, and the potential for an overstated problem definition.

Questionable Evaluation Baseline and Methodology:
- The performance claims are based on a simulation framework extended from AttAcc [25]. While leveraging existing simulators is standard practice, the complexity of the proposed scheduling and memory management in ORCHES raises concerns about the fidelity of a simulation. Real-world overheads from the OS, memory controller contention, and interconnect latency in a tightly-coupled heterogeneous system are notoriously difficult to model accurately.
- The comparison against AttAcc [25] and Duplex [40] in Figure 11 is fundamentally flawed. These systems are designed for general LLM inference, not the highly specialized, multi-step, branch-intensive workloads of TTC. Comparing a purpose-built system (ORCHES) to systems not designed for the target workload inflates the perceived benefits. The most critical baseline—a highly optimized software-only implementation of the same TTC reasoning pipeline on the baseline GPU (NVIDIA AGX Orin)—appears to be missing. Without this, it's impossible to discern how much of the speedup comes from the novel hardware and how much could be achieved through superior software scheduling on existing hardware.
Unquantified Misprediction Penalty:
- Technique 2 relies on a candidate verification predictor to enable speculative execution. Table 4 shows the predictor achieves ~78% accuracy after applying the "history alignment" mechanism. While an improvement, a 22% misprediction rate is substantial. The paper states that on a misprediction, the system must "roll back to the correctly selected candidate and regenerate the output" (Section 4.3.1, page 8). However, the latency cost of this rollback and regeneration process is never quantified. Without knowing the misprediction penalty, the entire benefit of the pipelining technique is unsubstantiated. A high penalty could easily negate the gains from the 78% of correct predictions. The case studies in Figure 13 show only the ideal scenario and are not sufficient evidence.
Under-substantiated Overhead Claims for Memory Management:
- Technique 3 introduces a complex memory management system involving an address cache, dynamic reorganization, and a controller-side buffer. The authors claim in Section 5.5 (page 12) that the average runtime overhead of this reorganization is "only 0.12%, which is negligible in practice." This figure seems extraordinarily low for a process that involves tracking fragmentation and physically moving KV cache segments in memory. The paper provides no breakdown of how this 0.12% was calculated, what operations it includes (e.g., data movement, metadata updates), or how frequently the reorganization is triggered. This claim lacks credibility without a detailed analysis.
Problem Framing May Be an Artifact of a Specific Setup:
- Challenge 2, "Branch Dependencies Hinder Pipeline Execution," is primarily motivated by the scenario where the Process Reward Model (PRM) is significantly larger than the policy model (e.g., an 8B PRM verifying a 1B policy model, as shown in Figure 6, page 5). While this may be a valid configuration from a specific paper [18], it is not a fundamental, immutable property of TTC. The performance bottleneck is a direct consequence of an algorithmic choice. The paper frames this as a general hardware challenge, but one could just as easily argue it is a software problem that could be mitigated by using more balanced model sizes. The generalizability of this "challenge" is therefore questionable.

Questions to Address In Rebuttal

The authors must address the following points to establish the technical soundness of their work:

Misprediction Penalty: What is the precise latency cost, in cycles or milliseconds, of a single branch misprediction event in your proposed system? Please provide a detailed breakdown of the rollback and regeneration overhead. How does this penalty affect the overall speedup when factoring in the ~22% misprediction rate?
Memory Reorganization Overhead: Provide a detailed breakdown of the 0.12% runtime overhead claimed for Technique 3. This breakdown should include the cost of data movement (reads and writes), metadata management for the address cache, and the computational cost of the reorganization logic itself. How was this measured in the simulator?
Baseline Justification: Please justify the choice of AttAcc and Duplex as primary comparison points, given they are not optimized for TTC workloads. More importantly, provide performance data comparing ORCHES against a state-of-the-art, software-only TTC implementation (using, for example, optimized kernels, batching, and scheduling) running on the standalone baseline AGX Orin GPU.
Analytical Model Fidelity: The offline and online scheduling strategies (Technique 1) depend on an analytical performance model (Equations 1-7). What evidence is there that this simplified model accurately predicts the performance of complex operators on real heterogeneous hardware, accounting for factors like cache contention and memory interference?
Sensitivity to Model Configuration: How do the reported speedups change if the policy model and PRM are similarly sized (e.g., 3B policy and 3B PRM)? Does "Challenge 2" cease to be a significant bottleneck, and if so, how does that impact the contribution of Technique 2 to the overall performance?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents ORCHES, a heterogeneous GPU and Processing-in-Memory (PIM) system co-designed to accelerate a specific and increasingly important class of workloads: multi-step, Test-Time-Compute (TTC) based LLM reasoning. The authors' core contribution is not merely the application of PIM to LLMs, but the insightful identification of a new set of system-level challenges unique to these reasoning pipelines. They astutely observe that TTC workloads are fundamentally different from standard, single-step LLM inference.

The authors categorize these new challenges into three key barriers: 1. Variable Parallelism (C1): The workload dynamically shifts between compute-bound (e.g., verification/prefilling) and memory-bound (e.g., policy model decoding), complicating static scheduling. 2. Branch Dependencies (C2): The sequential nature of the reasoning steps (generation followed by verification) creates pipeline stalls that hinder throughput. 3. Memory Fragmentation (C3): The pruning of unsuccessful reasoning "branches" leads to sparse and fragmented memory, which degrades the performance of memory-sensitive architectures like PIM.

In response, ORCHES proposes a tightly integrated set of three corresponding techniques: adaptive workload assignment (T1), speculative branch-aware pipelining (T2), and fragmentation-aware memory structuring (T3). The work positions itself as a forward-looking solution for enabling complex AI reasoning on resource-constrained edge devices, demonstrating significant speedups over state-of-the-art baselines in simulation.

Strengths

Excellent Problem Formulation and Contextualization: The primary strength of this paper lies in its clear and compelling problem definition. The authors do an exceptional job of explaining why existing LLM acceleration techniques, which are optimized for monolithic inference, are insufficient for the emerging paradigm of TTC. The breakdown of the problem into the three challenges (C1, C2, C3) in Section 3 (pages 4-5) is insightful and provides a strong foundation for the proposed solutions. This work successfully frames TTC not just as another LLM task, but as a distinct algorithmic workload with unique system-level implications.
Elegant Synthesis of Cross-Disciplinary Concepts: The proposed solutions are a thoughtful synthesis of ideas from different domains of computer science. Technique 2 (Branch Prediction Facilitating Pipelining, Section 4.3, page 8) is a particularly clever adaptation of a cornerstone concept from classical CPU architecture—speculative execution—to hide the latency of inter-step dependencies in a reasoning pipeline. Similarly, Technique 3 (Memory Structuring, Section 4.4, page 9) draws parallels to memory management and garbage collection strategies from operating systems. This cross-pollination demonstrates a deep understanding of systems design and elevates the work beyond a simple application-specific accelerator.
A Forward-Looking Perspective on AI Systems: This research is timely and significant. The field of AI is rapidly moving beyond simple text generation towards more complex, multi-step reasoning, as seen in agentic systems, Chain-of-Thought, and Tree-of-Thoughts. This paper is one of the first to tackle the systems-level challenges of these algorithms head-on. By treating the entire reasoning process as the target for optimization, ORCHES provides a blueprint for a new class of "reasoning accelerators." Its focus on enabling compact models to achieve the performance of much larger ones through efficient computation is a critical direction for deploying advanced AI on the edge.
Systematic and Coherent Solution: The one-to-one mapping of the proposed techniques (T1, T2, T3) to the identified challenges (C1, C2, C3) results in a very coherent and compelling narrative. The system feels thoughtfully architected rather than being a collection of disparate optimizations. The ablation studies presented in the evaluation (e.g., Figure 12, page 11) effectively demonstrate that each component contributes meaningfully to the overall performance, reinforcing the validity of the initial problem analysis.

Weaknesses

While the core ideas are strong, the paper could be strengthened by addressing the following points, which are more about depth and potential limitations than fundamental flaws.

Reliance on Simulation: The evaluation is conducted entirely within a simulated environment. While this is standard practice for novel architecture proposals, the significance of the results hinges on the fidelity of the underlying performance and power models, especially in a complex heterogeneous system. A deeper discussion on the calibration of the simulator against real hardware (beyond referencing prior work) would build more confidence in the reported speedup and energy figures.
Scope of TTC Generalizability: The paper focuses on a specific TTC structure involving a policy model and a process reward model (PRM). However, the landscape of reasoning algorithms is evolving. It is unclear how well the ORCHES design principles would map to other structures, such as Monte Carlo Tree Search (MCTS) in AlphaCode-style generation, or agentic workflows that involve external tool use and dynamically change the nature of the computation at each step. The current design is tightly coupled to the generate-verify loop.
Overhead Analysis of Memory Management: Technique 3 is presented as a highly effective solution to memory fragmentation, with the authors stating the average runtime overhead is a "negligible" 0.12% (Section 5.5, page 12). This figure seems exceptionally low for a process that involves tracking, buffering, and reorganizing memory. A more detailed cost-benefit analysis is warranted. For example, what is the latency of the reorganization process itself, and how does its trigger policy (e.g., after 3-5 steps) impact performance under different reasoning depths and branch widths? There might be corner cases where this overhead becomes more significant.

Questions to Address In Rebuttal

The proposed system is expertly tailored to the generate-verify structure of the evaluated TTC pipelines. Could the authors comment on the applicability of the ORCHES framework to other multi-step reasoning paradigms like Tree-of-Thoughts (ToT), which involves more complex state evaluation and backtracking, or agentic systems that might call external APIs, introducing unpredictable latency? Does the core principle of separating and speculating on distinct computational steps still hold?
Regarding Technique 2 (Branch Prediction), Table 4 (page 11) shows that the history alignment mechanism significantly improves prediction accuracy. Could you provide more insight into the performance trade-offs? Specifically, what is the misprediction penalty in terms of latency or wasted work, and how does this penalty interact with the predictor's accuracy to determine the overall speedup from speculation?
Could the authors provide a more detailed breakdown of the 0.12% runtime overhead claimed for Technique 3? Specifically, what is the latency of a single memory reorganization operation, and what is the typical frequency of this operation in your benchmarks? Understanding these two factors would help clarify how the overhead remains so low across different workloads.

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper introduces ORCHES, a heterogeneous GPU-PIM system designed to accelerate Test-Time-Compute (TTC) based Large Language Model (LLM) reasoning. The authors first identify a set of challenges unique to TTC workloads that are not present in standard single-step LLM inference: C1) variable parallelism complicating scheduling, C2) inter-step branch dependencies hindering pipelining, and C3) branch pruning inducing memory fragmentation. To address these, the authors propose a system integrating three primary techniques: T1) adaptive workload assignment between GPU and PIM, T2) a speculative, branch-aware pipelining mechanism, and T3) a fragmentation-aware memory structuring scheme.

My analysis concludes that while the system-level integration and the specific application to the TTC reasoning problem are well-executed, the novelty of the core underlying techniques is limited. Many of the proposed solutions are adaptations of well-established concepts from heterogeneous computing, speculative execution, and memory management. The primary contribution of this work is therefore not the invention of new primitives, but rather the insightful characterization of the TTC workload and the synthesis of existing ideas into a cohesive system to solve that specific problem.

Strengths

Novel Problem Characterization: The paper's most significant novel contribution is its in-depth analysis and characterization of the TTC-based LLM reasoning workload in Section 3 (Pages 4-5). The identification of dynamically evolving compute patterns due to the changing ratio of shared-to-unique KV caches (Section 3.1.2) is a sharp and valuable insight that clearly distinguishes this workload from standard LLM serving. This analysis provides a strong motivation for a specialized solution.
System-Level Synthesis: The authors have assembled a coherent system by integrating techniques from different domains. The novelty lies in this synthesis—recognizing that a combination of adaptive scheduling, speculation, and custom memory management is required to holistically address the TTC problem on a GPU-PIM architecture.
Refinement of an Existing Idea: Within the broader "branch prediction" technique (T2), the proposed "history alignment" strategy (Section 4.3.1, Figure 9c, Page 8) is a clever and potentially novel refinement. Using the more accurate historical scores from the large verification model to condition the lightweight prediction model is a non-obvious mechanism to improve the accuracy of a speculative process.

Weaknesses

My primary concerns relate to the novelty of the core technical contributions when evaluated individually against prior art.

T1: Adaptive Assignment is a Known Concept: The core idea of partitioning workloads between heterogeneous processors (GPU and a co-processor like PIM) based on their arithmetic intensity (Figure 4, Page 5) is a foundational principle of heterogeneous computing. This methodology has been explored for decades in the context of CPU-GPU systems. While the online compensation model (Section 4.2.2) adds a dynamic element, the fundamental approach of mapping compute-bound kernels to the GPU and memory-bound kernels to a memory-centric accelerator is not new.
T2: "Branch Prediction" is Conceptually Indistinguishable from Speculative Decoding: The proposed "branch prediction" mechanism (Section 4.3, Page 8) is a direct application of the "draft-then-verify" paradigm, which is the cornerstone of speculative decoding in LLMs. The body of work on speculative decoding is extensive (e.g., Chen et al., 2023, "Accelerating large language model decoding with speculative sampling"; Leviathan et al., 2023, "Fast inference from transformers via speculative decoding"). The authors' mechanism uses a smaller model (a subset of the PRM layers) to "draft" a likely path, which is then "verified" by the larger model. This is functionally identical to speculative decoding, merely applied to reasoning branches instead of token sequences. The paper acknowledges this in Related Work (Section 6, Page 12) but does not sufficiently differentiate its core mechanism as a novel contribution. The renaming of the technique to "branch prediction" does not create novelty.
T3: Memory Structuring Leverages Standard Techniques: The techniques proposed for memory management (Section 4.4, Page 9) are a combination of well-known solutions.
- Memory Compaction: The process of reorganizing memory to eliminate fragmentation ("holes") is a classic technique used in garbage collectors and memory management units for decades.
- Caching and Buffering: The use of an address cache and a controller-side buffer are standard architectural optimizations to reduce latency and manage data movement.
- Overlap with PageAttention: The problem of managing a dynamic and sparse KV-cache has been famously addressed by PageAttention (Kwon et al., 2023). PageAttention uses a virtual-to-physical mapping akin to OS page tables to handle non-contiguous memory blocks. ORCHES instead appears to enforce contiguity via periodic reorganization. While the implementation differs, the high-level problem it solves is not new, and the paper should provide a more direct and rigorous comparison to this state-of-the-art baseline. The claim that T3 "achieves both the elimination of memory waste and the contiguous storage" (Section 6, Page 13) is the key delta, but it comes at the cost of reorganization overhead, a trade-off that is not fully explored against prior art.

Questions to Address In Rebuttal

Regarding T2 (Branch Prediction): Please articulate the fundamental conceptual novelty of your proposed "branch prediction" mechanism (Section 4.3) compared to the existing body of work on speculative decoding. Beyond the application context (reasoning steps vs. output tokens), what makes the core "draft-then-verify" process presented here different and novel? Is the "history alignment" technique the sole point of novelty in this contribution?
Regarding T3 (Memory Management): Please provide a more detailed comparison of your memory reorganization approach (Section 4.4) with PageAttention. Specifically, can you quantify the performance trade-offs between your approach (which incurs runtime overhead for compaction to maintain contiguity) and the PageAttention approach (which avoids compaction overhead but may incur latency penalties from non-contiguous memory access patterns)? Why is your chosen approach superior for a collaborative GPU-PIM system?
Regarding Complexity vs. Benefit: The proposed system introduces significant complexity with three distinct optimization techniques running concurrently. The online scheduling compensation in T1, for example, relies on an analytical model that may have its own inaccuracies. Could the authors demonstrate that this combined complexity provides a benefit that is substantially greater than applying just one or two of the more novel refinements (e.g., only the history-aligned speculation)? Is it possible that a simpler, static partitioning scheme combined with your memory manager would yield a large fraction of the benefits with much lower complexity?

LoopFrog: In-Core Hint-Based Loop Parallelization

Abstract

To scale ILP, designers build deeper and wider out-of-order superscalar CPUs. However, this approach incurs quadratic scaling complexity, area, and energy costs with each generation. While small loops may benefit from increased instruction-window sizes ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose LoopFrog, a hardware/software co-design for speculative in-core loop parallelization on wide out-of-order processors. The scheme uses compiler-inserted hint instructions (detach, reattach, sync) to delineate parallelizable loop regions, which are then executed on lightweight, OS-transparent "threadlets." Data versioning and conflict detection are managed by a new microarchitectural unit, the Speculative State Buffer (SSB), and a conflict detector. The authors claim a geometric mean speedup of 9.5% on SPEC CPU 2017 over a strong 8-wide baseline, with what they characterize as "modest" overheads.

However, the work's central claims are predicated on several critical idealizations in the experimental methodology, most notably a perfect conflict detection mechanism and an oracle-based loop selection strategy. These assumptions sidestep the most challenging practical aspects of thread-level speculation, calling into question the validity and achievability of the reported results.

Strengths

Strong Baseline: The evaluation is performed against a convincing, aggressive 8-wide out-of-order CPU model (Table 1, page 9). This provides a challenging baseline and ensures that the reported speedups are not merely due to deficiencies in the baseline architecture.
ISA Design: The hint-based ISA extension is a reasonable approach. It offloads the difficult problem of dynamic task detection to the compiler, which is the appropriate place for it, while maintaining backward compatibility.
Detailed Analysis: The paper provides a breakdown of performance gains into sub-categories (Table 2, page 11), attempting to explain the sources of speedup. This analysis, particularly the identification of prefetching effects, is insightful, even if it simultaneously weakens the paper's main thesis.

Weaknesses

Idealized Conflict Detection: The single greatest flaw in this study is the idealization of the conflict detector. The authors state in Section 6.1 (page 9) that their simulation models "No false positives" for the conflict check, which they propose to implement with Bloom filters. They dismiss the impact of false aliasing as a "second-order effect" (page 8). This is fundamentally incorrect. A single false positive in a conflict detector can cause the erroneous squash of an entire epoch of potentially useful work. The resulting performance degradation is a first-order effect. Without modeling a realistic conflict detector with a non-zero false-positive rate, the performance results are unreliable at best.
Unrealistic Compilation and Loop Selection: The compiler's role is severely overstated. The study relies on manual #pragma annotations to simulate "perfect static loop selection" (Section 5.1, page 8). This completely avoids the notoriously difficult problem of identifying profitable loops automatically. Furthermore, the hint insertion pass is naive; as stated in Section 5.3 (page 8), it "does not consider through-memory LCDs." This limitation excludes a vast and important class of loops, meaning the technique is only applicable to loops that were largely parallel to begin with. The study is therefore evaluating a best-case scenario that is unlikely to be realized by a real-world compiler.
Insufficient Coherence and Multi-Core Analysis: The mechanism for preserving the memory model described in Section 4.1.4 (page 6) is hand-waved. The SSB is said to "send coherence messages" to acquire lines and is squashed if another core requests a line in an incompatible state. The entire evaluation is performed in a single-core context. This is a critical omission. In any realistic multi-threaded application running on a multi-core system, coherence traffic is constant. The rate of squashes due to external invalidations could easily overwhelm any benefit from speculation. The lack of any multi-core evaluation renders the memory model claims unsubstantiated.
Conflation of Parallelism with Prefetching: The analysis in Section 6.4.2 (page 11) reveals that a significant portion of the performance gain (35% of the total, combining "Branch conditions" and "Data values") comes from the prefetching side-effects of failed speculation. This raises a critical question: is LoopFrog an effective parallelization technique, or is it an exceedingly complex and expensive hardware prefetcher? A rigorous study would compare these gains against a state-of-the-art stride or stream prefetcher, which could potentially achieve similar benefits with a fraction of the complexity.

Questions to Address In Rebuttal

Please provide a sensitivity study showing the impact of a realistic Bloom filter-based conflict detector on the geometric mean speedup. What is the performance degradation at plausible false-positive rates (e.g., 0.1%, 1%)? Justify the claim that this is a "second-order effect."
The compiler ignores memory-carried dependencies (Section 5.3). What percentage of total execution time in the SPEC benchmarks is spent in loops that are disqualified by this constraint? How does this limitation affect the overall applicability of your technique?
How does the LoopFrog mechanism behave in a multi-core context running a workload with true sharing (e.g., a parallel benchmark like PARSEC)? Specifically, what is the frequency of speculation squashes caused by coherence requests from other cores, and what is the resulting performance impact?
Please provide a more direct comparison to justify the complexity of LoopFrog. How does the 35% of your speedup attributed to prefetching effects compare to the gains from enabling or enhancing a state-of-the-art hardware prefetcher in your baseline system?
The area overhead of 12-17% relative to a non-SMT core (Section 6.8) is substantial. The cost analysis via CACTI appears to neglect the control logic complexity for the SSB, the multi-versioned read logic, and the integration with the core's coherence protocol. Can you provide a more comprehensive estimate of these logic overheads?

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents LoopFrog, an in-core, hint-based microarchitectural technique for speculative loop parallelization. The work is motivated by the diminishing returns and resource underutilization seen in modern, wide out-of-order processors. The core idea is to revive Thread-Level Speculation (TLS) in a modern context by using compiler-inserted hints to spawn lightweight, OS-transparent "threadlets" within a single core. These threadlets speculatively execute future loop iterations, "leapfrogging" beyond the main thread's instruction window to expose medium-grained parallelism.

The authors propose a set of hardware extensions, most notably a Speculative State Buffer (SSB) that manages speculative memory state in a way that is contained within the core and respects the architectural memory model. Using a modified LLVM compiler and gem5 simulation of an aggressive 8-wide baseline, the authors demonstrate a geometric mean speedup of 9.5% on SPEC CPU 2017 with what they argue are modest hardware overheads. The work represents a compelling synthesis of classic speculative parallelization ideas with the realities of modern CPU design.

Strengths

High Conceptual and Contextual Relevance: The paper tackles one of the most significant challenges in high-performance computing today: the stagnation of single-thread performance. It correctly identifies that simply widening cores is becoming inefficient (as shown in their motivational Figure 1, page 2). By targeting the "medium-granularity" parallelism that falls between traditional ILP and coarse-grained TLP, LoopFrog addresses a well-known and valuable performance gap.
An Elegant Synthesis of Established Concepts: The true strength of this work lies not in a single novel mechanism, but in its masterful integration of ideas from several research domains. It draws from:
- Classic TLS/SpMT (e.g., Multiscalar, STAMPede) for the core concept of speculative tasking.
- Simultaneous Multithreading (SMT) for the underlying microarchitectural substrate of sharing resources between thread contexts.
- Hardware Transactional Memory (HTM) for the principles of speculative state buffering and conflict detection, which are clearly reflected in the design of the SSB.
- Compiler-Architecture Co-design (e.g., Tapir) for the use of lightweight, semantics-preserving hints as the interface between software and hardware.
This synthesis results in a design that is far more practical and less disruptive than many of its historical predecessors. By confining speculation entirely within the core and making it transparent to the OS and memory system, the authors have significantly lowered the barrier to potential real-world adoption.
Strong and Well-Analyzed Empirical Results: A geometric mean speedup of 9.2-9.5% on SPEC CPU suites is a significant result that would be highly attractive to processor designers. The authors go beyond simply presenting the final number. The detailed breakdown of performance sources in Table 2 (page 11) is particularly insightful, revealing that the benefits arise not just from "true parallelism" but also from powerful prefetching side-effects that resolve hard-to-predict branches. This level of analysis provides a deep understanding of why the mechanism works and builds confidence in the results.
Thoughtful Design for a Modern System: The design carefully considers key requirements for modern architectures, such as preserving the memory consistency model (Section 4.1.4, page 6). This is a critical detail that was often a stumbling block for earlier TLS systems that exposed speculative state more widely. The granular conflict checking (as opposed to cache-line level) is another pragmatic choice that, as their sensitivity study shows, is key to avoiding false conflicts.

Weaknesses

The Compiler's Crucial Role is Underdeveloped: The evaluation relies on manual #pragma annotations to select loops for parallelization, which the paper describes as simulating "perfect static loop selection." While the hint insertion is automated, the far more challenging problem of identifying profitable loops automatically and robustly is left as future work. The entire system's practical success hinges on a compiler that can make intelligent decisions about which loops to annotate, avoiding the slowdowns the authors themselves mention (up to 10%). The paper would be stronger if it explored heuristics for this process or showed sensitivity to a non-perfect selection.
Overhead and Complexity Analysis is High-Level: The area and power analysis in Section 6.8 (page 12) is based on high-level models (CACTI) and analogies to existing SMT overheads. While a reasonable first-order approximation, the complexity of the proposed structures, particularly the SSB, may be understated. The logic for performing parallel, multi-versioned reads and snooping coherence traffic could have non-trivial implications for timing, verification effort, and power consumption that are not fully captured by this analysis.
Lack of Comparison to an "Iso-Area" Alternative: The paper provides a strong performance evaluation against its own baseline. However, to fully contextualize the efficiency of LoopFrog, it would be beneficial to compare it against an alternative design that uses a similar increase in hardware resources. For example, how does a 4-threadlet LoopFrog core compare to a baseline core that is simply made wider (e.g., 9- or 10-wide) or equipped with a much larger re-order buffer or a next-generation hardware prefetcher, assuming a similar transistor budget? This would help answer whether speculative threadlets are the most efficient use of that additional silicon.

Questions to Address In Rebuttal

The reliance on manual loop selection is a significant limitation on the path to a practical system. Could the authors elaborate on the primary challenges a fully automated compiler would face in making these decisions? Based on your analysis, what static or dynamic features (e.g., trip count, body size, memory access patterns, branch predictability) are the best indicators of a loop being profitable for LoopFrog, and how might a compiler collect and use this information?
The analysis in Table 2 (page 11) is excellent and reveals that 35% of the total speedup comes from "Prefetching" effects (primarily resolving branch conditions faster). This suggests that a significant benefit comes from running ahead, even if the speculation ultimately fails. How does this implicit prefetching capability compare to what could be achieved with a state-of-the-art, dedicated hardware prefetcher (e.g., one that can prefetch down complex pointer chains or recognize indirect branch patterns)? Is it possible that a more advanced but less complex prefetcher could capture a large fraction of this particular benefit?
The Speculative State Buffer (SSB) is the heart of the hardware proposal. The read path requires a parallel lookup across all active threadlet slices plus the L1D, followed by logic to merge the results into the correct version for the reading threadlet (Section 4.1.3, page 6). Could you comment on the potential timing implications of this logic? Is there a risk that it could extend the critical path of a load-to-use dependency and impact the core's clock frequency?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present LoopFrog, an in-core speculative execution framework for parallelizing loops within a single, wide out-of-order processor core. The stated goal is to utilize the spare execution resources evident in modern superscalar designs, thereby tapping into "medium-granularity" parallelism that is too small for traditional thread-level parallelism (TLP) across cores but too large to be fully captured by the instruction window of a single thread (ILP).

The proposed mechanism relies on compiler-inserted ISA hints (detach, reattach, sync) to delineate loop iterations. The microarchitecture uses these hints to spawn lightweight, OS-transparent "threadlets" that execute future iterations speculatively. A key component is the Speculative State Buffer (SSB), which buffers speculative memory writes, versions data, and detects inter-threadlet dependency violations. The entire mechanism is designed to be contained within the core, hiding the speculative state from the broader memory system and other cores. The authors report a geometric mean speedup of 9.5% on SPEC CPU 2017.

My analysis concludes that while the engineering and evaluation are sound, the core concepts are a synthesis of well-established prior art. The novelty is not in the invention of a new mechanism, but rather in the specific integration and application of existing techniques (SMT-based TLS, HTM-like memory versioning, hint-based ISAs) to the context of modern, wide superscalar cores. The "delta" over prior art is tangible but evolutionary, not revolutionary.

Strengths

Clear Problem Definition: The paper effectively frames its motivation around the diminishing returns of widening superscalar processors, clearly illustrated by the divergence of IPC and commit utilization in Figure 1 (page 2). This provides a compelling, modern context for revisiting speculative execution techniques.
Architectural Containment: The decision to confine all speculative state and management logic within a single core (the "in-core" aspect) is a significant and novel design choice compared to many prior TLS systems (e.g., STAMPede [28], Swarm [12]) that require modifications to the multi-core cache coherence protocol. This self-containment greatly improves the feasibility of deployment in commercial systems.
Granular Dependency Tracking: The analysis in Section 6.6 (page 11) demonstrating the performance advantage of sub-cache-line granularity for conflict detection is a valuable contribution. Many historical TLS/HTM systems operate at cache-line granularity, which is known to suffer from false sharing. By designing and evaluating a system with 4-byte granules, the authors directly address this known limitation and show it is a key enabler for their performance.

Weaknesses

Synthesis of Existing Concepts: The primary weakness from a novelty standpoint is that LoopFrog is a composite of previously published ideas.
- In-Core SMT-based TLS: The core execution model of using multiple hardware contexts within a single core to run speculative threads is not new. It was a central idea in early work like the Dynamic Multithreading Processor (DMP) [1] and Implicitly-Multithreaded Processors (IMT) [21]. LoopFrog's "threadlets" are functionally identical to the speculative SMT threads in these proposals.
- Speculative Memory Buffering: The Speculative State Buffer (SSB) is functionally analogous to the memory versioning systems proposed in countless TLS and Hardware Transactional Memory (HTM) papers. Its role in buffering writes, detecting read-after-write hazards across threads, and versioning data is a foundational concept in speculative parallelization. The description in Section 4.1 strongly echoes the logic of Log-based or Eager versioning HTM systems.
- Hint-Based ISA: The use of architectural hints to guide speculative parallelization has also been explored. The detach/reattach hints are conceptually similar to the spawn/sync primitives in Tapir [23] (which the authors cite) and the fork/join model used to delineate tasks in ordered parallelism work like Swarm [12]. The innovation here is in the target of the hints (an in-core microarchitecture) rather than the concept of the hints themselves.
Marginal Delta Over Prior Art: The authors argue in Section 7 (page 12) that the gains from prior SMT-based TLS have been "superseded by progress in CPU core design." While plausible, the paper does not sufficiently articulate the specific architectural "delta" that makes LoopFrog succeed where its predecessors may have faltered. Is it simply the availability of more idle resources, or is there a fundamental architectural innovation in LoopFrog beyond scaling up old ideas? The novelty claim rests heavily on this distinction, which needs to be sharpened.
Performance Gains vs. Complexity: The implementation of the SSB, conflict detector, and checkpointing logic introduces significant design complexity, regardless of the final silicon area. A crucial finding in the paper's own analysis (Section 6.4.2, page 11) is that a large fraction of the total speedup (32% + 3% = 35%) comes from the prefetching side-effects of (often failed) speculation. This raises a critical question: could a significant portion of the 9.5% geomean speedup be achieved with a much simpler, dedicated hardware prefetcher aware of loop structures, without the full complexity of speculative execution, state buffering, and squash logic? This potential for a simpler alternative dilutes the perceived value of the proposed complex mechanism.

Questions to Address In Rebuttal

Differentiation from SMT-TLS: Please articulate the specific, fundamental microarchitectural differences between LoopFrog and prior SMT-based TLS proposals like IMT [21] and DMP [1]. Beyond the argument that baseline cores are now wider, what core mechanism in LoopFrog is novel and essential to its success that was absent in this prior art?
Comparison to HTM: The SSB's functionality closely mirrors that of an Eager versioning Hardware Transactional Memory system. Could the authors contrast their detach/reattach/sync model for loops against a hypothetical implementation where each loop iteration is simply wrapped in an HTM transaction? What are the fundamental performance and complexity advantages of the LoopFrog model that would justify it over leveraging an existing HTM implementation?
Justification of Speculative Execution for Prefetching: Given that over a third of the performance benefit derives from prefetching effects (as detailed in Section 6.4.2), please justify the necessity of the full speculative execution and state versioning framework. Could a simpler "scout thread" or a runahead execution scheme, which executes instructions only for their prefetching side-effects and discards all results, achieve a comparable performance gain with substantially lower hardware complexity than the proposed SSB and conflict detector?

Multi-Stream Squash Reuse for Control-Independent Processors

Abstract

Single- core performance remains crucial for mitigating the serial bottleneck in applications, according to Amdahl’s Law. However, hard-to-predict branches pose significant challenges to achieve high Instruction-Level Parallelism (ILP) due to frequent ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose "Multi-Stream Squash Reuse," a microarchitectural technique to recover useful work from multiple, non-contiguous previously squashed instruction streams. This extends prior work like Dynamic Control Independence (DCI) and Register Integration (RI), which are typically limited to reusing work from only the most recent squashed path. The central mechanism is the "Rename Mapping Generation ID" (RGID), a versioning tag for architectural-to-physical register mappings, intended to track data dependencies across disjoint execution contexts. The authors present an implementation integrated into the Fetch and Rename stages and report average IPC improvements of 2.2% on a subset of SPECint2006, 0.8% on SPECint2017, and 2.4% on GAP benchmarks.

Strengths

Sound Motivation: The paper correctly identifies a limitation in existing squash reuse schemes. The analysis in Section 2.2.5 (Figure 4) provides empirical evidence that a non-trivial fraction of reconvergence opportunities (15-43% in some SPEC benchmarks) involves streams other than the immediately preceding one. This quantitatively justifies the exploration of a multi-stream approach.
Plausible Core Mechanism: The RGID concept is a conceptually straightforward approach to tracking the temporal state of register mappings. By versioning the mappings themselves, it avoids the complexities of reconstructing and comparing full dependency graphs across streams.

Weaknesses

My primary concerns with this work center on the practical viability of the proposed solution, the significance of the results relative to the induced complexity, and an insufficient analysis of critical corner cases, particularly memory dependencies.

Marginal Performance Gains for Significant Hardware Complexity: The headline results are underwhelming. An average IPC gain of 0.8% on SPECint2017 and 2.2% on SPECint2006 is deep in the territory of diminishing returns. Yet, the hardware required to achieve this is substantial: two-dimensional Wrong-Path Buffers (WPBs), Squash Logs, per-architectural-register RGID counters, extensions to the RAT and ROB to store RGIDs, and non-trivial reconvergence detection logic (Section 3.4, Figure 7) and reuse test logic (Section 3.5, Figure 8). The authors have not made a convincing case that this complexity is justified for such a minor performance uplift on standard workloads.
Superficial Treatment of Memory Order Violations: The handling of memory dependencies, a notoriously difficult problem for speculative reuse techniques, is a critical weakness. In Section 3.8, the authors propose two options: a Bloom filter or re-executing all reused load instructions. They state, "In our evaluation, we choose to implement the latter mechanism for simplicity." This is a significant concession that undermines the entire premise of "reuse." Re-executing all loads is not reuse; it is a "verification" that consumes execution ports and energy, and its performance cost could easily negate the gains from reusing ALU instructions. The paper provides no data on what percentage of reused instructions are loads, nor the performance penalty incurred by this re-execution policy. This is not a minor implementation detail; it is fundamental to the correctness and performance of the scheme.
Unconvincing Critical Path Analysis: The authors claim their modifications do not affect the processor's critical path, but the evidence is thin. The reuse test logic presented in Section 3.5 and Figure 8 introduces a dependency for the Nth instruction's reuse test on the reuse test results of the preceding N-1 instructions in the same cycle. While they argue this is overshadowed by the existing register dependency check (Reg CMP), adding any serial dependency chain to the rename stage is hazardous. The post-synthesis results in Table 4 report up to 41 logic levels for an 8-wide machine. For a target 2 GHz clock (a 500ps cycle time), this path depth (~12 ps/level) is extremely aggressive and likely on the critical path, contrary to the authors' assertions.
RGID Management is Under-specified: The paper mentions a "global RGID reset" when counters overflow or other conditions are met (Section 3.4). This is a coarse-grained, disruptive event that disables the entire mechanism. The authors provide no analysis on the frequency of these resets. If overflows are common, the actual achievable performance will be lower than what is simulated, as the mechanism will be periodically unavailable. This is a crucial parameter for understanding the real-world efficacy of the RGID system.
Selective Evaluation: The authors state in Section 4 that they "select benchmarks from the SPECint2006, 2017 suites that have a branch misprediction rate of more than 3%." This constitutes a form of selective reporting. While justified for studying the mechanism, it inflates the perceived average benefit. The impact on the full, unmodified SPEC suites would provide a more honest assessment of the technique's value to a general-purpose processor and would almost certainly be much lower than the already marginal numbers reported.

Questions to Address In Rebuttal

The authors must address the following points to substantiate their claims:

Cost/Benefit Justification: Given the added hardware complexity (WPBs, Squash Logs, RAT/ROB extensions), how do you justify an average IPC improvement of less than 2.5% on SPEC and GAP suites? Provide a detailed area and power breakdown versus a baseline core.
Memory Dependencies: Please provide a quantitative analysis of your chosen memory hazard solution (re-executing all loads). Specifically:
- What percentage of instructions that passed the RGID reuse test were loads?
- What is the performance impact of re-executing these loads versus truly reusing their results? Please provide data showing the IPC gain before and after accounting for load re-execution.
- Why was a proper analysis of a Bloom filter approach, including its false-positive rate and performance impact, omitted?
RGID Overflow: What is the frequency of the "global RGID reset" event in your SPEC and GAP simulations? What is the performance cost associated with the periods where the squash reuse mechanism is disabled awaiting re-synchronization?
Critical Path: Can you provide a more rigorous analysis comparing the timing of your full reuse-test logic path (41 levels for 8-wide) against the critical path of a comparable, baseline high-frequency rename stage? The claim that this logic is "overshadowed" requires stronger evidence than is provided.
Full Suite Results: Please provide performance results for the entire SPECint2006 and SPECint2017 suites, not only the subset with high misprediction rates, to demonstrate the technique's overall value.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper addresses the long-standing problem of branch misprediction penalties in high-performance processors. The authors observe that existing techniques for "squash reuse"—recovering useful work from a mispredicted path—are overly constrained. They typically only consider reconvergence between the current correct path and the single, most recent incorrect path, often limiting them to simple if-else control structures.

The core contribution of this work is the concept of Multi-Stream Squash Reuse, a mechanism that generalizes reuse to enable the current instruction stream to reconverge with any of several previously squashed streams. To achieve this, the authors introduce an elegant and novel mechanism called Rename Mapping Generation ID (RGID). Each time an architectural register is mapped to a new physical register, the mapping is tagged with a unique, incrementing ID. By comparing the RGIDs of an instruction's source operands on the current path with those on a past squashed path, the processor can quickly and robustly verify data-flow integrity without complex dependency tracking.

The authors implement this scheme with modest hardware extensions to the fetch (Wrong-Path Buffers) and rename (Squash Log) stages and demonstrate average IPC improvements of 2.2% on SPECint2006 and 2.4% on GAP benchmarks, with significant gains on select workloads.

Strengths

Excellent Problem Formulation and Motivation: The paper's primary strength lies in its clear and compelling motivation. The authors effectively argue that the traditional model of reconvergence is too simplistic for modern complex control flow. The distinction between "software-induced" and "hardware-induced" multi-stream reconvergence (Section 2.2.2, page 3) is particularly insightful, highlighting that out-of-order execution itself can create complex reconvergence scenarios that simpler models would miss. The profiling data in Figure 4 (page 5), which shows that non-neighboring stream reconvergence constitutes a significant fraction (15-43%) of opportunities, provides strong evidence that this is a problem worth solving.
Elegant and Scalable Core Mechanism (RGID): The RGID concept is the technical heart of the paper and its most significant contribution. It is a beautifully simple solution to the complex problem of tracking data-flow equivalence across multiple divergent speculative paths. It neatly sidesteps the major issues of prior table-based schemes like Register Integration (RI), such as table conflicts and the overhead of transitive invalidations (as well-argued in Section 3.7, page 10). It also appears more scalable than extending queue-based schemes like Dynamic Control Independence (DCI), which would require managing complex poison vectors across multiple stream segments. The RGID is a powerful abstraction for representing dynamic data versions.
Contextualization within the Field: This work fits perfectly within the lineage of research on Control Independence and squash reuse, pioneered by works from Sohi, Smith, and Rotenberg. It can be seen as a direct and logical evolution of foundational papers like Register Integration [24] and Dynamic Control Independence [5]. Where RI used physical register names and DCI used a single shadow ROB, this work introduces a more general versioning system (RGIDs) to create a more powerful and flexible framework. The authors do an excellent job of positioning their work relative to these predecessors in Section 3.7.
Thorough and Credible Evaluation: The experimental methodology is solid. The use of gem5 with detailed modeling, combined with SPEC and GAP benchmarks, is appropriate. Crucially, the inclusion of a direct comparison against a re-implementation of Register Integration (Figure 12, page 13) strengthens their claims. Furthermore, the post-synthesis complexity analysis for the critical logic components (Table 4, page 13) adds a layer of practicality and credibility, showing that the proposed hardware is feasible within a reasonable area and power budget.

Weaknesses

While the core idea is strong, its practical realization as presented has a few points that could be strengthened:

Handling of Memory Dependencies: The proposed solution for handling memory order violations (Section 3.8, page 10) feels underdeveloped compared to the elegance of the RGID mechanism for registers. The authors evaluate a simplified approach of re-executing all reused load instructions, which seems to partially defeat the purpose of reuse. While this is acknowledged as a choice for simplicity, it leaves a significant question mark over the true potential of the technique. The performance gains might be considerably higher if a more sophisticated memory dependency mechanism, such as the proposed Bloom filter, were implemented and evaluated.
Practicality of RGID Overflow and Reset: The paper briefly mentions a global reset mechanism to handle RGID counter overflows (Section 3.4, page 8). However, the performance implications of this are not explored. Halting the acceptance of new squashed streams, even temporarily, could introduce performance jitter or bubbles that negate some of the gains, especially in programs with very long-running phases of high branch misprediction rates. Some data on the frequency of these resets and their performance cost would be valuable.
Modest Gains on Newer Benchmarks: The average IPC improvement on SPECint2017 is notably lower (0.8%) than on SPECint2006 (2.2%). This suggests that either the control-flow patterns in the newer suite offer fewer multi-stream reconvergence opportunities, or that other bottlenecks (e.g., memory dependencies, cache misses) are more dominant, limiting the impact of this optimization. A deeper analysis of this discrepancy would strengthen the paper.

Questions to Address In Rebuttal

On Memory Dependencies: The decision to re-execute all loads is a major simplification. Could the authors provide data on what fraction of the total reuse opportunities come from load instructions in their key benchmarks? This would help the committee understand how much performance is being left on the table by the current evaluation model and assess the importance of developing a more sophisticated memory hazard detection scheme.
On RGID Overflow: Could the authors provide statistics on the frequency of RGID overflows and the subsequent mechanism resets during the evaluated benchmark runs? What is the performance sensitivity to the "fixed number of instructions" that must be committed before the mechanism is re-enabled?
On SPECint2017 Performance: Could the authors offer a more detailed hypothesis for the lower average gains on SPECint2017? Is this an artifact of the specific benchmarks chosen, or does it reflect a broader trend in modern software where this class of optimization provides diminishing returns?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper proposes "Multi-Stream Squash Reuse," a technique to recover useful work from multiple, non-contiguous, previously squashed instruction streams following a branch misprediction. This extends the scope of traditional squash reuse, which typically only considers reconvergence with the single most recent mispredicted path. The core enabling mechanism is a novel concept called Rename Mapping Generation ID (RGID), a versioning tag applied to each architectural-to-physical register mapping. By comparing RGIDs between a currently fetching instruction and its counterpart in an old squashed stream, the system can verify data integrity and reuse previously computed results.

My assessment is that the core idea of extending squash reuse to multiple streams, enabled by the novel RGID mechanism, represents a genuine contribution to the field. The mechanism avoids known pitfalls of prior art. However, the demonstrated performance benefits appear modest relative to the proposed hardware complexity, raising questions about the practical utility of this novel concept.

Strengths

Novelty of Scope (The "Multi-Stream" Concept): The primary novelty lies in identifying and targeting reconvergence opportunities beyond the immediate predecessor stream. Prior art, most notably Dynamic Control Independence (DCI) [5], established the queue-based approach but limited its scope to a single squashed stream. This paper convincingly argues in Section 2.2 that both software control-flow structures and hardware phenomena (out-of-order branch resolution) create scenarios where the correct path might reconverge with an "ancestor" squashed stream. To my knowledge, this is the first work to systematically design a mechanism to exploit this.
Novelty of Mechanism (The "RGID" Concept): The proposed RGID mechanism is an elegant and novel solution to the data dependency tracking problem in this complex multi-stream context. It differs fundamentally from prior approaches:
- It avoids the table-based structure of Dynamic Instruction Reuse (DIR) [30] and Register Integration (RI) [24], thereby sidestepping the issues of table conflicts and transitive invalidations, as correctly articulated in Section 3.7.
- It offers a more scalable approach than naively extending DCI's poison-vector method. As the authors argue in Section 2.2.3, tracking dependency segments across N streams with poison vectors would lead to significant management complexity. RGIDs provide a decentralized check: two execution states are equivalent with respect to a register if their RGIDs match. This is a clever way to compare states without needing to reconstruct the full path between them.
Clear Articulation of the "Delta" from Prior Art: The authors demonstrate a strong command of the literature. Section 3.7 provides a direct and accurate comparison against DIR, RI, and DCI, clearly isolating what makes their approach different and, in their view, superior. This clarity is commendable.

Weaknesses

Marginal Performance Gains vs. Significant Complexity: The central weakness is the trade-off between the novelty and its impact. The proposed architecture introduces non-trivial hardware: Wrong-Path Buffers (WPBs), a multi-stream Squash Log, and additional storage in the RAT and ROB for RGIDs (summarized in Table 2). The post-synthesis results in Table 4 confirm this, showing thousands of µm² in area and a critical path for the reuse test that scales with pipeline width (e.g., 41 logic levels for an 8-wide machine). In return for this complexity, the average IPC gains are 2.2% on SPECint2006, 0.8% on SPECint2017, and 2.4% on GAP. While maximum gains on specific benchmarks like astar are notable (8.9%), the average improvements across standard suites are low for a mechanism of this complexity. A truly innovative idea should ideally provide a more compelling performance-per-transistor argument.
Unexplored Practicalities of RGID Management: The RGID concept, while novel, has potential failure modes that are not fully characterized. The paper mentions in Section 3.4 that a "global RGID reset" is triggered on overflow or after repeated overflow events, causing a "temporary halt" in accepting new squashed streams. This sounds disruptive. The frequency of these events and the precise performance penalty are not quantified. If RGID counters are small, this reset mechanism could frequently disable the entire benefit of the multi-stream reuse, undermining the proposal's value.
The Opportunity is Niche: The authors' own profiling in Figure 4 shows that for a majority of benchmarks (especially in GAP), "simple reconvergence" — the kind that can be handled by a single-stream DCI-like scheme — constitutes the vast majority of opportunities. The combined software- and hardware-induced multi-stream opportunities are significant in only a handful of the SPEC benchmarks shown (omnetpp, astar). This suggests that the novel problem the paper solves, while real, may not be prevalent enough to justify a general-purpose hardware solution.

Questions to Address In Rebuttal

Complexity vs. Benefit Justification: Can the authors provide a more direct analysis of the efficiency of their proposal? For instance, what is the IPC-per-area (mm²) or IPC-per-watt (mW) improvement of your technique over the baseline? A 2% IPC gain for a 5% area/power increase might be acceptable, but the current presentation makes this trade-off difficult to assess.
RGID Overflow Analysis: Please provide data on the frequency of RGID overflows and the resulting global resets during the benchmark runs. How much performance is lost due to the "temporary halt" described in Section 3.4? This is critical to understanding if the mechanism is robust in practice.
Characterization of Ideal Workloads: The novelty of this work lies in addressing multi-stream reconvergence. Could you provide a more detailed characterization of the program structures (e.g., deeply nested loops with data-dependent exits, complex recursive functions) that generate these opportunities? This would help justify the design by clearly defining the domain where its novel capabilities provide a substantial advantage over single-stream approaches.

Drishti: Do Not Forget Slicing While Designing Last-Level Cache Replacement Policies for Many-Core Systems

Abstract

High- performance Last-level Cache (LLC) replacement policies mitigate off-chip memory access latency by intelligently determining which cache lines to retain in the LLC. State-of-the-art replacement policies significantly outperform policies like LRU. ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

This paper posits that state-of-the-art LLC replacement policies, such as Hawkeye and Mockingjay, are sub-optimal in many-core systems with sliced LLCs. The authors identify two primary deficiencies: 1) "myopic" reuse predictions resulting from access streams being scattered across slices, and 2) "underutilized" sampled sets where randomly selected sets for monitoring receive too few misses to provide useful training data. To address this, the paper proposes "Drishti," a set of two enhancements: a per-core, globally-accessible reuse predictor to create a non-myopic view, and a dynamic sampled cache (DSC) that intelligently selects high-miss-rate sets for monitoring. The authors claim that these enhancements significantly improve the performance of existing policies, with Mockingjay's geomean speedup over LRU on a 32-core system increasing from 6.7% to 13.2%.

Strengths

The paper correctly identifies the "myopic" prediction problem in sliced LLC architectures as a potential performance limiter. The scattering of a single PC's access stream across multiple per-slice predictors is a valid architectural concern.
The motivation for the proposed enhancements is supported by some initial data analysis (e.g., Figures 2, 3, and 5), which illustrates the access scattering and skewed miss distribution that form the basis of the authors' arguments.
The authors provide an ablation study (Figure 17) that attempts to isolate the performance contributions of each proposed mechanism (the global predictor and the dynamic sampled cache).

Weaknesses

Critical Dependency on Unrealistic Hardware: The central and most significant weakness is the proposal's complete reliance on a dedicated, low-latency, three-cycle interconnect (NOCSTAR). This is not a simple policy tweak but a fundamental alteration of the on-chip network fabric. The authors' own data in Figure 11a demonstrates that without this idealized network, their proposal results in a significant performance slowdown (up to 9% on average for 32 cores). This makes NOCSTAR a hard requirement, not an optimization. The practicality, area, power, and complexity costs of adding a second, dedicated network are non-trivial and are not sufficiently justified against the performance gains. The proposal's viability hinges entirely on this assumption.
Selective Scope and Inconsistent Baseline Comparison: The authors explicitly state in the introduction (Section 1, Page 1) that they "do not consider machine learning (ML) [55] and reinforcement learning (RL) [38]-based LLC replacement policies." This is a critical omission, as these policies represent the current state-of-the-art frontier. This selective exclusion creates a simplified problem space where Drishti's enhancements may appear more effective than they would against stronger, more adaptive baselines. This exclusion is then directly contradicted in Section 6 (Page 12), where the authors claim applicability to and provide results for CHROME [38] (RL-based) and Glider [55] (deep learning-based). This inconsistency suggests either a flawed initial premise or a post-hoc attempt to broaden the paper's applicability without a rigorous, head-to-head comparison in the main evaluation.
Fragile Justification for Dynamic Sampled Cache (DSC): The justification for the DSC relies heavily on workloads with highly skewed miss distributions (e.g., mcf in Figure 5a). The authors concede that for workloads with uniform distributions (lbm in Figure 5c), this mechanism is ineffective and must be disabled via a detection heuristic (Section 4.2, Page 7). This admission reveals the DSC is not a universally applicable improvement but a mode-based optimization that adds the complexity of phase/behavior detection logic. It is unclear how this mechanism performs on the wide spectrum of workloads between these two extremes. The claim of a net hardware saving (Table 3) is also tenuous; while the sampled cache structure is smaller, the proposal adds k-bit saturating counters for every LLC set, selection logic, and the aforementioned NOCSTAR interconnect. The overall system complexity is demonstrably higher.
Insufficient Detail in Motivational Analysis: In the motivational analysis (Section 3.1, Figure 3), the methodology for simulating the "global view" predictor is not sufficiently detailed. It appears to be an idealized oracle with perfect and instantaneous access to all sampled set information across all slices. This sets an unachievably high bar and may inflate the perceived potential of a global predictor, making the authors' practical implementation seem closer to the ideal than it actually is. The performance gap between this oracle and the authors' NOCSTAR-based implementation is not quantified.

Questions to Address In Rebuttal

Please provide a detailed analysis of the performance of Drishti using the existing on-chip mesh interconnect, assuming realistic contention and latency derived from the increased predictor traffic. Given that your own data (Figure 11) shows a performance loss without NOCSTAR, how can this proposal be considered a practical enhancement to replacement policies rather than a coupled co-design of a policy and a new interconnect?
Please justify the exclusion of ML/RL-based policies like CHROME or Glider from the main evaluation (Section 5), especially given that you test against them in Section 6. To properly situate Drishti's contribution, a full comparison against these true state-of-the-art policies is necessary in the main results tables and figures.
How does the Dynamic Sampled Cache (DSC) perform on workloads where the miss distribution across sets is relatively flat but not perfectly uniform (i.e., not as skewed as mcf, but not as flat as lbm)? What is the performance and hardware overhead of the workload detection logic that is required to enable/disable the DSC?
Clarify the precise simulation setup for the idealized "global view" predictor used in Figure 3. Is this a contention-free model with zero-cycle access latency? Please provide data comparing this ideal oracle to your implemented per-core global predictor with the three-cycle NOCSTAR latency to properly frame the implementation's efficiency.

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper, "Drishti," addresses a timely and important disconnect between the design of state-of-the-art Last-Level Cache (LLC) replacement policies and their deployment in modern many-core systems. The authors correctly observe that while advanced policies like Hawkeye and Mockingjay show significant promise, their evaluation has largely been confined to monolithic LLC models. In contrast, commercial processors utilize sliced, non-uniform cache architectures (NUCA).

The core contribution is not a new replacement policy, but rather a set of two well-motivated architectural enhancements that make existing policies "slice-aware." The authors first identify the problem of "myopic predictions," where per-slice predictors have an incomplete view of a program's global reuse behavior. Their solution is a clever compromise: a local (per-slice) sampled cache that feeds into a per-core, yet globally-aware, reuse predictor. Second, they identify that randomly selected sampled sets are often "underutilized," receiving too few misses to effectively train the predictor. They propose a dynamic sampled cache that intelligently selects sets with high miss rates (MPKA) to maximize learning efficiency. The paper demonstrates that these enhancements, when applied to Hawkeye and Mockingjay, can substantially boost their effectiveness, for instance, nearly doubling the performance gains of Mockingjay on a 32-core system.

Strengths

Excellent Problem Formulation: The paper's greatest strength is its premise. It tackles a practical and increasingly relevant issue that is often overlooked in academic studies. By grounding their work in the reality of commercial sliced LLCs (as mentioned in Section 1, page 1), the authors immediately establish the significance of their investigation. This is a crucial step in bridging the gap between theoretical microarchitectural improvements and real-world processor design.
Clear, Data-Driven Motivation: The motivation presented in Section 3 (page 3) is compelling. Figure 2, which quantifies how many Program Counters (PCs) are confined to a single slice, provides a powerful and intuitive argument for why per-slice predictors are inherently myopic. Furthermore, the analysis of underutilized sampled sets (Section 3.2, page 4), culminating in the simple yet powerful experiment in Table 1, perfectly justifies the need for a more intelligent set selection mechanism.
Elegant and Generalizable Solutions: The proposed enhancements are not overly complex and demonstrate a deep understanding of the design trade-offs. The "per-core yet global" predictor is a pragmatic solution that balances the need for a global view against the communication overhead of a fully centralized structure. The dynamic sampled cache uses a well-understood technique (saturating counters) to solve the problem in a simple and effective manner. Most importantly, as shown in Section 6 (page 12), these ideas are not limited to Hawkeye and Mockingjay. The authors demonstrate their applicability to a wider class of prediction-based policies (including SHiP++, CHROME, and Glider), elevating the work from a specific optimization to a more general architectural principle.
Contextualization within the Field: This work fits beautifully into the ongoing evolution of cache management. It follows the trajectory from simple heuristics (LRU), to predictive policies (RRIP, SHiP), to emulating optimality with large-scale tracking (Hawkeye, Mockingjay). Drishti represents the next logical step: adapting these sophisticated predictors to the physical and distributed reality of modern hardware. It addresses the "systems" aspect of a microarchitectural problem.

Weaknesses

Framing as an "Enhancement": While technically accurate, framing the work solely as an enhancement to existing policies may undersell its conceptual contribution. The core ideas—decoupling the scope of sampling from the scope of prediction and making the sampling process adaptive—are fundamental principles that could inform the design of future replacement policies from the ground up.
Reliance on a Dedicated Interconnect: The proposal for a dedicated interconnect (NOCSTAR, Section 4.1.4, page 6) to handle predictor traffic is a practical solution but also introduces non-trivial design complexity and area/power overhead, however small. While the authors justify its low cost, it represents an additional system component that must be validated. An exploration of alternative approaches, such as leveraging quality-of-service (QoS) mechanisms on the main network-on-chip (NoC), would have strengthened this aspect of the proposal.
Trace-Driven Simulation Limitations: The use of a trace-driven simulator (ChampSim) is standard practice and perfectly acceptable for this type of study. However, it inherently cannot capture complex feedback loops where changes in cache behavior might alter the application's execution path, memory access timing, or prefetching behavior. This is a minor limitation but worth noting, as the true system-level impact could differ slightly in an execution-driven environment.

Questions to Address In Rebuttal

The utility of each enhancement is analyzed in Figure 17 (page 10). It appears the move to a global predictor provides the largest performance jump, with the dynamic sampled cache (DSC) providing a further, significant boost. Could the authors confirm this interpretation and comment on the synergy between the two proposals? Are the gains largely additive, or is there a super-linear effect where the DSC is particularly effective because it is training a more powerful global predictor?
Regarding the NOCSTAR interconnect, could the authors elaborate on why a dedicated network is preferable to using a high-priority virtual channel on the existing NoC? While the paper argues for low latency, could a QoS approach provide "good enough" latency without the cost of dedicated wiring and arbiters?
The paper effectively shows scalability up to 128 cores. As we look toward future systems with hundreds of cores, does the "per-core yet global" predictor model begin to face new scalability challenges? For example, does the storage for the per-core predictors become prohibitive, or does the aggregate traffic to these distributed predictors start to congest the dedicated interconnect?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper identifies a valid and often overlooked problem: state-of-the-art Last-Level Cache (LLC) replacement policies, such as Hawkeye and Mockingjay, which were designed for monolithic caches, suffer from degraded performance on the sliced LLC architectures common in modern many-core processors. The authors attribute this degradation to two primary factors: "myopic predictions" from per-slice predictors that lack a global view of an application's reuse behavior, and "underutilized sampled sets" where randomly chosen sets for monitoring see too few misses to provide a useful training signal.

To address this, the paper proposes "Drishti," a set of two hardware enhancements. The first is a "per-core yet global" reuse predictor architecture, which aims to provide a global view of a core's access patterns without the bottleneck of a single centralized predictor. This is paired with a local, per-slice sampled cache. The second enhancement is a "dynamic sampled cache," which eschews random set selection in favor of a mechanism that periodically identifies and samples LLC sets with the highest miss rates (MPKA), thereby focusing the monitoring effort where it is most impactful.

Strengths

Problem Identification: The paper correctly identifies a significant gap between academic research on LLC replacement policies and the reality of commercial hardware. The analysis in Section 3.1 (page 3), particularly Figures 2 and 3, provides a clear and compelling demonstration of the "myopic" prediction problem in sliced LLCs. This is a valuable insight.
System Integration: The authors propose a complete system solution. The two enhancements are designed to work in concert, addressing two distinct but related weaknesses of existing policies in a sliced environment. The consideration of interconnect traffic and the proposal to use a dedicated interconnect (NOCSTAR) shows a thoroughness in engineering the solution.

Weaknesses

My evaluation is centered on the fundamental novelty of the proposed ideas. While the engineering and integration are competent, the core concepts appear to be reformulations or applications of pre-existing principles.

"Per-Core yet Global" Predictor Lacks Conceptual Novelty: The core idea here is to overcome the limitations of distributed, uncoordinated decision-making by introducing a shared, global perspective. This is a foundational concept in distributed systems and parallel computing, not a new one. The paper itself acknowledges that a fully centralized predictor is a "trivial solution" (Abstract, page 1). The proposed "per-core yet global" architecture is an engineering trade-off to manage the scalability of this known solution. It is essentially a form of state replication (one predictor instance per core) where each replica is updated by a single logical stream (the core's accesses). While this specific arrangement might be new in the context of LLC predictors, the architectural pattern itself is not novel. The problem of balancing global state with distributed access is well-trodden ground. The novelty lies in the application, not the invention.
"Dynamic Sampled Cache" is an Application of a Known Principle: The insight that random sampling can be inefficient and that monitoring should focus on "hotspots" or high-activity regions is not new. The principle of dynamically identifying high-pressure resources to guide policy is seen in other areas of cache management. For instance:
- Utility-Based Cache Partitioning (UCP): This class of techniques monitors the marginal utility (e.g., miss rate reduction) of allocating cache resources to different applications. This inherently involves identifying which applications/cores are creating the most memory pressure.
- Adaptive Set Management: Prior work like The V-Way Cache [47] and The Set Balancing Cache [50] dynamically adjust cache associativity or line placement based on monitoring pressure within individual sets. The underlying principle is the same: identify high-contention sets and act upon them.
Drishti's contribution is to apply this principle of "focus on the hotspots" to the specific problem of selecting which sets to sample for training a reuse predictor. The mechanism proposed—using per-set saturating counters to track MPKA—is a straightforward heuristic implementation of this principle. The idea is an incremental refinement of sampling strategy, not a fundamentally new concept.
Overall Contribution is Systematization, Not Invention: The primary contribution of this paper is the careful identification of a real-world system issue and the competent engineering of a solution by combining and adapting existing architectural principles. It is a work of system integration. While valuable, it does not present a novel algorithmic or architectural paradigm for cache replacement. The "delta" over prior art is in the specific application and combination of ideas, which, while effective, is not large from a conceptual standpoint.

Questions to Address In Rebuttal

The concept of creating a global view to overcome myopic local decisions is a classic problem. Please articulate the fundamental architectural novelty of the "per-core yet global" predictor beyond it being a point in the design space between fully distributed and fully centralized state. What makes this specific arrangement conceptually distinct from other forms of replicated, coherent state in parallel architectures?
Please contrast the core principle of your dynamic sampled cache with prior work in adaptive cache management (e.g., adaptive associativity, cache partitioning) that also dynamically identify high-pressure sets/regions to guide policy decisions. Is the novelty in the principle itself, or solely in its application to selecting predictor training sets?
The proposed hardware changes (a dedicated NOCSTAR interconnect, per-set MPKA counters) are significant. Given that the underlying concepts are adaptations of existing principles, could a substantial fraction of the performance gain be achieved through a less complex approach, perhaps by leveraging existing coherence traffic or performance counters to approximate a global view and identify high-miss sets? Please justify why this level of hardware complexity is necessary for the claimed advance.

A TRRIP Down Memory Lane: Temperature-Based Re-Reference Interval Prediction For Instruction Caching

Abstract

Modern mobile CPU software pose challenges for conventional instruction cache replacement policies due to their complex runtime behavior causing high reuse distance between executions of the same instruction. Mobile code commonly suffers from large ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose TRRIP, a software-hardware co-design for instruction cache replacement. The core mechanism leverages Profile-Guided Optimization (PGO) to classify code into "hot," "warm," and "cold" temperatures. This temperature information is stored in page table entries (PTEs) via an OS interface and subsequently passed to the L2 cache controller with each memory request. The hardware then uses this hint to modify a baseline RRIP replacement policy, prioritizing the retention of "hot" instruction lines. The paper claims this approach reduces L2 instruction MPKI by 26.5%, leading to a 3.9% geomean speedup on mobile-like workloads, with minimal hardware modifications.

Strengths

Problem Identification: The motivation presented in Section 2 is well-founded. Figure 3 effectively demonstrates that even after PGO-based layout optimizations, hot instruction code still suffers from high reuse distances, clearly identifying a remaining performance gap that pure software or conventional hardware policies fail to address.
Pragmatic Hinting Mechanism: The proposal to pass software hints to hardware via existing, implementation-defined PTE bits (as detailed in Section 3.1, page 5) is a practical design choice. It correctly identifies and avoids the significant barrier of modifying the Instruction Set Architecture (ISA), which has doomed many similar co-design proposals.
Baseline Policy: The choice to build upon RRIP is sensible. RRIP is a strong and widely recognized baseline, making the claimed improvements more credible than if they were compared against a weaker policy like pure LRU.

Weaknesses

My primary concerns with this work center on the validity of its evaluation methodology and the overstatement of its practical applicability.

Lack of Representativeness in Benchmarks: The paper's central premise is to solve a problem observed in "modern mobile system software" (Section 2.1, page 2), citing components like UI frameworks, renderers, and interpreters (Figure 1). However, the evaluation in Section 4 is conducted on a suite of "proxy benchmarks" (e.g., clang, gcc, deepsjeng, omnetpp). There is no evidence provided to substantiate the claim that the instruction access patterns and memory behavior of these proxy applications are representative of the actual mobile system components they are meant to mimic. This creates a fundamental disconnect between the problem being motivated and the problem being solved.
Simulation Fidelity and Key Omissions: The evaluation relies on the Sniper simulator, which is trace-based. As the authors themselves concede in Section 4.1 (page 7), this means the simulation does not model wrong-path execution. For a study focused on the CPU frontend, this is a critical omission. Instruction prefetchers, especially aggressive ones, frequently operate on wrong paths, polluting the cache. The interaction between a replacement policy and wrong-path prefetches is a first-order effect, and its absence calls the accuracy of the MPKI and speedup results into serious question. The "pseudo-FDIP" prefetcher model further weakens the setup.
Unfair Comparison to Prior Art: The implementation of competing state-of-the-art techniques, particularly Emissary, is described as being done "to the best of our ability" (Section 4.3, page 8). This phrasing suggests a potential lack of fidelity to the original proposal. Emissary's mechanism is tightly coupled to a specific microarchitecture's stall signals. It is unclear if the authors' implementation on a different simulation infrastructure accurately captures its behavior. Consequently, the performance comparison may be unfairly skewed in TRRIP's favor due to a suboptimal implementation of the competition.
Superficial Analysis of Practical Limitations:
- PGO Brittleness: The entire system hinges on the quality and representativeness of PGO profiles. The paper leverages an existing PGO flow but fails to discuss the significant engineering challenges of maintaining profile freshness and coverage for complex, rapidly evolving system software. An application's behavior might drift from the profile, rendering TRRIP's temperature hints incorrect and potentially degrading performance.
- Page Size Issues: The analysis in Section 4.9 (page 11) acknowledges that larger page sizes can cause a single page to contain code of multiple temperatures, corrupting the hint. The proposed solutions—"adding padding" or "disable marking"—are hand-wavy and their performance implications are not evaluated. Adding padding increases the code footprint, while disabling marking negates the benefit of TRRIP on those pages. This is a non-trivial practical issue that is inadequately addressed.
- The "Zero-Cost" Fallacy: While the paper claims minimal hardware changes by reusing PTE bits, it glosses over the significant, cross-stack engineering cost. Coordinating changes between the compiler, OS, and multiple hardware teams to correctly implement and validate this feature is a monumental undertaking. Claiming this is "practical and adoptable" without discussing this process is misleading.

Questions to Address In Rebuttal

Please provide quantitative evidence (e.g., analysis of instruction stream deltas, call graph similarity, cache access patterns) to demonstrate that the chosen proxy benchmarks are indeed representative of the real mobile system software components whose problems motivate this work (as shown in Figure 1).
The authors' implementation of Emissary is stated as a best-effort port. Can you elaborate on the specific microarchitectural signals used to drive your Emissary implementation and justify why this is a fair and faithful representation of the original work, whose performance is heavily dependent on those signals?
TRRIP is fundamentally a static, profile-based optimization. How does the system handle code for which no PGO profile exists (e.g., dynamically loaded third-party libraries, JIT-compiled code) or situations where the runtime behavior significantly deviates from the training profile? Does TRRIP not risk pessimizing the cache for hot code in these common scenarios?
Given that your simulation is trace-based and does not model wrong-path execution, how can you be confident in your results? A key function of a replacement policy is to mitigate cache pollution from sources like overly aggressive or inaccurate prefetching, which predominantly occurs on wrong paths. Please justify why this omission does not invalidate your conclusions regarding MPKI reduction and speedup.

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents TRRIP, a software-hardware co-design for improving instruction cache performance on mobile platforms. The core contribution is a pragmatic, end-to-end system that leverages existing Profile-Guided Optimization (PGO) infrastructure to classify code into "temperature" tiers (hot, warm, cold). This temperature information is then passed from the compiler to the hardware through the operating system, using existing, implementation-defined bits in the Page Table Entries (PTEs). The hardware cache controller uses this simple hint to modify the baseline RRIP replacement policy, giving strong priority to hot instruction lines to prevent their premature eviction. The authors demonstrate that this lightweight approach yields a 3.9% geomean speedup by reducing L2 instruction MPKI by 26.5% on PGO-optimized mobile proxy benchmarks, with negligible power and area overhead.

Strengths

The true strength of this paper lies in its elegant integration of existing, mature technologies into a novel and highly practical system. It stands as an excellent example of a co-design that respects the constraints of real-world product development.

Pragmatism and Adoptability: The authors' primary design pillar—"No additions/modifications to ISA"—is a crucial one. Many academic proposals in this space fail to gain traction because they require fundamental, costly changes. By instead utilizing existing architectural features like the implementation-defined bits in ARM PTEs (as mentioned in Section 3.3, page 6), TRRIP presents a solution with a remarkably low barrier to adoption. This focus on practicality is the paper's most significant contribution.
Clear and Compelling Motivation: The background analysis in Section 2 is thorough and effectively builds the case for TRRIP. The authors don't just state that frontend stalls are a problem; they demonstrate it with data on real mobile system software (Figure 1). Crucially, the analysis in Section 2.4 and Figure 3 isolates the core issue: even after PGO, frequently executed hot code suffers from high re-reference intervals, making it vulnerable to eviction. This insight provides a sharp, focused target for their solution.
Elegant System-Level Integration: The proposed flow, beautifully illustrated in Figure 4 (page 5), connects disparate parts of the system stack—the compiler, the object file format, the OS loader, and the microarchitecture—into a cohesive whole. The use of the OS and page tables as the conduit for compiler-derived information is a clever and efficient communication mechanism. This is co-design done right: not just proposing a new hardware feature in isolation, but architecting the flow of information across the entire system.
Strong and Relevant Evaluation: The authors compare TRRIP against a strong suite of modern replacement policies, including CLIP, SHiP, and Emissary. This demonstrates a solid understanding of the state-of-the-art. By outperforming these more complex, purely hardware-based solutions on average, the paper makes a compelling case for its simpler, co-designed approach.

Weaknesses

The paper's weaknesses are less about fundamental flaws and more about the inherent trade-offs of its chosen approach. Exploring these boundaries would strengthen the work.

The "Coverage" Limitation: The most significant limitation of TRRIP is that its benefits are confined to code that has been compiled through its specific PGO-enabled toolchain. As the authors themselves astutely analyze in Section 4.6 ("Coverage of Costly Instruction Misses"), costly misses in third-party libraries, dynamically-linked system code, or JIT-compiled code will not be covered. While PGO is widely used for first-party system components, a modern mobile environment is a heterogeneous ecosystem. This static, ahead-of-time dependency is the Achilles' heel of the approach when compared to purely dynamic hardware solutions like Emissary, which can react to any code being executed.
Static Profiles vs. Dynamic Behavior: PGO provides a static snapshot of program behavior based on a specific set of training inputs. The paper acknowledges performance degradation can occur due to profile mismatch (footnote 1, page 3) but does not fully explore the system's robustness. How gracefully does TRRIP handle significant application phase changes or workloads that deviate substantially from the profiling runs? Its static hints could become counter-productive in such scenarios.
Interaction with Other Frontend Mechanisms: The paper reasonably claims that instruction prefetching is an orthogonal technique. However, the interaction is likely more complex. An aggressive hardware prefetcher could be a primary source of cache pollution that evicts the very "hot" lines TRRIP is trying to protect. A deeper analysis of the interplay between TRRIP's priority scheme and prefetcher-induced cache pressure would provide a more complete picture of its system-level impact.

Questions to Address In Rebuttal

Quantifying the Coverage Gap: Regarding the "coverage" limitation, could the authors provide an estimate of what percentage of total execution cycles in a typical, interactive mobile usage scenario (e.g., app launch, web browsing) is spent in code that would not be visible to the TRRIP toolchain (e.g., third-party SDKs, JIT code from web engines)? This would help contextualize the real-world impact of the approach.
Robustness to Profile Mismatch: Have the authors performed experiments to measure the sensitivity of TRRIP to profile quality? For instance, evaluating the benchmarks using a profile generated from a completely different input set could reveal how TRRIP performs under sub-optimal conditions and whether it risks significant performance degradation compared to a baseline RRIP.
Synergy with Prefetching: Instead of being merely orthogonal, could TRRIP's temperature hints actively improve prefetching? For example, could the hardware use the "hot" hint to protect a cache set from polluting prefetches, or perhaps use the "cold" hint to identify regions where prefetching should be more aggressive?
Generalizability of "Temperature": The conclusion suggests applying the TRRIP philosophy to other structures like the BTB and TLB. What would be the analogous PGO-derived metric for "temperature" for these structures? Is it simply execution frequency, or would a more nuanced metric (e.g., branch misprediction rate for BTB entries, TLB miss rate for pages) be required?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper proposes TRRIP, a software-hardware co-design for instruction cache replacement. The core idea is to leverage Profile-Guided Optimization (PGO) in the compiler to classify code into "temperature" categories (hot, warm, cold). This temperature information is then communicated to the hardware via implementation-defined bits in the page table entries (PTEs), a mechanism supported by modern architectures like ARM. The hardware cache replacement policy, based on RRIP, uses these temperature hints to prioritize hot instruction lines, inserting them with a higher priority (Immediate re-reference) and demoting them more slowly, aiming to reduce frontend stalls from instruction cache misses. The authors claim a 3.9% geomean speedup and a 26.5% reduction in L2 instruction MPKI over a baseline RRIP policy on PGO-optimized mobile workloads.

Strengths

Practicality of the Communication Mechanism: The most significant aspect of this work is its focus on a practical, non-intrusive communication path between software and hardware. By leveraging existing, implementation-defined page table attribute bits (e.g., ARM's PBHA), the authors sidestep the need for ISA extensions, which is a notoriously high barrier to adoption for co-designed techniques. This makes the proposal more plausible for real-world implementation than many of its predecessors.
Low Implementation Overhead: The proposed hardware modification is minimal, essentially adding a small amount of conditional logic to an existing RRIP controller based on the temperature hint (Algorithm 1, page 6). As shown in Table 4 (page 9), the area and power overheads are negligible. The software complexity is also low, as it builds upon existing PGO infrastructure.
End-to-End System Integration: The paper presents a complete, vertically integrated solution spanning the compiler, OS, and microarchitecture. The flow from PGO analysis to ELF sectioning to loader/OS page table population to hardware action is clearly articulated (Figure 4, page 5).

Weaknesses

Limited Conceptual Novelty: The core concept of using compiler-generated hints to guide cache replacement is not new. The work is an evolutionary step that combines known concepts into a new, practical package. The primary novel element is not the idea itself, but the specific implementation path.
- Prior Art: The most relevant prior work is Ripple [39], which also uses PGO to guide instruction cache replacement in data centers. Ripple's goal is nearly identical to TRRIP's: use offline profiles to identify and protect important instruction cache lines. The key difference—and TRRIP's main contribution over Ripple—is the communication mechanism. Ripple proposes new ISA instructions (crm.set, crm.unset), whereas TRRIP uses PTE bits. While this difference is critical for practicality, it means the fundamental concept of "PGO-guided I-cache replacement" has been previously established.
Granularity of Hints: The proposed mechanism provides hints at the granularity of an OS page (typically 4KB or 16KB). As the authors acknowledge in Section 4.9 (page 11), a single page can contain code of mixed temperatures, especially as page sizes increase. This can lead to coarse-grained, potentially inaccurate prioritization, where cold code on a "hot" page is prioritized, or hot code on a "warm" page is not. While the paper suggests mitigation strategies, this appears to be a fundamental limitation of the chosen communication mechanism compared to more fine-grained, instruction-based hints.
Static Nature of Hints: The approach is entirely dependent on a static, offline PGO profile. It cannot adapt to dynamic phase changes in application behavior where the "hot" code paths might shift. In such scenarios, the static hints could become stale and potentially degrade performance by protecting code that is no longer critical. Purely hardware-based adaptive schemes, such as Emissary [45] (which tracks frontend stalls at runtime), do not suffer from this limitation, though they come with their own hardware overheads. The novelty of TRRIP's approach does not address this long-standing issue with static optimization.

Questions to Address In Rebuttal

Differentiation from Ripple [39]: The authors should more directly and explicitly contrast their work with Ripple. Beyond the acknowledged difference in the software-hardware interface (PTE bits vs. ISA extension), are there any other fundamental, conceptual, or algorithmic differences in how the PGO data is used to inform the replacement policy? The contribution would be stronger if it were framed as a more practical and lightweight implementation of the principle established by Ripple, rather than a wholly new concept.
Impact of Page-Level Granularity: The analysis in Section 4.9 (page 11) and Table 5 shows the number of pages used but does not quantify the performance impact of mixed-temperature pages. Can the authors provide data on how much of the hot code resides on pages that are not marked as hot? What is the performance loss when a truly hot cache line resides on a page that is classified as warm or cold due to the surrounding code? This is key to understanding the trade-off made for the practical communication mechanism.
Dynamic Behavior and Stale Profiles: How does TRRIP's performance hold up when the execution profile deviates significantly from the training profile used for PGO? A sensitivity study showing performance against varying inputs would help quantify the robustness of this static approach. How does it compare to a purely dynamic hardware scheme like Emissary [45] under such workload shifts?
Interaction with Code Layout Optimizations: PGO is already used for hot-cold splitting and basic block reordering, which primarily improve spatial locality to reduce I-cache misses. TRRIP aims to improve temporal locality. Is the 3.9% speedup measured on top of a baseline that already includes aggressive PGO-based code layout? Could the authors clarify if there is a synergistic or potentially overlapping effect between these optimizations?

NetZIP: Algorithm/Hardware Co-design of In-network Lossless Compression for Distributed Large Model Training

Abstract

In distributed large model training, the long communication time required to exchange large volumes of gradients and activations among GPUs dominates the training time. To reduce the communication times, lossy or lossless compression of gradients and/or ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose NetZIP, an algorithm/hardware co-design for in-network lossless compression to accelerate distributed large model training. The approach consists of two parts: NetZIP-algorithm, which applies bit/byte grouping and delta-value transformations to make gradients and activations more compressible, and NetZIP-accelerator, a proposed bump-in-the-wire hardware block within a NIC to perform these operations with low latency. The central claim is that this co-design achieves superior compression ratios compared to standard lossless algorithms and results in a 35% reduction in total training time by mitigating communication bottlenecks. However, the evaluation's heavy reliance on simulation for its primary system-level claims, combined with a hardware prototype that is an emulation rather than an integrated system, raises significant questions about the validity and practical achievability of the reported end-to-end performance gains.

Strengths

Problem Motivation: The paper correctly identifies a critical bottleneck in distributed training and provides a clear motivation for exploring lossless compression, detailing the shortcomings of both lossy approaches (convergence issues) and standard lossless methods on commodity hardware (high latency overhead).
Data Characterization: The analysis of the bfloat16 representation of gradients and activations in Section 5.1 (page 6, Figure 5) is sound. Identifying the low entropy in exponent bits versus the high entropy in mantissa bits provides a solid, data-driven foundation for the proposed byte- and bit-grouping techniques.
Baseline Evaluation: The experimental results in Section 4, which quantify the performance of standard lossless algorithms (LZ4, Zstd, etc.) on commodity CPU, GPU, and SNIC platforms, are thorough. Table 3 (page 6) effectively establishes a crucial baseline, demonstrating that a naive application of these algorithms increases, rather than decreases, total communication latency. This provides strong justification for the need for a specialized solution.

Weaknesses

Evaluation Methodology Relies on Unvalidated Simulation: The headline claim of a 35% reduction in training time is not derived from a real-world, at-scale hardware deployment. Instead, it is the output of the SimAI simulator (Section 6.3, page 12). The authors provide no evidence validating SimAI's accuracy against a physical cluster for this specific workload. More critically, the simulator is fed latency values obtained from a separate FPGA emulation (Section 6.1, page 9). This creates a fragile, multi-layered abstraction from reality, making the final end-to-end numbers speculative rather than empirically proven.
Hardware Claims are Based on Emulation and Projection, Not Integration: The "NetZIP-accelerator" as tested is not an integrated NIC ASIC. It is an FPGA connected externally to a standard NIC (Figure 9, page 9). This "bump-in-the-wire" setup is a proof-of-concept at best and fails to capture the realities of on-chip resource contention, memory bandwidth, and power envelopes within a real NIC ASIC. Furthermore, the claims regarding ASIC area and power in Section 5.2 (page 9) are merely projections based on a methodology from 2010, which is insufficient evidence for a hardware-centric claim in 2025.
Limited Applicability of Delta Compression: The evaluation is conducted on fine-tuning tasks using the Alpaca dataset (Section 6.1, page 10). The core assumption of delta compression is that values change incrementally between iterations. While this may hold during fine-tuning, it is unlikely to be true during the volatile, early stages of pre-training a model from scratch, where gradients can experience large fluctuations. The paper makes a general claim about "distributed large model training" but only provides evidence for a specific, and arguably more favorable, sub-domain.
Insufficient Justification for the Delta Base Value Heuristic: The paper concedes that true delta compression is infeasible due to memory constraints and instead proposes using a single base value (the minimum value in a layer) for subtraction (Section 5.1, "Delta Value Compression", page 7). This is a critical simplification. There is no theoretical or empirical justification provided for why the minimum value is a suitable or optimal choice over other statistics like the mean, median, or a learned scalar. The impact of this heuristic on compression effectiveness across different model architectures and training phases is left unexplored.
Strawman Comparison to Lossy Compression: The comparison with lossy compression in Section 6.4 (Figure 14, page 12) is weak. The authors compare NetZIP only against top-K sparsification. This ignores a vast body of work on more sophisticated lossy techniques, such as adaptive quantization, error feedback (e.g., EF-SGD), and gradient-norm-based methods, which are known to achieve high compression ratios with minimal impact on convergence. To claim superiority, a comparison against the state-of-the-art in lossy compression is required, not a baseline method.

Questions to Address In Rebuttal

How have you validated the accuracy of the SimAI network simulation for this specific communication pattern against a physical testbed? Please provide data comparing simulated and real-world communication times on a smaller-scale cluster (e.g., 8-16 nodes) to justify its use for the 512-node extrapolation.
The hardware evaluation uses an external FPGA. How would the performance (latency, throughput) and resource costs (area, power) be affected by a true integration into a modern NIC ASIC, considering shared resources such as on-chip memory controllers and PCIe interfaces? Why should the projected ASIC figures be considered reliable?
Please provide compression ratio data for NetZIP's delta compression algorithm during the initial epochs of pre-training a large model from a random initialization. How does its effectiveness compare to the fine-tuning results presented?
Please provide an ablation study justifying the choice of the layer's minimum value as the base for delta compression. How does this heuristic compare, in terms of compression ratio, against using the layer's mean or median value?
The comparison to lossy compression is limited to top-K. Please provide a time-to-accuracy comparison between NetZIP and a state-of-the-art lossy compression algorithm (e.g., QSGD with error feedback) to demonstrate the practical advantage of your lossless approach.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents NetZIP, an algorithm/hardware co-design for in-network lossless compression to accelerate distributed large model training. The authors identify a critical gap: lossy compression techniques can harm model convergence, especially for activations, while existing lossless compression methods are too slow on current platforms (CPU/GPU/SNIC) to overcome their own latency overhead.

The core contribution is a two-part solution. First, NetZIP-algorithm is a data transformation technique that analyzes and restructures gradients and activations at the bit and value level (via byte/bit grouping and delta compression) to make them significantly more compressible by standard algorithms. Second, NETZIP-accelerator is a lightweight, "bump-in-the-wire" hardware accelerator integrated into a NIC that implements this transformation alongside a simple, fast compressor like LZ4. This co-design approach aims to deliver both high compression ratios and extremely low latency, thereby reducing the end-to-end communication time that dominates training. The authors demonstrate through comprehensive experiments and large-scale simulation that NetZIP can reduce total training time by up to 35% for models like Llama-3 and GPT-3.

Strengths

Excellent Problem Scoping and Motivation: The paper does a superb job of positioning its contribution. The authors clearly establish the limitations of the two dominant alternatives. In Section 3 (pages 3-4), they show that lossy compression, while reducing per-iteration time, can increase total training time due to accuracy degradation. Then, in Section 4 (pages 5-6), they convincingly demonstrate that off-the-shelf lossless compression on existing hardware platforms actually increases communication time. This framing creates a clear and compelling need for the proposed solution.
Insightful Data-Driven Algorithm Design: The strength of NetZIP-algorithm lies in its foundation of careful data analysis. The insights from Figure 5 (page 6)—that exponent bits are structured while mantissa bits are random, and that value distributions are narrow—directly motivate the byte/bit grouping strategy. Similarly, the intuition that values change slowly between iterations motivates the delta compression scheme, which is validated in Figure 7 (page 7). This is not a brute-force approach; it's an elegant solution derived from understanding the unique properties of the target data.
Strong Systems-Level Co-design Thinking: This work is a prime example of successful algorithm/hardware co-design. The authors recognize that heavy algorithms like Zstd/Deflate, while offering better compression, are too costly for a low-latency hardware implementation. By designing an algorithm that makes data highly compressible even for a simple method like LZ4, they enable a hardware design that is fast, efficient, and practical to integrate into a NIC ASIC. The design space exploration in Section 5.2 (page 8) and the subsequent accelerator architecture in Figure 8 are logical consequences of this co-design philosophy.
High Potential for Practical Impact: The work addresses a real, expensive, and worsening bottleneck in AI infrastructure. By focusing on a lossless approach, NetZIP sidesteps the complex and often unpredictable effects of lossy compression on model convergence. Its ability to compress activations is particularly significant, as this has been a major challenge for prior work. If integrated into future NICs, this technology could substantially reduce the cost and time of training large models, especially for users relying on public cloud infrastructure with limited network bandwidth, as motivated in Figure 3 (page 4).

Weaknesses

Limited Engagement with Modern Parallelism Strategies (FSDP): While the paper evaluates a standard DP/TP/PP setup, the related work section (Section 7, page 13) acknowledges but then sidesteps Fully Sharded Data Parallelism (FSDP). FSDP is now the dominant strategy for training very large models, and it fundamentally changes communication patterns from the AllReduce of gradients to ReduceScatter and AllGather of parameters. This will change the size, structure, and timing of the data chunks being sent over the network. The paper would be significantly stronger if it included an analysis, even speculative, of how NetZIP's data assumptions and performance benefits would translate to an FSDP environment.
Scope of Data Analysis (Fine-tuning vs. Pre-training): The experiments are based on collecting gradients and activations during fine-tuning (Section 6.1, page 10). During this phase, model weights and activations are expected to change incrementally, making the delta compression scheme particularly effective. However, during the initial stages of pre-training from scratch, gradients can be much larger and more chaotic. The core assumptions about small deltas may not hold as strongly, potentially reducing the effectiveness of the algorithm. A brief analysis of data from early-stage pre-training would help generalize the paper's claims.
Hardware Implementation Realism: The hardware evaluation relies on an FPGA-based prototype connected externally to a standard NIC, with performance plugged into a simulator. This is a common and reasonable methodology. However, the true challenge lies in the tight integration of such a "bump-in-the-wire" accelerator into a real, high-performance NIC data path, especially one supporting RDMA. A deeper discussion of the practical challenges—such as managing buffer pressure when compression ratios vary, handling packet reordering, and interacting with the NIC's transport-level logic without adding significant latency—would add valuable context to the ASIC projection claims (Section 5.2, page 9).

Questions to Address In Rebuttal

Regarding FSDP: Could the authors speculate on how the communication patterns of FSDP (specifically, the AllGather of sharded parameters) would affect the performance of NetZIP? Would the data chunks still exhibit the same compressibility properties observed in the paper's AllReduce-centric evaluation?
Regarding Pre-training: The paper's analysis is based on fine-tuning. Have the authors examined data from early-stage pre-training? How does the effectiveness of the delta compression scheme, in particular, change when gradients and activations are more volatile?
Regarding Hardware Integration: Could the authors elaborate on the potential challenges of integrating the NETZIP accelerator directly into a modern NIC's data path alongside an RDMA engine? Specifically, how would the system handle situations where the compressed payload size exceeds the original packet buffer, and how is flow control managed between the DMA engine and the compressor?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present NetZIP, a hardware/software co-design aimed at reducing communication latency in distributed large model training through in-network lossless compression. The core proposal consists of two parts: (1) NetZIP-algorithm, a set of pre-processing techniques (byte/bit grouping and delta compression) designed to transform gradient and activation data to be more amenable to standard lossless compression; and (2) NetZIP-accelerator, a "bump-in-the-wire" hardware architecture integrated into a NIC that implements these algorithms and a lightweight compressor (LZ4) to minimize overhead.

My analysis concludes that while the constituent algorithmic components have clear and strong precedents in prior art, their specific synthesis and application to the lossless compression of both gradients and activations for large model training, coupled with the proposed in-network hardware architecture, represents a novel system-level contribution. However, the paper overstates its algorithmic novelty and fails to properly contextualize its methods against existing, functionally analogous techniques.

Strengths

Novel Architectural Proposal: The "bump-in-the-wire" accelerator architecture (Section 5.2, page 8, Figure 8) is a key novel contribution. By placing the compression/decompression logic directly in the NIC's datapath between the DMA engine and protocol engines, the design compellingly addresses the overhead of data movement to/from host CPU, GPU, or even a PCIe-attached accelerator on a SmartNIC, which the authors correctly identify as a major performance bottleneck (Section 5.2, page 7). This architectural insight is well-argued and significant.
Application to Activations: The paper's focus on compressing not only gradients but also activations is a noteworthy distinction from the bulk of prior work in training communication reduction, which has myopically focused on gradients. The analysis showing that activations constitute a significant portion of communication traffic (21-49% in their models, Section 4, page 4) validates this focus and represents a novel problem framing.
Co-design Synergy: The primary strength of the work lies in the co-design itself. The choice of a lightweight algorithm (LZ4) is justified not on its standalone compression ratio (which is poor) but on its hardware efficiency, which enables a high-throughput, low-latency implementation. This efficiency is then leveraged by the algorithmic pre-processing, which specifically enhances the compressibility for LZ-style algorithms. This tight coupling is the essence of the proposed system's novelty.

Weaknesses

My singular focus is novelty, and on this front, the claimed algorithmic contributions are substantially weaker than presented.

Bit/Byte Grouping is Not New: The core idea of reorganizing data by grouping bits or bytes of similar entropy to improve compressibility is not novel. This technique is functionally analogous to the "Lane Compression" method proposed by Ko et al. [23], which the authors cite in their related work (Section 7, page 13) but do not compare against. Lane Compression also groups bit positions into different "lanes" based on entropy to aid subsequent compression. While the application domain differs (model parameters for inference vs. gradients/activations for training), the fundamental algorithmic principle is identical. The paper must acknowledge this prior art directly and significantly temper its claims of algorithmic novelty in this area.
Delta Compression is a Standard Technique: The proposed "Delta Value Compression" (Section 5.1, page 7) is a straightforward application of delta encoding, one of the oldest and most fundamental techniques in data compression. It is used ubiquitously in version control, video codecs, and data backup systems. While its application to gradients in training is logical and effective, as shown by the authors' analysis, it does not represent a new algorithmic concept. The adaptation of using a minimum value per layer as a base instead of the previous iteration's full tensor is a practical engineering choice to manage memory, not a fundamental innovation.
Overstated Algorithmic Contribution: Due to the points above, the paper's narrative of proposing new algorithms is misleading. The novelty is not in the invention of these techniques, but in their application and hardware co-design. The contribution would be stronger if it were framed as "a novel system architecture that effectively applies and accelerates known data transformation techniques for in-network training communication," rather than implying the creation of new compression algorithms.

Questions to Address In Rebuttal

The authors should use the rebuttal to clarify the precise boundaries of their novel contributions.

Please explicitly differentiate the proposed "bit/byte grouping" from the "Lane Compression" method in [23]. Beyond the difference in application domain (training vs. inference), what is the fundamental algorithmic distinction in how data is transformed to improve compression? Why was this highly relevant work not discussed in the main body of the paper (e.g., in Section 5.1)?
Given that delta compression is a well-established technique, could the authors refine their claim regarding the novelty of "Delta Value Compression"? Is the novelty simply its application in this context, or is there a more subtle algorithmic innovation that I have missed?
The paper notes that FSDP is a key distributed training paradigm that it could not evaluate (Section 7, page 13). FSDP fundamentally changes communication by sharding parameters and optimizer states, potentially altering the iterative similarity of communicated data. How do the authors expect the effectiveness of their delta compression scheme to be impacted in an FSDP context, where a given worker may not see the same tensor slice in consecutive iterations? Does this limit the novelty of the approach to specific parallelism strategies (DP/TP/PP)?

Characterizing the Efficiency of Distributed Training: A Power, Performance, and Thermal Perspective

Abstract

The rapid scaling of Large Language Models (LLMs) has pushed training workloads far beyond the limits of single-node analysis, demanding a deeper understanding of how these models behave across large-scale, multi-GPU systems. In this paper, we present a ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present a characterization study of distributed Large Language Model (LLM) training. They evaluate performance, power, and thermal behavior across three modern GPU platforms (NVIDIA H100/H200, AMD MI250) using several dense and sparse models. The study examines the effects of different parallelism strategies (Tensor, Pipeline, Data, Expert) and common software optimizations like activation recomputation and compute-communication overlap. The authors conclude with several insights, including the nuanced trade-offs between scale-up and scale-out systems, inefficiencies in hybrid parallelism schemes, the limits of microbatch scaling, and the performance impact of thermal imbalances. Finally, they attempt to project their findings to datacenter-scale systems via simulation.

Strengths

Experimental Testbed: The study is conducted on an impressive and relevant set of modern hardware (H200, H100, MI250). Access to such platforms for a systematic study is a notable strength.
Breadth of Experiments: The authors have undertaken a significant experimental effort, covering multiple models, parallelism configurations, and optimization techniques. This provides a broad, albeit thin, survey of the design space.
Inclusion of Thermal Analysis: The focus on thermal imbalance (Section 6) and its concrete impact on clock throttling (Figure 17) is a valuable contribution. It highlights a practical, physical-layer constraint often overlooked in purely algorithmic performance studies.

Weaknesses

The paper's ambition is commendable, but its execution lacks the necessary rigor, leading to shallow analyses and insufficiently substantiated claims.

Insufficient Methodological Detail: The foundation of any characterization study is its measurement methodology, which is inadequately described here.
- The authors state they use a "modified version of Zeus" (Page 4, Section 3.1) for energy and telemetry. The nature and impact of these modifications are not specified. What is the measurement overhead? How was the modified tool validated against ground truth? Without this information, the fidelity of all power, thermal, and utilization data is questionable.
- For the AMD MI250 platform, the paper states that "smaller versions of GPT-3 and Llama-3" were used due to memory constraints (Page 5, Section 3.2). The process of scaling down these models is not detailed. Architectural changes during scaling (e.g., number of layers vs. hidden dimension) can drastically alter the compute-to-communication ratio, making the cross-platform comparison to the full-size models on NVIDIA hardware potentially invalid. The claim of providing "valuable cross-platform insights" is therefore weakened.
Superficial Analysis and Overstated Insights: The paper identifies several well-known phenomena but fails to provide deep, quantitative explanations for their root causes.
- Scale-up vs. Scale-out (Section 4.1): The conclusion that the optimal strategy "depends on model size, sparsity, and parallelism strategy" is not a novel insight. The analysis attributes performance differences to "communication locality" and "inter-node traffic" (Page 5), but fails to provide a quantitative breakdown. For instance, in the GPT3-175B case where H200 excels, what precise percentage of the performance gain is due to avoiding inter-node AllReduce versus exploiting higher intra-node NVLink bandwidth for Tensor Parallelism? The kernel breakdowns in Figure 3 are a start, but the narrative connecting them to the high-level claims is tenuous.
- Limits of Microbatch Scaling (Section 5): The paper correctly observes that increasing microbatch size can harm performance but attributes this to vague causes like "communication bandwidth saturation" and "pipeline-induced stalls" (Page 10, Insight box). Which specific communication fabric is saturating (PCIe, InfiniBand)? Figure 15 shows that AllReduce and SendRecv time increases, but provides no evidence of why. Is this due to increased message size leading to network congestion, or is it a tail-latency effect from stragglers? The analysis stops short of identifying the true bottleneck.
Unsupported Extrapolation and Speculation:
- The projection to 8K-GPU systems in Section 7.1 is the paper's most significant flaw. The authors switch from empirical measurement on at most 64 GPUs to simulation with Astra-Sim. The paper provides zero detail on how the simulator was parameterized or calibrated against their real-system measurements. Network simulators are notoriously difficult to configure accurately, and without a rigorous calibration methodology, the results presented in Figure 22 are purely speculative and cannot be considered a valid extension of the paper's empirical findings. This section undermines the work's grounding in real-world characterization.
Inclusion of Underdeveloped and Contradictory Results:
- The "thermal-aware pipeline parallelism strategy" presented at the end of Section 6 and in Figure 21 is a premature and distracting inclusion. It is presented as a solution, yet the results are mixed: a meager 4% efficiency gain for Llama3 is contrasted with a 7% efficiency degradation for GPT3-175B. The paper glosses over this negative result. Such a preliminary and inconclusive experiment does not belong in a characterization paper and weakens its focus and credibility.

Questions to Address In Rebuttal

The authors must provide precise, data-driven answers to the following questions:

Regarding your "modified version of Zeus" (Page 4): What specific modifications were made? Provide data on the validation of this tool and quantify its measurement overhead on the system.
Regarding the scaled-down models for the MI250 (Page 5): Detail the exact methodology used to scale down GPT-3 and Llama-3. Provide evidence that these smaller variants maintain the same fundamental bottleneck characteristics (e.g., compute-bound vs. memory-bound, communication patterns) as their full-sized counterparts, thereby justifying the cross-platform comparison.
Regarding the claim of "communication bandwidth saturation" with larger microbatches (Page 10): Provide specific data from your telemetry (e.g., PCIe bus utilization, InfiniBand NIC throughput) that directly demonstrates saturation of a specific hardware resource. Correlate this saturation point with the observed performance degradation.
Regarding the thermal-aware scheduling experiment (Page 12, Figure 21): Explain the 7% performance degradation observed for GPT3-175B. Given this negative result, what is the justification for including this experiment as a positive contribution?
Regarding the 8K GPU extrapolation (Page 12, Section 7.1): Provide a complete and detailed account of the calibration process for Astra-Sim. How were kernel latencies, network parameters, and collective communication models from your 32/64-GPU empirical measurements translated to the simulator to ensure the validity of the 8K-GPU projections?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a comprehensive, multi-faceted characterization of distributed training for large language models (LLMs). The authors move beyond traditional performance metrics (e.g., throughput) to provide a holistic analysis that incorporates power consumption, hardware utilization, and thermal behavior. By conducting experiments on a diverse set of modern hardware platforms (NVIDIA H100/H200, AMD MI250) and across various models and parallelism strategies (TP, PP, DP, EP), the authors empirically investigate the complex, second-order effects that arise at scale.

The core contribution is the assertion, backed by extensive data, that optimizing large-scale training requires a "full-stack" perspective. The paper demonstrates that software-level decisions—such as the choice of parallelism strategy, microbatch size, and optimizations like activation recomputation—have profound and often non-intuitive interactions with the physical realities of the underlying hardware, including network topology, power limits, and thermal dissipation. Key findings include the nuanced trade-offs between scale-up and scale-out systems, the communication inefficiencies of certain parallelism combinations (TP+PP), the performance pitfalls of excessive microbatch scaling, and the significant impact of thermal imbalance on system reliability and throughput.

Strengths

This is an excellent and timely study that provides a much-needed bridge between the domains of ML systems software and computer architecture/datacenter operations. Its primary strengths are:

Holistic and Grounded Perspective: The single most important strength of this paper is its commitment to analyzing the entire stack. While many papers focus on optimizing parallelism strategies in the abstract (e.g., Megatron-LM, DeepSpeed), and others focus on datacenter efficiency, this work is one of the first to rigorously connect the two. It moves the conversation from idealized performance models to the messy, physical realities of running these workloads, which is where the next set of bottlenecks lies. The focus on thermal throttling and power draw (Sections 5 & 6, pages 9-11) is particularly novel and significant for the training domain.
Methodological Rigor and Relevance: The experimental setup is state-of-the-art and highly relevant. The use of H100, H200, and MI250 GPUs covers the most important accelerator architectures in the market today. The choice of workloads, from dense models like GPT-3 and Llama3 to sparse MoE models like Mixtral, ensures the findings are broadly applicable. The fine-grained telemetry collection provides a solid empirical foundation for the paper's claims.
Revealing Non-Obvious Interactions: The paper excels at uncovering insights that challenge conventional wisdom. For example:
- The finding that higher-memory "scale-up" systems (32xH200) can outperform "scale-out" systems (64xH100) in communication-heavy regimes (Section 4.1, page 5) highlights that raw aggregate compute is not the only factor in performance.
- The demonstration that increasing microbatch size can harm performance due to communication saturation and thermal stress (Section 5, page 9, Figure 13) provides a crucial, practical guideline that contradicts the simplistic "bigger is better" assumption.
- The clear visualization of thermal imbalance due to physical node layout (Section 6, page 10, Figure 17) and its direct link to performance throttling is a powerful demonstration of how physical constraints impact distributed algorithms.
Connecting to Broader Research Agendas: This work provides foundational data that can inform multiple research communities. It implicitly challenges automated parallelism frameworks (e.g., Alpa) to incorporate physical constraints into their cost models. It provides concrete motivation for work on topology-aware communication collectives (e.g., TACCL, TopoOpt). Finally, it extends the investigation of power- and thermal-aware scheduling, previously explored for inference (e.g., TAPAS, DynamoLLM), into the synchronous, long-running, and highly-coupled domain of LLM training.

Weaknesses

The weaknesses of the paper are primarily related to its framing and the scope of its conclusions, rather than fundamental flaws in the methodology or results.

Characterization vs. Solution: The paper is, at its heart, a characterization study. It does an excellent job of identifying and quantifying problems. While the brief exploration of thermal-aware pipeline scheduling is a step in the right direction (Section 7, page 12), it feels preliminary compared to the depth of the problem analysis. The paper would be strengthened if the authors were more explicit in framing their work as foundational characterization that motivates the need for new solutions, rather than presenting a complete solution itself.
Generalizability of Topology-Specific Findings: The results are derived from specific cluster topologies (e.g., NVIDIA HGX nodes connected via InfiniBand). While these are common, the authors could do more to discuss how their findings might change in systems with different network topologies (e.g., Dragonfly, optical circuit switches) or cooling systems (e.g., direct liquid cooling). For instance, the severe thermal imbalance shown in Figure 17 might be less pronounced in a liquid-cooled system, which would alter the trade-off calculus for different parallelism strategies.
Depth of Causal Analysis: The paper establishes strong correlations between software choices and physical effects (e.g., PP-heavy configurations lead to higher peak power). However, a deeper microarchitectural analysis explaining why these patterns emerge would be beneficial. For example, what specific resource contention (e.g., on memory controllers, PCIe bus) during compute-communication overlap leads to the observed thermal stress? While likely outside the primary scope, adding some discussion here would further solidify the paper's contribution.

Questions to Address In Rebuttal

The thermal-aware pipeline stage placement experiment (Section 7, page 12) is very promising. Could you elaborate on the limitations of this approach? For instance, how does the strategy adapt if the "hot" and "cold" GPUs change during a long training run, and how much of the performance gain is tied to the specific model architecture (e.g., number of layers being divisible by the number of stages)?
Your work compellingly demonstrates the importance of physical node layout and network topology. How do you foresee your key insights—particularly regarding scale-up vs. scale-out and TP+PP inefficiencies—translating to future disaggregated or chiplet-based systems where memory, compute, and networking resources may be composed in more flexible ways?
The industry is increasingly moving towards advanced cooling solutions like direct liquid cooling to manage the thermal density of modern accelerators. How would such a technology alter the conclusions of your study? Would it eliminate the thermal bottleneck entirely, or would it simply shift the bottleneck to another system component (e.g., power delivery, interconnect bandwidth)?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present a comprehensive characterization study of distributed Large Language Model (LLM) training. The work's central thesis is that traditional performance-centric analysis is insufficient, and a holistic perspective incorporating power consumption and thermal behavior is necessary to understand system efficiency. The authors evaluate various parallelism strategies (Tensor, Pipeline, Data, Expert) and common optimizations (activation recomputation, compute-communication overlap) across modern hardware platforms (NVIDIA H100/H200, AMD MI250).

The paper does not propose a new algorithm, hardware architecture, or training technique. Its claim to novelty rests entirely on the insights derived from its multi-faceted characterization. Specifically, it claims to be the first to systematically and quantitatively link specific software-level parallelism and optimization choices to their physical, second-order consequences, such as thermal imbalance, power excursions, and clock throttling, in the context of state-of-the-art LLM training systems.

Strengths

The primary contribution of this work is the rigorous documentation of system behaviors that, while perhaps suspected by practitioners, have not been systematically studied and published in an academic venue. The novelty is not in the invention of a new method, but in the elucidation of previously unquantified, non-obvious system interactions.

Specific novel insights include:

The Counter-Intuitive Limits of Microbatch Scaling: While conventional wisdom suggests larger microbatches are better if memory allows, this paper provides concrete evidence that beyond an optimal point, they create bursty execution patterns that lead to higher peak power and worsened thermal throttling, ultimately degrading performance (Section 5, page 9, Figure 13). This is a significant finding that challenges common heuristics.
Quantification of Thermal Imbalance Impact: The paper moves beyond acknowledging thermal issues to showing a direct causal link between server airflow design (Figure 16, page 10), GPU placement, the chosen parallelism strategy (high-PP configurations), and persistent clock throttling on specific GPUs (Figure 17, page 11). This demonstrates that applying uniform software optimizations to physically non-uniform hardware is a flawed strategy.
Inefficiency of Combined Parallelism Strategies: The analysis revealing that the combination of Tensor Parallelism and Pipeline Parallelism (TP+PP) leads to underutilization of PCIe bandwidth due to sparse, uncoordinated SendRecv calls is a specific and novel finding (Section 4.2, page 7). This identifies a concrete inefficiency in current software frameworks that was not previously highlighted.

Weaknesses

My main critique stems from the definition of novelty. This is fundamentally a characterization study, an established research methodology. The work synthesizes existing measurement tools and techniques to analyze existing software on existing hardware.

Lack of a Constructive Contribution: The paper is diagnostic, not prescriptive. It identifies and quantifies numerous problems but stops short of proposing and evaluating a novel mechanism to solve them. For example, after demonstrating the negative impact of thermal imbalance, the authors offer recommendations but do not implement a novel thermal-aware scheduler. The brief experiment on thermal-aware placement in the discussion (Section 7, page 12) is a simple heuristic (placing heavy stages on cold GPUs) and is presented more as a proof-of-concept than a fully-fledged novel algorithm. The core of the paper remains observational.
Conceptual Overlap with Prior Art: While the authors focus on LLM training, the core concept of co-designing software with an awareness of power and thermal constraints is not new. Prior work in HPC and datacenter management has long explored thermal-aware job scheduling. Google's technical blog post, "Balance of power: A full-stack approach to power and thermal fluctuations in ML infrastructure" [16], which the authors cite, discusses this exact problem space at a high level. This paper's contribution is the granular, quantitative analysis for specific LLM parallelism strategies, which makes it an incremental, albeit valuable, advancement over the conceptual state-of-the-art rather than a paradigm shift.
Dependence on Established Optimizations: The techniques analyzed—activation recomputation [28], compute-communication overlap [75], FSDP [82]—are all well-established. The paper's contribution is to map their secondary effects, not to introduce a new optimization.

In essence, the paper provides a new, high-resolution map of a known territory. The map is useful and reveals details not seen before, but it is not the discovery of a new continent.

Questions to Address In Rebuttal

The paper's core novelty lies in its insights. Could the authors consolidate their findings and explicitly state which of their documented system interactions they believe were truly unknown to the community (both academic and industrial) prior to this work, versus those that were "common knowledge" but previously unquantified?
The work expertly diagnoses the sub-optimality of applying uniform optimizations to thermally heterogeneous hardware. The lack of a proposed novel mechanism to address this is a significant limitation. Can the authors justify why they chose not to propose and evaluate a new scheduling algorithm based on their findings? Does the brief thermal-aware placement experiment (page 12) contain a novel algorithmic component that was not fully elaborated upon?
How does this work's contribution differ fundamentally from the full-stack power/thermal analysis described in prior industrial reports, such as Google's blog post [16]? Please precisely articulate the delta beyond being a more detailed, academic study. Is the novelty simply the level of granularity, or is there a conceptual advance?

SkipReduce: (Interconnection) Network Sparsity to Accelerate Distributed Machine Learning

Abstract

The interconnection network is a critical component for building scalable systems, as its communication bandwidth directly impacts the collective communication performance of distributed training. In this work, we exploit interconnection network sparsity (...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose SkipReduce, a technique to accelerate distributed training by deterministically omitting entire communication steps within the Ring AllReduce algorithm's Reduce-Scatter phase. The central idea is that by skipping steps, the communication latency is proportionally reduced. To mitigate potential accuracy degradation from biased gradient skipping, the authors introduce "Random SkipReduce," which varies the skipped gradient slices across iterations, and "Selective SkipReduce," which avoids skipping gradients from pre-determined "important" layers. The authors claim that SkipReduce provides up to a 1.58x speedup in time-to-accuracy over baseline AllReduce and other communication reduction techniques on an 8-GPU system.

Strengths

Simplicity and Low Overhead: The core mechanism of skipping communication steps is conceptually simple and, as implemented, introduces negligible computational overhead compared to compression-based methods like top-k sparsification or low-rank approximation, which require significant pre/post-processing.
Direct Integration: The proposed method is implemented directly within NCCL, demonstrating a plausible path for integration into existing deep learning frameworks without requiring major architectural changes.
Problem Formulation: The paper correctly identifies the growing problem of communication overhead in distributed training and explores a direction—coarse-grained communication omission—that is distinct from fine-grained compression.

Weaknesses

The paper's claims rest on a foundation that appears methodologically fragile and insufficiently validated. My primary concerns are as follows:

Fundamentally Flawed Baseline Comparison: The central performance claims are benchmarked against a fundamentally flawed and non-functional baseline, "Ideal Top-k." In Section 5.1 (page 9), the authors explicitly state, "we implement an idealized version of top-k, in which the indices and values of the selected gradients are communicated as if they were dense tensors... this implementation is not functionally correct as it does not support actual sparse reductions." This is not a valid scientific comparison. It compares a real, working implementation (SkipReduce) against a hypothetical best-case scenario for a competitor that ignores the very real-world overheads (e.g., COO format conversion, irregular memory access, sparse reduction kernels) that make top-k methods challenging. The reported speedups in Figure 11 and Figure 12 are therefore highly misleading, as they do not reflect a comparison against a viable, state-of-the-art alternative.
Fragile Heuristic for Selective SkipReduce: The concept of "Selective SkipReduce" relies on a fragile and poorly justified heuristic. In Section 4.4 (page 8), "important gradients" are identified as those in the global top-25% by magnitude, sampled at the end of the first epoch. The gradient distribution of a model is known to be highly dynamic, shifting significantly from early-stage training to convergence (a point the authors themselves make in Figure 2). A decision based on a single snapshot after the first epoch is unlikely to remain optimal for the entire duration of training. No evidence is provided to support the robustness of this one-time decision. This introduces an ad-hoc hyperparameter (the sampling point) and undermines the generality of the selective approach.
Insufficient Justification for "Skipping" over "Dropping": The authors differentiate between "skipping" (effectively zeroing out a gradient's contribution) and "dropping" (removing the gradient and adjusting the divisor for the average). In Figure 15 (page 10), they show "DropReduce" performs worse but offer a weak explanation: "by completely removing the gradients, the contribution of the remaining gradients is amplified, possibly creating bias." This is a hand-wavy argument. Adjusting the divisor is a mathematically sound way to compute a correct average over the non-dropped elements. The claim that zeroing is superior requires a more rigorous theoretical or empirical analysis, as it is non-obvious why introducing zeros (a specific value) is inherently better than re-normalizing.
Inadequate Scale of Evaluation: The empirical evaluation is conducted exclusively on an 8-GPU system (Section 5.1, page 8). While useful for initial validation, this scale is insufficient to substantiate claims about accelerating large-scale machine learning. The communication-to-computation ratio, and thus the potential benefit of techniques like SkipReduce, changes dramatically when scaling to hundreds or thousands of workers. The "analytical evaluation" for other topologies in Section 6.1 (page 10) is not a substitute for empirical results and offers no proof that the method's accuracy/performance trade-off holds at scale.
Post-Hoc Modifications to the Algorithm: The introduction of a "warm-up" period for Sharded Data Parallel (SDP) training (Section 6.3, page 12) appears to be an ad-hoc modification required to make the technique work in that context. This was not presented as a core component of the SkipReduce algorithm. Its necessity complicates the method, adds another hyperparameter (the warm-up duration), and raises questions about whether similar "fixes" are needed for other models or training regimes not tested in the paper.

Questions to Address In Rebuttal

Please justify the use of a non-functional "Ideal Top-k" baseline. To make a credible claim, the authors must either compare against a real, functional top-k implementation (e.g., from [25, 42]) or significantly temper their performance claims and acknowledge the comparison is only against a theoretical, unachievable lower bound.
Provide evidence that selecting "important" layers based on the gradient distribution after a single, initial epoch is a robust heuristic. For example, show how the set of important layers evolves throughout training and demonstrate that the initial choice remains valid.
Provide a more rigorous mathematical or empirical justification for why "skipping" (zeroing) gradients results in better accuracy than "dropping" (re-normalizing the average). The current explanation is insufficient.
How can the results from an 8-GPU experiment be extrapolated to large-scale systems where communication patterns and bottlenecks are vastly different? Please provide a more compelling argument for the scalability of SkipReduce's benefits.
Is the "warm-up" period for SDP a general requirement for SkipReduce when applied to weight communication, or a specific fix for the LLaMA model? How should a user determine the necessity and duration of such a warm-up?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces "SkipReduce," a novel and pragmatic approach to accelerate collective communication in distributed deep learning. The core contribution is the idea of coarse-grained gradient skipping, implemented by intentionally omitting a fraction of the communication steps in a standard ring-based AllReduce algorithm. This method directly reduces communication time by shortening the critical path of the collective operation.

The authors astutely identify that while prior gradient compression techniques (e.g., top-k sparsification, low-rank approximation) also reduce data transfer, they often introduce significant computational overhead for selection, compression, and reconstruction. This overhead can negate the benefits, especially on modern systems with high-bandwidth interconnects. SkipReduce's primary advantage is its near-zero computational overhead, allowing it to scale effectively with improving hardware.

The paper further refines this core idea by introducing two key enhancements: 1) Random SkipReduce, which randomizes the skipped gradient slices in each iteration to mitigate the bias and accuracy loss of naively skipping the same slices repeatedly, and 2) Selective SkipReduce, which leverages the insight that gradient importance is non-uniform across model layers, allowing for targeted skipping of less critical layers to preserve model accuracy. The authors demonstrate through comprehensive experiments that SkipReduce provides a significant speedup in time-to-accuracy for various models, positioning it as a powerful and practical alternative to existing communication acceleration techniques.

Strengths

Elegant and Practical Core Idea: The central concept of skipping collective communication steps is both simple and powerful. It reframes the problem of gradient sparsification from a data-centric view (which gradients to send?) to a system-centric one (which communication steps can be omitted?). This leads to an implementation with minimal overhead, which is a critical differentiator from much of the prior work. The argument presented in Section 3 and validated in Figures 3 and 4 (pages 4-5) is particularly compelling, showing how the computational overhead of techniques like PowerSGD and top-k becomes a performance limiter on high-bandwidth networks, a limitation that SkipReduce elegantly sidesteps.
Excellent Contextualization and Exploration of Broader Implications: This work stands out for its panoramic view. The authors do not just present a performance optimization; they explore its deeper connections within the field.
- Connection to Regularization: The analysis in Section 6.4 (page 12), which positions SkipReduce as a form of coarse-grained gradient dropout, is insightful. Demonstrating that it is complementary to traditional Dropout (Figure 20) elevates the contribution from a mere systems trick to a technique with potential benefits for model generalization.
- Applicability to Advanced Parallelism: The extension to Sharded Data Parallelism (SDP) in Section 6.3 (page 11) is forward-looking and demonstrates the versatility of the core idea. The discussion of the challenges (aligning skipped weights and gradients) and the novel implications (creating an ensemble of subnetworks) opens up fascinating new research avenues.
- Analysis of Alternative Topologies: The analytical exploration of how SkipReduce would behave with tree and halving-doubling topologies in Section 6.1 (page 10) shows a thorough understanding of the collective communication landscape and strengthens the paper's claims of generalizability.
Thorough and Methodologically Sound Evaluation: The experimental validation is robust. The use of time-to-accuracy (TTA) as the primary metric is the correct choice, as it holistically captures the trade-off between per-iteration speedup and convergence behavior. The comparison against a well-implemented "ideal top-k" baseline and PowerSGD across a diverse set of workloads (CNN, and large-scale Transformers like BERT and LLaMA) provides a convincing case for SkipReduce's effectiveness, especially in the high-bandwidth regime common in modern GPU clusters. The ablations, such as comparing Static vs. Random SkipReduce (Figure 7, page 7), are well-designed and clearly justify the design choices.

Weaknesses

Limited Discussion on Hyperparameter Tuning: SkipReduce introduces a new critical hyperparameter: the skipping ratio. The paper primarily evaluates a fixed ratio (e.g., 50%). However, there is little guidance on how a practitioner should select this value. The observation in Figure 2 (page 4) that gradient sparsity changes dynamically throughout training suggests that a static skipping ratio may be suboptimal. A discussion on the sensitivity to this parameter or the potential for an adaptive skipping schedule would significantly enhance the practical utility of the work.
Theoretical Intuition Could Be Deepened: The paper's motivation rests on the empirical observation of gradient sparsity. While it correctly cites prior work [46] on the convergence of random gradient compressors, the connection could be more deeply explored. Why is skipping coarse-grained, contiguous slices in a ring AllReduce so effective? There seems to be an implicit assumption that information is sufficiently redundant or evenly distributed such that dropping entire slices is tolerable. A more formal or intuitive explanation of the interplay between the ring algorithm's specific data flow and gradient information structure would strengthen the paper's foundations.

Questions to Address In Rebuttal

Could the authors elaborate on the sensitivity of SkipReduce's performance (both TTA and final accuracy) to the skipping ratio? Is there a clear "sweet spot," and does it vary significantly across different model architectures or datasets? Could you provide guidance on how a practitioner might approach tuning this hyperparameter?
The connection to regularization is compelling. The finding that SkipReduce and Dropout are complementary (Figure 20, page 12) suggests they might be regularizing the model in different ways. Could you speculate on the mechanisms? For instance, does Dropout prevent co-adaptation of individual neurons, while SkipReduce's coarse-grained nature prevents co-adaptation of larger, structurally-defined groups of parameters (i.e., those grouped into a slice)?
The analysis of Sharded Data Parallelism (SDP) in Section 6.3 is fascinating. In this setting, skipping steps in the weight AllGather phase is framed as training a unique subnetwork on each GPU. This seems conceptually similar to ensemble methods. Did you observe if this ensembling effect led to benefits beyond what was seen in the data-parallel case, such as improved model calibration or robustness, even if final accuracy was slightly lower?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper introduces SkipReduce, a novel method for accelerating collective communication in distributed deep learning. The core idea is to modify the standard ring AllReduce algorithm by intentionally skipping a subset of the communication steps in the Reduce-Scatter phase. This action is functionally equivalent to dropping entire slices of gradients in a coarse-grained manner, which directly reduces the total communication time. The authors augment this core idea with two key refinements: 1) "Random SkipReduce," which randomizes the skipped slices across iterations to prevent systemic bias, and 2) "Selective SkipReduce," which applies the skipping mechanism only to model layers deemed less important based on gradient magnitudes. The work positions itself as a low-overhead alternative to existing communication reduction techniques like top-k sparsification and gradient quantization, particularly for high-bandwidth systems where the computational overhead of those methods can negate their benefits.

Strengths

Novelty of the Core Mechanism: The central contribution—modifying the communication protocol itself by truncating its steps—is genuinely novel. Prior art has extensively focused on modifying the data being sent (e.g., sparsification via top-k, compression via quantization or low-rank factorization). SkipReduce, in contrast, modifies the protocol execution. Instead of sending modified data for 2(N-1) steps, it sends unmodified data for 2(N-1)-S steps. This shift in approach from data manipulation to protocol manipulation is a significant and clever conceptual leap.
Elegant Simplicity: The proposed modification is remarkably simple. As described in Algorithm 2 (page 6), the implementation appears to only require changing the starting index of a loop within the Reduce-Scatter phase. This stands in stark contrast to the significant engineering complexity of efficient top-k selection, sparse format handling (and the associated "fill-in" problem), or the computational cost of low-rank factorization (e.g., PowerSGD). The low complexity is a major strength.
Clear Articulation of the Problem Niche: The authors correctly identify a critical gap in existing work. As shown in their motivational analysis (Section 3, Figures 3 and 4), the computational overhead of many compression techniques makes them less effective on modern, high-bandwidth interconnects. By proposing a method with virtually zero computational overhead, the paper targets a relevant and increasingly important regime of distributed training.

Weaknesses

Conceptual Overlap with Fault-Tolerant Collectives: While the motivation is novel (proactive acceleration), the effect (an incomplete gradient reduction) bears a strong conceptual resemblance to fault-tolerant or resilient collective communication algorithms designed for unreliable networks (e.g., MLT [52], OptiReduce [53]). These works also deal with dropped packets/gradients and proceed with an incomplete result to avoid tail latency. The paper mentions these works as orthogonal (Section 2.3, page 4), but the distinction is not sufficiently sharp. A more direct comparison explaining why intentionally skipping steps is fundamentally different from reactively handling dropped steps would strengthen the novelty claim.
The Novelty is Tightly Bound to a Specific Protocol: The entire mechanism and its elegant simplicity are described almost exclusively in the context of Ring AllReduce. While Section 6.1 (page 10) provides a high-level analytical discussion of other topologies (tree, halving-doubling), it lacks the concrete, algorithmic detail needed to establish the novelty of the idea beyond the ring structure. For a tree-based AllReduce, it is not immediately obvious what "skipping a step" means and whether it would yield a similarly clean reduction in communication time without significant algorithmic changes. The novelty, as presented, is somewhat narrow.
Incremental Novelty of Refinements: The refinements, while valuable, are applications of well-known principles to the new SkipReduce mechanism. Randomization to mitigate bias ("Random SkipReduce") is a standard technique in machine learning. Similarly, using gradient magnitude as a proxy for importance ("Selective SkipReduce") is the foundational heuristic for all top-k sparsification methods. While the application of these ideas is new in this specific context, the ideas themselves are not. The paper's primary novelty lies squarely in the core SkipReduce protocol modification, not these extensions.

Questions to Address In Rebuttal

Please elaborate on the novelty of SkipReduce in contrast to fault-tolerant collective communication schemes (e.g., MLT [52], OptiReduce [53]) that also result in incomplete reductions. While the motivation (performance vs. resilience) differs, the functional outcome has strong similarities. What is the fundamental distinction that makes SkipReduce a novel contribution beyond a change in intent?
The core mechanism is elegantly demonstrated on Ring AllReduce. Can the authors provide a more concrete algorithmic sketch of how SkipReduce would be implemented on a non-ring topology, such as a tree or recursive doubling algorithm? The current analysis in Section 6.1 (page 10) is primarily theoretical and does not sufficiently demonstrate that the core novel idea is generalizable.
The concept of selectively skipping less important layers (Section 4.4, page 7) relies on gradient magnitude, a well-established heuristic borrowed from the gradient sparsification literature. Could the authors clarify the novel contribution of this aspect beyond applying a known heuristic to the new SkipReduce primitive?

Optimizing All-to-All Collective Communication with Fault Tolerance on Torus Networks

Abstract

Large- scale distributed processing is extensively employed for large model inference and training, such as Deep Learning Recommendation Models (DLRMs) and Mixture-of-Experts (MoE) models. However, the All-to-All collective, with its complex point-to-point ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present a suite of algorithms for All-to-All collective communication on torus networks, targeting both fault-free and fault-tolerant scenarios. For the fault-free case, they propose HalfRing for single-dimension communication and DimRotation for multi-dimensional scheduling. For scenarios with link failures, they introduce FoldedRing as a basic recovery mechanism and MATE/MATEe as an acceleration technique that leverages links from other dimensions. The paper claims significant performance speedups over a baseline Ring algorithm and Google's state-of-the-art routing on TPUv4 clusters.

However, the work rests on a series of questionable assumptions and methodological choices that undermine the validity of its core claims. The comparison against state-of-the-art is fundamentally flawed, the reported real-machine performance directly contradicts the simulation results for the fault-tolerant case, and crucial algorithmic details for complex scenarios are omitted entirely.

Strengths

The paper addresses a timely and relevant problem: the All-to-All bottleneck in large-scale distributed training, particularly for torus networks where contention is a primary concern. The added focus on fault tolerance is also well-motivated.
The core ideas of HalfRing (utilizing bidirectional links for shortest paths) and DimRotation (balancing load across dimensions) are conceptually straightforward and intuitive for improving upon a simplistic baseline.

Weaknesses

Fundamental Methodological Mismatch in SOTA Comparison: The authors' primary comparison is against Google's DOR/WFR routing on TPUv4. Their proposed methods are based on a fine-grained, hop-by-hop, store-and-forward scheduling model. In contrast, modern interconnects like Google's ICI utilize hardware-based, low-latency wormhole or virtual-cut-through routing. Comparing a contention-avoiding store-and-forward algorithm against a contention-prone but low-latency wormhole routing algorithm is a category error. The performance characteristics are entirely different; the former trades higher per-hop latency for algorithmic simplicity and contention avoidance, while the latter optimizes for latency at the cost of requiring complex hardware-level flow control. The paper makes no attempt to justify this apples-to-oranges comparison or to model the latency and resource overheads of its own store-and-forward approach fairly against a hardware-based one.
Selection of a Weak Baseline: The claimed speedups (e.g., 2.28x in Figure 11) are presented relative to a "Ring algorithm with pipeline scheduling." This appears to be a strawman. While the basic Ring algorithm is standard, more sophisticated pipeline scheduling schemes that mitigate bubbles more effectively than the simple one implied in Figure 7a exist in the literature. By selecting a potentially weak baseline, the performance gains of HalfRing and DimRotation are likely inflated.
Contradictory and Alarming Real-Machine Results: The most significant flaw is the stark contradiction between simulation and real-world measurement. For the fault-tolerant case, simulations claim MATE achieves a ~1.37x speedup over the fault-free baseline. However, the real-machine experiments reported in Section 5.8 and Figure 19 show that MATE achieves a speedup of 0.77x—a 23% slowdown compared to the baseline. The authors briefly attribute this to "greater complexity" and "frequent interruptions," but this explanation is insufficient. This result invalidates the central claim that MATE is an effective acceleration technique in practice; on the contrary, it suggests the overheads of the proposed scheme are so severe that they negate any theoretical benefits.
Oversimplified and Underdeveloped Fault-Tolerant Mechanisms:
- The MATE algorithm's core mechanism relies on "offline performance analysis" (Section 4.2, page 8) to allocate data for acceleration. This critical procedure is never detailed. How is this analysis performed? Is it a heuristic or an optimal solver? Without this information, the algorithm is not reproducible or verifiable.
- The analysis of multiple faults in Section 5.7 is superficial. For Type 1 failure (two faults on one ring), the paper claims a "two-acceleration-phase MATEe scheme" is used, but provides zero detail on how this works, how it is scheduled, or how it avoids conflicts. This is a critical omission for a paper claiming robust fault tolerance.
Unsubstantiated Claims of Generality: The paper claims DimRotation handles mixed-radix torus networks well (Section 3.2, page 6), but provides no analysis or experiments to support this. In mixed-radix systems, communication time per dimension is inherently unbalanced. It is not self-evident that simply rotating the dimension order would be optimal, as the longest-latency dimension would remain the bottleneck regardless of its position in the schedule.

Questions to Address In Rebuttal

Please provide a rigorous justification for comparing your software-based, store-and-forward scheduling model against Google's hardware-based, wormhole-routing model (DOR/WFR). Address the fundamental differences in latency, buffer requirements, and contention management.
Defend your choice of "Ring algorithm with pipeline scheduling" as the primary baseline. Provide evidence that this baseline is representative of common practice and not an artificially weak competitor.
The central contradiction: Your simulations show MATE providing a >1.3x speedup under failure, yet your own real-machine tests show it causes a >20% slowdown (0.77x). Please provide a detailed, quantitative explanation for this discrepancy. Why should the community trust the simulation results when they are invalidated by physical measurement?
Provide the full algorithm and a conflict-freedom analysis for the "two-acceleration-phase MATEe scheme" used to handle multiple link failures on a single ring, as mentioned in Section 5.7.
Detail the exact "offline performance analysis" procedure used to schedule data transfers in MATE. What are the inputs, the objective function, and the algorithm used to determine the communication schedule? How is the fraction parameter for MATEe determined?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a suite of contention-free algorithms and scheduling policies to optimize the All-to-All collective communication primitive on torus networks, a topology critical to modern large-scale ML systems like Google's TPU clusters. The authors address this well-known bottleneck from two angles: a fault-free scenario and a fault-tolerant one.

For the fault-free case, they propose the HalfRing algorithm, which optimizes single-dimension communication by using bidirectional links for shortest-path routing, and DimRotation scheduling, which orchestrates communication across multiple dimensions to eliminate pipeline bubbles and maximize bandwidth utilization.

For the fault-tolerant case, they introduce FoldedRing to handle a single link failure within a ring, and more importantly, the MATE scheduling strategy. MATE's core insight is to leverage healthy links from other, parallel dimensions to accelerate the necessarily slower communication on the faulty ring. The work is evaluated through extensive simulation, showing significant speedups over baseline ring algorithms and, notably, over the sophisticated hardware routing schemes used in Google's TPUv4 clusters.

Strengths

Significant and Timely Problem: The paper tackles a problem of immense practical importance. As acknowledged in the introduction (Section 1, page 1), All-to-All communication is a dominant performance bottleneck for training and inference of massive Mixture-of-Experts (MoE) and Deep Learning Recommendation Models (DLRM). Optimizing this primitive on widely-deployed torus networks is a high-impact endeavor.
Elegant Core Contribution in MATE: While the fault-free optimizations (HalfRing, DimRotation) are solid, principled improvements on existing ideas, the MATE scheduling concept for fault tolerance is the paper's most significant and novel contribution. The idea of "borrowing" bandwidth from healthy dimensions to compensate for a localized failure in another (as visualized in Figure 9, page 7) is an elegant solution. It transforms the problem from merely circumventing a failure to actively mitigating its performance impact using the network's inherent multi-dimensional resources. This is a powerful new perspective on fault-tolerant collective design.
Principled, Contention-Free Design: The authors' choice to decompose multi-hop transfers into a sequence of orchestrated single-hop, store-and-forward steps is a classic and robust approach to designing collective algorithms. By doing so, they guarantee contention-free communication, which sidesteps the complex and often unpredictable congestion issues that can plague hardware-based, multi-hop routing schemes, even sophisticated ones. This principled design is a key reason for their strong performance.
Strong and Contextually-Aware Evaluation: The evaluation is comprehensive and compelling. The authors don't just compare against a weak baseline; they go head-to-head with Google's dimension-order routing (DOR) and wild-first routing (WFR), which are state-of-the-art industrial solutions for the exact same problem on the same hardware topology (Section 5.4, page 10). Demonstrating a 1.57x-1.61x speedup over this baseline is a very strong result. Furthermore, the analysis of real model performance (Section 5.5, page 11) and non-uniform communication patterns (Section 5.6, page 12) grounds the work in practical, real-world scenarios.

Weaknesses

While the paper is strong, there are areas where its context and limitations could be further explored. My points are less about flaws and more about understanding the work's boundaries.

Practical Overheads of Store-and-Forward: The store-and-forward approach, while eliminating network contention, can introduce its own overheads, such as per-hop latency, memory buffer pressure, and CPU/local controller costs for managing the fine-grained steps. The paper's analytical model (Table 1, page 5) simplifies this, and the real-machine experiments (Figure 19, page 12) hint at this complexity, showing non-trivial "Startup Time" and "Interruption" costs. While the net result is still a win, a deeper discussion of these practical trade-offs would strengthen the paper.
Specialization to Torus Networks: The proposed solutions are exquisitely tailored to the properties of a torus network (specifically, its orthogonal dimensions and wrap-around links). This specialization is a strength, as it allows for high performance. However, it also means the direct contributions are not applicable to other popular large-scale topologies, such as fat-trees or Clos networks, which are common in GPU-based clusters. This is not a flaw, but a boundary condition worth acknowledging more explicitly when discussing the work's impact.
Complexity of MATE Scheduling: The implementation of MATE requires an offline analysis to allocate data volumes across the healthy "acceleration planes" (Section 4.2, page 7-8). For a single, known link failure, this seems tractable. However, in scenarios with multiple or dynamic failures, this scheduling problem could become significantly more complex. The paper touches on multiple faults (Section 5.7, page 12), but the scalability and complexity of the scheduler itself is an important practical consideration that could be discussed further.

Questions to Address In Rebuttal

Your real-machine performance breakdown (Figure 19b, page 12) shows that MATE has a noticeably higher "Startup Time" and overall "Interruption" time compared to the baseline. Could you elaborate on the source of this overhead? Does it stem from the increased complexity of the multi-path scheduling logic in the PyTorch Distributed backend, and do you see avenues for optimizing this control-plane aspect of your proposal?
The MATE scheduling strategy relies on an offline calculation to partition the workload. How does the complexity of this calculation scale with the number of network dimensions and the number/pattern of link failures? Is there a point where the scheduling becomes computationally prohibitive or where finding an optimal partition is an NP-hard problem?
The field has seen a growing interest in automated synthesis of collective algorithms (e.g., SCCL, TACOS, as mentioned in Section 6.1). How do you see your manually designed, principled algorithms co-existing with this trend? Could the core insight of MATE—borrowing inter-dimensional bandwidth for fault tolerance—be used as a heuristic or a core principle to guide a future automated synthesizer for resilient collectives?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents a suite of algorithms and scheduling strategies for optimizing All-to-All collective communication on torus networks, considering both fault-free and fault-tolerant scenarios. For the fault-free case, the authors propose the HalfRing algorithm for single-dimension communication and DimRotation for multi-dimensional scheduling. For scenarios with link failures, they introduce the FoldedRing algorithm and the MATE scheduling strategy, which leverages links from other dimensions to accelerate communication on a faulty ring.

While the paper presents a comprehensive and well-engineered system, the core novelty of its constituent components varies significantly. The proposed fault-tolerant scheduling strategy, MATE, appears to be a genuinely novel contribution to collective communication design. However, the remaining components—HalfRing, DimRotation, and FoldedRing—are largely adaptations or specific implementations of well-established principles in parallel algorithms and network routing. The primary contribution of this work thus lies in the clever MATE concept and the integration of these techniques into a cohesive, contention-free framework for All-to-All.

Strengths

Novel Fault-Tolerant Scheduling (MATE): The most significant and novel contribution of this paper is the MATE scheduling strategy (Section 4.2, page 7). The idea of dynamically constructing parallel communication paths for nodes on a faulty ring by "borrowing" links from orthogonal dimensions is a clever departure from simple packet-level rerouting (like Google's WFR [78]). This rescheduling of a collective's sub-problem onto entirely different physical resources is a new and powerful concept for algorithmic fault tolerance.
Comprehensive System Design: The authors have successfully integrated their proposed techniques into a complete, end-to-end, contention-free solution for All-to-All on a torus. The combination of intra-dimension algorithms and inter-dimension scheduling is systematically handled for both fault-free and faulty cases.
Strong Baseline Comparison: The evaluation against Google's DOR and WFR routing schemes on TPUv4-like configurations provides a strong and relevant point of comparison, lending credibility to the performance results.

Weaknesses

The central weakness of this paper is the limited novelty of several of its core algorithmic claims. While presented as new proposals, they represent specific instances of long-standing concepts.

HalfRing Lacks Algorithmic Novelty: The HalfRing algorithm (Section 3.1, page 5) is described as leveraging bidirectional links for shortest-path communication. This is the fundamental principle behind any optimal All-to-All algorithm on a ring. The idea of splitting the message for the diametrically opposite node is a textbook method for achieving optimal bandwidth utilization and has been implicitly or explicitly part of optimal ring algorithms for decades. For example, the work of Lam et al. [44] on optimal personalized communication on tori already established the theoretical performance limits that such an approach would achieve. The contribution here is an implementation, not a new algorithm.
DimRotation is an Incremental Scheduling Pattern: The DimRotation scheduling scheme (Section 3.2, page 6) is an elegant way to avoid pipeline bubbles. However, the core concept of staggering or rotating the order of operations across parallel units to improve resource utilization is a well-known technique in parallel scheduling. While it is an improvement over the simple pipeline shown in Figure 7a, it is not a fundamentally new scheduling paradigm. The novelty is limited to the specific cyclic assignment of starting dimensions to chunks.
FoldedRing is a Reapplication of a Known Concept: The FoldedRing algorithm (Section 4.1, page 7) repairs a broken link by using the reverse channel to form a longer, logical ring. This concept is functionally identical to fault-tolerance strategies seen in other contexts. For instance, Google's "AltRing" algorithm for All-Reduce [42] employs a similar strategy of rerouting to maintain a logical ring in the presence of node failures on a 2D mesh. The idea of "folding" a path back on itself to bypass a fault is a classic routing technique. The authors' contribution is applying this known method to their specific store-and-forward All-to-All implementation, which does not constitute a novel algorithmic discovery.

Questions to Address In Rebuttal

The authors should use the rebuttal to clarify the precise "delta" between their proposals and existing art, and to defend the significance of their most novel contribution.

On HalfRing and FoldedRing: The HalfRing algorithm appears to be an implementation of the theoretically optimal strategy for ring All-to-All. Similarly, FoldedRing seems analogous to existing fault-tolerant ring repair mechanisms. Could the authors please clarify what is conceptually novel about these two algorithms beyond their application within the paper's specific store-and-forward scheduling framework?
On the Novelty of MATE: The MATE scheduler is the most compelling contribution. However, its efficacy depends on the availability of otherwise idle links in orthogonal dimensions. How would MATE's performance and implementation complexity be affected in a scenario where multiple collective operations are overlapped, potentially creating contention for the "borrowed" links?
On Complexity vs. Benefit: The proposed methods rely on fine-grained, software-driven, store-and-forward scheduling, which is inherently more complex to orchestrate than the hardware-based, dimension-order routing (DOR) used as a baseline. For the fault-free case, the combined HalfRing+DimRotation approach yields a ~1.57x speedup over DOR on a single TPUv4 pod (Section 5.4, page 11). Is this performance gain substantial enough to justify the significant increase in software complexity and scheduling overhead compared to simpler, hardware-managed routing?

Titan-I: An Open-Source, High Performance RISC-V Vector Core

Abstract

Vector processing has evolved from early systems like the CDC STAR-100 and Cray-1 to modern ISAs like ARM’s Scalable Vector Extension (SVE) and RISC-V Vector (RVV) extensions. However, scaling vector processing for contemporary workloads presents ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present Titan-I (T1), a parameterizable, out-of-order (OoO), lane-based RISC-V vector core generator. The paper introduces several microarchitectural techniques aimed at improving scalability and performance, including a floor-planning solver, a dedicated permutation unit, and a shadow cache for mask registers. The evaluation section presents performance comparisons against high-end GPUs (NVIDIA GA102/GB202) and contemporary CPU vector architectures (HiSilicon KP920, SpacemiT X60).

While the ambition of the project is noted, the paper’s central claims of superior performance are predicated on a narrow and highly favorable set of benchmarks. Several key architectural decisions appear to trade correctness and generality for performance in specific scenarios, and the quantitative analysis of the design's own scaling properties is questionable. The comparisons to other architectures, particularly GPUs, are fundamentally flawed, raising serious doubts about the validity of the conclusions drawn.

Strengths

Open-Source Contribution: The commitment to providing an open-source RTL generator is a significant contribution that enables community verification and research.
Area Efficiency: The reported area efficiency against the HiSilicon KP920 core (Section 6.2.1, page 12) is impressive, assuming the area measurement methodology for the baseline is sound. Achieving comparable performance in 19% of the area is a noteworthy engineering result.
Focus on Scalability: The paper correctly identifies critical bottlenecks in scaling vector architectures (permutation, masking, scheduling) and makes a concerted effort to address them, even if the proposed solutions have weaknesses.

Weaknesses

Fundamentally Flawed GPU Comparison: The central claim of outperforming NVIDIA GPUs is based on an inappropriate comparison. The authors benchmark T1 against a single Streaming Multiprocessor (SM) on two integer-only cryptographic workloads (NTT, MMM) (Section 6.1, page 11). These workloads are known to be pathologically ill-suited for the SIMT execution model of GPUs and are perfectly suited for a wide-vector architecture. This is a clear case of cherry-picking benchmarks to maximize the perceived advantage. The comparison completely ignores workloads where GPUs excel, such as dense floating-point linear algebra or highly divergent codes. The claim of "outperforming" a GPU is meaningless without a balanced and representative benchmark suite.
Questionable Area Scaling Analysis: The area scaling results presented in Figure 4 (page 6) are highly suspect. Specifically, Figure 4c suggests that increasing the LaneScale (i.e., making individual lane datapaths wider) leads to a decrease in the total area of T1. This is counter-intuitive and physically implausible. While the authors attribute this to sharing control logic, such a dramatic reduction in total area (nearly 40% when moving from LaneScale=1 to 4) is an extraordinary claim that requires a much more detailed explanation and justification than is provided. It suggests a potential flaw in the area model or an omission of critical information.
Unsafe Architectural Shortcuts for Exception Handling: The mechanism for handling long-latency indexed memory operations relies on a "chicken bit" in a CSR to suppress access-fault exceptions (Section 4.2.1, page 7 and Section 4.2.6, page 9). This is not a robust architectural solution; it is a hardware hack that offloads the burden of ensuring memory safety entirely to software. For a core intended for high-performance, general-purpose workloads, this is a critical design flaw. It renders the core unsuitable for environments where precise exceptions are required for memory management, debugging, or security.
Misleading Comparison to In-Order Cores: The performance comparison against the SpacemiT X60 (Section 6.2.3, page 12) results in a claimed 8.05x speedup. However, the X60 is a simpler, in-order core. It is neither surprising nor particularly insightful that a complex OoO architecture significantly outperforms an in-order one. This comparison serves more to inflate T1's performance numbers than to provide a meaningful benchmark against a peer competitor.
Insufficient Detail on Novel Contributions: The "coarse-grained floor-planning solver" (Section 4.1.1, page 6) is presented as a key innovation. However, the paper provides no details on the heuristic algorithm itself. Without this information, it is impossible to assess its novelty or effectiveness beyond the single, potentially cherry-picked example in Figure 5. It is unclear how this differs from standard P&R scripting. Similarly, the "shadow mask" cache (Section 4.1.2, page 6) is described, but the mechanism for ensuring correctness and handling coherence with pending writes to v0 is glossed over, despite this being a potential serialization point.

Questions to Address In Rebuttal

On GPU Benchmarking: Can the authors justify their claim of GPU superiority by providing performance comparisons on workloads where GPUs are traditionally strong, such as FP32/FP16 SGEMM, stencil computations, or graph analytics? If not, the claims regarding GPU performance should be significantly moderated to reflect the narrow, integer-only context.
On Area Scaling: Please provide a detailed breakdown and justification for the claim in Figure 4c that total core area decreases as LaneScale increases. What specific components are shrinking, and why does this effect overwhelm the expected area increase from wider datapaths and crossbars within the lane?
On Exception Handling: Please defend the use of a "chicken bit" for indexed memory operations. How does this design support robust, general-purpose software that relies on precise exceptions for virtual memory paging, memory protection, or runtime error handling?
On the Floorplan Solver: Please provide sufficient detail on the heuristic algorithm used in your floorplan solver to allow the reader to understand its novelty and distinguish it from a trivial script that drives a standard placement tool. What are the constraints and the objective function it optimizes?
On the X60 Comparison: Please clarify if the SpacemiT X60 core used for comparison is an in-order design. If so, please justify why this is presented as a meaningful comparison for an OoO architecture, rather than an expected outcome.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents Titan-I, an open-source, highly parameterized generator for an out-of-order (OoO) RISC-V vector (RVV) processor. The work's core contribution is a suite of microarchitectural innovations designed to holistically address the long-standing challenge of scaling both data-level parallelism (DLP), via wide vector datapaths and long vector lengths (VLEN), and instruction-level parallelism (ILP), via fine-grained, OoO execution.

The authors identify that traditional superscalar OoO techniques do not scale well to the massive state of wide vector registers, while traditional vector machines often sacrifice ILP. Titan-I tackles this gap with several key techniques: a coarse-grained floor-planning solver to manage routing complexity in wide designs, a datapath-wide permutation unit to efficiently handle shuffle-heavy workloads, a shadow cache for mask registers to reduce broadcast traffic, and a novel "issue-as-commit" mechanism to decouple the scalar and vector pipelines. Central to its ILP capabilities is a fine-grained chaining microarchitecture that operates at the sub-register (lane) level.

The authors validate their architecture with strong empirical results, including two ASIC tape-outs. Their evaluations show Titan-I outperforming high-end NVIDIA GPUs on specific cryptographic kernels and demonstrating competitive performance-per-area against a high-performance ARM SVE core (HiSilicon KP920) in HPC workloads. The work is presented not as a single point design, but as a flexible generator, positioning it as a significant contribution to the open-source hardware ecosystem.

Strengths

Addresses a Fundamental Architectural Trade-off: The paper targets the difficult, yet crucial, intersection of CPU and GPU design philosophies. The quest to unify high single-thread performance (ILP) with massive data throughput (DLP) is a central theme in modern computer architecture. Titan-I offers a compelling and well-reasoned "vector-first" approach to this problem, standing as a modern successor to the philosophy of classic vector supercomputers like the Cray-1.
Holistic, System-Level Design: The strength of this work lies not in a single trick, but in the co-design of multiple solutions to solve the wider problem of scalability. The authors correctly identify that simply widening a datapath creates cascading problems in routing, control logic, and data movement. Their solutions—the floorplanner for physical layout (Section 4.1.1, page 6), the dedicated permutation unit for data shuffling (Section 4.1.3, page 7), and the mask register cache (Section 4.1.2, page 6) for predication—demonstrate a deep understanding of both the logical and physical barriers to scaling.
Credible and Impressive Empirical Validation: The authors provide a robust evaluation against relevant and powerful commercial counterparts. Comparing against NVIDIA GPUs for crypto and a flagship ARM server CPU for HPC is ambitious and gives the results significant weight. The fact that the project has yielded two physical tape-outs (Section 5.3, page 10) lends enormous credibility to the claimed performance and area results, moving it beyond a purely academic simulation study.
Contribution as an Open-Source Generator: Perhaps the most significant aspect of this work is that its deliverable is a highly parameterized generator. This elevates its potential impact substantially. Instead of a single, static design, the authors provide a framework that can be adapted to different application domains and PPA (Power, Performance, Area) targets, from edge devices to data centers. This is a powerful enabler for the RISC-V ecosystem and the broader hardware community.

Weaknesses

While the microarchitectural contributions are excellent, the paper could be strengthened by providing more context on its place within a complete system, particularly regarding software and memory.

The Software Abstraction Challenge: The paper rightly celebrates its hardware's performance, but this performance is unlocked via hand-tuned assembly or a custom MLIR-based toolchain (Section 5.1, page 10). A significant advantage of competitors like NVIDIA is the maturity and accessibility of the CUDA ecosystem. For Titan-I to have broader impact, the path from high-level code (e.g., C++, Python) to high-performance vectorized execution needs to be more thoroughly explored. The current presentation leaves the impression that achieving these results requires expert-level programming effort, which could limit its adoption.
The Scalar Core as a Potential Bottleneck: The architecture is presented as a vector co-processor that relies on a scalar core for control flow, address generation, and non-vector instructions. The "issue-as-commit" policy (Section 4.2.1, page 7) is a clever way to decouple the pipelines, but the performance of many real-world vector applications (e.g., sparse matrix operations) is often limited by the efficiency of the scalar code that prepares the data for the vector unit. The paper provides little detail on the assumed capabilities of the scalar core or the potential for it to become an "Amdahl's Law" bottleneck.
System-Level Memory Integration: The paper demonstrates impressive memory latency tolerance (Section 6.2.2, page 12) and discusses a Memory Management Unit (MMU) as future work (Section 7, page 13). However, in a real system, the interaction with virtual memory—handling page faults, TLB misses, and maintaining coherence with other cores in an SoC—is a first-order design constraint, not just a feature to be added later. A deeper discussion of how the massive, long-latency memory accesses would be managed within a virtual memory system would contextualize the design's practicality for general-purpose computing.

Questions to Address In Rebuttal

Could the authors elaborate on the maturity and usability of their MLIR-based software toolchain? For the HPC benchmarks, how much of the performance was achieved through fully automatic auto-vectorization versus manual intervention or the use of compiler intrinsics?
The "issue-as-commit" mechanism effectively decouples the scalar and vector units. However, what are the key performance considerations for the scalar core that pairs with Titan-I? Are there specific workload characteristics (e.g., complex address generation, frequent loop-carried dependencies) where the scalar front-end is likely to become the primary performance limiter?
The paper's discussion on memory latency tolerance is a highlight. Could the authors comment on how the design's philosophy of handling long-latency operations would be extended to manage system-level complexities like page faults? Would a page fault on one element of a VLEN=16384 operation require stalling the entire vector instruction for its duration, and what would be the performance implications?
Section 7 mentions the possibility of adopting a Multiple Streaming Processor (MSP) approach, akin to the Cray X-MP, to improve TLP. Given the generator's flexibility, have the authors explored configurations that might partition a wide Titan-I core into several narrower, independent vector contexts? This seems like a natural architectural evolution to bridge the gap between this powerful single-thread vector model and the many-thread GPU model.

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present Titan-I (T1), a generator for an open-source, out-of-order (OoO), lane-based RISC-V Vector (RVV) core. The central thesis is that a combination of novel microarchitectural techniques can overcome the traditional scaling challenges of vector processors, enabling simultaneous scaling of both Data-Level Parallelism (DLP) and Instruction-Level Parallelism (ILP). The proposed contributions are a collection of solutions targeting specific bottlenecks: a coarse-grained floor-planning solver and a dedicated permutation unit for DLP scaling, and a fine-grained chaining mechanism, a scalar-vector decoupling policy ("issue-as-commit"), and specialized memory scheduling for ILP. The authors provide an open-source RTL generator and validate their design with two ASIC implementations and extensive performance comparisons.

My review focuses exclusively on the novelty of the proposed ideas relative to the vast body of prior work in vector and parallel architectures.

Strengths

The primary strength of this work lies not in a single, revolutionary concept, but in the clever synthesis and specific implementation of multiple techniques, some of which represent a genuine delta over prior art.

Fine-Grained, Configurable Chaining: The concept of chaining dates back to the Cray-1. However, its application in modern, highly-laned architectures presents new challenges. The closest academic prior art cited, Berkeley's Saturn [51], implements chaining at a coarse "DLEN-granular" level. T1's proposal for configurable, fine-grained chaining that can operate down to the element (ELEN) level (Section 4.2.2, page 8) is a significant and novel advancement. It directly addresses the need for maximizing pipeline utilization in the presence of wide, partitioned datapaths, a key limitation of prior academic designs like Ara [9, 35], which lacks chaining entirely.
Physical Design-Aware Microarchitecture: The introduction of a "coarse-grained floor-planning solver" (Section 4.1.1, page 6) is a noteworthy contribution. While floorplanning is a standard part of physical design, explicitly incorporating a heuristic solver into the architectural design flow of a generator to minimize cross-lane routing latency is novel. It acknowledges that at the scale the authors are targeting, physical realities can no longer be an afterthought for the microarchitect. This is a commendable step towards a true hardware-software-physical co-design methodology.
Specific Bottleneck Alleviation: The "shadow-cache for mask registers (v0)" (Section 4.1.2, page 6) is an elegant solution to a well-known and painful bottleneck in lane-based RVV designs. While caching is a fundamental computer science concept, the creation of a specialized, dedicated cache within the permutation unit to solve the v0 broadcast problem is a specific, novel, and practical microarchitectural innovation.

Weaknesses

While the paper contains novel elements, several of the core ideas are clever engineering applications of well-established architectural concepts. The paper would be stronger if it more precisely delineated its contributions from this foundational work instead of presenting them as entirely new pillars.

Derivative ILP Concepts: Several techniques presented to enhance ILP are modern implementations of classic ideas.
- "Issue-as-commit" (Section 4.2.1, page 7): This is a form of scalar-vector decoupling. The idea that a scalar core can run ahead of a long-latency vector unit as long as there are no dependencies is conceptually similar to Decoupled Access/Execute (DAE) architectures and the function of scoreboards in early machines like the CDC 6600. The contribution here is a specific, low-overhead scoreboard implementation for RVV, not a fundamentally new execution model. The novelty is in the implementation, not the concept.
- "Memory Interleaving" and "Memory Delay Slot" (Sections 4.2.5 and 4.2.6, page 9): Overlapping load and store operations and scheduling independent instructions to hide memory latency are canonical compiler and architecture techniques. The paper presents a robust implementation for RVV (e.g., the Conflict Region Table), but the underlying principles are not new.
Conflation of Contributions: The paper aggregates a large number of techniques. This makes it difficult to assess the novelty and benefit of each individual contribution. The impressive final results are a product of the entire system, but the value of, for instance, the floor-planning solver is not isolated from the value of the fine-grained chaining. An ablation study would be necessary to truly weigh the merit of each proposed idea against its complexity. The performance gains are substantial, but it's unclear if they come from one or two key breakthroughs or the aggregation of many marginal improvements.

Questions to Address In Rebuttal

Clarification on Chaining Novelty: The claim of "fine-grained chaining" is compelling. Could the authors please elaborate on the delta between their linked-list scoreboard approach (Section 4.2.4, page 8) and other forms of fine-grained dependency tracking in prior SIMD or vector architectures, academic or industrial? Is the novelty in the data structure, the configurability, or both?
Comparison to Classic Decoupling: Please contrast the "issue-as-commit" mechanism with classic scoreboard-based designs and more formal Decoupled Access/Execute architectures. What is the precise, novel contribution beyond applying a known decoupling principle to the specific scalar-vector interface of RVV?
Quantifying Individual Contributions: The paper presents a complex system with multiple interacting optimizations. To better assess the significance of each novel idea, can the authors provide any data (even from simulation) that isolates the performance benefit of the key contributions? For example, what is the performance of T1 with fine-grained chaining disabled (i.e., DLEN-granular like Saturn)? What is the performance impact of using a naive, grid-like floorplan instead of the solver's output?
Generality of the Floor-Planner: The floor-planning solver is an interesting design-time contribution. Is the solver's heuristic specifically tuned for the permutation patterns of RVV, or does it represent a more generalizable approach for optimizing communication in tiled accelerator architectures?

Recommendation: Accept with minor revisions.

The paper presents a powerful and well-engineered vector core. While some of its ILP-enhancing concepts are derived from established principles, the work contains several genuinely novel and significant contributions, particularly in its approach to fine-grained chaining and the integration of physical design constraints into the microarchitecture generator. The rebuttal should focus on more precisely situating their work in historical context and, if possible, providing data to deconvolve the benefits of their many contributions.

SHADOW: Simultaneous Multi-Threading Architecture with Asymmetric Threads

Abstract

Many important applications exhibit shifting demands between instruction-level parallelism (ILP) and thread-level parallelism (TLP) due to irregular sparsity and unpredictable memory access patterns. Conventional CPUs optimize for one but fail to balance ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present SHADOW, an asymmetric Simultaneous Multi-Threading (SMT) architecture that concurrently executes out-of-order (OoO) and in-order (InO) threads on a single core. The stated goal is to dynamically balance Instruction-Level Parallelism (ILP) and Thread-Level Parallelism (TLP) to better suit workloads with irregular memory access patterns, such as sparse matrix multiplication. The core mechanism involves partitioning core resources (e.g., the register file) between a small number of heavyweight OoO threads and a larger number of lightweight InO threads, with a standard software work-stealing mechanism distributing tasks. While the paper identifies a valid and well-known problem, the proposed solution's evaluation, claims of efficiency, and novelty are not rigorously substantiated, and critical design aspects such as security are inadequately addressed.

Strengths

Problem Formulation: The paper correctly identifies the performance challenges of sparse, memory-bound workloads on conventional CPU architectures and the inherent tension between deep ILP extraction and wide TLP scaling.
Core Concept: The high-level idea of combining OoO and InO execution contexts within a single core to adapt to dynamic workload phases is conceptually plausible.
Detailed Case Study: The analysis of Sparse Matrix Multiplication (SpMM) across a range of sparsities (Section 5.3, pages 10-11) provides the paper's most compelling, albeit narrow, evidence for the architecture's potential benefits under specific memory-constrained conditions.

Weaknesses

Unsubstantiated and Implausible Overhead Claims: The central claim of a "just 1% area and power overhead" (Abstract, Section 5.4 page 12) is based on a "modified McPAT" model. This is insufficient evidence. The paper fails to detail the modifications made to the tool, nor does it justify them. The architecture introduces non-trivial hardware: per-thread InO FIFO queues, a multi-ported scoreboard for InO dependency checking, and additional multiplexing/demultiplexing logic in the fetch, decode, and issue stages (Table 4, page 11). To claim these structures collectively contribute only 1% overhead without a detailed, verifiable analysis is not credible.
Grossly Inadequate Security Analysis: The security implications of this new SMT design are dismissed in a single, hand-waving paragraph (Section 3.11, page 7). In the current landscape of microarchitectural attacks, proposing a novel resource-sharing scheme without a thorough analysis of its vulnerability to contention-based side- and covert-channels is a critical omission. Stating that comprehensive protection is "beyond this paper's scope" is unacceptable. The design directly creates new shared resources (e.g., InO issue logic, scoreboards) that could be exploited.
Flawed and Unconvincing Competitive Analysis: The comparison to FIFOShelf is not based on a faithful implementation but on a "roof-lined" model whose parameters seem arbitrary (Section 5.3.1, page 10; Figure 13, page 10). The authors claim this provides an "optimistic upper bound," but the justification for the chosen parameters (e.g., "a doubled ROB dedicated to the OoO path") is absent. Without a principled or direct comparison, the claims of SHADOW's superiority over related speculative-steering approaches are unsupported.
Overstatement of Novelty in Work Distribution: The paper presents the "dynamic work distribution mechanism" (Section 3.9, page 7) as a key feature. However, Algorithm 1 is a textbook implementation of a work-stealing queue using a shared counter protected by a mutex. The "emergent" property where faster threads take more work is a fundamental, long-understood characteristic of this pattern, not a novel co-design. The contribution is merely the application of a standard software library pattern on their hardware, not a novel hardware-software mechanism.
Contradictory Performance Rationale: Table 3 (page 10) claims the best configuration for Backprop (1 OoO + 3 InO) provides a 3.16x speedup by alleviating "RF, ROB and cache contention." While InO threads do not consume ROB entries or OoO PRF entries, the baseline is a single OoO core. The critical comparison should be against symmetric SMT configurations (e.g., 2-OoO, 3-OoO). Figure 4 shows that 3-OoO performs worse than 1-OoO+4-InO, but the text needs to more rigorously prove that the resource savings from using InO threads outweighs the performance loss from their simpler pipelines, especially compared to a well-provisioned symmetric 2-OoO SMT core, which is the industry standard.

Questions to Address In Rebuttal

Overhead: Please provide a detailed breakdown of the modifications made to McPAT. Justify the area and power models used for the new structures (InO FIFOs, scoreboards, multiplexers). How can these additions be credibly contained within a 1% total core overhead budget?
Security: Given that SMT security is a first-order design concern, provide a detailed analysis of potential new contention channels introduced by SHADOW. How would the shared fetch policy, InO/OoO RS arbitration, and the InO scoreboard be secured against malicious threads?
Comparison: Justify the specific parameters chosen for the FIFOShelf roofline model. Why is this configuration a fair and representative upper bound for a state-of-the-art speculative instruction steering architecture? A more principled comparison is required.
Work Stealing: Please clarify the novelty of the work distribution mechanism. Is there any hardware support for this beyond providing distinct thread contexts, or is the contribution entirely the use of a standard software Pthreads library?
Critical Path: Section 3.10 (page 7) cites MorphCore's 2.5% frequency impact and claims a similar effect for SHADOW due to multiplexers. Can you provide a more rigorous analysis of the critical path impact? Specifically, how does the logic for selecting between ready OoO and InO instructions affect the issue stage's timing?
OS/Runtime Interaction: The shdw_cfg instruction (Section 3.3, page 5) implies OS modification and a reconfiguration process. What is the latency of this context switch and reconfiguration? How does this latency impact workloads with frequent phase changes or fine-grained multitasking?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents SHADOW, a novel asymmetric Simultaneous Multi-Threading (SMT) architecture. The core contribution is the ability to execute both traditional high-ILP out-of-order (OoO) threads and lightweight high-TLP in-order (InO) threads concurrently within a single processor core. By dynamically balancing the workload between these asymmetric thread types via a software work-stealing mechanism, SHADOW aims to adapt to applications with shifting parallelism characteristics, particularly memory-bound and sparse workloads that suffer from underutilized resources on conventional core designs. The authors demonstrate significant performance gains (up to 3.16x, with a 1.33x average) over a baseline OoO core with minimal (1%) area and power overhead.

Strengths

The true strength of this work lies in its elegant synthesis of several established architectural concepts into a coherent and compelling new design point.

A Novel and Elegant Approach to ILP-TLP Balancing: The problem of balancing Instruction-Level Parallelism (ILP) and Thread-Level Parallelism (TLP) is a classic challenge in computer architecture. Historically, designs have polarized towards one extreme (e.g., wide OoO cores for ILP) or the other (e.g., Sun's Niagara for TLP). More recent approaches have explored heterogeneity at the core level (e.g., ARM's big.LITTLE) or temporal mode-switching within a core (e.g., MorphCore). SHADOW’s contribution of enabling simultaneous, intra-core heterogeneity is a genuinely new perspective. It avoids the OS scheduling complexity of inter-core migration and the potential stalls of mode-switching by allowing both execution styles to coexist and dynamically share resources. This is a very clever architectural solution.
Excellent Positioning within the Design Space: The paper does a good job of situating SHADOW relative to its closest relatives. The distinction from MorphCore (simultaneous execution vs. mode-switching) and from speculative steering approaches like FIFOShelf (partitioning at the thread-level vs. the instruction-level) is clearly articulated in Section 2.2 (page 3). This highlights the pragmatism of the SHADOW design, which avoids the complexities of speculative recovery across different execution paths. It finds a sweet spot in the design space that was previously unoccupied.
Pragmatic Hardware-Software Co-Design: The architecture does not rely on exotic hardware or a new ISA. Instead, it makes modest, well-motivated modifications to a conventional OoO pipeline. Critically, it leverages a well-understood software work-stealing paradigm (Algorithm 1, page 8) for load balancing. This emergent balancing, where faster (high-IPC OoO) threads naturally claim more work, is an efficient and decentralized control mechanism. It connects the hardware architecture to decades of research in parallel runtime systems (e.g., Cilk, TBB, OpenMP) and makes the design more readily programmable.
Strong Problem Motivation: The choice to focus on sparse and memory-bound workloads is timely and important. These workloads are prevalent in critical domains like machine learning, graph analytics, and scientific computing, and they are notoriously difficult for conventional architectures to accelerate efficiently. By demonstrating significant speedups on benchmarks like SpMM, Backprop, and APSP, the paper makes a strong case for its real-world relevance.

Weaknesses

While the core idea is strong, the evaluation and discussion are focused primarily on the microarchitecture, leaving its broader system-level implications less explored.

Limited System-Level Scope: The current model is constrained to either multiple single-threaded applications or a single multi-threaded application (as stated in Section 3.3, page 5). This is a significant limitation for a general-purpose processor operating in a modern multi-tasking OS. The paper does not adequately explore the challenges of resource partitioning, scheduling, and ensuring fairness or QoS if multiple SHADOW-aware applications were to run concurrently. This is the biggest barrier between the current concept and a deployable system.
Under-explored Cache and Prefetcher Dynamics: The paper reports that cache-sensitive kernels can slow down due to contention (Section 5.1, page 9, Figure 12). This is a crucial interaction. The lightweight InO threads, while not adding ROB pressure, will still aggressively issue memory requests. How do their access patterns interact with the (potentially more regular) access patterns of the OoO thread? It seems likely that the mix of memory streams could confuse conventional stride or stream prefetchers, potentially degrading performance for both thread types. A deeper analysis of this cross-thread resource contention, especially on the memory subsystem, would strengthen the paper.
Security Implications Acknowledged but Not Addressed: Section 3.11 (page 7) correctly identifies that resource sharing in SMT creates security vulnerabilities. While it notes that SHADOW's InO lanes reduce some attack surfaces by eliminating speculation, it does not explore the new, potentially subtle channels that arise from the interaction between OoO and InO threads sharing the L1 cache, execution units, or other structures. Given the intense focus on SMT security post-Spectre, this aspect warrants more than a brief mention and a pointer to prior work.

Questions to Address In Rebuttal

The authors have presented a compelling architectural idea. To better understand its potential, I would appreciate their thoughts on the following:

OS Scheduling Interaction: Beyond the "delegate thread" for configuration, how do you envision a modern OS scheduler (like the Linux CFS) interacting with a SHADOW core? Would the OS need to be aware of the OoO/InO asymmetry to make intelligent placement decisions, for example, by prioritizing latency-sensitive threads for the OoO slots?
Hardware-Software Interface for Work Stealing: The current work-stealing mechanism is purely software-based. Could the hardware provide performance counters or hints to the runtime—for instance, an indicator of the OoO thread's ROB/LSQ occupancy or recent IPC—to help the software make more informed decisions about work chunk size or stealing strategy?
Applicability to Other Domains: While the paper focuses on sparse workloads, the core idea of balancing ILP and TLP seems broadly applicable. Could you comment on how SHADOW might perform on other types of workloads, such as those with producer-consumer patterns, where one thread might be compute-bound (ideal for OoO) while another is I/O-bound (potentially benefiting from a lightweight InO thread)?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The central novel claim of this paper is the design of an asymmetric Simultaneous Multi-Threading (SMT) core that concurrently executes both full-fledged out-of-order (OoO) software threads and lightweight in-order (InO) software threads. The proposed architecture, SHADOW, partitions front-end resources (e.g., renaming logic, reorder buffer) to cater to these two distinct thread types while allowing them to share back-end execution units. The stated goal is to create a single core that can dynamically and efficiently balance Instruction-Level Parallelism (ILP), extracted by the OoO thread(s), and Thread-Level Parallelism (TLP), provided by the numerous InO threads. This adaptability is presented as a solution for workloads, such as sparse matrix multiplication, that exhibit shifting parallelism characteristics.

My assessment, based on an extensive review of prior art in computer architecture, is that the core architectural concept—the simultaneous co-existence and execution of distinct OoO and InO software threads within a single SMT core—is a novel contribution. While constituent ideas (SMT, heterogeneous architectures, lightweight threads) are well-established, their synthesis into this specific microarchitecture represents a new and unexplored design point.

Strengths

Clear Novel Contribution in SMT Design: The paper successfully differentiates its core idea from the closest prior art. The key differentiator is simultaneity and thread-level granularity.
- Unlike MorphCore [56], which performs whole-core mode-switching between OoO and InO execution, SHADOW allows both modes to operate concurrently. This is a fundamental architectural distinction that enables the handling of workloads with mixed-parallelism phases without incurring mode-switch penalties.
- Unlike FIFOShelf [53] and FIFOrder [6], which speculatively steer instructions from a single thread down different paths, SHADOW partitions entire software threads non-speculatively. This is a much coarser-grained approach that avoids the significant hardware complexity of cross-path dependency tracking, speculative wakeup, and misprediction recovery, making the design arguably more practical.
Elegant Synthesis of Existing Concepts: The novelty here is not in the invention of a brand-new mechanism from scratch, but in the clever synthesis of established principles. The idea of adding lightweight, non-speculative execution capabilities to a complex OoO core is a direct and logical approach to tackling the underutilization of back-end resources during memory stalls. The proposed implementation, which leverages per-thread FIFO queues for InO instructions to bypass the complex rename/ROB pipeline (Figure 5, page 4), is a clean and low-overhead design.
Low-Complexity Architectural Delta: The authors claim the modifications result in only a 1% area and power overhead (Section 5.4, page 12). From a novelty standpoint, this is crucial. The proposal does not require a radical redesign of the core; rather, it augments an existing OoO pipeline with simple, parallel structures. This demonstrates that a significant gain in adaptability can be achieved with a minimal and non-disruptive change, which enhances the value of the new idea.

Weaknesses

The Software Mechanism Lacks Novelty: While the hardware architecture is novel, the mechanism for workload distribution—software-based work stealing (Algorithm 1, page 8)—is a standard, widely-used technique in parallel programming (e.g., Intel TBB, Cilk). The paper should be more explicit that the novelty lies exclusively in the hardware that makes this standard software pattern highly effective, rather than in the pattern itself. The dynamic adaptation is an emergent property of existing software running on new hardware, not a feature of a new co-designed algorithm.
Under-explored Connection to Intra-Core Heterogeneity: The paper correctly distinguishes its approach from inter-core heterogeneous systems like ARM's big.LITTLE. However, the novelty could be framed more powerfully by situating it within the broader landscape of "intra-core heterogeneity." The introduction focuses on the ILP-TLP trade-off but could benefit from a clearer articulation of how SHADOW presents a new path to achieving heterogeneity inside the core boundary, contrasting it more sharply with prior academic concepts like Core Fusion [28] which composed cores, rather than decomposing thread types.
Static Nature of the Asymmetry: The novelty is the flexible mix of OoO and InO threads. However, the mechanism to configure this mix, the shdw_cfg instruction (Section 3.3, page 5), appears to be a one-time setup at the start of an application or context switch. This feels like a missed opportunity. A truly dynamic architecture might allow for the promotion/demotion of threads between OoO and InO modes during execution, which would represent an even greater delta over the prior art. As presented, the "dynamic" balancing is in the work distribution, not in the hardware's configuration itself post-spawn.

Questions to Address In Rebuttal

Dynamic Reconfiguration: The shdw_cfg instruction (Section 3.3, page 5) appears to set the core's asymmetric configuration for an application's lifetime or until the next context switch. Is there any fundamental architectural barrier to reconfiguring the OoO/InO mix dynamically within a single application run without a full OS context switch? For example, could a user-level runtime library trigger a reconfiguration if it detects a persistent phase change in the application? This speaks to the true dynamism and novelty of the design.
Comparison to Fine-Grained Heterogeneity: Could the authors further elaborate on the trade-offs between their coarse-grained, thread-level asymmetry and the finer-grained, instruction-level asymmetry of FIFOShelf/FIFOrder? While SHADOW is clearly simpler, are there classes of workloads with very rapidly changing ILP characteristics where an instruction-level approach, despite its complexity, might prove superior? A deeper analysis would better solidify the novelty and contribution of the specific design point SHADOW occupies.
Novelty Beyond a Single OoO Thread: The paper's most compelling results often come from a 1 OoO + N InO configuration. How does the novelty and benefit of the architecture hold up when scaling to multiple OoO threads (e.g., 2 OoO + 2 InO)? My concern is that contention between two complex OoO threads on shared resources (ROB partitions, LSQ, register file) could negate the benefits of the InO threads, making the novel concept less effective beyond the single "main thread" use case.

ATR: Out-of-Order Register Release Exploiting Atomic Regions

Abstract

Modern superscalar processors require large physical register files to support a high number of in-flight instructions, which is crucial for achieving higher ILP and IPC. Conventional register renaming techniques release physical registers conservatively, ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The authors propose a register renaming technique, "ATR," that aims to reduce physical register file pressure by releasing registers early. The central concept is the "atomic commit region," defined as a sequence of instructions containing no branches or potential exceptions. The authors claim this allows for safe, out-of-order register release without the need for checkpointing or complex recovery mechanisms associated with fully speculative techniques. They present analysis suggesting a significant portion of registers (17% in SPECint) are allocated within these regions and show IPC improvements, particularly for small register file configurations.

Strengths

Problem Motivation: The paper does an excellent job of motivating the problem. The introductory discussion and Figure 1 clearly and effectively illustrate that physical register file pressure is a first-order performance limiter in modern wide-issue superscalar processors. The analysis is sound and sets a clear context for the work.
Opportunity Analysis: The lifecycle analysis of a physical register in Section 3.1 is well-structured and provides a useful taxonomy ("In-use," "Unused," "Verified-unused"). Figure 4, which quantifies the time spent in each state, effectively frames the performance opportunity that early release schemes, including the one proposed, aim to capture.
Conceptual Framework: The core idea of identifying an "atomic commit region" as a basis for a safer form of early release is a logical middle ground between overly conservative commit-order release and aggressive, complex speculative release. Conceptually, it presents a potentially valuable point in the design space.

Weaknesses

My primary concerns with this work center on the fundamental safety claims, the practical implications of the design, and the significance of the results when placed in context.

The Definition of "Atomic" is Fundamentally Unsafe Regarding Exceptions: The paper's central claim of safety hinges on identifying "atomic commit regions" at the rename stage. These regions must exclude "exception-causing instructions" (Abstract, page 1). However, the paper fails to address how it can possibly know, at rename, whether a memory instruction (e.g., a load or store) will cause an exception like a page fault. A memory access is always potentially faulting until it has been executed and its address translated by the memory subsystem. By classifying all memory instructions as potentially exception-causing and thus breaking atomic regions, the length of these regions would become trivial, likely consisting of only a few ALU instructions. Conversely, if the authors are not considering memory instructions as exception-causing at rename, their technique is no longer safe or non-speculative with respect to exceptions, directly contradicting their claims (e.g., "our approach is safe providing precise exceptions," Section 1, page 2). This appears to be a critical, unaddressed flaw in the core premise.
Impractical Interrupt Handling: The proposed handling for interrupts in Section 4.1 (page 5) is concerningly simplistic and heavyweight. The authors suggest either draining the ROB (introducing potentially unbounded latency) or flushing the entire pipeline and re-executing. While they argue this "does not violate correctness," it sidesteps the immense performance implications. A high-priority interrupt in a real-time system or even a standard OS timer tick could trigger a catastrophic performance loss with the flush-based approach. The paper provides zero evaluation of the performance overhead of this mechanism, which is a major omission for a technique intended for high-performance processors.
Optimistic Hardware Implementation and Timing: The hardware described in Section 4.2.2 (page 7) for bulk-marking ptags as "no-early-release" appears complex for the rename stage. The logic must read all current architectural-to-physical mappings from the SRT upon renaming any branch or exception-causing instruction. The authors propose pipelining this critical path, but the analysis in Section 5.5 that dismisses the impact of a 2-cycle delay is unconvincing. It relies on averages from Figure 14, which can easily hide worst-case scenarios where short-lived registers are redefined within the pipeline delay, completely negating the benefit of ATR for those instances. The claim that this complex logic can be pipelined to run at 4GHz+ after synthesis at 2.6GHz seems optimistic.
Marginal Gains Over a Stronger Baseline: While the speedups for a 64-entry register file are large, this is an artificially constrained configuration for the simulated Golden Cove-like core, which in reality has a 512-entry ROB and would be paired with a much larger register file. In the more realistic 224-entry configuration (Figure 10), the proposed "atomic" scheme provides only a 1.48% speedup. More importantly, when combined with a non-speculative early release (nonspec-ER) scheme, the additional benefit of ATR is a mere 0.37% for SPECint. This suggests that a well-implemented conventional early release scheme already captures most of the available benefits, making the significant added complexity of ATR difficult to justify for such a negligible incremental improvement.

Questions to Address In Rebuttal

Please clarify your precise criteria for identifying an instruction as "non-exception-causing" at the rename stage. Specifically, how are memory access instructions handled? If they are treated as potentially causing exceptions (as they should be), please provide data on the resulting (likely much smaller) size of atomic regions and the impact on your reported 17%/13% opportunity in Figure 6. If they are not, please justify how your mechanism can be considered "safe" with respect to precise exceptions.
The proposed interrupt handling mechanisms (ROB drain or flush) introduce significant, unevaluated performance penalties. Please quantify the performance impact of your chosen mechanism under a realistic interrupt workload (e.g., a 1ms timer tick) and justify its viability in a general-purpose processor.
The performance impact of the N-stage pipeline delay for the bulk-marking logic was dismissed as "negligible" based on average register lifetimes. What percentage of atomic early-release opportunities are lost specifically due to this 2-cycle delay? How does this impact registers with lifetimes shorter than this delay?
For the realistic 224-entry register file configuration, the "combined" scheme shows only a 0.37% improvement over the "nonspec-ER" baseline. Given the added hardware complexity for atomic region detection, consumer counting, and the intricate flush-recovery logic (Section 4.2.4), how do you justify the value proposition of ATR for such a marginal gain over an existing technique?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces "ATR" (ATomic register Release), a novel technique for improving the efficiency of physical register file (PRF) utilization in modern out-of-order processors. The core problem it addresses is the conservative nature of conventional register release, where a physical register is only freed after a subsequent instruction that redefines the same architectural register commits. Existing "early release" solutions are often either speculatively unsafe (requiring complex recovery mechanisms like shadow register files) or non-speculatively safe but still overly conservative (requiring the redefining instruction to be past all unresolved branches and potential exceptions).

The key insight of ATR is to identify and exploit "atomic commit regions"—sequences of instructions guaranteed to contain no conditional branches or exception-causing instructions. The authors posit that within such a region, a physical register can be safely released out-of-order as soon as its last consumer has executed and it has been redefined, even if older, unresolved branches exist outside the region. This safety is guaranteed because any misprediction of an older branch would flush the entire atomic region, making the early release moot, while a correct prediction ensures the region will eventually commit, validating the release. The authors propose a low-overhead hardware mechanism to identify these regions and track consumer counts, demonstrating significant IPC improvements on resource-constrained PRFs or, alternatively, substantial reductions in required PRF size for equivalent performance.

Strengths

Elegant Core Concept: The central idea of using "atomic commit regions" as a basis for safe, out-of-order release is both insightful and elegant. It carves out a well-defined space between the aggressive-but-complex speculative approaches (e.g., checkpointing) and the safe-but-conservative non-speculative ones. It's a clever microarchitectural observation that directly translates into a practical optimization.
Strong Contextualization and Motivation: The paper does an excellent job positioning itself within the broader landscape of PRF management techniques. The background (Section 2) and related work (Section 6) sections clearly delineate how ATR differs from and improves upon prior art. The motivation provided in the introduction, supported by Figure 1, effectively communicates the criticality and timeliness of the problem.
Pragmatic and Plausible Implementation: The proposed hardware mechanism (Section 4.2) appears practical and low-cost. Augmenting the physical register table with a small consumer counter and adding logic to the rename stage to detect the boundaries of atomic regions is a far lighter-weight solution than implementing a full shadow register file or complex checkpoint-and-restore logic. The overhead analysis in Section 4.4, including the synthesis results, further strengthens the claim of practicality.
Demonstrated Orthogonality: A key strength of this work is the demonstration that ATR is not a mutually exclusive alternative but a complementary technique. The evaluation in Figure 10, which shows that combining ATR with a traditional non-speculative early release scheme yields the best results, highlights its value. This suggests that ATR could be integrated into existing high-performance cores as another tool in the architect's toolbox for managing register pressure.

Weaknesses

Limited Scope of "Atomic Regions": The definition of an atomic region is necessarily strict to ensure safety (no branches, no loads/stores, etc.). While the analysis in Figure 6 shows a respectable opportunity (13-17% of registers), it also implicitly highlights that over 80% of registers are outside the scope of this specific optimization. The paper could be strengthened by a discussion on the potential for, or challenges of, relaxing this definition. For instance, could branches with extremely high confidence or memory operations with predictable latency (e.g., L1 hits) be incorporated into a "quasi-atomic" region concept? This feels like a natural and important direction for future work that is worth acknowledging.
Brief Treatment of Interrupts: The handling of precise interrupts is addressed in Section 4.1 by suggesting either draining the ROB or a full flush after the active atomic regions are resolved. While functionally correct, this could introduce significant and unpredictable interrupt latency. In many domains (e.g., real-time systems, high-frequency trading), this latency is a critical design parameter. A more in-depth analysis of the performance implications of this design choice, perhaps quantifying the frequency of interrupts and the resulting pipeline drain cycles, would make the proposal more robust.
Missed Connection to Trace-Based Mechanisms: The concept of identifying and optimizing branch-free regions of code is reminiscent of work on trace caches and other trace-based processors (e.g., Rotenberg et al. [28]). While the goal is different (instruction supply vs. register release), the underlying principle of leveraging linear code sequences is similar. A brief discussion situating ATR in relation to these concepts could provide a richer context, exploring potential synergies or design trade-offs if both mechanisms were present in a core.

Questions to Address In Rebuttal

The core contribution hinges on the definition of an atomic region. Could the authors elaborate on the sensitivity of their results to this definition? For instance, what is the impact on the opportunity space (the percentage of "atomic registers") if load instructions that are guaranteed L1D hits are allowed within a region? Is there a path toward a more flexible, dynamic definition of atomicity?
Regarding interrupt handling (Section 4.1), the proposal to drain the pipeline could be a performance concern. Could you provide data on the frequency of interrupts in the SPEC workloads and estimate the average number of stall cycles this mechanism might introduce per interrupt event? How does this compare to the baseline architecture's interrupt handling latency?
The combined scheme (ATR + non-speculative ER) shows the most promise. This suggests a hybrid approach is optimal. Do the authors envision a scenario where the processor could dynamically choose which release policy to apply based on runtime characteristics, or is a static, combined implementation the most logical endpoint for this line of research?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper introduces "ATR: Out-of-Order Register Release Exploiting Atomic Regions," a technique aimed at alleviating physical register file pressure. The central claim is that by identifying "atomic commit regions"—sequences of instructions guaranteed to contain no branches or exception-causing instructions—it is possible to safely release a physical register out-of-order, even before the redefining instruction has been pre-committed. This carves out a new design point between aggressive but unsafe speculative early release schemes (which require complex checkpointing and recovery) and safe non-speculative schemes (which are conservative, waiting for all prior branches to resolve). The authors implement this idea and show modest IPC speedups for constrained register files (5.13% for 64-entry) and a more significant reduction in register file size (27.1%) for a minimal performance loss.

Strengths

The primary strength of this paper lies in its core conceptual contribution. My analysis confirms that the central idea is, in fact, novel.

Novel Condition for Safe Release: The prior art in safe, non-speculative early release, such as Monreal et al. [19], hinges on the redefining instruction becoming pre-committed (i.e., all older control flow instructions have resolved). This is a global property of the instruction stream. ATR replaces this global condition with a local one: whether the producer, all its consumers, and the redefiner exist within a dynamically identified atomic region. This allows release to occur even in the presence of an older, unresolved, long-latency branch, something existing safe techniques cannot do. This "local atomicity" as a sufficient condition for safe out-of-order release is a new insight.
Clear Delimitation from Prior Art: The paper does an admirable job of positioning its contribution relative to the vast body of work on register management. It correctly identifies the unsafe nature of purely speculative techniques (e.g., Moudgill [22], Ergin [6]) and the conservative nature of existing non-speculative techniques [19]. ATR cleverly exploits a property of instruction sequences to achieve safety without the conservatism of waiting for pre-commit.
Thorough Analysis of the Opportunity: The analysis in Section 3, particularly Figure 6 (page 5), is commendable. Quantifying the fraction of registers that fall within non-branch, non-exception, and fully atomic regions provides a solid theoretical foundation for the work and demonstrates that the opportunity ATR targets is non-trivial.

Weaknesses

While the core idea is novel, its practical realization and significance are open to critique.

Marginal Impact for Realistic Configurations: The novelty's impact diminishes as the machine configuration scales. A 1.48% speedup for a 224-entry register file is a very marginal gain. While the authors pivot to a register-file-size reduction argument (Figure 15, page 11), this reframes the contribution from a performance enhancement to a cost-saving measure. The novelty is not in question, but its ability to drive significant performance in modern, well-provisioned cores is. A truly groundbreaking idea should ideally provide more substantial benefits.
Non-Trivial Implementation Complexity: The proposed mechanism for identifying atomic regions and managing register state is not simple. The "bulk no-early-release" logic described in Section 4.2.2 (page 7) requires setting the state of numerous ptags in parallel whenever a branch or exception-causing instruction is renamed. The authors themselves note this may require pipelining, which introduces latency into the redefinition signal—the very signal that enables early release. While their sensitivity study suggests a 2-cycle delay has minimal impact, this adds a non-trivial piece of timing-sensitive logic to the already critical rename stage. Furthermore, the "Double-Free Avoidance" mechanism (Section 4.2.4, page 7) adds state (two bits per architectural register) and complex logic to the flush recovery path. The novelty comes at the cost of tangible complexity.
Restrictive Definition of "Atomic Region": The definition of an atomic region is extremely strict: no conditional branches, no indirect jumps, and no exception-causing instructions. The exclusion of exception-causing instructions effectively bars all memory operations (loads/stores) from these regions. This severely limits the length and prevalence of qualifying atomic regions, capping the potential of the proposed technique. The novelty is confined to very specific, short instruction sequences.

Questions to Address In Rebuttal

Significance of the Contribution: Given the modest IPC gains (1.48%) on the 224-entry register file configuration, which is closer to modern core designs, what is the most compelling argument for a processor architect to adopt the added hardware complexity of ATR over existing, simpler non-speculative schemes? Is the primary benefit area/power reduction rather than performance?
Boundaries of the "Atomic Region" Concept: The definition of an atomic region seems overly restrictive by excluding all loads and stores due to the possibility of page faults. Have the authors considered relaxing this definition? For instance, could loads that are provably non-faulting (e.g., stack-relative accesses within the mapped stack frame) be permitted within an atomic region? Such a relaxation would significantly increase the applicability of ATR and strengthen the novelty of the contribution.
Scalability of the Invalidation Logic: The bulk invalidation logic (Figure 9, page 7) must check and potentially set the no-early-release status for all ptags referenced in the SRT when a branch is renamed. For an 8-wide x86 machine, this is a substantial number of ptags. Could the authors elaborate on the scalability and timing implications of this logic for machines wider than the 6-wide Golden Cove core modeled? Does the fan-in/fan-out of this logic create a potential timing bottleneck in the rename stage for future, wider designs?

Vegapunk: Accurate and Fast Decoding for Quantum LDPC Codes with Online Hierarchical Algorithm and Sparse Accelerator

Abstract

Quantum Low-Density Parity-Check (qLDPC) codes are a promising class of quantum error-correcting codes that exhibit constant-rate encoding and high error thresholds, thereby facilitating scalable fault-tolerant quantum computation. However, real-time ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present "Vegapunk," a software-hardware co-design framework for decoding quantum Low-Density Parity-Check (qLDPC) codes. The core proposal involves an offline, SMT-based strategy to transform a given check matrix into a partially block-diagonal form, purportedly to mitigate quantum degeneracy. This is followed by an online, hierarchical greedy algorithm, implemented on a custom FPGA accelerator, to perform the decoding under a real-time (<1µs) latency constraint. The authors claim that this approach achieves decoding accuracy on par with the state-of-the-art BP+OSD decoder while meeting the strict latency requirements for fault-tolerant quantum computing.

Strengths

The work addresses the critical and well-understood trade-off between accuracy and latency in qLDPC decoding, a central challenge for the field.
The software-hardware co-design approach is a logical direction for tackling this problem, correctly identifying that algorithmic improvements must be paired with dedicated hardware to meet real-time constraints.
The use of an offline pre-computation step to simplify the online decoding problem is a sound principle.

Weaknesses

My primary concerns with this submission relate to the feasibility of the offline step, the significant limitations of the online greedy heuristic, and the overstatement of the empirical results.

Computational Feasibility of SMT-based Decoupling: The paper's proposal hinges on an offline SMT optimization to find suitable transformation (T) and permutation (P) matrices (Section 4.2, page 5). The authors claim this "only needs to be performed once per check matrix" (page 6) but provide no data on the computational cost of this step. The search space for these matrices is combinatorially vast, and solving such an SMT problem is likely NP-hard. It is entirely plausible that this offline step is computationally intractable for the very codes for which this decoder is intended. Without reporting the wall-clock time required by the Z3 solver for the benchmarked codes (especially [[784,24,24]]), the entire method's practicality is unsubstantiated.
Fundamentally Limited Greedy Heuristic: The online hierarchical algorithm is a greedy search that is hard-capped at a maximum iteration count of M=3 (Section 6.4, page 12). This explicitly limits the search for the "right error" (r) to a maximum Hamming weight of 3. This is a severe and arbitrary constraint. While low-weight errors are most probable, error correction codes are defined by their ability to correct errors up to a certain weight ((d-1)/2). By design, this decoder will fail on any valid correctable error pattern that requires a right-part error of weight 4 or more. Claiming accuracy "on par with BP+OSD" (a far more exhaustive search method) is therefore a significant overstatement. The justification provided in Figure 13 (page 13) merely shows diminishing returns for their specific heuristic; it does not prove that higher-weight errors are irrelevant.
Unsupported Complexity Claims vs. Implementation: The complexity analysis (Section 4.4, page 8) claims logarithmic scaling with sufficient parallelism, suggesting P > n*K parallel processing units. For the [[784,24,24]] code, this implies P > 3920 * 14 ≈ 55,000 units. It is highly doubtful that the FPGA implementation reported in Table 4 (page 12) instantiates this many parallel cores. The paper fails to state how many parallel HDUs were actually implemented, creating a disconnect between the theoretical scaling argument and the physical reality of the accelerator that produced the results. The reported latency is likely for a configuration with far less parallelism, rendering the O(log n + S) complexity claim moot.
Inconsistent Evaluation and Misleading Comparisons:
- Accuracy: The LER plots in Figure 10 (page 10) show that Vegapunk is frequently outperformed by the BP+OSD-CS(7) baseline, especially at lower physical error rates which are the target regime for fault tolerance (e.g., plots a1, a3). The claim of being "on par" is not supported by the authors' own data; in several key instances, there is a clear accuracy penalty.
- Noise Models: The evaluation employs a circuit-level noise model for BB codes but a less realistic phenomenological noise model for HP codes (Section 6.1, page 10). This inconsistency undermines the generality of the conclusions and prevents a fair comparison of performance across the different code families. A rigorous evaluation requires a consistent and realistic noise model for all tested codes.

Questions to Address In Rebuttal

Regarding the offline SMT decoupling (Section 4.2): What is the actual wall-clock time required by the Z3 solver for the largest codes presented, namely [[784,24,24]] and [[1488,30,7]]? Please provide evidence that this offline step is computationally feasible for codes relevant to fault-tolerant computing.
The online algorithm is limited by M=3, capping the searchable Hamming weight of the "right error" to 3. Please provide a rigorous justification for why this hard limit does not unacceptably compromise decoding accuracy. What is the fraction of correctable errors (within the code's distance) that are missed due to this constraint, and how does this affect the error floor?
The parallel complexity analysis (Section 4.4) relies on a number of parallel units (P) that appears unrealistic for an FPGA implementation. Please clarify the exact number of parallel Hierarchical Decoding Units (HDUs) instantiated in the evaluated FPGA design for each code. How does the latency scale if the number of HDUs is fixed while the code size n increases?
In several plots in Figure 10 (e.g., a1 for [[72,12,6]] and a3 for [[108,8,10]]), Vegapunk's Logical Error Rate is visibly higher than that of BP+OSD-CS(7). How do the authors reconcile this observable accuracy gap with the claim of being "on par" with the state-of-the-art?
Please justify the use of two different noise models (circuit-level for BB codes, phenomenological for HP codes). Would the conclusions for HP codes remain the same if a more realistic circuit-level noise model were applied?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents "Vegapunk," a comprehensive software-hardware co-design framework aimed at solving the critical accuracy-latency tradeoff in the decoding of quantum Low-Density Parity-Check (qLDPC) codes. The central challenge in this domain is that fast decoders like Belief Propagation (BP) are inaccurate due to issues like quantum degeneracy, while accurate decoders like BP with Ordered Statistics Decoding (BP+OSD) are far too slow for the real-time requirements of fault-tolerant quantum computers (typically < 1µs).

Vegapunk's core contribution is a novel two-pronged strategy:

Offline SMT-Optimized Decoupling: The authors reframe the problem by performing a one-time, offline transformation of the qLDPC check matrix. Using a Satisfiability Modulo Theories (SMT) solver, they find a transformation that converts the wide, unstructured check matrix into a more manageable form: a series of smaller, independent diagonal blocks and a sparse residual matrix. This pre-computation directly mitigates the quantum degeneracy issue that plagues BP, fundamentally simplifying the online decoding task.
Online Hierarchical Greedy Algorithm & Accelerator: The structured matrix produced by the offline step enables a much simpler online decoding algorithm. Instead of complex message passing, Vegapunk uses a hierarchical, greedy search to find the most likely error pattern. The authors then design a dedicated, sparse hardware accelerator that fully exploits the parallelism and sparsity inherent in this new algorithmic structure, allowing it to meet the stringent real-time deadline.

Experimental results demonstrate that Vegapunk achieves decoding latencies under 1µs for significant qLDPC codes (up to [[784,24,24]]) while maintaining a logical error rate comparable to the highly accurate but slow BP+OSD, effectively breaking the existing tradeoff.

Strengths

High Conceptual Significance and Impact: This work addresses what is arguably one of the most significant architectural bottlenecks for next-generation quantum computers. The field has largely acknowledged that surface codes have poor scaling overhead, and qLDPC codes are the leading contender to replace them. However, the lack of a practical, real-time decoder has been the primary barrier to their adoption. By providing a plausible and well-evaluated solution that achieves both speed and accuracy, this work could fundamentally alter the trajectory of quantum architecture research, shifting focus toward the practical implementation of these more efficient codes.
Novel and Elegant Problem Reframing: The most intellectually compelling aspect of this paper is the offline decoupling strategy (Section 4.2, page 5). The application of SMT solvers—a tool from the formal methods and verification community—to restructure a coding theory problem is a brilliant piece of interdisciplinary thinking. It transforms a difficult, recurring online problem into a (potentially very hard) one-time offline problem and a much simpler online one. This "pre-computation" approach is an elegant way to manage complexity and is a powerful design pattern that could inspire solutions in other areas.
Holistic Full-Stack Approach: The authors demonstrate a mature understanding of the problem by providing a full-stack solution. They do not merely propose a new algorithm in isolation; they consider its hardware mapping from the outset. The co-design of the hierarchical algorithm with the sparse accelerator (Section 5, page 8) is crucial. This demonstrates a deep appreciation for the fact that in the realm of quantum computer architecture, algorithms and hardware are inextricably linked. This end-to-end perspective, from abstract matrix transformation to FPGA implementation details, makes the work far more credible and impactful.
Strong and Well-Contextualized Empirical Evaluation: The experimental results are thorough and convincing. The authors benchmark against the correct state-of-the-art baselines (BP for speed, BP+OSD for accuracy) using relevant and challenging qLDPC code families (BB and HP codes). The demonstration of sub-microsecond latency for a large [[784,24,24]] code is a headline result that will capture the community's attention. The LER curves in Figure 10 (page 10) clearly show that they have achieved their goal of matching BP+OSD's accuracy, solidifying their central claim.

Weaknesses

While the core idea is powerful, the paper could be strengthened by addressing the broader implications and potential limitations of its approach.

Scalability and Cost of the Offline Step: The SMT-based decoupling is the "secret sauce," but the paper provides limited discussion on the practical cost of this step. SMT problems can have exponential complexity. While performing a difficult computation offline is acceptable, it is crucial to understand its limits. How long did the Z3 solver take for the [[784,24,24]] code? How will this scale to the even larger codes required for full-scale fault tolerance? A discussion of the computational complexity and resource requirements of the offline step would provide a more complete picture of the framework's practicality for future codes.
Generality of the Decoupling Strategy: The method is demonstrated successfully on BB and HP codes. The authors provide some intuition for why a block-diagonal structure can be found for these specific code families (Section 4.2, "Caveats of Check Matrix Decomposition," page 6). However, the broader applicability of this SMT formulation to any arbitrary qLDPC code remains an open question. Is there a risk that for some code families, the SMT solver might fail to find a useful, sparse decomposition? A more general discussion on the conditions under which this method is expected to succeed would be valuable.
Limited Intuition on the Online Heuristic: The online algorithm is a greedy search with a very small maximum iteration count (M=3). The ablation study in Figure 13 (page 13) empirically justifies this choice by showing diminishing returns. However, the paper could offer a deeper theoretical or intuitive explanation for why such a simple heuristic is so effective. Is it purely that most correctable errors have low weight, or does the SMT decoupling process structure the problem in such a way that the residual syndrome becomes "easy" to solve greedily? Connecting the success of the online algorithm more explicitly to the properties of the offline transformation would strengthen the narrative.

Questions to Address In Rebuttal

Could the authors please provide data on the runtime and resource usage of the offline SMT decoupling step for the largest codes tested? What are their projections for how this offline cost will scale as qLDPC codes grow in size in the future?
Can the authors comment on the expected generality of their SMT-based decoupling? Are there known properties of qLDPC check matrices (e.g., related to their construction) that would make them more or less amenable to this transformation?
The success of the simple, low-iteration greedy search is a key enabler for the framework's speed. Could the authors provide more intuition as to why this approach is sufficient? Does the offline decoupling effectively "pre-solve" the hardest parts of the problem, leaving a residual search space that is easily navigable by a greedy algorithm?
On a lighter note, is there a specific inspiration behind the name "Vegapunk"? It is a distinctive and memorable choice.

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper, "Vegapunk," presents a software-hardware co-design framework for decoding quantum Low-Density Parity-Check (qLDPC) codes. The authors claim to achieve both high accuracy (comparable to BP+OSD) and low, real-time latency (< 1µs), addressing a well-known trade-off in the field. The central thesis is that a computationally intensive offline pre-processing step can transform the qLDPC check matrix into a structure that enables a very simple, fast, and parallelizable online decoding algorithm.

The core of the proposed novelty lies in using Satisfiability Modulo Theories (SMT) solvers to perform this offline matrix transformation. This "decoupling" yields a matrix with a block-diagonal structure and a sparse remainder. The online decoding algorithm is a hierarchical, greedy search that first guesses the error components corresponding to the sparse part of the matrix and then directly solves for the errors related to the block-diagonal part. A dedicated accelerator is designed to be a direct hardware mapping of this online algorithm.

While the constituent concepts (greedy algorithms, matrix transformations, hardware acceleration) are not new in isolation, their synthesis, particularly the use of an SMT solver to find an optimal matrix structure for decoding, appears to be a genuinely novel approach in this domain.

Strengths

Novel Offline Formulation: The primary and most significant novel contribution is the offline, SMT-optimized decoupling strategy (Section 4.2, page 5). While SAT/SMT solvers have been explored in classical coding theory for tasks like finding minimum-weight codewords, their application here is fundamentally different and new. The authors are not using the SMT solver to decode, but rather to restructure the decoding problem itself. Formulating the search for optimal transformation (T) and permutation (P) matrices as a constrained optimization problem for an SMT solver (Figure 5, page 5) is a clever and, to my knowledge, unprecedented method for preparing a qLDPC code for hardware-accelerated decoding.
Synergistic Algorithm-Architecture Co-Design: The novelty is reinforced by the tight coupling between the offline transformation and the online algorithm. The online hierarchical greedy algorithm (Section 4.3, page 6) is specifically enabled by the structure created by the SMT solver. While greedy search is a standard heuristic, its application here—guessing the "right error" r to simplify the solution for the "left error" l—is a direct consequence of the D' = ([diag(D_i)], A) decomposition. This demonstrates a strong system-level novelty where the offline step creates a problem that the simple online step is uniquely suited to solve.
Complexity Shifting as an Architectural Principle: The work presents a clear and novel architectural trade-off: shifting immense computational complexity from the time-critical online path to a one-time, offline pre-computation. This is a powerful design pattern, and its application to the qLDPC decoding problem appears to be a new and insightful contribution.

Weaknesses

My concerns are not that the core idea has been done before, but rather with the characterization and evaluation of the novel components themselves.

Under-explored Scalability of the Core Novelty: The central claim hinges on the offline SMT step, yet the paper provides zero data on its computational cost. SMT is NP-hard. The "Caveats of Check Matrix Decomposition" section (page 6) briefly discusses the process but completely omits any discussion of the SMT solver's runtime. How long did it take to find the transformation matrices for the [[784,24,24]] BB code? Hours? Days? Weeks? Does the solver time scale polynomially or exponentially with the code size? Without this information, the practicality of the entire approach for future, much larger qLDPC codes remains a major open question. The novelty is significant, but its viability is not established.
Limited Generalizability of the Novel Method: The SMT formulation is evaluated only on highly structured Bivariate Bicycle (BB) and Hypergraph Product (HP) codes. As noted on page 6, the structure of these codes provides hints for the decomposition (e.g., K = max(min(l, m), l x m) for BB codes). It is unclear whether the SMT-based approach is a general tool or if its success is an artifact of the inherent algebraic structure of these specific code families. A truly novel method should demonstrate robustness on less structured or more generic qLDPC code constructions.
Incremental Novelty of the Online Algorithm: The online hierarchical algorithm, when viewed in isolation, is a simple greedy search. Its novelty is almost entirely derived from the pre-transformed matrix it operates on. Conceptually, the strategy of guessing a small number of error bits to resolve a syndrome is related to the post-processing in Ordered Statistics Decoding (OSD) or decimation strategies used in other decoders (e.g., BPGD). The paper should more clearly articulate the fundamental delta between its "guess-and-solve" method and these prior conceptual frameworks, beyond simply stating that it operates on a different matrix structure.

Questions to Address In Rebuttal

The authors should address the following points to solidify the significance of their novel contributions:

Scalability of the SMT Solver: Please provide the runtime of the Z3 SMT solver for the largest BB code ([[784,24,24]]) and the largest HP code ([[1488,30,7]]) presented in the paper. Can you provide any data or theoretical arguments regarding how this runtime scales with n (number of qubits)?
Generalizability: Have you attempted to apply your offline SMT decoupling strategy to other families of qLDPC codes that lack the clean algebraic structure of BB or HP codes? If the method failed or was too slow, this would be crucial information for understanding the boundaries of this novel technique.
Algorithmic Distinction: Please provide a more detailed comparison between the online hierarchical algorithm and decimation-based approaches like BPGD [48]. Both involve iteratively making high-confidence "guesses" to simplify the remaining problem. What is the fundamental algorithmic innovation in your online approach beyond the fact that it leverages the pre-computed matrix structure?

OneAdapt: Resource-Adaptive Compilation of Measurement-Based Quantum Computing for Photonic Hardware

Abstract

Measurement- based quantum computing (MBQC), a.k.a. one-way quantum computing (1WQC), is a universal quantum computing model, which is particularly well-suited for photonic platforms. In this model, computation is driven by measurements on an entangled ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose OneAdapt, a compilation framework for measurement-based quantum computing (MBQC) on photonic hardware. The work introduces a new intermediate representation (IR) that extends the prior FlexLattice IR by (1) enforcing a bound on the length of temporal edges and (2) allowing "skewed" temporal edges between nodes at nearby 2D locations on different layers. The compiler includes two main optimization passes: a "dynamic refresh" mechanism to manage node lifetime and a "2D-bounded temporal routing" algorithm to leverage the new skewed edges. The authors claim significant reductions in execution time (1D depth) compared to both a modified version of the OnePerc compiler and a more rigid cluster-state-based approach. The framework's adaptability to various hardware constraints is evaluated, along with a preliminary extension to fault-tolerant quantum computing (FTQC) using surface codes.

Strengths

Problem Motivation: The paper correctly identifies a critical gap in existing MBQC compilers for photonic systems. The tension between the rigidity of cluster states and the potentially unbounded resource requirements of the more flexible FlexLattice IR is a well-articulated and important problem.
IR Design: The proposed IR features—bounded-length and skewed temporal edges—are well-grounded in the physical realities and capabilities of fusion-based photonic architectures. Capturing these hardware characteristics at the IR level is a sound design choice.
Core Algorithm Concept: The central idea of "dynamic refresh" (Section 4.3) is a logical and more granular alternative to the "periodic refresh" strategy from prior work. Performing refreshes on an as-needed basis, prioritized by computational relevance, has clear potential to avoid the overhead of dedicated refresh layers.

Weaknesses

My primary concerns with this paper relate to the justification for key claims, the strength of the experimental baselines, and the oversimplification of critical hardware effects.

Unsupported Claims Regarding Skewed Edge Implementation: The paper makes the strong claim that skewed edges can be implemented with "a slight algorithm modification, without requiring additional hardware capabilities" (Section 3.1, page 5) and that the associated overheads are "negligible" (Section 4.5, page 9). This is insufficiently substantiated.
- PL Ratio: The analysis in Section 5.5 (Figure 12) assesses the Physical-to-Logical (PL) ratio overhead by randomly selecting skewed edges. This is not representative of a real compilation scenario where structured algorithms create deterministic and potentially conflicting routing patterns (i.e., routing hotspots). The conclusion that the overhead is negligible is therefore unconvincing.
- Fidelity: The authors acknowledge that skewed edges may require longer fusion paths, which degrades fidelity. They then hand-wave this concern away by arguing that the reduction in 1D depth reduces the total edge count, thus "improving the overall fidelity" (Section 4.5, page 9). This is a qualitative argument without any quantitative backing. A paper focused on hardware-adaptive compilation must include a concrete fidelity model to properly evaluate such trade-offs. The absence of one is a major flaw.
Potentially Weak or Unfair Baselines: The impressive performance gains reported (e.g., 3.48x in Table 1) are questionable due to the construction of the baselines.
- Baseline 1 (OnePerc): The authors modify OnePerc by forcing it to perform periodic refreshes and restricting its scheduling (Section 5.1, page 9) to make it "more comparable." This appears to be a post-hoc modification that forces the baseline into a regime it was not designed for, potentially crippling its performance. The data in Table 1 (page 10) shows OnePerc's compiled temporal edge length (Df (compiled)) far exceeds the target Df = 20, while OneAdapt meets it. This does not demonstrate that OneAdapt is 3.48x faster; it demonstrates that OneAdapt can satisfy a constraint that the authors' modified version of OnePerc cannot. This is a much weaker claim.
- Baseline 2 (Qiskit to Cluster State): Compiling a circuit to a rigid 3D cluster state is a known-to-be-inefficient method that flexible compilers like OnePerc and OneAdapt are explicitly designed to outperform. While useful as a sanity check, the large 6.7x improvement over this baseline is expected and does not constitute a strong result on its own.
Over-reliance on Empirically Tuned Heuristics: The compiler's performance hinges on several heuristics. The refresh prioritization scheme (Section 4.3, page 7) is reasonable but not deeply analyzed for failure modes. More critically, the "Refresh Percentage Tuning" (Equation 1, page 8) relies on a parameter p, which is empirically set to 0.4 based on the data in Table 3 (page 12). This suggests the system is sensitive and has been tuned for the specific benchmarks tested, raising questions about its generalizability.
Superficial FTQC Analysis: The extension to FTQC (Section 5.6, page 12) is underdeveloped. The baseline is a "static strategy that interleaves QEC patches uniformly." This appears to be a strawman. The field of FTQC compilation has more sophisticated resource management schemes. Without comparing against a stronger, state-of-the-art dynamic scheduling baseline, the claimed 3.33x improvement is not credible.

Questions to Address In Rebuttal

On Skewed Edges: Please provide a quantitative analysis of the physical overheads of skewed edges.
- a) Re-evaluate the PL ratio using the specific, deterministic routing patterns generated for the benchmark circuits in your evaluation, not random sampling.
- b) Introduce a simple, explicit fidelity model (e.g., constant depolarizing error per fusion) and show how the final logical fidelity is affected by the trade-off between longer paths for skewed edges and a lower total 1D depth.
On Baselines: Please provide a stronger justification for your choice and modification of the OnePerc baseline. Is it not possible to configure OnePerc in a different manner to more effectively handle temporal edge constraints, even if not explicitly bounded? A fair comparison requires demonstrating that you are comparing against the baseline operating in a reasonable, if not optimal, configuration.
On Heuristics: With respect to the refresh percentage bound p, how sensitive is the compiler's performance to this parameter? Please provide data showing how the performance improvements hold up across a wider range of p values and justify why p=0.4 is a robust choice and not simply an artifact of overfitting to the selected benchmarks.
On FTQC: Please provide citations and justification that your "static strategy" baseline for FTQC scheduling is representative of a state-of-the-art approach. If it is not, please explain why a more advanced dynamic scheduling algorithm was not used for comparison.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces OneAdapt, a resource-adaptive compiler for Measurement-Based Quantum Computing (MBQC) specifically tailored for photonic hardware. The work identifies critical limitations in prior Intermediate Representations (IRs), such as the FlexLattice IR, which lack adaptivity to realistic hardware constraints. The core contribution is a novel, more hardware-aware IR and two associated optimization passes designed to bridge this gap.

The new IR extends FlexLattice in two significant ways: 1. It enforces a bound on the length of temporal edges, directly modeling the physical constraint of finite photon delay lines. 2. It introduces skewed temporal edges, which connect nodes at nearby but different 2D spatial locations across time steps, exploiting a latent capability of fusion-based architectures.

To manage and leverage this new IR, the authors propose two key compiler optimizations: 1. Dynamic Refreshing: An intelligent, on-demand node refresh mechanism that prevents temporal edge lengths from exceeding the hardware-imposed limit, in contrast to less efficient periodic refresh schemes. 2. 2D-Bounded Temporal Routing: A routing algorithm that utilizes the new skewed edges to achieve more efficient mappings, reducing both the required 2D hardware area and the 1D temporal depth of the computation.

The paper evaluates OneAdapt against both the state-of-the-art photonic compiler (OnePerc) and a circuit-model-based compiler (Qiskit), demonstrating significant improvements in execution depth while respecting hardware constraints. The framework is also extended to the fault-tolerant setting, showing substantial depth reduction for surface code implementations.

Strengths

The primary strength of this paper is its thoughtful and pragmatic approach to co-designing a quantum compiler IR with the realities of a promising hardware platform. It successfully moves the compilation stack for photonic MBQC from a realm of idealized assumptions toward practical implementation.

Clear Context and Problem Formulation: The paper does an outstanding job of contextualizing its contribution. Figure 1 (page 2) is particularly effective, providing a concise visual history of MBQC IRs and clearly positioning this work as the next logical step in balancing expressive power with hardware feasibility. The motivation is compelling and grounded in real physical limitations (finite delay lines) and opportunities (underutilized fusion pathways).
Significant Technical Contributions: The two core ideas are both novel and impactful.
- The concept of dynamic refresh is an elegant solution to the bounded-delay problem. By refreshing nodes based on their individual "time-to-live" in a delay line rather than through rigid, periodic cycles, the compiler can minimize the overhead associated with maintaining quantum states over time.
- The introduction and exploitation of skewed edges is a keen architectural insight. Recognizing that the underlying fusion-based hardware can support these connections without modification and then building a compiler pass to leverage them for routing is a prime example of effective hardware/software co-design. It unlocks a new dimension for optimization that was previously ignored.
Demonstrated Resource Adaptivity: The paper's central claim of "resource-adaptivity" is well-supported by the evaluation (Section 5, starting page 9). The experiments showing trade-offs between 2D hardware size and the available delay line length (varying Df in Figure 10, page 11) are crucial. This capability elevates the compiler from a mere translator to a vital tool for architects exploring the vast design space of future photonic systems.
Forward-Looking Scope (FTQC): The inclusion of a study on fault-tolerant quantum computing (FTQC) using surface codes (Section 5.6, page 12) is a significant strength. It demonstrates that the proposed techniques are not limited to the NISQ era but provide a scalable path toward fault tolerance, which is the ultimate goal. The reported 3.33x depth reduction in this context is highly promising.

Weaknesses

The weaknesses of the paper are primarily related to the scope of its analysis and potential missed connections, rather than fundamental flaws in the core ideas.

Hardware Model Fidelity: The paper's model of the photonic hardware, while more realistic than its predecessors, is still an abstraction. The crucial PL Ratio (Physical-to-Logical layer ratio) is treated as a fixed parameter based on prior simulations. However, the effectiveness of the proposed routing strategies, especially for skewed edges, likely depends heavily on the actual connectivity of the physical layer after probabilistic fusions. A discussion on how the compiler's performance degrades as the fusion success rate drops (and the physical graph becomes sparser) would add significant depth and realism. Section 5.5 touches on this but could be more integrated with the main results.
Narrow Focus on Fusion-Based MBQC: The work is situated entirely within the fusion-based MBQC paradigm. While this is a leading approach for photonics (e.g., PsiQuantum), it is not the only one. Other paradigms, such as continuous-variable (CV) cluster state generation or direct circuit-model implementations on different photonic architectures, exist. A brief acknowledgment of these alternatives in the introduction or related work would help to better delineate the specific domain of the paper's contribution and strengthen its overall positioning within the broader landscape of photonic quantum computing.
Fidelity vs. Resource Costs: The evaluation focuses exclusively on architectural metrics: 1D depth, 2D size, and temporal edge length. However, the introduction of skewed edges, as the authors briefly note in Section 4.5 (page 9), may require longer physical fusion paths, potentially leading to lower fidelity per logical edge. The paper lacks a quantitative analysis or even a qualitative discussion of this critical trade-off. A 3x reduction in depth is less compelling if it comes at the cost of a 10x increase in the logical error rate.

Questions to Address In Rebuttal

The effectiveness of 2D-bounded temporal routing relies on finding connected paths in the physical substrate. How sensitive are the reported depth and size improvements to the fusion success rate? Is there a percolation threshold below which the advantage of skewed edges diminishes because the required skewed paths are rarely available?
The dynamic refresh mechanism is governed by a refreshing bound br, which is tuned by the parameter p (Equation 1, page 8). The paper identifies p=0.4 as a "sweet spot." Could the authors elaborate on the methodology used to determine this value? Furthermore, how sensitive are the results to variations in p, and could an adaptive scheme that dynamically tunes p based on program characteristics yield even better performance?
Regarding the implementation of skewed edges (Section 4.5), the paper argues that they can lead to longer fusion paths and fidelity degradation. Can the authors provide a more concrete analysis of this trade-off? For instance, for a given reduction in 1D depth, what is the estimated increase in the total number of physical fusions required, which could serve as a proxy for fidelity cost? This would provide a more holistic view of the optimization's overall benefit.

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents OneAdapt, a compiler for Measurement-Based Quantum Computing (MBQC) targeting photonic hardware. The core contribution is twofold: 1) an extension of the Intermediate Representation (IR) from prior work, and 2) two new compilation passes designed to leverage this extended IR. Specifically, the authors extend the FlexLattice IR, introduced in OnePerc [74], by incorporating two new features: a hard constraint on the length of temporal edges and the allowance of "skewed" temporal edges that connect nodes at nearby, but not identical, 2D locations on different time-like layers. The new compilation passes, "Dynamic Refreshing" and "2D-bounded Temporal Routing," are designed to enforce the length constraint and exploit the skewed edges, respectively. The authors claim this new framework leads to significant reductions in program depth and hardware requirements compared to prior art.

Strengths

The paper's primary strength lies in its clearly articulated and well-justified novelty. The contributions are not presented in a vacuum but as a direct and intelligent evolution of a specific, state-of-the-art predecessor (OnePerc [74]).

Novelty of the IR Extension: The introduction of "skewed edges" is a genuinely novel concept in the context of MBQC compilation for photonic systems. While routing is a general concept, embedding this specific form of spatially-offset temporal connectivity directly into the IR is a clever architectural insight. It correctly identifies a latent capability in the underlying fusion-based hardware model—that path-finding between resource states is not strictly limited to a vertical "stack"—and proposes a formal IR feature to exploit it. This is a significant conceptual leap beyond the strictly columnar temporal connections in the original FlexLattice IR.
Algorithmic Novelty: The proposed "Dynamic Refreshing" mechanism (Section 3.2, page 6) is a substantial improvement over the "periodic refreshing" from OnePerc. The latter is a brute-force, synchronous method, whereas the proposed dynamic approach is an asynchronous, needs-based, and more granular scheduling algorithm. The use of a feedback mechanism based on the number of constraint-driven refreshes (Section 4.4, page 8, equation 1) to tune the refresh-to-computation ratio is a sophisticated and novel heuristic in this domain. This represents a significant increase in the algorithmic elegance for managing a critical resource constraint (photon storage time).
Synergistic Design: The two primary novelties (the extended IR and the new passes) are not independent but work in synergy. The skewed edges are not merely an addition; they are the feature that enables the more powerful 2D-bounded temporal routing. This tight coupling between the IR design and the compiler optimizations demonstrates a mature co-design approach, which itself is a valuable contribution. The resulting performance gains are not marginal; a 3.48x average depth reduction over OnePerc (Table 1, page 10) is substantial and directly validates the benefit of the novel ideas presented.

Weaknesses

My concerns are not with the existence of novelty, but with the rigor used to justify the practicality and scope of the novel claims.

Under-substantiated Feasibility of Skewed Edges: The entire premise of the skewed edge novelty rests on the claim that its implementation requires minimal overhead. Section 4.5 (page 9) states that a skewed IR edge "can be formed easily by a skewed fusion path" and that the only required change is "to allow path searching between qubits corresponding to IR nodes at nearby 2D locations." This assertion is the lynchpin for the paper's most innovative feature, yet it is treated with surprising brevity. Allowing paths to deviate from a straight column could significantly increase the complexity and runtime of the (2+1)-D path searching algorithm used for renormalization. It could also increase the likelihood of routing conflicts between different logical edges, potentially increasing the physical-to-logical layer ratio (PL Ratio). While the paper claims in Section 5.5 (page 11) that this effect is "negligible" for a skew distance of 1, this feels more like an empirical observation than a rigorous justification. The novelty is strong, but its practical foundation feels shaky.
Arbitrary Limitation on Novelty: The skewed edges are restricted to a Hamming distance of 1. While this is a practical choice that delivers good results, the paper does not explore the reasoning behind this specific limit. Is this a fundamental constraint imposed by the physics or connectivity of the hardware, or is it merely the first parameter value the authors tried? The novelty of the skewed edge concept would be significantly strengthened by a characterization of the design space. An analysis of the trade-offs (e.g., impact on PL Ratio, path-finding complexity, potential for 1D depth reduction) for a skew distance > 1 would provide a much deeper understanding of this new IR feature. As it stands, the innovation feels like a single point-solution rather than the introduction of a new, well-understood architectural knob.

Questions to Address In Rebuttal

The feasibility of skewed edges is central to this work. Can the authors elaborate on the algorithmic complexity of the modified (2+1)-D path searching? Does searching a larger volume for paths for each logical edge measurably increase the runtime of the renormalization step, which is a critical part of the real-time hardware operation? Can you provide stronger evidence that path congestion and the PL Ratio are not adversely affected in denser, more complex programs than those in the benchmark suite?
Regarding the scope of the skewed edge concept: Please justify the choice of a Hamming distance of 1. Is this limit based on a physical constraint in the fusion-based architecture model, or is it an empirical choice? A brief discussion on the projected overheads and potential benefits of allowing a larger skew distance would help establish the generality of this novel contribution.
The dynamic refresh algorithm is an interesting scheduling solution that prioritizes nodes based on their computational relevance and storage time. This bears a conceptual resemblance to deadline-driven scheduling in real-time systems or advanced cache/register management in classical compilers. Was this novel approach inspired by prior art in other domains of computer science? Placing this quantum compilation technique in a broader context could help clarify its fundamental contribution.

MUSS-TI: Multi-level Shuttle Scheduling for Large-Scale Entanglement Module Linked Trapped-Ion

Abstract

Trapped- ion computing is a leading architecture in the pursuit of scalable and high fidelity quantum systems. Modular quantum architectures based on photonic interconnects offer a promising path for scaling trapped ion devices. In this design, multiple ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose a compilation framework, MUSS-TI, tailored for a hypothetical trapped-ion architecture they term EML-QCCD. This architecture segregates trap regions into specialized zones (storage, operation, optical). The compiler employs a multi-level scheduling heuristic, analogous to classical memory hierarchies, to manage qubit movement between these zones. The primary claims are significant reductions in shuttle operations and execution time compared to existing QCCD compilers, leading to improved final state fidelity.

Strengths

The paper identifies a relevant problem: compiling for modular, zoned trapped-ion architectures is a critical next step for scalability.
The inclusion of an ablation study (Section 5.4, Figure 8) to dissect the performance contribution of different components (mapping vs. SWAP insertion) is a methodologically sound practice.

Weaknesses

Unfair Baseline Comparison: The central weakness of this work is the comparison of a specialized compiler (MUSS-TI) on its target specialized architecture (EML-QCCD) against baseline compilers [13, 55, 70] designed for generic QCCD grids. The baselines are not zone-aware and are thus fundamentally disadvantaged. The reported performance gains are therefore unsurprising and likely exaggerated. A rigorous evaluation would require adapting the baseline heuristics to be zone-aware or demonstrating MUSS-TI's superiority on a standard, non-zoned architecture.
Oversimplified and Potentially Biased Fidelity Model: The conclusions regarding fidelity are heavily dependent on the chosen model (Section 4, page 7). The quadratic decay of two-qubit gate fidelity with ion number (1 - €N^2) directly drives the conclusion of an "optimal" trap size in Section 5.3. The fixed fidelity for remote entanglement (0.99) is highly optimistic. The heating model (-kn) is simplistic and lacks detailed justification for the chosen parameter k=0.001. The paper fails to provide a sensitivity analysis of its results with respect to these crucial, and debatable, modeling assumptions.
Heavily Parameterized Heuristics: The SWAP insertion strategy (Section 3.3, page 6) relies on manually-tuned "magic numbers" (k=8, T=4). The paper provides scant justification for these specific values and does not explore how performance changes with different parameter choices. The claim that k can be adjusted based on "an understanding of the locality of the input circuits" is unsubstantiated and not demonstrated.
Hypothetical Architecture: The work is predicated on the EML-QCCD architecture, which, while plausible, is presented without a detailed hardware analysis. Claims that it is "more achievable" than other proposals like TITAN (page 4, sec 2.3) are asserted rather than proven. The compiler's performance is inextricably linked to this architecture's specific layout, making the results less generalizable.
Strained Analogy: The framing of the problem as analogous to "multi-level memory scheduling" (Section 3, page 4) is more of a narrative convenience than a technically rigorous mapping. The physical realities of ion shuttling—high latency, induced heating, and connectivity constraints—differ fundamentally from data movement in classical memory hierarchies, and the analogy may obscure more than it clarifies.

Questions to Address In Rebuttal

How would you justify the fairness of comparing your zone-aware compiler against zone-unaware baselines? Can you provide results for an experiment where the baseline algorithms are modified to be aware of the EML-QCCD zones, or where MUSS-TI is benchmarked against them on a standard QCCD grid architecture?
Please provide a sensitivity analysis for the key parameters in your fidelity model (€, k) and your SWAP insertion heuristic (k, T). How robust are your claimed improvements to variations in these assumptions? Specifically, how do the results change if the fiber entanglement fidelity is lowered from the optimistic 0.99?
The ablation study (Figure 8) suggests that the SABRE-style mapping provides the vast majority of the performance improvement. Can you quantify the individual contributions of the LRU replacement policy and the multi-level scheduling heuristic, independent of the SABRE mapping? This would clarify whether the core "MUSS" concept is truly the main driver of performance.
Can you elaborate on the claim that the EML-QCCD architecture is "more readily achievable" (page 3)? What specific fabrication or control challenges present in architectures like TITAN are circumvented by this design, and what new challenges does it introduce?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces MUSS-TI, a compilation framework designed for a promising class of large-scale trapped-ion quantum computers: Entanglement Module Linked Quantum Charge-Coupled Devices (EML-QCCD). The core and most compelling contribution of this work is the application of a classical computer architecture concept—the multi-level memory hierarchy—to the complex problem of quantum circuit compilation. The authors intelligently map the distinct functional zones of the EML-QCCD architecture (storage, operation, and optical zones) to different levels of a memory hierarchy. By doing so, they can leverage well-established scheduling policies, such as Least Recently Used (LRU), to manage the movement (shuttling) of qubits. This elegant analogy allows their compiler to make informed decisions about where and when to move qubits, with the primary goal of minimizing the costly shuttling operations that introduce latency and decoherence. The paper provides a comprehensive evaluation showing significant reductions in shuttle counts and execution time, particularly for medium- and large-scale applications, thereby making a strong case for the viability of both the proposed compiler and the underlying EML-QCCD architecture.

Strengths

Powerful and Elegant Central Analogy: The single greatest strength of this paper is its central idea: framing the qubit scheduling problem as a memory hierarchy management problem. This is a beautiful piece of conceptual synthesis. It takes a notoriously complex, multi-variable optimization problem in the quantum domain and maps it to a problem space that has been deeply studied for decades in classical computer architecture. This not only provides an intuitive framework for reasoning about qubit locality and movement but also unlocks a rich toolbox of existing scheduling heuristics (like LRU, as demonstrated here). This reframing is a significant intellectual contribution.
Architectural Foresight and Relevance: The work is not performed in a vacuum; it directly addresses the software challenges of a highly plausible next-generation hardware architecture. The EML-QCCD model, as described in Section 1 and Figure 2 (page 3), represents a credible path toward scaling trapped-ion systems by modularizing them. By developing a compiler specifically for this architecture, the authors are engaging in crucial hardware-software co-design. This work anticipates the needs of hardware experimentalists and provides a foundational software layer that will be necessary to make such systems programmable and efficient. It connects directly to the trend of building distributed and modular quantum systems, as seen in works like TITAN [11] and various experimental demonstrations of photonic links.
Comprehensive and Scalable Evaluation: The evaluation is thorough and persuasive. The authors benchmark MUSS-TI against several relevant prior works across small, medium, and large-scale applications. The results presented in Table 2 (page 8) and Figure 6 (page 8) consistently show dramatic improvements in shuttle count and execution time. Furthermore, the analysis extends beyond simple metrics to include crucial investigations into the impact of trap capacity (Section 5.3, page 9) and an ablation study of the compiler's own techniques (Section 5.4, page 9), lending significant weight to their conclusions. This demonstrates that the benefits are not just theoretical but are robust across different system parameters and application types.
Bridging Disciplinary Gaps: This paper serves as an excellent bridge between the fields of quantum computing and classical computer architecture. It demonstrates that the challenges emerging in scalable quantum systems are not entirely alien; rather, they are new instantiations of fundamental computer science problems related to locality, communication, and resource management. This work can inspire further cross-pollination of ideas, which will be essential for building the full quantum computing stack.

Weaknesses

While the core idea is strong, the work could be further contextualized and its underlying assumptions explored more deeply. These are not fatal flaws but rather opportunities for strengthening the work.

Simplification of the Cost Model: The fidelity model presented in Section 4 (page 7) is a necessary and reasonable simplification for a compiler-level study. However, the costs of the different "memory accesses" are highly complex. For instance, the error mechanisms of an intra-QCCD MS gate (in the "operation zone") are very different from those of a photonically mediated remote entanglement gate (in the "optical zone"). The current framework appears to treat the hierarchy as a linear progression of cost/speed, but the reality might be more nuanced. The paper would be strengthened by a discussion of how the framework might adapt if, for example, the fidelity of remote gates improved dramatically, potentially changing the optimal scheduling strategy.
Limited Exploration of the Analogy's Full Potential: The authors successfully apply the concept of a memory hierarchy and an LRU replacement policy. However, the analogy has much deeper potential. Classical systems employ sophisticated techniques like prefetching, different write-back/write-through policies, and dynamic cache partitioning. While beyond the scope of this initial work, a discussion of how these more advanced concepts might translate to the qubit scheduling problem would elevate the paper's vision and impact. For instance, could static analysis of the quantum circuit's dependency graph (DAG) enable predictive "qubit prefetching"?
Scalability of the Classical Compilation: The paper focuses on the performance scalability of the quantum circuit execution. The analysis of the compiler's own classical runtime (Section 5.6, page 10) is present but brief. The observed spikes in compilation time in Figure 10 suggest that for extremely large circuits, the classical overhead of making these sophisticated scheduling decisions could become non-trivial. This is a common challenge in advanced compilation, but it warrants a more detailed discussion about the trade-offs and potential bottlenecks in the classical control system.

Questions to Address In Rebuttal

The memory hierarchy analogy is the paper's most significant contribution. Could the authors elaborate on how this analogy might be extended? For instance, could data-flow analysis of the circuit's DAG be used to implement a form of "qubit prefetching," where qubits are speculatively moved to higher-level zones (e.g., from storage to operation) in anticipation of their use, potentially hiding shuttle latency?
The principles of MUSS-TI are developed for the EML-QCCD architecture. How generalizable is this multi-level scheduling concept? Could a similar framework be applied to other emerging modular architectures, such as networked superconducting processors with different tiers of coupler speeds, or neutral atom arrays with physically separate "storage" and "interaction" zones?
The current work prioritizes minimizing shuttling, which is a key bottleneck. However, the cost landscape is dynamic. How would the MUSS-TI framework adapt if future hardware advancements dramatically changed the relative costs of operations—for example, if fiber-based remote entanglement (Level 2) became significantly faster or higher fidelity than local MS gates (Level 1)? Does the framework allow for flexible cost functions to re-prioritize scheduling decisions based on evolving hardware realities?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present MUSS-TI, a compiler framework designed to optimize shuttle scheduling for a large-scale, modular trapped-ion architecture termed Entanglement Module Linked Quantum Charge-Coupled Device (EML-QCCD). The central claim to novelty lies in the application of a classical multi-level memory hierarchy analogy to the problem of qubit movement. Specifically, the authors map the distinct functional zones of the EML-QCCD architecture—storage, operation, and optical zones—to different levels of a memory hierarchy (L0, L1, and L2/CPU, respectively). This conceptual framework is then used to drive a scheduling algorithm that employs policies directly inspired by classical cache management, such as a Least Recently Used (LRU) policy for qubit "eviction" from high-value zones. This is complemented by a tailored, lookahead-based SWAP gate insertion heuristic to manage qubit placement across different QCCD modules. The paper claims this new approach significantly reduces shuttle operations and improves overall circuit fidelity compared to existing compilers.

Strengths

Novelty of the Core Conceptual Framework: The primary strength of this paper is the successful and novel application of a well-understood concept from classical computer architecture (multi-level memory/cache hierarchies) to a fundamentally different domain (quantum ion shuttling). While prior works [13, 55, 69] have developed heuristics for shuttle optimization, they have not, to my knowledge, explicitly formalized the problem using this powerful analogy. The mapping of physical zones to memory levels, as detailed in Section 3 (page 4), provides a new and intuitive lens through which to structure the optimization problem. This conceptual bridge is the paper's most significant and original contribution.
Effectiveness of the Novel Heuristics: The new perspective is not merely a semantic relabeling; it directly inspires effective algorithmic choices. The use of an LRU replacement policy for managing limited space in the operation/optical zones is a direct and logical consequence of the cache analogy. This appears to be a genuinely new approach for qubit management in this context. The substantial performance gains reported in Table 2 and Figure 6 (page 8), which are far from marginal, serve as strong evidence that this novel conceptual framework leads to a superior class of scheduling heuristics.

Weaknesses

The Architectural Premise is Evolutionary, Not Revolutionary: The EML-QCCD architecture, which underpins the entire scheduling model, is itself not a wholly new invention. The concept of functionally distinct zones (storage, interaction, readout) has been a cornerstone of QCCD proposals since the original work by Kielpinski, Monroe, and Wineland [30]. Similarly, the use of photonic interconnects to link modules is a widely explored avenue for scaling [1, 33, 61]. The paper is compiling for a specific, plausible, and well-motivated instantiation of these existing ideas. The novelty lies in the compiler, not the hardware concept, and this distinction could be made clearer. The contribution is a novel solution for an evolved architecture, not the invention of the architecture itself.
Supporting Techniques are Adaptations of Prior Art: While the synthesis is novel, several key components of the algorithm are direct adaptations of existing techniques.
- LRU Policy: The qubit replacement scheduler (Section 3.2, page 5) is explicitly an LRU policy, a foundational algorithm in classical OS and architecture. Its application here is novel, but the algorithm itself is not.
- SWAP Insertion: The use of a lookahead-based search to determine the utility of inserting SWAP gates is conceptually similar to established techniques in qubit routing for other platforms, most notably SABRE [37], which the authors also adapt for initial mapping. The novelty in their SWAP insertion method (Section 3.3, page 6) is confined to the specific cost metric (W(qi, cj)) tailored for their inter-module architecture, which is an incremental, albeit useful, advancement.

The paper’s main achievement is the integration of these ideas under a new, cohesive framework, rather than the invention of each individual part from first principles.

Questions to Address In Rebuttal

The central novelty is the mapping of the EML-QCCD architecture to a classical memory hierarchy. Beyond providing an intuitive vocabulary (e.g., 'cache miss', 'eviction'), what fundamental scheduling advantages does this analogy unlock that could not be achieved through a more conventional graph-based heuristic cost model that simply assigned higher costs to shuttles into/out of specific zones? Please elaborate on the specific algorithmic choices that are uniquely enabled by this perspective.
The SWAP insertion policy described in Section 3.3 (page 6) relies on a lookahead window of k=8 layers in the DAG. How was this value determined? Section 5.5 (page 9) suggests that the optimal k is application-dependent. How sensitive is the performance of MUSS-TI to this hyperparameter, and does this sensitivity undermine the claim of a generally applicable, robust framework?
The proposed multi-level scheduling is tightly coupled to the three-zone (storage, operation, optical) EML-QCCD architecture. How would the MUSS-TI framework generalize to a different trapped-ion architecture with, for instance, only two distinct zone types (e.g., storage and a combined operation/optical zone) or a more complex hierarchy with four or more levels? Does the novelty of the approach persist if the underlying architecture does not map as cleanly to a three-level memory system?

Rasengan: A Transition Hamiltonian-based Approximation Algorithm for Solving Constrained Binary Optimization Problems

Abstract

Constrained binary optimization is a representative NP-hard problem in various domains, including engineering, scheduling, and finance. Variational quantum algorithms (VQAs) provide a promising methodology for solving this problem by integrating the power ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose "Rasengan," a variational quantum algorithm for constrained binary optimization. The core idea deviates from traditional VQAs by expanding the search space from a single known feasible solution rather than shrinking it from a global superposition. This expansion is driven by a proposed "transition Hamiltonian," derived from the homogeneous basis of the problem's linear constraints. To make the algorithm deployable, the authors introduce several optimization techniques: Hamiltonian simplification/pruning, a "segmented execution" strategy to manage circuit depth, and a "purification" step for error mitigation. The paper claims significant improvements in both accuracy (ARG) and circuit depth over existing methods like penalty-term QAOA and commute-Hamiltonian-based QAOA (Choco-Q).

While the paper presents a novel approach built on a sound linear algebra foundation, its claims of superiority rest on a series of methodological choices and dramatic optimizations that warrant severe scrutiny. The core algorithm appears undeployable in its pure form, and its claimed practical success is almost entirely dependent on a segmented, measurement-heavy execution model that challenges its classification as a cohesive quantum algorithm.

Strengths

Conceptual Foundation: The core concept of building the solution space from a known feasible solution (xp) and the null space (u) of the constraint matrix (C) is sound from a linear algebra perspective. Framing this expansion in the language of a "transition Hamiltonian" is a clear and direct way to map the classical concept to a quantum operator.
Problem Formulation: The paper is well-structured and clearly identifies a key weakness in existing VQAs for constrained problems—namely, the inefficiency of searching a vast, mostly infeasible space. The proposed "expand-from-feasible" approach is a logical alternative.
Ablation Study: The inclusion of an ablation study (Section 5.6, Figures 15 and 16) is commendable, as it provides some insight into the relative impact of each proposed optimization. However, the interpretation of these results is debatable, as I will detail below.

Weaknesses

The "Segmented Execution" Breaks Coherent Quantum Evolution: The paper's claim to deployability hinges critically on the "segmented execution" strategy (Section 4.2). This technique partitions the sequence of Hamiltonians, executes each segment, measures the result, and uses the output probability distribution to initialize the next segment. This is not a continuous, coherent quantum evolution. It is a sequence of distinct quantum-classical steps. By collapsing the wavefunction to a classical probability distribution between segments, any quantum coherence and entanglement that might have been built up across segments is destroyed. This raises a fundamental question: is Rasengan a true quantum algorithm, or a classical heuristic that uses a quantum computer as a parameterized sampler at each step? The claim that it "preserves the probability information" (Section 4.2, Page 7) is insufficient; the crucial aspect of a quantum algorithm is the evolution of amplitudes in a superposition, which is explicitly broken here.
Misleading Baseline Comparisons: The primary claims of algorithmic superiority are based on potentially unfair comparisons. In Table 2, the baselines (HEA, P-QAOA, Choco-Q) are run with a fixed five layers. However, the paper's own analysis in Figure 9 demonstrates that the performance of Choco-Q improves dramatically with more layers, approaching Rasengan's accuracy at 14 layers. Therefore, the headline claim of a 4.12x accuracy improvement (from the abstract) is derived from comparing Rasengan against under-parameterized and non-optimized baselines. A rigorous comparison would benchmark algorithms against a fixed resource budget (e.g., total circuit depth or number of CNOTs), not a fixed, arbitrary layer count.
Grossly Overstated Real-Hardware Performance: The 379x improvement claim on real hardware (Section 5.4, Figure 11) is profoundly misleading. The data shows that the baseline algorithms produce ARGs greater than 90, which indicates a complete failure to find any meaningful solution—they are effectively outputting noise. Rasengan's achievement is not outperforming them in a meaningful way, but simply not failing as catastrophically. Attributing a "379x improvement" to this is statistically questionable. This success is likely due to the segmented, shallow-depth nature of the execution and the aggressive post-selection, which makes it more resilient to decoherence, rather than inherent algorithmic superiority.
"Purification" is Trivial Post-Selection: The "Error Mitigation by Purification" (Section 4.3) is presented as a novel technique. It is not. It is simply checking if a measurement result satisfies the classical constraints (Cx = b) and discarding it if it does not. This is standard post-selection. While necessary, its impact is misrepresented. The 100% in-constraints rate reported in noisy experiments (Figure 11b) is not an achievement of the algorithm in handling noise, but a direct artifact of this filtering process. Furthermore, the massive 303x ARG improvement on hardware attributed to Opt 3 (which includes purification) in Figure 16 reveals that this filtering step is responsible for the vast majority of the claimed performance, masking the poor quality of the raw quantum output.
Contradictory Circuit Complexity Narrative: The paper initially presents a transition Hamiltonian that requires "34k CX gates" where k is the number of non-zero elements in the basis vector (Section 3.2, Page 5). For non-trivial problems, this is undeployable on any NISQ device. The paper only achieves a deployable depth (~50, Section 4) via the "segmented execution" method, which, as argued in point #1, fundamentally changes the nature of the algorithm. The narrative presents a single algorithm, "Rasengan," but its theoretical formulation and its practical implementation are two vastly different procedures.

Questions to Address In Rebuttal

Please provide a theoretical justification for how "segmented execution" (Section 4.2), which involves intermediate measurements and classical handoffs, preserves the quantum computational advantage of superposition and entanglement across the entire problem evolution. How does this method differ from a classical search heuristic that uses a quantum device as a parameterized sampler at each step?
Given that Figure 9 shows Choco-Q's ARG improves significantly with more layers, how can the authors justify the fixed 5-layer comparison in Table 2 and the resulting 4.12x improvement claim? A fair comparison would require comparing both algorithms under an equivalent resource budget (e.g., total circuit depth or execution time).
Can the authors defend the 379x hardware improvement claim (Figure 11) when the baselines are performing at near-random levels (ARG > 90)? Does this result demonstrate genuine algorithmic superiority, or merely that the baselines are more fragile to noise and Rasengan's segmented, post-selected approach is less so?
Please clarify the contribution of the "purification" technique (Section 4.3). Since this is equivalent to post-selecting valid solutions, how does it address the underlying noise in the quantum computation itself, rather than simply filtering the final, noisy output distribution? Does this post-selection not risk biasing the final solution, especially if noise affects different basis states non-uniformly?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces Rasengan, a novel variational quantum algorithm (VQA) for solving constrained binary optimization problems. The work's essential contribution is a fundamental inversion of the typical VQA search strategy. Instead of starting with a superposition of all possible states (both feasible and infeasible) and attempting to "shrink" the search space towards an optimal solution, Rasengan begins with a single, classically-found feasible solution and uses a specially constructed "transition Hamiltonian" to "expand" the search space to cover all other feasible solutions.

This transition Hamiltonian is elegantly derived from the homogeneous basis vectors (the null space) of the problem's linear constraint matrix, ensuring that every step of the quantum evolution remains strictly within the feasible subspace. The authors complement this core algorithmic idea with a suite of pragmatic algorithm-hardware co-design techniques—including Hamiltonian simplification, pruning of redundant transitions, segmented execution to manage circuit depth, and a purification step for error mitigation. These optimizations make the conceptually powerful idea deployable on near-term quantum hardware. The paper presents a comprehensive evaluation demonstrating significant improvements in both accuracy and circuit depth over existing state-of-the-art methods like penalty-based and commute-Hamiltonian QAOA.

Strengths

Powerful Conceptual Shift: The most significant strength of this work is its core idea of "expanding" the feasible space rather than "shrinking" the total space (beautifully illustrated in Figure 2, page 4). This directly addresses a primary source of inefficiency in many quantum optimization algorithms, which expend vast computational resources navigating a huge, mostly infeasible search space. By constraining the search to the relevant subspace from the very beginning, the algorithm promises a much more efficient use of quantum resources.
Elegant Connection Between Linear Algebra and Quantum Dynamics: The formulation of the transition Hamiltonian (Section 3.1, page 4) based on the homogeneous basis vectors of the constraint system Cx=b is a particularly insightful contribution. It provides a rigorous mathematical bridge between the classical, algebraic structure of the problem's constraints and the quantum evolution needed to explore the solution space. This is a far more structured and problem-aware approach than the generic mixers used in standard QAOA or the adaptive layers of a Hardware Efficient Ansatz (HEA).
Pragmatic Co-Design for NISQ-era Reality: The authors do not stop at a purely theoretical proposal. The optimization techniques presented in Section 4 (page 6) show a mature understanding of the practical limitations of current quantum computers. Segmented execution, in particular, is a clever strategy to execute deep logical circuits on hardware with short decoherence times. This combination of a strong theoretical idea with practical, hardware-aware engineering is a major strength and is what makes the impressive real-hardware results (Figure 11, page 10) believable and impactful.
Strong Empirical Validation: The experimental evaluation is extensive and convincing. The authors test their method against relevant baselines across five different problem domains (Table 2, page 8), analyze scalability (Figure 10, page 9), and demonstrate performance on real IBMQ hardware. The reported 4.12x accuracy improvement over the state-of-the-art Choco-Q and the staggering 379x improvement on a real quantum device are compelling evidence of the method's potential.

Weaknesses

My critiques are less about fundamental flaws and more about clarifying the boundaries and assumptions of the proposed method, which will help contextualize its applicability.

Dependence on an Initial Feasible Solution: The entire framework is predicated on the ability to classically and efficiently find at least one feasible solution to initialize the algorithm. The authors state this can often be done in linear time (Section 5.1, page 7), which is true for the benchmarks chosen (e.g., facility location, set cover). However, for other classes of problems, particularly those with very tight or complex constraints (e.g., certain integer linear programs), finding any feasible solution can be an NP-hard problem in itself. This assumption represents a key boundary on the applicability of Rasengan.
Limited Scope of Constraints: The methodology is presented for problems with linear equality constraints (Cx=b). While the paper notes that inequalities can be handled by introducing slack variables, this standard technique can significantly increase the number of required qubits, which is the most precious resource in the NISQ era. The work would be strengthened by a more thorough discussion of its applicability and potential performance degradation when applied to problems dominated by inequalities or those with non-linear constraints.
Implicit Assumptions on Basis Sparsity: The performance of the algorithm, especially the complexity of the transition Hamiltonian circuit, depends on the number of non-zero elements in the homogeneous basis vectors. While the Hamiltonian simplification technique (Section 4.1, page 6) is designed to mitigate this, the paper does not fully explore the worst-case scenario where the basis vectors are pathologically dense. This could lead to circuits that remain prohibitively deep even after optimization, representing another potential boundary condition.

Questions to Address In Rebuttal

Could the authors comment on the applicability of Rasengan to problems where finding an initial feasible solution is itself computationally difficult? Does the framework offer any recourse in such scenarios, or would it be considered out-of-scope for this method?
Regarding the handling of inequality constraints, could the authors provide some analysis or experimental data on how the introduction of slack variables impacts Rasengan's performance? Specifically, how does the increase in qubit count affect the circuit depth and overall accuracy compared to methods that handle inequalities more natively?
While the Hamiltonian simplification and pruning are shown to be effective on the chosen benchmarks, could the authors elaborate on the theoretical limitations of these techniques? Are there known problem structures or constraint topologies that would result in dense homogeneous basis vectors for which these optimizations would provide limited benefit?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper introduces "Rasengan," a variational quantum algorithm (VQA) for constrained binary optimization problems with linear constraints. The central claim of novelty lies in its core methodology: instead of starting with a superposition of all possible states and "shrinking" the search space towards a feasible optimum (the standard QAOA paradigm), Rasengan starts from a single, classically-determined feasible solution and "expands" the search space to cover all other feasible solutions.

This expansion is achieved through a novel construct the authors term the "transition Hamiltonian," which is derived directly from the homogeneous basis of the problem's linear constraint matrix. The algorithm iteratively applies these transition Hamiltonians to navigate exclusively between basis states corresponding to feasible solutions. The paper also presents a set of co-design optimizations (Hamiltonian simplification/pruning, segmented execution, and purification) to make this approach viable on NISQ-era hardware.

Strengths

My evaluation identifies the following genuinely novel contributions:

Core Algorithmic Paradigm Shift: The primary strength and most significant novel contribution is the conceptual reversal from a "shrinking" to an "expanding" search space. To my knowledge, existing VQAs for this problem class, such as penalty-based QAOA or commute-Hamiltonian QAOA (like Choco-Q [43]), operate by either penalizing infeasible states or designing mixers that confine the evolution within the feasible subspace. Rasengan's approach of structured, generative exploration from a single point using the algebraic properties of the constraints is a fundamentally new and distinct VQA design pattern. This is clearly articulated in Figure 2 (page 4).
The "Transition Hamiltonian" Construct: The specific formulation of the transition Hamiltonian H(u) in Section 3.1 (Equation 7, page 5) is a novel piece of technical machinery. It translates a vector from the constraint's null space (u) directly into a quantum operator that performs a transition between two specific feasible solutions. This creates a direct, programmable link between the algebraic structure of the problem and the quantum evolution, which is a more tailored approach than the generic mixers used in standard QAOA.
Decoupling of Objective Function from Quantum Evolution: A subtle but profound novelty is that Rasengan’s quantum circuit is entirely agnostic to the objective function. Standard QAOA requires encoding the objective function into a (potentially complex) phase-separating Hamiltonian. Here, the quantum evolution's sole purpose is to navigate the feasible space. The objective function is evaluated purely classically on the measurement outcomes. This offloading of complexity from the quantum to the classical component represents a new way of structuring a hybrid quantum-classical workflow for optimization.

Weaknesses

My analysis identifies areas where the novelty is either overstated or not sufficiently demarcated from existing concepts:

Limited Novelty of Optimization Techniques in Isolation: The three optimization techniques presented in Section 4 (page 6) are novel in their specific integration but are conceptually derivative of prior art.
- Segmented Execution (Section 4.2, page 7): The idea of breaking a deep circuit into shallower ones is well-established in the field, often under names like "circuit cutting" or in the context of dynamic circuits. The paper does not adequately compare its "probability-preserving" approach to this body of work to clearly isolate its novel contribution.
- Error Mitigation by Purification (Section 4.3, page 7): This is a form of post-selection based on constraint satisfaction. Post-selection is a standard, albeit costly, error mitigation technique. Its application between segments is a specific implementation choice, but the underlying concept is not new.
- Hamiltonian Simplification (Section 4.1, page 6): The greedy search for a sparser basis for the null space is a classical heuristic. While its application to reducing quantum circuit complexity is clever, the novelty lies in the application, not the technique itself.
Dependence on Classical Pre-Processing: The algorithm's novelty is purely in the quantum search phase. The entire framework is predicated on the ability to classically and efficiently find an initial feasible solution and the homogeneous basis ({u}). For problems where this classical pre-computation is itself NP-hard, the practical novelty of the quantum speedup is diminished. The paper should more clearly define the boundary of its contribution, acknowledging this reliance.
Constrained Scope of Novelty: The formulation of the transition Hamiltonian is explicitly tied to linear equality constraints (Cx=b). The paper makes no claims about extending this core novel idea to problems with non-linear or inequality constraints, thus limiting the generality of the proposed paradigm.

Questions to Address In Rebuttal

The authors should use the rebuttal to clarify the precise boundaries of their novel contributions:

Regarding "Segmented Execution" (Section 4.2, page 7): Can you elaborate on the key difference between your proposed method and established circuit cutting techniques? Is the primary novelty simply the dynamic re-allocation of shots based on intermediate probabilities, or is there a more fundamental distinction?
The decoupling of the objective function from the quantum circuit is a significant architectural change from QAOA. To your knowledge, is Rasengan the first VQA for optimization that uses the quantum evolution solely for state space navigation while leaving the objective evaluation entirely classical? Please position this contribution with respect to prior art.
The proposed method is applicable to problems with linear constraints. How would the core novel idea—the construction of a "transition Hamiltonian" from the problem structure—be formulated for problems with non-linear constraints? Does the paradigm break down, or is there a conceivable path to generalization?
The complexity of finding the homogeneous basis is a classical pre-processing cost. For which important problem classes is this step guaranteed to be polynomial, and are there classes where it is not? How does this affect the overall novelty of the quantum algorithm's contribution to solving the end-to-end problem?

Chasoň: Supporting Cross HBM Channel Data Migration to Enable Efficient Sparse Algebraic Acceleration

Abstract

High bandwidth memory (HBM) equipped sparse accelerators are emerging as a new class of accelerators that offer concurrent accesses to data and parallel execution to mitigate the memory bound behavior of sparse kernels. However, because of their ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper proposes Chasoň, an HBM-based streaming accelerator for sparse matrix-vector multiplication (SpMV). The central contribution is a novel scheduling scheme, Cross-HBM Channel out-of-order Scheduling (CrHCS), designed to mitigate the processing element (PE) underutilization found in prior work like Serpens. CrHCS enables the migration of non-zero data elements across HBM channels to fill pipeline stalls that would otherwise occur. The authors implement Chasoň on an AMD Alveo U55C FPGA and evaluate it against Serpens, high-end GPUs, and a CPU, claiming significant improvements in performance and energy efficiency stemming from improved PE utilization.

Strengths

The paper correctly identifies a significant and well-known limitation in state-of-the-art HBM-based sparse accelerators: PE stalls due to workload imbalance across rows mapped to a single HBM channel. The motivation presented in Section 2 is clear and logically sound.
The proposed solution, migrating non-zeroes between channels, is a conceptually logical extension to the existing intra-channel scheduling schemes.
The experimental evaluation is broad, utilizing a large set of 800 matrices and comparing against multiple relevant platforms (FPGA, GPU, CPU).

Weaknesses

My primary concerns with this work center on the validity of its core assumptions, the fairness of its experimental comparison, and the analysis of its performance claims.

Unsubstantiated Core Scheduling Assumption: The entire efficacy of CrHCS rests on a critically optimistic and unproven assumption. In Section 3.3, the authors state, "Our experiments have shown that CrHCS never fails to find a RAW dependency-free value to migrate." This is an exceptionally strong claim presented without any supporting data, theoretical proof, or statistical analysis. The motivational example in Figure 2c conveniently shows a perfect pipeline with 0% underutilization, which is predicated on the constant availability of suitable non-zero candidates from the neighboring channel. This appears to be a best-case scenario presented as a general capability. The paper fails to characterize the conditions under which a suitable candidate would not be available, which would break the core premise of the work.
Flawed and Unfair Baseline Comparison: The comparison to the primary baseline, Serpens, is questionable.
- The authors report their implementation of Serpens achieves only 223MHz, while Chasoň achieves 301MHz on the same U55C platform (Section 5.2). A ~35% frequency advantage for Chasoň is a major confounding variable. The authors attribute this to "reduced logic congestion," but this is a vague justification. A more rigorous analysis is required to prove that their port of Serpens is faithful and optimized, and not simply a poor implementation used to inflate Chasoň's relative performance.
- The resource comparison in Table 1 shows that Chasoň consumes significantly more resources than Serpens (+58% LUT, +66% FF, +57% DSP). The authors dismiss the need for an iso-area comparison by asserting that Serpens could not benefit from more PEs (Section 4.5). This is an unsubstantiated claim and sidesteps a fundamental requirement for fair hardware accelerator comparisons. The authors are comparing a larger, more complex, and higher-frequency design to a smaller, simpler, and slower one. The performance gains are therefore not solely attributable to the CrHCS scheduling algorithm.
Misleading Attribution of Performance Gains: The paper attributes its speedup over Serpens to "data transfer reduction" (Figure 15, Section 6.2). This is a confusing and misleading framing. Both Chasoň and Serpens must process the exact same number of non-zero elements. The issue with Serpens is not that it "transfers zeros," but that its pipeline stalls for lack of available work. Chasoň avoids these stalls by transferring useful non-zero data from another channel. The gain comes from increased temporal utilization of the datapath (i.e., stall reduction), not from reducing the total volume of data transferred from HBM. This mischaracterization obscures the true mechanism of improvement and suggests a misunderstanding of the performance dynamics.
Introduction of New, Unanalyzed Bottlenecks: The architectural components required to support CrHCS (Router, Reduction Unit, Re-order Unit) introduce non-trivial latency and complexity. The authors' own analysis provides evidence of this. On page 12, they note that the C5 matrix, despite having a larger "data transfer reduction" factor, achieves lower speedup than the MY matrix because its larger column dimension creates contention and latency in the ScUGs and Reduction Unit. This is a critical finding: the overhead of the proposed architecture can itself become the primary performance bottleneck, potentially negating the benefits of CrHCS. This trade-off is not systematically explored or quantified.

Questions to Address In Rebuttal

Regarding the core assumption of CrHCS: Please provide statistical data from your 800-matrix evaluation that quantifies the probability of finding a RAW-dependency-free non-zero element in the neighboring channel for any given stall cycle. How does this probability change with matrix structure and sparsity? On what basis can the authors claim this "never fails"?
Regarding the baseline comparison: Please provide a detailed justification for the 35% frequency advantage over your implementation of Serpens. Furthermore, please provide a detailed argument for why a rigorous iso-resource comparison is not necessary, or alternatively, provide such a comparison. For instance, what would the performance of Serpens be if it were scaled to consume the same resources as Chasoň?
Regarding performance attribution: Please clarify the "data transfer reduction" claim. Is it not more accurate to state that the improvement comes from reducing pipeline stall cycles? Please re-frame the argument in terms of pipeline occupancy or throughput.
Regarding new bottlenecks: The paper admits that the reduction and re-ordering logic can become a bottleneck (C5 vs. MY matrix example). Please provide a sensitivity analysis that characterizes how the latency of these new architectural units impacts the overall speedup as a function of matrix properties (e.g., number of columns, non-zero distribution). At what point do the overheads of Chasoň's architecture outweigh the benefits of stall reduction?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces Chasoň, an HBM-based streaming accelerator for Sparse Matrix-Vector Multiplication (SpMV). The central contribution is a novel out-of-order scheduling scheme, "Cross-HBM Channel OoO Scheduling" (CrHCS), which enables dynamic data migration across HBM channels. The authors identify a key limitation in state-of-the-art accelerators like Serpens: while they parallelize work across HBM channels, their scheduling is confined within a channel ("intra-channel"). This leads to significant PE underutilization when the matrix rows assigned to a particular channel are sparse.

Chasoň's CrHCS addresses this by allowing a PE group associated with one channel to "borrow" non-zero elements from a neighboring channel's data stream, effectively filling its own pipeline stalls. The paper presents the necessary architectural support for this scheme, including data tagging, a routing mechanism within the PEs, and dedicated on-chip memory structures to manage results from both "private" and "shared" (migrated) data. The authors implement Chasoň on an AMD Alveo U55C FPGA and demonstrate significant improvements in PE utilization, leading to substantial performance and energy efficiency gains over Serpens, as well as modern GPU and CPU platforms.

Strengths

This is a strong paper with a clear, compelling, and well-executed core idea.

Excellent Problem Formulation and Motivation: The paper does an outstanding job of contextualizing its contribution. It clearly explains the evolution from row-based to PE-aware scheduling (Section 2.2, pages 2-3) and uses a simple, effective diagram (Figure 2, page 3) to illustrate the limitations of the current state-of-the-art. The motivation is powerfully reinforced by Figure 3 (page 4), which quantifies the PE underutilization problem across a large dataset, showing that ~70% underutilization is common. This makes the need for a better solution undeniable.
A Novel and Elegant Core Concept: The idea of inter-channel data migration is a genuinely new perspective in the design space of HBM-based sparse accelerators. The prevailing wisdom has been to treat the HBM channels as independent data streams feeding isolated processing element groups (PEGs). Chasoň cleverly breaks this isolation to solve the fundamental problem of load imbalance. This is a microarchitectural solution to a classic parallel computing challenge, and it feels both innovative and intuitive. It's a natural evolution of out-of-order principles applied at a coarser, inter-channel granularity.
Thorough and Credible Implementation: This is not merely a conceptual or simulation-based study. The authors provide a detailed architectural design (Section 4, pages 5-7) and a full implementation on a modern FPGA platform. The design considers the practical details required to make CrHCS work correctly, such as the pvt and PE_src flags for data provenance and the Rearrange Unit to ensure final results are correctly ordered. Achieving a 301MHz clock frequency on the Alveo U55C with this level of complexity lends significant credibility to the work.
Strong Connection to the Broader Landscape: The work is well-positioned within the literature. It correctly identifies Serpens [71], Sextans [72], and other HBM-based accelerators as its direct lineage. By building upon and significantly improving a known, open-source architecture (Serpens), the authors make their contribution easy to understand and appreciate. The principle of dynamically sharing work to smooth out irregularities connects to broader themes in parallel processing, systolic array design (e.g., [35]), and even processing-in-memory concepts where data locality and movement are paramount.

Weaknesses

The weaknesses are minor and relate more to exploring the full design space of the proposed idea rather than fundamental flaws in the presented work.

Limited Exploration of Migration Policies: The implementation restricts data migration to only the immediate next channel (as mentioned in Section 6.1, page 10). While justified by on-chip resource constraints, this feels somewhat arbitrary from a conceptual standpoint. A richer discussion on the design space of migration policies would strengthen the paper. For instance, what are the trade-offs of bidirectional migration (borrowing from channel i-1 and channel i+1), or a wider but shallower "neighborhood" of channels? This seems like a critical design parameter that is fixed here without much exploration.
Interplay with Static Data Partitioning: The effectiveness of the dynamic migration scheme is likely coupled with the initial static assignment of matrix rows to HBM channels. The paper does not discuss this relationship. For example, if a "poor" static partitioning creates a pattern where a sparse channel is always followed by another sparse channel, the "next-channel" policy would be ineffective. A brief analysis of this interplay would add depth.
Analysis of Latency Overhead: While the paper successfully argues that reducing stalls and data transfers yields a large net performance gain, the latency overhead of the new architectural units (Router, Reduction Unit, Reorder Unit) is not explicitly quantified. The authors state these units are "fundamental" (Section 6.2.2, page 12), but a more direct analysis—perhaps in terms of additional pipeline stages or critical path impact—would provide a more complete picture of the architectural trade-offs.

Questions to Address In Rebuttal

The current CrHCS implementation limits data migration to the immediate "next" channel in a unidirectional manner. Could the authors elaborate on the design trade-offs here? Would a bidirectional (i.e., borrowing from both previous and next channels) or a wider neighborhood migration scheme be feasible, and what would be the expected impact on resource usage (especially URAMs) and performance?
How sensitive is the proposed scheme to the initial static mapping of matrix rows to HBM channels? For example, if the workload distribution across channels is highly correlated (e.g., alternating blocks of dense and sparse regions), could the fixed "next channel" migration policy become a bottleneck?
The paper's core idea seems highly generalizable. Beyond the successful application to SpMV, have the authors considered its potential for other sparse kernels that are often accelerated on HBM platforms, such as SpMM (Sparse-Matrix Dense-Matrix Multiplication) or SpGEMM (Sparse-Matrix Sparse-Matrix Multiplication), where load imbalance across parallel units can be even more severe? (The authors briefly touch on SpMM in Section 7.1, but I would be interested to hear more on the potential challenges).

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper introduces "Chasoň," an HBM-based streaming accelerator for sparse matrix-vector multiplication (SpMV). The central claim to novelty lies in its non-zero scheduling scheme, "Cross-HBM Channel out-of-order Scheduling" (CrHCS). This scheme is designed to address the problem of processing element (PE) underutilization, which the authors identify as a key weakness in prior art like Serpens [71]. Where Serpens and similar works use intra-channel out-of-order scheduling, CrHCS extends this to inter-channel scheduling. This is achieved by allowing non-zero data elements scheduled for one HBM channel to be migrated to an adjacent channel to fill stalls in its data stream. To support this, the authors propose an architecture with specific hardware for routing, re-ordering, and reducing partial sums originating from these migrated, or "shared," data elements.

Strengths

Clear Articulation of the Novel "Delta": The paper does an excellent job of isolating its contribution relative to the immediate state-of-the-art. It correctly identifies that HBM-based accelerators like Serpens [71] and Sextans [72] treat HBM channels as independent data streams feeding dedicated PE groups. The core novel idea of Chasoň is to break this strict channel-to-PEG isolation at the data scheduling level. The shift from an intra-channel to an inter-channel scheduling scope is a specific and well-defined advancement.
Novel Application of a Known Concept: The underlying principle of CrHCS is a form of work stealing or dynamic load balancing—transferring work from a source with available tasks to an idle resource. While work stealing is a classic concept in parallel computing, its application in this specific context is novel. The authors have adapted this general principle to the unique constraints of a streaming HBM-FPGA architecture, where data is pre-scheduled into fixed-width streams. This is not a trivial application and represents a new approach for this class of accelerators.
Enabling Architectural and Data Structure Novelty: To realize the CrHCS scheme, the authors introduce necessary and novel modifications. The inclusion of the pvt (private) and PE_src (source PE) flags in the 64-bit data element (Section 3.2, page 4) is a simple but clever mechanism to handle the bookkeeping of migrated data without requiring complex control logic. The architectural additions—specifically the Router within the PE and the system-level Reduction and Re-order Units (Section 4, pages 5-7)—are a direct and logical consequence of the scheduling scheme and are new in their composition and purpose for this problem.

Weaknesses

Conceptual vs. Foundational Novelty: While the application of inter-channel data migration is novel for HBM-based sparse accelerators, the paper could be stronger by explicitly positioning CrHCS within the broader literature of work stealing and dynamic load balancing. The current framing presents the idea as wholly new, whereas its foundational concept is well-established. Acknowledging this and then detailing why their implementation is non-trivial and unique to the HBM architecture would more accurately place the contribution.
Incremental Nature of Architectural Components: The hardware units proposed to support CrHCS (Router, Reduction Unit, Re-order Unit) are, in isolation, standard digital design building blocks. The Router is a multiplexer-based structure, the Reduction Unit is an adder tree, and the Re-order Unit is a form of sorter or stream aligner. The novelty is not in the components themselves, but entirely in their synthesis to support the new scheduling algorithm. The paper is clear about their function, but it's important to recognize that the architectural novelty is compositional, not elemental.
Limited Scope of the Novel Mechanism: The implementation of CrHCS is constrained to migrating data only from the immediate next channel (e.g., channel i+1 to i). The authors justify this by citing on-chip memory limitations on the target FPGA (Section 6.1, page 10). While this is a practical engineering decision, it significantly limits the generality and novelty of the proposed data migration scheme. A truly novel interconnection scheme might support more flexible migration patterns (e.g., any-to-any, or from the two nearest neighbors). The current implementation is a 1D, unidirectional work-stealing approach, which is the simplest possible form of inter-resource cooperation.

Questions to Address In Rebuttal

On the Generality of CrHCS: The decision to migrate data only from the immediate next channel appears pragmatic but limits the solution's generality. Could the authors elaborate on the architectural complexity (e.g., interconnect, URAM scaling for partial sums, and Re-order Unit complexity) of a more flexible "any-to-any" or "bi-directional nearest-neighbor" channel migration scheme? What is the anticipated crossover point where the hardware overhead and control complexity would outweigh the benefits of filling more stalls?
Comparison to Classic Work Stealing: Please contrast CrHCS with classic work-stealing algorithms (e.g., Cilk). What specific constraints of the HBM/FPGA streaming architecture (e.g., fixed AXI widths, offline scheduling, lack of a shared task queue) prevent a more traditional implementation, and how does CrHCS's offline, stream-injection approach uniquely address them?
Overhead of the Novel Scheduler: The paper focuses on the hardware implementation, but CrHCS requires a more sophisticated offline scheduler compared to the simpler round-robin assignment in Serpens. What is the computational overhead of this new scheduling step? While this cost is likely amortized over many runs, a quantification would help assess if the novelty in scheduling introduces a new, non-trivial bottleneck in the overall workflow before execution can even begin.

A Probabilistic Perspective on Tiling Sparse Tensor Algebra

Abstract

Sparse tensor algebra computations are often memory-bound due to irregular access patterns and low arithmetic intensity. We present D2T2 (Data-Driven Tensor Tiling), a framework that optimizes static coordinate-space tiling schemes to minimize memory ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper presents D2T2, a framework for optimizing static tiling schemes for sparse tensor algebra. The core idea is to move beyond conservative, worst-case tiling by building a probabilistic model based on high-level statistics extracted from the input tensors. This model is used to predict memory traffic for various tile shapes and sizes, enabling a search for a traffic-optimal configuration that does not require specialized hardware. The authors evaluate D2T2 against state-of-the-art methods like Tailors and DRT, claiming significant performance improvements.

However, the work rests on a series of questionable assumptions regarding the statistical independence of tensor structures, and its experimental validation against prior art is methodologically compromised. While the goal is commendable, the rigor of the modeling and evaluation is insufficient to substantiate the paper's primary claims.

Strengths

Hardware-Agnostic Approach: The primary conceptual strength is the aim to improve static tiling without mandating specialized hardware support (unlike DRT's dynamic aggregation or Tailors' overflow buffers). This makes the proposed technique potentially more generalizable.
Data-Driven Philosophy: The fundamental motivation to use the statistical properties of the input data to inform tiling is sound. Moving away from worst-case (dense) assumptions is a necessary step for efficient sparse computation.
Identification of a Correlation Proxy: The introduction of the Corrs statistic (Section 4.4, page 6) to model output reuse is a necessary attempt to patch the model's core assumption of independence. Figure 8 demonstrates a clear, if simplistic, relationship between this metric and optimal tile shape.

Weaknesses

Fundamentally Flawed Modeling Assumption: The entire probabilistic model is built on a demonstrably false premise. In Section 4.2 (page 5), the authors state: "We estimate this probability... by assuming that the nonzero structures of A and C are uncorrelated." This assumption of independence is incorrect for a vast majority of real-world sparse computations. For example, in graph analytics (A x A^T), the structures of the input and output are highly correlated. Banded matrices, diagonally-dominant systems, and tensors from physical simulations all exhibit strong structural correlations. The Corrs proxy introduced in Section 4.4 is a limited, pairwise correction that is insufficient to capture the full, complex nature of these dependencies. The authors' own validation in Section 5.3 (page 8) and Figure 5 reveals this weakness, where they admit that for the A x A^T kernel, "the tiles may be correlated, leading to under-estimation of valid intersections." A model that is systematically biased for such a common kernel is not robust.
Methodologically Unsound Comparison with DRT: The comparison against DRT in Section 6.2 (page 10) is invalid. The authors use two different backends for the evaluation: DRT's performance is measured using its dedicated simulator, while D2T2's is measured using a TACO-based backend. The justification provided is that "the DRT simulator was unable to map D2T2-generated configurations to micro-tiles for most matrices." This is not a justification; it is a critical confounding variable. The claim that the backends produce comparable results (within <5%) is only validated for the Conservative tiling scheme. This says nothing about their relative performance on the highly non-uniform, rectangular tiles that D2T2 generates. In fact, the DRT simulator's failure to map these tiles suggests a fundamental incompatibility in the execution models, making any direct comparison of traffic numbers meaningless. The reported 1.13x traffic reduction over DRT is therefore an unsubstantiated claim.
Insufficient Validation on Higher-Order Kernels: The evaluation of TTM and MTTKRP in Section 6.3 (page 10) is weak. While the paper uses real-world tensors (from Frostt, Facebook), the other operands are random matrices with uniform sparsity. The text states, "we use a random matrix with dimensions T3 × max (T1, T2)". Real-world tensor operations often involve operands that share structural properties. Using random matrices fails to test the model against the structured correlations that are common in these applications and which the model is likely to mis-predict.
Misleading Overhead Analysis: The overhead analysis in Section 6.5 (page 11) reports the overhead of statistics collection (9.3%) and optimization (7.9%) relative to the initial tiling time. Tiling is typically a very small fraction of the total end-to-end execution time. Framing the overhead in this manner minimizes its perceived cost. A more transparent analysis would present this overhead as a percentage of the total application runtime for a set of representative problems.

Questions to Address In Rebuttal

Please provide a rigorous justification for the uncorrelation assumption that underpins your traffic model (Section 4.2). Given that many important sparse kernels (e.g., A x A^T) explicitly violate this assumption, how can the model be considered general? How does the simple pairwise Corrs statistic (Section 4.4) account for non-local or higher-order correlations present in real-world data?
How can the performance claims against DRT (Section 6.2) be considered valid when entirely different execution backends were used? Please provide evidence that the performance characteristics of the TACO and DRT backends are identical for the non-uniform, rectangular tiles generated by D2T2, not just for simple square tiles. Absent this evidence, the comparison must be removed.
In Figure 5, the model systematically underestimates traffic for A x A^T. Can the authors provide a more rigorous analysis of this model error and explain why relative accuracy is sufficient for optimization when the absolute error is significant and biased?
Why were random matrices used for the evaluation of TTM and MTTKRP in Table 4, rather than leveraging multiple real-world tensors which possess complex, correlated structures that would more rigorously test the model's assumptions?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

The authors present D2T2 (Data-Driven Tensor Tiling), a framework for generating data-aware, static tiling schemes for sparse tensor algebra computations. The core contribution is a probabilistic modeling approach that uses high-level statistics, extracted efficiently from the input tensors' compressed formats, to predict memory traffic for different tiling configurations. By modeling the probability of access and the expected size of tiled data, D2T2 searches for non-uniform tile shapes and sizes that minimize total memory movement. Crucially, this optimization is performed offline during a pre-processing step, allowing the resulting static tiling scheme to be executed on accelerators without requiring specialized dynamic tiling hardware. The paper demonstrates that this approach can achieve performance superior to state-of-the-art dynamic (DRT) and overflow-based (Tailors) tiling schemes, positioning it as a highly practical solution for a broad range of hardware.

Strengths

Elegant and Practical Core Idea: The central thesis—that a probabilistic model of data distribution can inform a superior static tiling scheme—is both powerful and pragmatic. It carves out a valuable and previously underexplored design point between two extremes: overly conservative static tiling (which ignores data properties) and complex hardware-intensive dynamic tiling (which is costly and less portable). This "software-first" approach to data-awareness is the paper's most significant strength.
Broad Deployability: By obviating the need for specialized hardware like dynamic tile aggregators (DRT) or overflow-capable buffers (Tailors), the D2T2 framework is immediately applicable to a wide array of existing and future sparse accelerators. The evaluation on the real-world Opal CGRA (Section 6.4, page 10) is a compelling demonstration of this, showing significant speedups on a hardware platform not co-designed with this specific tiling strategy in mind. This greatly enhances the potential impact of the work.
Strong Contextualization and Clear Problem Framing: The authors do an excellent job of positioning their work within the broader landscape of sparse tensor computation. The introduction clearly articulates the "utilization problem" and the limitations of prior art. The summary in Table 1 (page 3) is particularly effective, immediately clarifying the trade-offs between different tiling schemes and highlighting D2T2's unique value proposition.
Comprehensive Experimental Analysis: The evaluation is thorough. The authors not only compare against relevant state-of-the-art systems on their original benchmarks but also validate their predictive model's accuracy (Section 5.3, page 8), analyze the framework's runtime overhead (Section 6.5, page 11), and perform insightful ablation studies on the importance of different statistics (Section 6.7, page 12). This rigorous approach builds significant confidence in the results.

Weaknesses

Model Fidelity on Correlated Data: The model's primary simplification appears to be the assumption of uncorrelated inputs when estimating output traffic. The authors rightly note this leads to under-prediction in cases like A x A^T (Figure 5, page 9), where the input structures are perfectly correlated. While they argue the model still captures relative trends correctly, this is a fundamental limitation. For many real-world problems (e.g., graph analytics involving adjacency matrices and their transposes), such correlations are the norm, not the exception. The "Corrs" statistic is a clever proxy to patch this, but it feels like a heuristic layered on top of a model that doesn't natively handle this phenomenon.
Generalizability of the Statistical Framework: The paper introduces a set of key statistics (e.g., PrTileIdx, ProbIndex, Corrs). These appear highly effective for the matrix- and 3-tensor-based kernels evaluated. However, it is less clear how this specific set of statistics would generalize to more complex, higher-order tensor expressions with multiple contraction loops. The concept of Corrs as a 1D correlation over the contracted dimension (Section 4.4, page 6) is intuitive for matrix multiplication, but its extension to, for instance, a tensor-times-matrix operation with two shared indices might require a more sophisticated, multi-dimensional correlation model.
Exploration of the Tiling Search Space: The shape optimization is guided by varying a "Reorder Factor (RF)" that controls the tile aspect ratio while preserving area (Section 5.2, page 8). This is a reasonable and practical heuristic. However, the true optimization space of tile shapes is vast. A deeper discussion on why this 1D parameter sweep is a sufficient exploration would be beneficial. It's possible that for some tensor structures, more exotic tile shapes not discoverable through this method could yield even better results.

Questions to Address In Rebuttal

On the Impact of Model Correlation: Regarding the model's under-prediction for correlated inputs (Section 5.3), could you elaborate on how this might affect tile shape selection for computations where output reuse is the dominant performance factor? Is it possible that the model, even while capturing the relative trend, might still erroneously prefer an outer-product-like shape (minimizing input re-fetches) when a square-like shape (maximizing output reuse) would have been globally optimal for that specific correlated input pair?
On the 'Sweet Spot' for This Approach: A key contribution is enabling data-awareness for static tiling. This implies a trade-off. Could you comment on the data distributions or problem types that define the "sweet spot" for D2T2? Conversely, are there specific sparsity structures (e.g., power-law graphs with extreme variance in degree) where the global statistics used by D2T2 might be misleading, making a fully dynamic, fine-grained approach like DRT inherently superior despite its hardware cost?
On the Future of Probabilistic Modeling: Your work successfully applies a probabilistic lens to tiling. Looking forward, how could this modeling framework be extended beyond tiling? For instance, could these same data statistics be used to guide other crucial compiler decisions, such as data layout transformations (e.g., choosing between CSF, COO, etc., on a per-tensor basis) or kernel fusion strategies for sparse computations?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents D2T2, a framework for optimizing static, coordinate-space tiling schemes for sparse tensor algebra computations. The core idea is to leverage a probabilistic model, built from high-level statistics extracted from the input tensors' compressed sparse fiber (CSF) format, to predict memory traffic for various tile shapes and sizes. The framework then searches for a configuration that minimizes this predicted traffic. The authors claim this approach provides a data-driven, static tiling solution that outperforms conservative static methods as well as more complex dynamic (DRT) and aggressive overflow-based (Tailors) schemes, without requiring specialized hardware support.

My analysis focuses exclusively on the novelty of this core idea. While performance modeling for compilers is not new, the specific formulation and application presented here contain genuinely novel elements.

Strengths

Novel Heuristic for Output Reuse: The most significant novel contribution is the formulation of the Corrs statistic (Section 4.4, page 6, Equation 11). The idea of estimating output data reuse by measuring the shifted self-intersection of fibers along a contracted dimension is a clever and computationally inexpensive proxy for an otherwise extremely complex phenomenon. To my knowledge, this specific analytical heuristic for guiding tile shape selection in sparse computations is new. Prior work, such as Cohen [6], focused on predicting the final nonzero count of a product, not on modeling traffic for intermediate tiled execution.
Identifies and Fills a Clear Gap in the Literature: The paper correctly identifies a gap between (1) overly conservative static tiling, (2) hardware-intensive dynamic tiling (DRT [26]), and (3) aggressive static tiling that requires specialized hardware for buffer overflows (Tailors [40]). D2T2 proposes a solution that resides in a novel design point: a data-aware static tiling scheme that does not require specialized hardware. This positioning itself represents a novel contribution to the design space of sparse tiling strategies.
Novel Application of Probabilistic Modeling to Tile Shape: While Tailors [40] uses statistical analysis of data occupancy, its primary goal is to optimize tile size by targeting a specific overflow rate. D2T2’s model is distinct in that it is used to evaluate the trade-offs of different tile shapes (i.e., aspect ratios), as shown in the Gustavson's Matmul example (Section 3, page 3) and the tile shape optimization logic (Section 5.2, page 8). The explicit modeling of how both input re-fetches and output reuse change with tile aspect ratio, guided by statistics like Corrs, is a significant delta over prior art.

Weaknesses

While the core idea is novel, its foundation rests on assumptions and simplifications whose novelty and robustness could be better defended.

Reliance on a Standard Independence Assumption: The probabilistic model simplifies the calculation of joint probabilities by assuming that the nonzero structures of different input tensors are uncorrelated (Section 4.2, page 5). This is a common assumption in the field and therefore not novel; however, it is also a known weakness. The model's predictive power is likely degraded in cases of high correlation, such as the A * A^T computation, which is common in graph analytics. The authors show in Figure 5a (page 9) that the model can have high absolute error in these cases, yet they claim it preserves relative trends. The novelty of the work would be strengthened by a more formal analysis of why the relative accuracy is maintained despite the foundational assumption being violated.
Limited Novelty in Higher-Order Kernel Modeling: The paper’s novel modeling concepts are most thoroughly developed and validated for SpMSpM. The extension to higher-order kernels like TTM and MTTKRP is presented primarily through experimental results (Table 4, page 10). It is unclear how the core novel idea, the Corrs statistic, is generalized to computations with multiple contracted indices or more complex data dependencies. If the extension is a straightforward application, its novelty is limited. If it required significant new modeling insights, those are not detailed, leaving the novelty of this aspect of the work unsubstantiated.

Questions to Address In Rebuttal

The model's utility hinges on its ability to make correct relative comparisons between tile shapes, even when absolute traffic predictions are inaccurate (as seen for A*A^T in Figure 5). Could the authors provide a more rigorous explanation for why the model's ranking of tile configurations remains robust even when the core assumption of tensor independence is strongly violated? Is there an underlying structural property of the model that makes it resilient in this way?
Regarding the higher-order tensor kernels (TTM, MTTKRP), could the authors elaborate on how the novel Corrs statistic, or a conceptual equivalent, is applied? For example, in MTTKRP, reductions occur over multiple modes. How is output reuse modeled in this context? Is the model simply a product of pairwise Corrs-like terms, or does it involve a more novel, higher-order correlation statistic?
The statistics used by the model, such as PrTileIdx and ProbIndex, are averaged over all tiles. For tensors with highly non-uniform distributions (e.g., a small dense block in a large sparse matrix), such averaging might obscure critical local characteristics. Does this lead to suboptimal tiling choices? Was a more localized statistical model considered, and if so, what was the trade-off in complexity versus benefit?

Bootes: Boosting the Efficiency of Sparse Accelerators Using Spectral Clustering

Abstract

Sparse matrix-matrix multiplication (SpGEMM) is crucial in many applications, with numerous recent efforts focused on optimizing it. The row-wise product has emerged as a favorable SpGEMM dataflow due to its balanced performance, but it alone is ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper proposes Bootes, a preprocessing framework for sparse matrix-matrix multiplication (SpGEMM) aimed at reducing off-chip memory traffic for row-wise product dataflows. The approach consists of two main components: (1) a row reordering technique based on spectral clustering to group rows with similar column access patterns, and (2) a decision tree model to perform a cost-benefit analysis, determining if reordering is beneficial and selecting the number of clusters (k). The authors claim that Bootes significantly reduces preprocessing time and memory footprint compared to prior reordering techniques (Gamma, Graph, Hier) while achieving superior memory traffic reduction on several state-of-the-art accelerator models.

However, the work suffers from several methodological weaknesses and overstated claims that undermine its conclusions. The claims of optimality are unsubstantiated, the novelty of the core algorithm implementation is questionable, and the evaluation of the end-to-end impact is fundamentally flawed, ignoring critical use cases for SpGEMM.

Strengths

The paper addresses a well-recognized and important problem: the high cost of preprocessing for sparse matrix reordering and the need for a cost-benefit analysis to justify its application.
The authors evaluate their technique against multiple state-of-the-art accelerator architectures (Flexagon, GAMMA, Trapezoid), which provides a broader context for the claimed traffic reductions than a single-point design.
The use of a decision tree to automate the parameter selection (k) and the reordering decision is a reasonable direction, moving beyond heuristics that require manual tuning.

Weaknesses

The Claim of "Optimal" Reordering is Unsubstantiated: The abstract repeatedly uses strong terms like "optimally reorder" and "maximize reuse." Spectral clustering is a heuristic that finds good, but not provably optimal, graph partitions. There is no theoretical proof or empirical evidence provided to suggest that the partitioning found by this method is optimal for minimizing memory traffic in a row-wise SpGEMM dataflow with finite cache constraints. This represents a significant overstatement of the method's capabilities.
Novelty of the "Optimized Implementation" is Misleading: In Section 3.1.2 (Page 7), the authors describe their optimizations. However, Algorithm 4 and the surrounding text make it clear that the implementation relies on standard, highly-optimized libraries like SciPy (scipy.sparse.linalg.eigsh) and Scikit-learn (sklearn.cluster.KMeans). While using efficient libraries is good engineering, presenting this as a novel research contribution to algorithm optimization is questionable. The core contribution is the application of these standard tools, not the creation of a new, faster spectral clustering algorithm.
The Decision Tree's Robustness and Evaluation are Superficial:
- The "88% accuracy" claim in Section 5.1 (Page 9) is ambiguous. Does this refer to the binary classification accuracy (reorder vs. no-reorder) or the accuracy of selecting the exact best k value? These are different tasks with different implications. A clear confusion matrix or a more detailed breakdown of performance is required.
- The 10% memory traffic reduction threshold for deciding to reorder is arbitrary and lacks justification. Why not 5% or 15%? This critical hyperparameter of the decision model itself is not rigorously analyzed.
- The model was trained on a curated dataset from SuiteSparse and SNAP. There is no analysis of its generalizability to out-of-distribution matrices with fundamentally different sparsity structures, which is a common failure mode for such learned models.
Scalability Analysis Lacks Rigor:
- The time complexity analysis in Table 2 (Page 5) hinges on g, the average number of nonzeros per row in the similarity matrix S = A * A^T. The authors state g is "typically small" but fail to analyze its behavior. For matrices with even a few dense columns, S can become catastrophically dense, invalidating the claims of efficiency and low memory footprint. A worst-case analysis is missing.
- In the scalability plot (Figure 5, Page 11), the x-axis is labeled "Matrix Size," which is ambiguous. It should be clarified whether this refers to the number of rows, columns, nonzeros, or the product of dimensions. Furthermore, using a log scale for preprocessing time can mask super-linear behavior. The claim of superior scalability requires a more transparent presentation.
The End-to-End Speedup Analysis is Fundamentally Flawed: This is the most critical weakness. The authors state on Page 12, "the row reordering has a minimal impact on the multiplication phase execution time." This means the end-to-end speedup reported in Figure 6 is almost entirely due to faster preprocessing. This analysis is only valid for a single, non-iterative SpGEMM computation. In many real-world applications (e.g., scientific computing, graph algorithms), the preprocessing cost is amortized over hundreds or thousands of subsequent computations using the same reordered matrix. In such a scenario, a method with a higher preprocessing cost but better traffic reduction (and thus potentially faster kernel execution) would be superior. The paper completely ignores this amortization context, rendering its end-to-end speedup claims misleading for a large class of important applications.

Questions to Address In Rebuttal

Please provide a justification for the term "optimal" reordering. Either provide theoretical proof that spectral clustering yields an optimal row permutation for this problem or revise the claims to reflect the heuristic nature of the algorithm (e.g., "effective," "high-quality").
Clarify the novelty of the "optimized implementation" of spectral clustering. Is the contribution the application of existing libraries, or did the authors develop novel algorithmic enhancements to the standard spectral clustering pipeline?
Regarding the decision tree:
- Provide a precise definition of the "88% accuracy" metric.
- Justify the choice of the 10% traffic reduction threshold. How sensitive are the final results to this value?
- Discuss the potential limitations of the model when faced with matrices whose sparsity patterns are not represented in the training set.
Regarding scalability (Figure 5):
- Define "Matrix Size" on the x-axis explicitly.
- Provide an analysis or at least empirical data on how the density of the similarity matrix (g) scales with the properties of the input matrix A, especially in challenging cases.
The end-to-end speedup analysis must be revisited. Provide an analysis that considers the amortization of preprocessing costs. For instance, show how the total time (preprocessing + N * kernel_time) for Bootes compares to baselines as N (the number of multiplications) increases. At what value of N do the benefits of Bootes' faster preprocessing become less significant than the kernel-time improvements from other methods, if any?

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

The paper introduces Bootes, a novel preprocessing framework designed to enhance the efficiency of sparse matrix-matrix multiplication (SpGEMM) on modern accelerators that utilize a row-wise product dataflow. The authors identify a key limitation in existing systems: irregular memory access to the second matrix (B) severely degrades performance by reducing data reuse and increasing off-chip traffic.

The core contribution is twofold: 1. It proposes using spectral clustering, a powerful graph-theoretic technique, to reorder the rows of the first matrix (A). This approach differs from prior heuristic and greedy methods by capturing the global structure of the matrix's sparsity pattern, leading to a more optimal grouping of rows with similar column access patterns. 2. It incorporates a cost-benefit analysis via a lightweight decision tree model. This model intelligently predicts whether reordering will be beneficial for a given matrix and hardware target, and if so, selects the optimal number of clusters (k), effectively creating a practical, automated, and hardware-aware optimization framework.

The authors demonstrate through extensive simulation on three state-of-the-art accelerator designs (Flexagon, GAMMA, Trapezoid) that Bootes significantly reduces off-chip memory traffic (up to 2.31x) and substantially accelerates the preprocessing stage itself (geomean of 11.61x) compared to previous reordering techniques.

Strengths

This is a strong paper with a compelling central idea that is well-executed and evaluated.

Conceptual Leap in Reordering Strategy: The primary strength of this work lies in its reframing of the row reordering problem. Instead of relying on local or greedy heuristics as seen in prior work (e.g., GAMMA's windowed approach or the graph-based nearest-neighbor traversal discussed in Section 2.2), Bootes applies a principled, global optimization method. Spectral clustering, which uses the eigenvectors of the graph Laplacian, is fundamentally designed to find optimal "cuts" in a graph. By mapping the reordering task to this domain, the authors leverage a powerful mathematical tool that considers the entire matrix structure simultaneously. The visual results in Figure 2 (e-i) provide a clear and intuitive validation of this method's superiority.
A Holistic and Practical Framework: The inclusion of the decision tree is a mark of mature, systems-oriented thinking. Many papers propose an optimization without thoroughly considering its overhead or applicability. Bootes, by contrast, explicitly models the cost-benefit trade-off. This "meta-optimization" layer, which decides if and how to apply the core algorithm, makes the entire system more robust and practical for real-world deployment where matrices have diverse characteristics. It elevates the work from a single algorithm to a complete, intelligent framework.
Generalizability and Robust Evaluation: The authors' decision to evaluate Bootes on three distinct accelerator architectures with different cache sizes and PE counts is commendable. The consistent memory traffic reduction shown in Figure 4 demonstrates that the benefits of the proposed reordering are fundamental and not tied to a single microarchitectural design. This significantly strengthens the claim that Bootes is a widely applicable technique for the entire class of row-wise SpGEMM accelerators.
Excellent Problem Contextualization: The paper does an admirable job in Section 2 of situating its contribution within the landscape of existing reordering techniques. The clear explanation and analysis of the limitations of Gamma, Graph, and Hierarchical clustering methods effectively motivate the need for a new approach and provide a solid foundation for comparison.

Weaknesses

The weaknesses are minor and relate more to missed opportunities for deeper contextualization rather than fundamental flaws.

Limited Positioning within the Machine Learning Literature: While the paper expertly applies spectral clustering, it could benefit from a brief discussion on why this specific clustering method is uniquely suited for this problem compared to other advanced, non-linear clustering techniques (e.g., DBSCAN, affinity propagation). The core justification for spectral clustering lies in its connection to graph partitioning, but explicitly stating this and contrasting it with other methods would further solidify the novelty and principled nature of the choice.
Amortization of Preprocessing Overhead: The paper convincingly shows that Bootes' preprocessing is much faster than its competitors (Figure 5). However, preprocessing is still an overhead that must be justified. The paper would be strengthened by a more direct analysis of the amortization cost. For instance, for a given matrix, how many SpGEMM executions are required for the energy/time saved from reduced memory traffic to outweigh the initial cost of running Bootes? This would provide clearer guidance on the application scenarios (e.g., iterative solvers, GNN training) where Bootes is most impactful.
Black-Box Nature of the Decision Tree: The features used for the decision tree are listed in Section 3.2, but the intuition behind their predictive power is somewhat brief. A more detailed analysis or an ablation study on feature importance could provide valuable insights into what structural properties of a sparse matrix make it a good or bad candidate for reordering. This would add another layer of scientific understanding to the work.

Questions to Address In Rebuttal

The choice of spectral clustering is the cornerstone of your work. Could you elaborate on why this method is fundamentally better suited for capturing the global structure of matrix access patterns compared to other global clustering algorithms? Did you consider any alternatives, and if so, how did they perform?
Regarding the practical application of Bootes, can you provide a quantitative estimate for the "break-even" point? For example, for a representative matrix from your workload set, how many SpGEMM computations are needed to fully amortize the preprocessing time and energy overhead of your method?
In your implementation (Algorithm 4), the number of eigenvectors computed is the same as the number of clusters (k) for k-means. Is this linkage necessary, or could performance be further improved by decoupling these two parameters (e.g., using more eigenvectors to create a richer embedding space before clustering into a smaller number of final groups)?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper proposes "Bootes," a preprocessing framework for SpGEMM accelerators that use a row-wise product dataflow. The stated goal is to reduce off-chip memory traffic by reordering the rows of the first input matrix (A) to improve data reuse for the second matrix (B). The authors identify three core contributions: (1) the application of spectral clustering for the row reordering task, (2) an optimized implementation of this algorithm to reduce its overhead, and (3) a decision tree model to determine if reordering is beneficial and to select the optimal hyperparameter (k) for a given matrix and target accelerator. The method is evaluated on three simulated state-of-the-art accelerators, showing significant reductions in memory traffic and preprocessing time compared to prior reordering techniques.

Strengths

The Decision Tree Component: The integration of a cost-benefit analysis via a decision tree (Section 3.2, page 7) is the most genuinely novel contribution of this work. Prior art, with the exception of [27], often applies reordering indiscriminately. Bootes's ability to predict not only if reordering is beneficial but also the optimal cluster count (k) for a specific hardware target addresses a critical and practical gap in the field. This elevates the work from a simple algorithm replacement to a more intelligent, adaptive framework.
Effective Implementation of a Known Algorithm: While spectral clustering is a well-established algorithm, its computational expense is a known barrier. The authors have clearly engineered an efficient implementation that leverages the inherent sparsity of the problem (Section 3.1.2, page 7). The results in Figure 5 (page 11), which show Bootes having a lower preprocessing time than simpler greedy and hierarchical methods, are impressive and demonstrate significant engineering novelty in adapting a complex algorithm to this specific domain.
Strong Empirical Results: The proposed method demonstrates a clear and substantial performance improvement over the cited prior art (Gamma [77], Graph [4], Hier [27]) across multiple metrics (memory traffic, preprocessing time) and target architectures. This delta is large enough to be considered a significant advancement.

Weaknesses

My primary critique centers on the framing of the work's core novelty. The paper presents the use of spectral clustering as a fundamentally new approach, but this claim requires significant qualification.

Incremental Algorithmic Novelty: The core idea of using a clustering algorithm to group similar rows is not new. The cited prior work, Hier [27], already established the use of hierarchical clustering for this exact problem. Therefore, the contribution of Bootes is the replacement of one clustering algorithm (hierarchical) with another (spectral). While spectral clustering may be more effective for capturing global structure and non-linear patterns, this is an incremental step on an existing path, not the creation of a new path. The conceptual leap is small.
Similarity Metric is Functionally Identical to Prior Art: The paper's key insight is described as "using a similarity matrix that captures the structural properties of matrix A" (Abstract, page 1), which is computed as A * A^T (Section 3.1.1, page 6). This matrix explicitly calculates the number of shared column coordinates between every pair of rows. This is not a new concept for similarity. The "Graph" reordering paper [4] constructs a graph where edge weights "represent the number of shared column coordinates between two rows" (Section 2.2.2, page 4). This is functionally the exact same similarity metric, merely represented implicitly in a graph structure rather than explicitly as a dense or sparse similarity matrix. The paper should acknowledge this and frame its contribution not as a new similarity metric, but as a more robust clustering algorithm that leverages this existing metric.
Limited Scope of Prior Art Search: The paper compares itself to a narrow set of very recent SpGEMM accelerator papers. However, matrix reordering to improve locality is a classic problem in high-performance and scientific computing. Spectral methods (bisection, partitioning) have been used for decades to reorder matrices for factorization and parallel processing by partitioning the underlying graph. The connection between this work and the vast body of literature on graph partitioning for sparse matrix computations (e.g., using tools like METIS/ParMETIS, Scotch, or Chaco) is not discussed. While the application context (row-wise SpGEMM on accelerators) is new, the fundamental idea of using spectral methods to reorder a matrix is not.

Questions to Address In Rebuttal

Beyond stating that spectral clustering is better for irregular structures, can the authors precisely articulate the conceptual delta that makes their use of clustering novel compared to Hier [27]? Why should this be considered a new approach rather than a superior algorithm choice within an existing framework?
Please explicitly compare the similarity metric derived from A * A^T with the graph-based similarity used in Graph [4]. Is there any functional difference, or is the novelty purely in the explicit matrix computation and the subsequent use of an eigensolver?
Could the authors comment on why they did not compare their approach to established graph partitioning libraries (e.g., METIS) applied to the row-similarity graph of matrix A? These tools are highly optimized and often serve as the baseline for any new partitioning or ordering scheme in the scientific computing domain.
The decision tree is a strong contribution. Could you provide an ablation study showing the performance of Bootes using a fixed, median k (e.g., k=8) versus the performance achieved using the decision tree to select k? This would help isolate the benefit derived specifically from this novel component of your framework.

Misam: Machine Learning Assisted Dataflow Selection in Accelerators for Sparse Matrix Multiplication

Abstract

The performance of Sparse Matrix-Matrix Multiplication (SpGEMM), a foundational operation in scientific computing and machine learning, is highly sensitive to the diverse and dynamic sparsity patterns of its input matrices. While specialized hardware ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper presents Misam, a framework that uses a machine learning model (a decision tree) to select an optimal dataflow for Sparse Matrix-Matrix Multiplication (SpGEMM) on an FPGA. The framework consists of a lightweight predictor to select one of several pre-compiled FPGA bitstreams and a "reconfiguration engine" that decides whether the performance gain of switching to a new bitstream justifies the runtime reconfiguration overhead. The authors evaluate this approach on a Xilinx Alveo U55C FPGA, comparing it against CPU (MKL), GPU (cuSPARSE), and a state-of-the-art accelerator (Trapezoid, via simulation).

Strengths

The work addresses the well-known and valid problem that no single SpGEMM dataflow is optimal across the wide spectrum of sparsity patterns found in real-world applications.
The evaluation includes comparisons against strong, relevant baselines, namely Intel MKL and NVIDIA cuSPARSE, which are standard practice in the field.

Weaknesses

The paper suffers from several critical flaws in its core claims, methodology, and analysis that undermine its conclusions.

Fundamental Contradiction Regarding Reconfiguration Feasibility: The central premise of the paper is dynamic, runtime reconfiguration to adapt to workloads. The authors propose reconfiguration at the tile level (Section 3.3, page 7). However, they later admit that full bitstream reconfiguration on their target hardware (Xilinx U55C) takes 3-4 seconds (Section 6.1, page 11). This overhead is orders of magnitude larger than the execution time of a single tile multiplication, rendering the proposed tile-level switching completely infeasible. The authors acknowledge this ("making tile-level reconfiguration suboptimal"), which creates a fatal internal contradiction. They propose a mechanism and then immediately concede its impracticality on their evaluation platform without providing a viable alternative.
Misleading and Unsubstantiated Performance Claims: The headline claim of "up to a 10.76× speedup" (Abstract, page 1) is derived from a single, cherry-picked workload ("cg14-cg14" in Figure 8, page 9). This is an extreme outlier and not representative of the overall performance. Furthermore, the paper reports a geometric mean speedup of 2.74x only for cases where reconfiguration occurs (Section 5.2, page 9). This is a biased statistic. A rigorous analysis would report the geometric mean speedup across the entire workload suite, accounting for the overhead in all cases, including those where the engine correctly decides not to reconfigure and thus forgoes potential gains. The current presentation inflates the perceived benefit.
Ambiguity in Problem Definition (SpMM vs. SpGEMM): The paper is titled and framed around SpGEMM (Sparse-Sparse). However, a detailed reading of the proposed hardware (Section 3.2, page 5) reveals that Designs 1, 2, and 3 are clearly based on the Sextans accelerator [86], which is a SpMM (Sparse-Dense) architecture. Design 4 is presented as the SpGEMM accelerator. The workload categories (e.g., HS×D, MS×D) are SpMM problems, yet they are evaluated with a framework purported for SpGEMM. This lack of precision is confusing and calls into question whether the designs and evaluation truly address the sparse-sparse problem comprehensively.
Inadequate Evaluation of the ML Model's Impact: The paper claims 90% accuracy for the ML predictor (Section 5.1, page 9). However, this "accuracy" is a simple classification metric. The more important question is the performance impact of the 10% of mispredictions. The authors state this results in a "slight slowdown of 1.06×" (Section 5.1), but this single average value obscures the potential for catastrophic performance degradation in specific cases. For a misprediction to be truly "slight," the performance difference between the chosen incorrect design and the optimal one must be minimal. Given the performance variation shown in Figure 3, this is not a safe assumption. A distribution of the performance loss for mispredicted cases is required.
Arbitrary Threshold in Reconfiguration Engine: The decision to reconfigure is based on a user-defined threshold, set to 20% in the experiments (Section 3.3, page 7). This value is presented without justification. The entire system's performance is sensitive to this hyperparameter, yet no sensitivity analysis is provided. A rigorous study would demonstrate how performance changes as this threshold is varied.
Questionable Fidelity of Simulator-Based Comparison: The comparison against Trapezoid is performed using Trapezoid's "cycle-accurate simulator" (Section 4, page 8). While common, this is not a direct hardware-to-hardware comparison. The paper provides no information about how the simulator was configured or validated, leaving open the possibility of an unfair comparison. Claims of superiority over a simulated baseline are weaker than those against a hardware implementation.

Questions to Address In Rebuttal

Please reconcile the core proposal of tile-level dynamic reconfiguration with the stated 3-4 second reconfiguration overhead on your target FPGA. How can such a system be practical for any realistic workload?
Provide the end-to-end geometric mean speedup of Misam across all 116 workloads, factoring in reconfiguration overhead where applied and performance loss where reconfiguration is (correctly or incorrectly) avoided. Do not limit the statistic to a biased subset of the data.
Clarify precisely which of your designs are for SpMM and which are for SpGEMM. Re-evaluate and present your results with a clear mapping between workload type (SpMM/SpGEMM) and the designs capable of handling them.
Instead of a single average, please provide a cumulative distribution function (CDF) or histogram of the performance degradation for the 10% of cases where the ML model mispredicts the optimal design.
Justify the selection of the 20% reconfiguration threshold. Please provide a sensitivity analysis showing how the system's overall performance changes as this threshold is varied from 0% to 100%.
What steps were taken to ensure a fair, apples-to-apples comparison between your hardware prototype and the simulated Trapezoid baseline? Please provide details on the configuration and validation of the simulator.

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces Misam, a framework designed to address a critical challenge in sparse matrix multiplication (SpGEMM) acceleration: the efficient selection of the optimal hardware dataflow at runtime. The authors correctly identify that while flexible accelerators supporting multiple dataflows are emerging, they lack a principled, low-overhead mechanism to choose the right dataflow for a given input's unique sparsity pattern.

The core contribution of this work is a synergistic two-part solution. First, a lightweight decision tree model predicts the optimal hardware configuration based on a small set of salient matrix features. Second, an intelligent reconfiguration engine uses this prediction to perform a cost-benefit analysis, determining if the performance gain from switching to the optimal configuration justifies the non-trivial overhead of reconfiguring the FPGA. This holistic approach bridges the gap between specialized, high-efficiency accelerators and general-purpose, adaptable ones, offering a practical path toward performance portability for sparse computations.

Strengths

The true strength of Misam lies in its elegant synthesis of ideas from machine learning, computer architecture, and reconfigurable computing to solve a well-defined and increasingly important problem.

Addressing a Crucial Gap in the Literature: The field has seen a progression from fixed-dataflow sparse accelerators (e.g., OuterSPACE, MatRaptor) to more flexible architectures (e.g., Trapezoid, Flexagon). However, this flexibility introduced a new "selection problem" that was largely left to heuristics or offline profiling. This paper directly tackles this second-order problem, which is a natural and necessary next step for the field. The work provides a formal, data-driven solution where previously there was an ad-hoc one.
Pragmatic and Well-Justified System Design: The authors' choice of a decision tree for the prediction model is astute. In a domain where inference overhead is paramount, selecting a simple, low-latency model over a more complex one (like a deep neural network) is the right engineering decision. Furthermore, the inclusion of the reconfiguration engine (Section 3.3, page 7) is the paper's most sophisticated contribution. It elevates the work from a simple "ML for architecture" paper to a genuine systems paper that understands and models real-world trade-offs between performance and overhead. This cost-benefit analysis is critical for any practical application of runtime reconfiguration.
Demonstrated Generality of the Core Concept: Perhaps the most compelling evidence for the paper's impact is the experiment in which the Misam selection model is applied to the dataflows of a different accelerator, Trapezoid (Section 6.3, Figure 13, page 12). By achieving 92% accuracy in predicting the best dataflow for a completely different hardware target, the authors prove that their contribution is not merely a set of FPGA bitstreams, but a portable methodology for dataflow selection. This insight suggests the Misam framework could be adapted as a "scheduler" for a wide range of current and future multi-dataflow accelerators, whether they are FPGAs, ASICs, or even CGRAs.
Clear Problem Framing and Strong Empirical Support: The paper is exceptionally well-motivated. Figure 1 (page 1) and Figure 3 (page 3) effectively communicate the core problem: sparsity patterns are diverse, and no single accelerator design is universally optimal. The experimental results, particularly the breakdown in Figure 8 (page 9) showing when reconfiguration is beneficial and when it is not, provide strong, transparent validation of the system's utility.

Weaknesses

The weaknesses of the paper are less about the core idea and more about the practical limitations imposed by the current state of the implementation platform.

High Reconfiguration Overhead as a Practical Bottleneck: The authors are commendably transparent about the 3-4 second full reconfiguration time on their target FPGA (Section 6.1, page 11). This is a significant overhead that constrains the framework's applicability to scenarios where workloads are very long-running or where the sparsity regime is static for a long time. While the framework itself is sound, this hardware limitation means Misam cannot, in its current form, adapt to fine-grained sparsity changes (e.g., between layers of a neural network). The paper would benefit from a more explicit discussion of the "timescale of adaptation" that is practical today.
Limited Novelty in Accelerator Microarchitecture: The paper's contribution is at the system and control level, not the microarchitectural level. The presented FPGA designs are described as adaptations of prior work, such as Sextans. While this is a perfectly valid approach—building upon existing components to create a novel system—it's important to frame the work's novelty correctly. The innovation is in the orchestration and intelligent management of these hardware resources, not in the resources themselves.

Questions to Address In Rebuttal

On the Timescale of Adaptability: The review touches on the high reconfiguration overhead. Could the authors characterize the "break-even" point for computation? For example, how many floating-point operations must a SpGEMM task contain for the potential speedup to overcome the multi-second reconfiguration cost? This would help readers understand the target application domain more clearly—is it for large-scale scientific computing jobs, or could it be adapted for batched inference workloads in machine learning?
Feature Extraction Overheads: The performance breakdown (Figure 12, page 11) shows that feature extraction is a small but non-zero cost. As matrix sizes shrink, this fixed cost could become a larger fraction of the total runtime. Could you discuss how the overhead of feature extraction scales with matrix dimensions and non-zeros, and if there is a problem size below which Misam's predict-and-reconfigure approach is no longer beneficial?
Extensibility to Other Decision Metrics: The current model is trained to optimize for performance (latency). However, in many contexts (especially edge computing), energy efficiency is the primary concern. How easily could the Misam framework be retrained to optimize for energy, power, or a combined metric like Energy-Delay Product? Does this simply require relabeling the training data, or would the optimal features themselves likely change?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents "Misam," a framework for accelerating Sparse Matrix-Matrix Multiplication (SpGEMM) on FPGAs. The authors' core claim to novelty lies in the synergistic combination of three main components: (1) a lightweight machine learning (ML) model (a decision tree) for runtime selection of the optimal hardware dataflow based on input matrix features; (2) the use of FPGA reconfiguration to load specialized, resource-efficient hardware designs (bitstreams) on demand, rather than using a single, monolithic, general-purpose design; and (3) an intelligent "reconfiguration engine" that performs a cost-benefit analysis to determine if the performance gain from switching to a new hardware configuration justifies the significant overhead of FPGA reconfiguration.

Strengths

The primary strength of this work is the novelty of the proposed framework as a whole. While the individual technological pillars it rests on are not new, their synthesis into a closed-loop, adaptive system for sparse acceleration is a notable contribution.

Principled, Runtime Decision-Making: The central novel idea is the replacement of static configurations, offline profiling (as in Flexagon [66]), or ad-hoc heuristics with a formal, ML-driven, cost-aware runtime decision process. The system doesn't just predict the best dataflow; it explicitly models the trade-off between the potential performance gain and the reconfiguration cost (Section 3.3, Page 7). This cost-benefit analysis is a crucial and novel step that moves beyond prior work, which often assumes a static choice or ignores switching costs.
Specialization via Reconfiguration as an Alternative to Monolithic Design: The authors identify a key weakness in recent flexible accelerators: hardware underutilization (Section 1, Page 2). Their proposed solution—using separate, lean bitstreams and reconfiguring—is a novel architectural philosophy for this problem space. It directly counters the trend of building single, large, flexible-but-underutilized ASICs or FPGA overlays. The novelty lies in architecting for specialization and dynamism, rather than for static generality.
Demonstrated Portability of the Core Idea: The authors show that their ML selection model can be trained on and applied to the dataflows of a different accelerator, Trapezoid [102], achieving 92% accuracy and significant speedups (Section 6.3, Page 12). This is a powerful demonstration that their core novel contribution—the selection mechanism—is a generalizable concept, not just an artifact of their specific FPGA designs.

Weaknesses

The evaluation of novelty must be precise. The paper's contribution is in the synthesis, not in the invention of its constituent parts.

Constituent Technologies are Not Novel: The paper's claims must be carefully scoped. Using ML models to select optimal kernels or hardware parameters is an established technique in auto-tuning and compilers. Likewise, using dynamic FPGA reconfiguration to swap hardware modules is a decades-old concept. The paper would be strengthened by more explicitly stating that its novelty lies in the framework that unifies these existing techniques to solve a specific problem in sparse acceleration, rather than implying novelty in the parts themselves.
Insufficient Delta Compared to Strong Heuristics: The core of the selection mechanism is a lightweight decision tree. As shown in Figure 4 (Page 4), the decision hinges on a small number of key features like Tile_1D_Density and A_load_imbalance_row. A key question of novelty is whether this ML approach provides a significant "delta" over a well-crafted, non-ML heuristic. It is conceivable that a simple set of if-then-else rules based on thresholding these same features could achieve comparable accuracy. The paper lacks a comparison against such a strong heuristic baseline, making it difficult to assess whether the complexity of an ML toolchain is truly justified over a simpler, domain-specific rule set.
Novelty is Muted by Hardware Limitations: The reconfiguration engine is a novel concept, but its practical benefit is severely constrained by the target platform's 3-4 second reconfiguration time (Section 6.1, Page 11). This overhead means the adaptive framework is only beneficial for extremely long-running computations where the cost can be amortized. While the idea is sound, its novelty is primarily academic on current-generation FPGAs for many real-world use cases. The contribution is more of a forward-looking proof-of-concept than an immediately applicable technique, a point that should be made more clearly.

Questions to Address In Rebuttal

Could the authors please clarify the novelty of their work by explicitly distinguishing their framework-level contribution from the prior art in the individual domains of (a) ML-based algorithm selection and (b) dynamic hardware reconfiguration?
To justify the novelty of using an ML model for selection, could the authors provide a comparison against a strong, non-ML heuristic baseline? For example, a set of hand-tuned rules based on thresholds for the top 2-3 features identified in Figure 4. How much better is the decision tree than what a domain expert could construct in a short time?
The reconfiguration engine's cost-benefit analysis is central to the paper's novelty. How would this model and its utility change if applied to a Coarse-Grained Reconfigurable Architecture (CGRA) with reconfiguration times in the microsecond-to-millisecond range, as mentioned in Section 6.1? Does the framework's novelty hold or strengthen in a regime where the reconfiguration cost is orders of magnitude lower?

AxCore: A Quantization-Aware Approximate GEMM Unit for LLM Inference

Abstract

Large Language Models (LLMs) have become foundational to modern natural language processing, yet their immense computational and memory demands pose major obstacles for efficient inference. Transformer-based LLMs rely heavily on floating-point general ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper presents AxCore, a general matrix-matrix multiplication (GEMM) unit designed for Large Language Model (LLM) inference. The core contribution is the replacement of conventional floating-point multipliers with a multiplier-free design based on Floating-Point Multiplication Approximation (FPMA), which utilizes integer addition. This approach is integrated into a mixed-precision systolic array that directly operates on 4-bit quantized floating-point (FP4) weights and 16-bit floating-point (FP16) activations. The authors propose several supplementary techniques to maintain accuracy, including a method for handling subnormal numbers (SNC), a constant-based error compensation scheme, and an adaptive format-aware quantization algorithm. The paper claims significant improvements in compute density and competitive or superior model accuracy compared to both conventional FP units and state-of-the-art INT4-based accelerators.

Strengths

Direct Targeting of a Key Bottleneck: The work correctly identifies the FP multiplier as a primary contributor to area and power costs in GEMM units and proposes a direct architectural alternative. The ambition to create a multiplier-free design is commendable.
System-Level Approach: The authors do not merely propose an approximate arithmetic trick but consider its integration into a full system, including a systolic array architecture (Section 5, page 7) and dataflow optimizations like Correction Advancing and Normalization Postponing (Section 5.3, page 8).
Inclusion of an Ablation Study: The authors provide a breakdown of accuracy improvements from their various techniques in Table 2 (page 11), which is a necessary step to justify the inclusion of each component (SNC, error compensation, etc.).

Weaknesses

My primary concerns with this paper stem from questionable methodological choices in the evaluation that appear to conflate distinct contributions, leading to claims that are not rigorously supported by the evidence provided.

Confounded Accuracy Evaluation: The central claim that AxCore achieves "comparable or better perplexity" (Abstract, page 1) is not soundly demonstrated. The paper compares AxCore—which includes both an approximate compute core and a novel format-aware quantization scheme—against baselines using standard quantization methods (e.g., GPTQ for FIGNA, as stated in Section 6.5.1, page 11). The observed accuracy improvements (e.g., PPL of 9.78 for AxCore vs. 9.82 for FPC and 9.95 for FIGNA on OPT-30B, Figure 1, page 2) are more likely attributable to the sophisticated quantization scheme rather than the approximate nature of the compute unit itself. An approximate method should not, by definition, be more accurate than an exact one given the same inputs. This is an apples-to-oranges comparison that conflates the benefits of a quantization algorithm with the performance of the underlying hardware. The paper lacks a crucial baseline: an exact FP4xFP16 multiplier using the authors' proposed format-aware quantization. Without this, the isolated impact of the mpFPMA approximation on accuracy remains unknown and unsubstantiated.
Oversimplified and Poorly Justified Error Compensation: The proposed "Mean-Based Constant Compensation" (Section 4.3.2, page 6) is a significant point of weakness. Equation (11) defines the correction constant C₁ by averaging the approximation error across all possible mantissa combinations. This implicitly assumes a uniform distribution of mantissa values. This assumption is highly unlikely to hold for the weights and, particularly, the activations within a neural network, which are known to have highly structured, non-uniform distributions. The paper provides no analysis to show that this single, pre-computed constant is robust across different models, layers, or even different input prompts that would induce varying activation distributions. This appears to be a crude heuristic, and its effectiveness is not convincingly demonstrated beyond the global perplexity numbers.
Unaddressed Implications of Stochastic Rounding: The handling of subnormal numbers relies on a "random selection policy" (Section 4.2.2, page 5) to mitigate rounding bias. This is a form of stochastic rounding. The implementation details in Section 5.2.2 (page 8) suggest this is not truly random but is derived from an activation bit. Regardless of the source, this introduces non-determinism into the computation, meaning identical inputs can produce different outputs on subsequent runs. This is a critical flaw for many deployment scenarios, especially those requiring verifiability, debugging, and regulatory compliance. The authors fail to discuss the implications of this non-determinism or quantify its impact on numerical stability.
Inadequate Comparison of Number Formats: The paper claims superiority over INT4-based designs like FIGNA. However, this comparison is fraught. FP4 and INT4 are fundamentally different 4-bit representations. While the authors claim FP4 offers "higher accuracy potential" (Section 2.3, page 3), this inherent advantage of the number format is used to claim superiority for their architecture. A rigorous comparison would require demonstrating AxCore's benefits over an architecture using an exact FP4 multiplier, or by implementing the proposed approximation techniques within an integer-based framework to provide a more direct comparison to INT4 designs.

Questions to Address In Rebuttal

The authors must address the following points to substantiate their claims:

On Confounded Evaluation: Can you provide perplexity results for a baseline system that uses your proposed adaptive format-aware quantization but with a conventional, exact FP4xFP16 multiplier? This is the only way to decouple the effects of your quantization algorithm from your approximate compute core and fairly assess the accuracy degradation caused by mpFPMA.
On Error Compensation: Please provide an analysis of the sensitivity of model accuracy to the constant compensation value C₁. How does the actual distribution of mantissas in representative LLMs (e.g., OPT, LLaMA) compare to the uniform distribution assumed in your derivation of C₁, and what is the resulting error if the distributions diverge significantly?
On Non-Determinism: Please clarify the exact mechanism for implementing the "random selection policy" for subnormal rounding. Acknowledge and discuss the implications of the resulting non-determinism for bit-for-bit result reproducibility in a production environment. If the method is pseudo-random, what is its period and have you analyzed potential correlations?
On Baseline Fairness: Given that your primary architectural claim rests on the efficiency of the mpFPMA unit, why was the decision made to present the main accuracy results (Table 2) against baselines using different number formats (INT4) and different quantization algorithms (GPTQ)? How can you claim the architecture itself is superior when so many other variables differ?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

The paper presents AxCore, a novel hardware accelerator architecture for Large Language Model (LLM) inference. The core contribution lies in the synergistic combination of two distinct but complementary research thrusts: weight-only quantization and approximate computing via Floating-Point Multiplication Approximation (FPMA). By fusing these concepts, the authors propose a multiplier-less mixed-precision GEMM unit where expensive floating-point multipliers are entirely replaced by simple integer adders within a systolic array.

To make this fusion practical and accurate, the paper introduces several key innovations: 1. A specialized Processing Element (PE) that performs mixed-precision FPMA directly on compressed weights (e.g., FP4) and high-precision activations (e.g., FP16). 2. A lightweight but critical accuracy preservation strategy that includes a novel hardware unit for Subnormal Number Conversion (SNC), constant-based error compensation, and an adaptive, block-wise selection of FP4 formats. 3. A set of clever systolic array optimizations, such as "Correction Advancing" and "Normalization Postponing," that reduce hardware redundancy and complexity.

The authors demonstrate through extensive evaluation that AxCore achieves significant improvements in compute density (up to 12.5x over conventional FP units) and energy efficiency, while maintaining model accuracy on par with, or even exceeding, state-of-the-art INT4-based accelerator designs.

Strengths

This is a well-executed and insightful piece of work that makes a valuable contribution to the field of hardware acceleration for AI. Its primary strengths are:

Elegant Conceptual Fusion: The most significant strength of this paper is its successful synthesis of ideas from approximate computing and model quantization. Rather than simply applying FPMA to a quantized model, the authors have deeply considered the second-order effects and co-designed the hardware and software components. This creates a cohesive system where the approximation, quantization format, and hardware microarchitecture are mutually supportive. This work serves as an excellent case study in bridging the gap between computer arithmetic theory and practical accelerator design for modern workloads.
Addressing a Critical, Understated Problem: The paper's focus on handling subnormal numbers in low-bit floating-point formats (Section 4.2, page 5) is particularly commendable. As the field aggressively pushes towards 4-bit and smaller data types, the limited exponent range makes subnormal values far more common than in traditional FP32/FP16 arithmetic. The authors correctly identify that this breaks the mathematical assumptions of FPMA and would otherwise lead to significant accuracy degradation (as shown in Figure 4, page 4). Their proposed Subnormal Number Conversion (SNC) is a pragmatic and well-justified solution to a real, emerging problem that many other works in this space overlook.
Strong Co-design Philosophy: The work is a prime example of effective software-hardware co-design. The offline, adaptive format-aware quantization strategy (Section 4.4, page 6) is not just a software trick; it is enabled by a hardware design that can concurrently support multiple FP4 formats without significant overhead. Similarly, the mean-based error compensation is an offline analysis that translates into a simple, low-cost online hardware correction. This holistic approach leads to a far more optimized result than if the software and hardware were designed in isolation.
Thorough and Convincing Evaluation: The evaluation is comprehensive and well-structured. The ablation study presented in Table 2 (page 11) is particularly powerful, as it clearly dissects the contribution of each accuracy-preserving technique (SNC, error compensation, etc.). This provides convincing evidence that the final excellent results are not accidental, but a direct consequence of their design choices. The comparison against strong and recent baselines, including the INT4-based FIGNA and LUT-based FIGLUT, grounds the performance claims in the context of the current state-of-the-art.

Weaknesses

The paper is strong, and its weaknesses are more related to missed opportunities for broader contextualization and exploration rather than fundamental flaws.

Limited Exploration of the Design Space: The paper is heavily focused on a W4A16 mixed-precision scenario. While this is a highly relevant and popular configuration for LLM inference, the underlying principles seem more general. The discussion could be strengthened by exploring how the AxCore philosophy might extend to other data formats, such as FP8 (as used in NVIDIA's Hopper) or even non-standard formats like block floating point. This would help contextualize where the FPMA-based approach is most effective and where its limitations might lie.
Positioning Relative to Logarithmic Computing: The FPMA technique is explicitly based on Mitchell's approximation, which is a cornerstone of Logarithmic Number Systems (LNS). The paper could benefit from briefly positioning itself within this broader historical context of computer arithmetic. AxCore can be seen as a highly specialized, lightweight, and workload-aware application of LNS principles, avoiding the overhead of a general-purpose LNS datapath (e.g., lookup tables for log/antilog conversion) by keeping activations in the linear domain. Acknowledging this connection would not diminish the novelty but rather highlight how the authors have cleverly adapted and simplified a classic idea for the modern LLM domain.
Potential Brittleness of Constant-Based Compensation: The mean-based error compensation (Section 4.3.2, page 6) is an elegant, low-cost solution. However, by using a single pre-computed constant, there is a small risk that its effectiveness is dependent on the data distribution of the calibration set. While the results show this works well, a brief discussion of the sensitivity of this method to out-of-distribution shifts in activation statistics would add nuance.

Questions to Address In Rebuttal

Regarding the Subnormal Number Conversion (SNC) unit: The use of stochastic rounding for ambiguous cases (Section 4.2.2, page 5) is an interesting choice to mitigate bias. Could the authors comment on the hardware cost and complexity of the random bit generation required for this? Was a simpler deterministic rounding scheme (e.g., round-to-nearest-even) evaluated, and what was its comparative impact on model accuracy?
Regarding the adaptive format-aware quantization: The selection is made from a set of three representative FP4 formats (E3M0, E2M1, E1M2). What is the sensitivity of the final accuracy to this specific set of choices? Is there a point of diminishing returns, or could further gains be realized by considering a wider array of custom FP4 formats?
The core idea is potent for weight-only quantization. Could the authors speculate on the applicability of the AxCore architecture to future scenarios that might also quantize activations (e.g., W4A8)? What would be the primary new challenges in extending the mpFPMA concept to handle two low-bit, approximate inputs, particularly concerning alignment and dynamic range?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents AxCore, a general matrix-matrix multiplication (GEMM) unit designed for accelerating Large Language Model (LLM) inference. The central claim of this paper is the design of a mixed-precision GEMM unit that replaces conventional floating-point multipliers with an approximate method, Floating-Point Multiplication Approximation (FPMA), based on integer addition. The architecture is tailored for weight-only quantization, performing direct computation on low-bit quantized weights (e.g., FP4) and high-precision activations (e.g., FP16). To maintain accuracy, the authors propose a set of supporting techniques, including a Subnormal Number Conversion (SNC) unit, a constant-based error compensation scheme, and an adaptive format-aware quantization strategy.

While the constituent concepts—FPMA, mixed-precision GEMM, and adaptive quantization—are not novel in isolation, the primary contribution of this work lies in their synthesis into a cohesive, mixed-precision architecture. The specific hardware co-design choices made to enable this synthesis, particularly the handling of subnormal numbers in low-bit formats and the "correction advancing" systolic array optimization, represent the most significant novel elements.

Strengths

Novel Synthesis of Known Concepts: The core novelty of AxCore is not in the invention of a new algorithm but in the architectural synthesis of two distinct techniques: mixed-precision GEMM and multiplier-free FPMA. To my knowledge, prior work on FPMA has largely focused on uniform-precision arithmetic. This paper is one of the first to explore and design a dedicated hardware architecture for the mixed-precision case (mpFPMA), which is highly relevant for modern LLM inference.
Problem-Driven Novelty in Subnormal Handling: The paper correctly identifies that low-bit floating-point formats (e.g., FP4) produce a high frequency of subnormal numbers, which breaks the mathematical assumptions of standard FPMA. The proposed Subnormal Number Conversion (SNC) unit (Section 4.2, page 5) is a novel and practical solution to a problem that arises directly from the core architectural choice. This demonstrates a thoughtful approach to addressing the second-order effects of the main idea.
Novel Application of Optimization Principles: The "Correction Advancing" technique described in Section 5.3.1 (page 8) is a clever application of the well-known principle of factoring out common computations. While the principle itself is not new, its application to the specific bias and error correction terms (B1 and C1) of the mpFPMA formulation is a novel design choice that yields a tangible reduction in per-PE complexity.

Weaknesses

Limited Foundational Novelty of Core Ideas: The paper's novelty rests on synthesis and implementation rather than on a fundamentally new concept.
- The FPMA technique itself is decades old, dating back to Mitchell's logarithm approximation [35].
- The use of FPMA for neural networks has been explored in several recent pre-prints and papers, a point the authors acknowledge. The work by Luo and Sun [33], "Addition is all you need for energy-efficient language models," presents a conceptually very similar motivation.
- Weight-only mixed-precision GEMM is the industry standard for LLM quantization and is implemented in accelerators like FIGNA [22], which is used as a baseline.
- Adaptive, block-wise quantization based on data distribution has been proposed in prior work such as ANT [19] and Olive [47]. The novelty here appears to be more in implementation and integration with the FPMA pipeline rather than a fundamentally new quantization strategy.
The "Delta" Over Prior Art Needs Sharper Definition: The paper does an adequate job of presenting its own contributions but could do more to precisely delineate its novelty against the closest and most recent prior art. The line between AxCore and other contemporary FPMA-for-NNs work is fine and rests primarily on the specific architectural choices for the mixed-precision case.
Marginal Novelty in Some Supporting Techniques: The mean-based constant error compensation (Section 4.3.2, page 6), while practical, is an incremental improvement over other error compensation techniques for FPMA. It is essentially a pre-calculation of the average error, a standard statistical technique. Its main virtue is simplicity, not conceptual novelty.

Questions to Address In Rebuttal

The recent work by Luo and Sun [33], and others exploring addition-based approximations for LLMs, seems to pursue a very similar goal. Please clarify the precise novel contributions of the AxCore architecture over this and similar contemporary works. Is the primary distinction simply the hardware implementation of the mixed-precision case, or is there a more fundamental algorithmic or architectural difference?
The paper claims that existing FPMA work largely ignores the issue of subnormal numbers (Section 3.1, page 3). Could the authors substantiate this claim with a broader survey of prior art in approximate computing? While plausible for high-precision formats, it is critical to establish that this problem was indeed unaddressed for the low-bit formats where it becomes prominent.
Given prior work on adaptive data-aware quantization like ANT [19], what is the key conceptual novelty of the proposed format-aware quantization (Section 4.4, page 6) beyond its co-design and integration within the AxCore pipeline?
The proposed accuracy preservation schemes (SNC, constant error compensation) add specific hardware complexity. Is the accuracy gain from these specific techniques substantial enough to justify their design cost over simpler, previously proposed FPMA error correction schemes (e.g., small LUTs or simple bias terms), especially when applied to the mixed-precision domain? A direct comparison would strengthen the argument for this specific novel approach.

Amove: Accelerating LLMs through Mitigating Outliers and Salient Points via Fine-Grained Grouped Vectorized Data Type

Abstract

The quantization of Large Language Models (LLMs) poses significant challenges due to the heterogeneous nature of feature point distributions in low-bit quantization scenarios, including salient points, normal outliers, and massive outliers. These ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper proposes "Amove," a hardware-software co-design framework for quantizing Large Language Models (LLMs). The core contribution is a "fine-grained grouped vectorized data type" that uses a residual approximation mechanism to represent per-cluster scale factors. This mechanism relies on a shared base scale factor for a coarse-grained group, a shared residual, and low-bit encodings for fine-grained clusters. The authors claim this approach balances accuracy and memory overhead, enabling effective 4-bit weight-activation quantization. The proposed architecture is implemented and evaluated on both GPU tensor core and systolic array designs, with claims of significant speedup and energy reduction over existing methods.

While the problem statement is relevant, the work rests on a fragile foundational assumption and is marred by significant methodological flaws in its evaluation, leading to claims that appear unsubstantiated upon close inspection.

Strengths

Problem Motivation: The paper correctly identifies the fundamental trade-off in LLM quantization: the tension between the accuracy benefits of fine-grained quantization and the associated memory and computational overhead of managing numerous scale factors (Section 3, page 3).
Architectural Scope: The authors provide designs for both GPU tensor core-style and systolic array-based accelerators (Section 5, page 7-8). This demonstrates a comprehensive consideration of hardware implementation targets.
Core Concept: The idea of approximating scale factors from a shared base and a residual is, in principle, a plausible approach to reducing metadata overhead.

Weaknesses

Unjustified and Contradicted Foundational Assumption: The entire residual approximation mechanism is predicated on the observation that scale factor distributions are "light-tailed" (Motivation I, Section 4.1, page 4). This is presented as a general property. However, the authors' own analysis in Section 6.4 and Figure 14 (page 12) directly contradicts this generalization. They show that the Bloom-3B model has a kurtosis of ~3.31, which is characteristic of a mesokurtic or even leptokurtic distribution, not a light-tailed (platykurtic) one. The authors' solution—to simply reduce the residual group size—is not a fix but an admission that the core assumption is not robust and requires parameter tuning that undermines the claimed efficiency. The framework's central premise is therefore not universally applicable but an empirical heuristic that fails on certain models.
Fundamentally Flawed Experimental Comparison: The performance evaluation is built on an invalid "apples-to-oranges" comparison. In Section 6.1 (Hardware Baselines, page 9), the authors state: "ANT, Olive, Tender, INT-Sym, and MX apply quantization only to linear layers, whereas Amove supports quantization for both linear and attention layers." Consequently, the perplexity results in Table 4 (page 10) and, more critically, the speedup and energy results in Figures 10 and 11 (page 11) are misleading. Amove naturally achieves higher speedups because it quantizes a larger portion of the model's computational graph (i.e., the attention layers), which the baselines are not configured to do. A rigorous comparison would either (a) restrict Amove to quantizing only linear layers to match the baselines, or (b) compare Amove against baselines that are also configured to quantize attention layers. As presented, the reported speedups are inflated and do not reflect a fair comparison of the underlying quantization technology.
Contradictory and Misleading Overhead Claims: The paper claims its architectural extensions are "lightweight" with "negligible area overhead" (Section 5.5, page 8). Table 7 (page 12) supports this by showing a total area overhead of only 1.62% for the tensor core. However, this top-level number masks the true cost. Figure 13 (page 12) reveals that the area and power of the core computational units—the PE and the Thread Group—increase by 16.2%/17.1% and 11.7%/12.9%, respectively, compared to a baseline INT4 design. An increase of over 15% in the core compute engine is not "negligible." Presenting the overhead relative to the full chip/SM area, which is dominated by memory and control logic, is a classic method of minimizing the perceived cost of a significant modification to the datapath.
Unaccounted Online Computational Overhead: Algorithm 1 (page 6) specifies that for activations, the residual R is computed online. The formula in line 13 involves a summation across all elements within a coarse-grained group (ΘΕΣ...). This is a non-trivial computation that must be performed for every group of activations during inference. The paper provides no analysis of the latency of this online calculation, nor is there any evidence that this latency is modeled in the performance simulators used for the results in Section 6.3. This omission means the reported performance likely neglects a critical source of overhead, casting further doubt on the validity of the speedup claims.
Insufficient Simulator Validation: The performance claims are based on modified simulators (TimeLoop and a BitMoD-based accelerator simulator). The authors report a validation against real hardware with a mean absolute percentage error of 4.59% (Figure 9, page 9). For a paper making claims of >2x speedup, a nearly 5% simulation error margin is substantial and weakens the confidence in the results. Furthermore, the validation was performed using a W8A8 workload, not the proposed 4-bit Amove format. It is not demonstrated that the simulator is accurate for the very data type and operations being proposed.

Questions to Address In Rebuttal

The performance comparison in Table 4 and Figures 10/11 appears to be fundamentally unsound, as Amove quantizes both linear and attention layers while the baselines are configured to quantize only linear layers. Please provide a revised comparison where either Amove is restricted to linear-only quantization or the baselines are configured to quantize both layer types. Without this, the claimed speedups are not credible.
Please reconcile the claim of "negligible area overhead" (Section 5.5) with the data in Figure 13, which shows a >16% area and >17% power increase for the core Processing Element (PE) relative to a baseline INT4 PE. Why should this be considered "lightweight"?
The core mechanism relies on a "light-tailed" scale factor distribution, yet your own results show this does not hold for models like Bloom-3B. How does the framework perform on models that violate this assumption without resorting to smaller group sizes, which increases overhead and complexity? Does this not limit the generality of your approach?
Algorithm 1 implies an online computation of the residual R for activations. What is the precise cycle latency of this computation for a given group size? Was this latency modeled and included in the end-to-end performance results presented in Figures 10 and 11? If not, please justify this omission.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

The paper introduces Amove, a hardware-software co-design framework for accelerating Large Language Models (LLMs). The core contribution is a novel method for managing the metadata overhead associated with fine-grained quantization. The authors observe that the per-group scale factors, which are essential for maintaining accuracy in low-bit quantization, exhibit a light-tailed statistical distribution. Based on this insight, they propose a "residual approximation mechanism" where a coarse-grained group of values shares a high-precision base scale factor, while smaller, fine-grained clusters within that group are represented by low-bit encoded residuals or offsets.

This mechanism is encapsulated in a flexible, fine-grained grouped vectorized data type. A key strength of this data type is its ability to provide a unified solution for both low-bit weight-only and weight-activation quantization modes, which are often handled by separate, specialized solutions. The authors demonstrate the practical viability of their approach by detailing its implementation on both GPU tensor core-like and systolic array-based architectures, showing significant speedup and energy reduction over existing state-of-the-art methods.

Strengths

This paper is a strong contribution to the field of LLM acceleration, effectively synthesizing algorithmic insights with practical hardware design.

Elegant Core Idea with Strong Motivation: The central idea of compressing scale factor information via a base-plus-residual approximation is both elegant and well-motivated. The authors correctly identify that as quantization granularity becomes finer to preserve accuracy, the memory and bandwidth overhead of the scale factors themselves becomes a primary bottleneck. The analysis in Section 3 (page 3), particularly the observation of the light-tailed distribution of scale factors (Figure 5c, page 4), provides a compelling statistical justification for their entire approach. This transforms the problem from storing many independent values to efficiently encoding small deviations from a shared mean.
Unification of Quantization Modes: A significant contribution is the framework's ability to seamlessly support both weight-only and weight-activation quantization. The current landscape is fragmented, with many techniques and hardware designs specializing in one or the other (as shown in their comparative analysis in Table 1, page 4). Amove provides a unified substrate that can handle both, which is a major advantage for hardware designers who seek to build general-purpose accelerators, and for software developers who can target a single format.
Excellent System-Level Co-Design: The work is a prime example of effective co-design. The authors do not stop at the algorithmic level; they propose a concrete data type format (Figure 6b, page 5), detail the necessary architectural extensions for both GPU-style and systolic array accelerators (Section 5, pages 7-8), and even consider the programming model via a custom Smma instruction (Section 5.6, page 8). This holistic approach bridges the gap between theoretical quantization research and practical implementation.
Compelling Orthogonality and Integration: One of the most powerful arguments for Amove's potential impact is its demonstrated compatibility with existing quantization algorithms. The results in Section 6.5 (page 12) show that Amove is not just another competing quantization scheme, but rather an underlying data representation that can be used to enhance established methods like GPTQ, AWQ, and OmniQuant. This positions Amove as a foundational technology that can uplift the entire ecosystem, rather than just a standalone solution.

Weaknesses

The weaknesses of the paper are more related to the boundaries of its exploration rather than fundamental flaws in the core idea.

Robustness of the Core Statistical Assumption: The effectiveness of the residual approximation hinges on the "light-tailed" nature of the scale factor distribution. While the authors validate this across several models (Figure 14, page 12), the landscape of model architectures is constantly evolving (e.g., Mixture-of-Experts, state-space models). The work would be strengthened by a more thorough stress test of this assumption. What happens in pathological cases where the distribution is heavy-tailed or multi-modal? While smaller group sizes can mitigate this, a discussion of the framework's performance degradation and potential adaptations would be valuable.
Practicality of the Offline Residual Search: Algorithm 1 (page 6) outlines an offline search to find the optimal residual R for weight quantization. While this is a one-time, pre-processing cost, its computational complexity is not analyzed. For future models with trillions of parameters, the scalability of this search could become a practical concern. A brief analysis of this complexity would add to the paper's practical grounding.
Positioning Relative to Non-Uniform Quantization: The paper primarily situates itself within the context of uniform, group-wise integer quantization. However, another active area of research involves non-uniform formats (e.g., logarithmic, posit-like representations) which are inherently suited to handling distributions with wide dynamic ranges. A brief discussion of how Amove's residual mechanism compares or could potentially be combined with such non-uniform schemes would help to more fully map its place in the broader quantization landscape.

Questions to Address In Rebuttal

Regarding the core assumption of light-tailed scale factor distributions (Figure 14): Have the authors investigated models or specific layer types (e.g., in MoE models) where this assumption is weaker? How gracefully does the residual approximation mechanism degrade in such scenarios, and could the framework be adapted (e.g., by dynamically adjusting the residual bit-width) to handle them?
Could the authors comment on the computational cost and scalability of the offline residual search (Algorithm 1, page 6) for quantizing weights? While it is a pre-processing step, understanding this cost is important for its practical adoption on foundation models that continue to grow in size.
The paper successfully demonstrates that Amove can serve as a "backend" data format for various quantization algorithms (Section 6.5). Could the authors speculate on the compatibility or potential synergy of the Amove data type with non-uniform quantization schemes, such as those that use logarithmic representations to manage outliers?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper, "Amove," presents a hardware-software co-design framework for quantizing Large Language Models (LLMs). The central claim of novelty rests on a "residual approximation mechanism" designed to manage the memory overhead of scale factors in fine-grained quantization. The authors propose a new vectorized data type that encodes a coarse-grained group of values by using a single shared base scale factor, a single shared residual, and a low-bit encoding for each fine-grained cluster within the group. This allows each cluster to approximate its unique scale factor as a function of these shared parameters (Scale_cluster ≈ Scale_base - Encoding * Residual). The paper demonstrates this concept through a data type, its integration with existing quantization algorithms, and proposed modifications to GPU tensor cores and systolic array architectures to support it efficiently.

Strengths

The primary strength of this paper lies in its well-defined and targeted novel contribution. While many components of the work build on existing concepts, the core idea is distinct:

Novel Mechanism for Scale Factor Compression: The specific method of approximating per-cluster scale factors using a shared base, a shared residual, and a per-cluster encoding is, to my knowledge, a novel formulation in the context of LLM quantization. Prior art has focused on either quantizing the scale factors themselves (e.g., VS-Quant [13]) or using a single shared scale factor for a group (e.g., standard group-wise quantization, MX format [65]). The Amove mechanism offers a middle ground, enabling variation within a group without incurring the full memory cost. It parameterizes the quantization scales in a structured, compressible way.
Identifies and Addresses a Key Bottleneck: The authors correctly identify that the memory and bandwidth overhead of scale factors is the primary impediment to adopting finer-grained quantization, despite its known accuracy benefits (as shown in their own Figure 2, page 2). The proposed solution directly and elegantly targets this specific problem.
Clear Articulation of the "Delta" over Prior Art: The paper implicitly and explicitly positions its contribution relative to the closest prior art, the Microscaling (MX) data format [65]. While MX also uses a vectorized format with a shared scale (exponent), Amove's novelty lies in adding the residual and encoding terms. This "delta" is what allows Amove to support finer-grained clusters within a coarse-grained block, which is a significant functional and conceptual advancement over MX's single-scale-per-block approach.

Weaknesses

The weaknesses of the paper relate not to a lack of novelty in the core idea, but to the assumptions that underpin it and the exploration of its boundaries.

Over-reliance on an Empirical Assumption: The entire residual approximation mechanism is motivated by the observation that scale factors in fine-grained settings exhibit a "light-tailed distribution" (Observation I, page 4). This is a strong assumption. While the authors provide some evidence in Figure 5(c) and Figure 14, the robustness of the framework is questionable for models or data distributions that might produce heavy-tailed or multi-modal scale factor distributions. The paper shows one less-ideal case (Bloom-3B in Figure 14, page 12) and states that reducing the group size helps, but this does not fully address the fragility of the core assumption. A more rigorous analysis of failure modes is warranted.
The Novelty is Highly Specific: The contribution is a clever engineering trick for compressing scale factors, not a fundamental new theory of quantization. While effective, its novelty is confined to this specific mechanism. The architectural modifications, for instance, are logical consequences of the data format design rather than being novel architectural primitives in their own right. They are essentially designing a specialized functional unit for their data type, which is standard practice in co-design papers.
Insufficient Justification for Design Choices: The framework introduces several new hyperparameters: the bit-widths for the shared scale (S) and residual (R), the encoding bits (E), and the group/cluster sizes. The paper presents two fixed configurations ("Aggressive" and "Conservative") but provides little insight into the methodology for choosing these parameters. For instance, in Section 6.1 (page 9), the residual search is defined over [-1, 1] with a granularity of 0.01. This seems arbitrary. A sensitivity analysis exploring the trade-offs between these parameters would be necessary to understand the design space this novel mechanism opens up.

Questions to Address In Rebuttal

On the Novelty Compared to Hierarchical Formats: The proposed mechanism Scale ≈ Base + E * R can be viewed as a form of structured, linear quantization of the scale factors themselves. Can the authors more explicitly contrast their approach with other hierarchical quantization formats (e.g., Figna [31], which also uses hierarchical representations) and explain why their specific linear approximation model is better suited for this problem?
On the Robustness of the Core Assumption: What is the performance of Amove on a model that is known to have a pathological, non-light-tailed distribution of scale factors? How does the fitting error (as shown in Figure 15) behave in such a case, and what is the resulting impact on model perplexity? Does the mechanism degrade gracefully, or does it fail catastrophically?
On the Online Computation of the Residual: Algorithm 1 (page 6) describes different methods for computing the residual R for weights (offline search) and activations (online average deviation). The online computation for activations is a simple average, which seems far less precise than the offline MSE-based search for weights. Could you quantify the accuracy impact of this simplification? What is the hardware cost and latency of performing a more rigorous online search for the activation residual, and why was this option discarded?
Justification of the Residual's Dynamic Range: The residual is calculated from deviations from a base scale factor. What is the empirical distribution of these residuals? Is an 8-bit FP8 representation for the residual (as used in Amove-Conservative/Aggressive) typically sufficient, or are there cases where the required residual value exceeds this range, leading to clipping and accuracy loss?

MX+: Pushing the Limits of Microscaling Formats for Efficient Large Language Model Serving

Abstract

Reduced- precision data formats are crucial for cost-effective serving of large language models (LLMs). While numerous reduced-precision formats have been introduced thus far, they often require intrusive modifications to the software frameworks or are ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors identify that low-bit microscaling (MX) formats, particularly MXFP4, suffer significant performance degradation in LLMs due to large quantization errors on outlier activation values. They propose MX+, an extension to the MX standard, which repurposes the exponent bits of the largest value in a block (the Block Max, or BM) to serve as additional mantissa bits, thereby increasing its precision. The location of this BM is stored using an additional 8-bit index per block. The authors evaluate this proposal via software emulation, a proposed software implementation using an extra matrix multiplication, and a new hardware design, claiming significant accuracy improvements with negligible overhead.

Strengths

Clear Problem Motivation: The paper provides a convincing analysis in Section 3 (pages 2-3) that pinpoints the source of performance degradation in low-bit MX formats. The investigation in Figures 3 and 4 correctly identifies that activation outliers (BMs) are the primary cause of high quantization error, both for the outliers themselves and for the other elements in the block (NBMs) that are scaled by them.
Intuitive Core Mechanism: The central insight—that the BM element's exponent is implicitly known and its bits can be repurposed for precision—is a simple and clever observation. This forms a logical basis for the proposed format extension.
Broad Empirical Evaluation: The authors have evaluated their proposal across a wide range of modern LLMs (OPT, Llama, Mistral, etc.) and on various academic benchmarks (perplexity and zero-shot tasks), lending some weight to their accuracy claims.

Weaknesses

My primary concerns with this manuscript center on logical inconsistencies between claims and evidence, questionable methodological choices in comparative analysis, and an underestimation of the proposed overheads.

Contradictory Claims of "Non-Intrusive" and "Negligible Slowdown": The abstract claims MX+ is a "non-intrusive extension" with "negligible slowdown." The evidence presented contradicts this directly.
- The proposed software implementation (Section 5.2, page 6) requires decomposing the BM and executing a second, sparse MMA operation. Figure 11 (page 10) shows this incurs a 1.54x slowdown in the prefill stage. This is not a "negligible" overhead.
- The proposed hardware implementation (Section 6, page 7) requires adding a "BM Detector," "Forward and Swap Unit," and "BM Compute Unit" to the Tensor Core's Dot Product Engine (DPE). Modifying the internal pipeline of a highly optimized unit like a Tensor Core is, by definition, an intrusive architectural change, not a "non-intrusive extension."
Unconvincing Hardware Overhead Analysis: The hardware analysis in Section 7.4 and Table 5 (page 10) is based on a 28nm technology node. This is a severely outdated process for evaluating hardware intended for state-of-the-art accelerators, which utilize 4nm or 5nm nodes. Area and power do not scale linearly between such disparate nodes. The authors' assertion that the overhead "would be even smaller if fabricated using more advanced node" is an unsubstantiated claim, not a rigorous analysis. This choice of process node appears designed to minimize the reported overhead figures.
Potential for Unfair Baseline Comparisons: In Section 8.1 and Table 7 (page 11), the authors compare MX+ against several other quantization schemes. For ANT, OliVe, and Tender, they create their own variants ("MX-ANT", "MX-OliVe", "MX-Tender") to support "finer-grained grouping." This raises a significant red flag. It is unclear if these re-implementations are faithful to the original work or are unoptimized strawman versions that serve to inflate the relative performance of MX+. Without a detailed description of this re-implementation and validation against the original papers' results, the integrity of this comparison is questionable.
Downplayed Storage and Bandwidth Overhead: The MX+ format requires an additional 8 bits per 32-element block to store the BM index. For MXFP4, a block is 32 elements * 4 bits/element = 128 bits. The overhead is therefore 8 / 128 = 6.25%. In the memory-bandwidth-bound decode phase of LLM inference, a 6.25% increase in data movement from memory is not "negligible" and will directly impact latency and energy consumption. The authors fail to quantify the performance impact of this additional bandwidth pressure.
Limited Scope of the Core Idea: The authors suggest applicability to non-FP formats like MXINT8 and other industry formats like NVFP4 (Section 8.2, page 12). However, the results for MXINT8+ in Table 10 show almost no benefit, and the proposed extension for NVFP4 is speculative and acknowledges it fails for blocks with small-magnitude values. This suggests the technique is a point solution for a narrow class of FP-based block formats rather than a broadly applicable principle.

Questions to Address In Rebuttal

Please reconcile the claims of "non-intrusive" and "negligible slowdown" with the evidence of a 1.54x prefill slowdown in the software implementation and the required modifications to the Tensor Core datapath in the hardware proposal. Which claim is correct?
Please provide a principled justification for using a 28nm process for the hardware overhead analysis. Can you provide a more rigorous projection of the area and power overhead on a contemporary 4nm process, accounting for differential scaling of logic and memory, as well as leakage?
Please provide evidence that your "MX-" implementations of ANT, OliVe, and Tender (Table 7) are fair, optimized, and faithful comparisons to the original published works. What steps were taken to ensure you were not comparing against weakened baselines?
Please provide an empirical analysis of the performance impact (latency, energy) of the 6.25% bandwidth overhead from the BM indices, especially in decode-bound scenarios with long output sequences. On what grounds is this overhead considered "negligible"?
The performance improvement of MXFP4++ over MXFP4+ appears marginal in many perplexity results (e.g., Table 3, Llama-3.1-8B, Mistral-7B). Does the added complexity of managing a second, decoupled scale factor for NBMs justify this minor gain?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces MX+, a non-intrusive extension to the emerging industry-standard Microscaling (MX) data formats, designed to improve the performance of Large Language Models (LLMs) under ultra-low-bit quantization. The authors correctly identify that the primary obstacle to aggressive 4-bit quantization (specifically, for activations) is the presence of high-magnitude outliers, which cause significant quantization error for both themselves and the other values within their block.

The core contribution is an elegant, format-level solution to this problem. The authors leverage the insight that in a block floating-point (BFP) format like MX, the largest value in a block (the "Block Max" or BM) implicitly has its exponent set to the maximum representable value. Therefore, its explicit exponent field is redundant. MX+ repurposes this redundant exponent field as an extended mantissa, affording the outlier element significantly higher precision without changing its bit-width. This simple change dramatically improves model accuracy for 4-bit formats with negligible storage and computational overhead, making W4A4 (4-bit weights and activations) inference far more viable. The authors support their proposal with a comprehensive evaluation, including software emulation, a detailed hardware implementation proposal for GPU Tensor Cores, and comparisons against a wide array of existing quantization schemes.

Strengths

High-Impact Problem and Timely Contribution: The work is situated squarely at the epicenter of a critical challenge in ML systems: enabling efficient 4-bit inference for LLMs. More importantly, by building directly upon the OCP Microscaling (MX) standard [47], the paper ensures its immediate relevance to the direction the industry is already heading. It is not an academic exercise in a vacuum but a direct and practical response to the limitations of an emerging standard.
Elegant and Pragmatic Solution: The core idea of repurposing the BM's exponent field is exceptionally clever in its simplicity. Unlike many outlier-mitigation techniques that require complex and often data-dependent pre-processing (e.g., SmoothQuant [63], QuaRot [3]), MX+ is a self-contained, format-level fix. This pragmatism is its greatest strength; it minimizes changes to the software stack and proposes a highly plausible, low-overhead hardware modification (Section 6, page 7). This makes the barrier to real-world adoption remarkably low.
Comprehensive Contextualization and Evaluation: The authors have done an excellent job placing their work within the broader landscape. The analysis in Section 2 (page 2) provides a clear overview of industry-driven BFP variants. The empirical evaluation is thorough, spanning multiple models and scales, and the comparisons in Section 8 (page 11) against a host of other academic and industry proposals (Atom, Olive, Tender, etc.) are invaluable for understanding where MX+ fits. The demonstration of synergy with an orthogonal method like Activation-aware Weight Quantization (AWQ) in Table 8 (page 11) is particularly insightful and showcases a deep understanding of the field.

Weaknesses

Novelty is in the Solution, Not the Problem: The paper's primary novelty lies in the elegance of its solution. The underlying problem—that outliers in activations are the bane of low-bit quantization—is widely known and has been the subject of intense research for several years. The paper would benefit from making this distinction clearer; its contribution is a powerful new tool in the fight against outliers, rather than the discovery of the fight itself.
Understated Philosophical Distinction: The paper compares MX+ to many algorithmic quantization schemes but could more forcefully articulate the fundamental difference in approach. MX+ is a format-level solution, while techniques like SmoothQuant are algorithm-level data transformations. These are not necessarily mutually exclusive. The paper demonstrates this empirically with AWQ, but a more explicit discussion of this format-vs-algorithm dichotomy could strengthen the paper's positioning and help readers understand the unique niche MX+ occupies.
The "Multiple Outliers" Analysis Feels Incomplete: The analysis in Section 8.3 (page 12) on addressing multiple outliers per block is a welcome addition, but it feels somewhat brief. The conclusion that handling the top-1 or top-2 outliers provides the most benefit is a significant finding. This could be elevated from a secondary analysis to a more central point about the inherent trade-offs of this solution class. It suggests that while the MX+ approach is highly effective, there may be a fundamental "accuracy ceiling" that can only be surpassed by combining it with the aforementioned algorithmic techniques.

Questions to Address In Rebuttal

On Synergy with Algorithmic Pre-processing: The paper demonstrates a successful combination of MX+ with AWQ for weights. Could the authors elaborate on the potential synergy between MX+ for activations and algorithm-level pre-processing techniques like SmoothQuant? For example, could one first apply a light-touch version of SmoothQuant to reduce the magnitude of the most extreme outliers, and then use MXFP4+ to more faithfully represent the now-smaller (but still significant) outliers? Does this combination offer a better accuracy-complexity trade-off?
The "Sweet Spot" of Complexity: The analysis in Section 8.3 indicates that focusing on the single largest element (BM) provides the best return on investment. Could the authors expand on why they believe this is the case? Is it because in most LLM activations, there is typically one dominant outlier per block, or is the complexity of tracking and encoding a second outlier (e.g., requiring more index bits, more complex hardware logic) simply not worth the marginal accuracy gain? This would help solidify the design choice of MX+.
Path to Real-World Adoption: The proposed hardware extension in Section 6 is clean and appears to have low overhead. Beyond technical feasibility, what do the authors see as the main non-technical hurdles to the adoption of MX+ into the official OCP MX specification or into proprietary hardware like NVIDIA's Tensor Cores? Does the added instruction complexity or the need to manage the BM index metadata present any unforeseen challenges for compiler developers or system architects?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors propose MX+, a non-intrusive extension to the industry-standard Microscaling (MX) data formats, designed to improve the accuracy of low-bit large language model inference. The core claim of novelty rests on a specific insight into block floating-point (BFP) representations like MXFP4: the largest magnitude element in a block, the "Block Max" (BM), which determines the shared block-wide exponent, will itself always be quantized using the maximum representable private exponent for its data type. The authors identify that the exponent bits for this specific BM element are therefore redundant. The proposed contribution, MX+, repurposes these redundant exponent bits as additional mantissa bits for the BM element only. This increases the precision of the most significant outlier in the block without changing the element bit-width or requiring different compute paths for other elements. The authors present both software emulation results demonstrating significant accuracy improvements and a hardware design for integrating MX+ into a GPU Tensor Core.

Strengths

A Genuinely Novel Representational Trick: The central idea of identifying and repurposing the redundant private exponent bits of the Block Max element in an MXFP-style format appears to be a novel contribution. While the problem of handling outliers in quantization is extremely well-trodden, the mechanism proposed here is unique. Prior art typically attacks this problem by either (a) pre-processing tensors to reduce outlier magnitude (e.g., SmoothQuant, QuaRot), or (b) using explicit mixed-precision formats where outliers are stored and computed using a different data type entirely (e.g., INT8 for outliers, INT4 for others, as in Atom or Olive). MX+ is distinct because it achieves higher precision for the outlier within the same uniform low-bit data stream. This is an elegant and clever format-level optimization.
Pragmatic Grounding in an Existing Standard: The contribution is not an entirely new format created in a vacuum; it is a direct and backward-compatible extension of the OCP Microscaling formats. This grounding in an emerging industry standard makes the idea more than a mere academic curiosity. The non-intrusive nature—maintaining a fixed bit-width per element—is a significant strength over schemes that introduce unaligned memory access patterns.
Clear Identification of the Enabling Condition: The authors correctly identify that this trick is specifically enabled by the architecture of the MXFP formats, which feature both a shared block-level exponent and a per-element private exponent. Simpler BFP formats like MSFP (Section 2, page 2), which lack a per-element exponent field, would not permit this specific form of bit repurposing. This demonstrates a clear understanding of the design space and pinpoints the exact source of the exploitable redundancy.

Weaknesses

Narrowly Scoped Novelty: While the core mechanism is novel, its applicability is inherently limited to a specific subclass of BFP formats. The contribution is an incremental, albeit very clever, optimization on a pre-existing format, rather than a fundamentally new framework for quantization. The paper's novelty is thus contingent on the existence and adoption of the MXFP format itself. This is not a critique of the idea's value, but an observation on the scope of its conceptual advancement.
"Non-Intrusive" Claim Understates Hardware Complexity: The authors repeatedly use the term "non-intrusive." While this holds true from a software and memory layout perspective (uniform bit-width), the proposed hardware implementation in Section 6 (page 7) is decidedly intrusive to the Tensor Core's Dot Product Engine (DPE). The design adds a BM Detector, a Forward and Swap Unit (FSU), and a dedicated BM Compute Unit (BCU). These are non-trivial additions to what is typically a highly optimized, rigid datapath. This represents a significant deviation from a standard DPE and incurs area, power, and, critically, design validation costs. The novelty here is in the format, but the proposed hardware to support it is a specialized datapath modification, the complexity of which is somewhat downplayed.
The Underlying Problem is Not New: The paper addresses the age-old problem of outliers. The novelty is in the solution's mechanism, not in the problem definition or the high-level strategy ("give more precision to outliers"). The paper's framing could more sharply distinguish its method from the well-established goal shared by dozens of other papers in this area.

Questions to Address In Rebuttal

On Hardware Novelty vs. Complexity: The term "non-intrusive" seems to conflict with the hardware design presented in Section 6. The proposed BM Detector, FSU, and BCU represent a specialized architectural modification. Can the authors justify this claim more rigorously? Specifically, how does the complexity and novelty of this hardware modification compare to the alternative of implementing dual-path MAC units (e.g., for INT4/INT8) as seen in prior hardware-centric outlier-aware works?
On Prior Art of Bit Repurposing: The core idea is repurposing redundant bits in a numerical format for an alternative semantic meaning (extended precision). While its application to BFP outliers seems new, have the authors conducted a broader search for similar bit-repurposing concepts in other numerical formats or computer arithmetic contexts, beyond the standard use for signaling NaNs or subnormals? Please clarify if this specific type of representational optimization has any precedent in other domains.
On the Limits of the Contribution: The analysis of multiple outliers in Section 8.3 (page 12) suggests that the proposed method, which targets a single BM element, captures the lion's share of the benefit, with diminishing returns for handling a second or third outlier. Does this imply that the novel mechanism is fundamentally a "one-shot" trick, and that the residual problem of multiple co-located outliers requires reverting to more traditional (and less novel) techniques like grouping or explicit mixed precision? This would help frame the true boundary and impact of this specific novel contribution.

Micro-MAMA: Multi-Agent Reinforcement Learning for Multicore Prefetching

Abstract

Online reinforcement learning (RL) holds promise for microarchitectural techniques like prefetching. Its ability to adapt to changing and previously-unseen scenarios makes it a versatile technique. However, when multiple RL-operated components compete for ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper addresses the issue of resource contention among multiple, independently operating Reinforcement Learning (RL) prefetchers in a multicore system. The authors frame this as a Multi-Agent Reinforcement Learning (MARL) problem where agents converge to a sub-optimal Nash Equilibrium. They propose µMAMA, a hierarchical system featuring a central "arbiter" agent that coordinates distributed multi-armed bandit agents. This arbiter learns to enforce globally beneficial joint actions stored in a Joint Action Value (JAV) cache, overriding local agent decisions when necessary. The authors claim that µMAMA improves system throughput by 2.1% and fairness (Harmonic Mean Speedup) by 10.4% on an 8-core system compared to uncoordinated agents, without requiring software support for workload profiling.

Strengths

Problem Formulation: The paper correctly identifies a valid and challenging problem in multicore systems architecture. Framing the competition between RL-based components through the lens of MARL and game theory (Section 2.2) is appropriate and provides a solid theoretical foundation for the work.
Fairness/Throughput Analysis: The evaluation of the system's flexibility in trading off throughput for fairness is a good contribution (Section 6.4, Figure 14). Demonstrating that the reward function can be tuned to target different points on the performance-fairness Pareto frontier is a compelling use of the RL framework.
Hierarchical Approach: The core idea of using a central supervisor to guide distributed agents is a standard and sensible pattern in MARL. The architecture, which attempts to balance local exploration with global exploitation, is logical in its structure.

Weaknesses

My primary concerns with this paper are the marginal performance gains for the primary throughput metric, the questionable assumptions underpinning the core reward estimation heuristic, and an incomplete evaluation that may overstate the benefits of the proposed technique.

Insignificant Throughput Improvement and Performance Regressions: The headline claim of a 2.1% average throughput improvement on an 8-core system (Figure 9) is tenuous at best. This level of improvement is well within the range of noise for architectural simulations and is insufficient to justify the proposed hardware complexity. More damningly, the per-workload breakdown in Figure 10b reveals that µMAMA results in a performance slowdown for a non-trivial number of workload mixes. The paper fails to provide any analysis or explanation for these performance regressions, which undermines the claim of intelligent, system-aware coordination. An effective coordinator should, at minimum, not make the system worse.
Fundamentally Flawed Reward Heuristic: The entire premise of a software-transparent µMAMA hinges on the ability to estimate system-level rewards at runtime. The proposed heuristic for estimating multicore slowdown (SMP in Equation 4, Page 5) is based solely on the fraction of L2 misses a core contributes. This is a gross oversimplification. Multicore contention is a complex phenomenon driven by interconnect traffic, memory controller queuing, row buffer conflicts, and cache coherence traffic—not just raw L2 misses. The authors themselves prove the weakness of this heuristic in Section 6.6.3, where the µMAMA-Profiled version (using ground-truth data) achieves a significantly higher 3.0% speedup over Bandit. This 0.9% gap is a direct measurement of the error introduced by the flawed heuristic, and it accounts for nearly half of the claimed benefit over the baseline.
Choice of Baseline: The evaluation exclusively compares against the Micro-Armed Bandit prefetcher [17]. While a relevant work, the authors' own data in Figure 9 suggests this may not be the strongest baseline. At 8 cores, the non-RL Bingo prefetcher appears to outperform the independent Bandit agents, making it a more challenging and appropriate baseline. The entire motivation of contention (Figure 3) is built on the observation that Bandit becomes overly aggressive with more cores. It is not demonstrated that other state-of-the-art prefetchers exhibit this same pathological behavior. The paper's claims would be substantially more credible if µMAMA's benefits were demonstrated against a system of independent, state-of-the-art non-RL prefetchers.
Understated Complexity and Scalability Concerns: The paper describes µMAMA as "light-weight," but the proposed mechanism introduces non-trivial complexity. It requires a central unit, additional on-chip network traffic for synchronization and state-sharing, and a multi-step communication protocol (Figure 8). The "planning one step ahead" mechanism is presented as a solution to latency, but it does not eliminate the need for a system-wide synchronization barrier at each timestep. The requirement for a "majority of local agents" to check in (Section 4.3.1) is a potential scalability bottleneck, as the timestep duration will be dictated by stragglers. The analysis in Section 4.4 seems optimistic and does not adequately address potential queuing delays or network congestion in larger systems (e.g., 32+ cores).

Questions to Address In Rebuttal

Please justify how a 2.1% average throughput improvement, which includes performance regressions on several workloads (Figure 10b), is sufficient to warrant the implementation of the µMAMA architecture. Please provide a detailed characterization of the workloads where µMAMA degrades performance and explain the mechanism causing this failure.
The reward heuristic in Equation 4 is a critical point of weakness. Please provide a quantitative analysis of the error between your L2-miss-based estimation of SMP and the ground-truth values. Justify why this heuristic is sufficient, given that it ignores other major contributors to memory system contention.
Please justify the selection of uncoordinated Micro-Armed Bandit agents as the sole baseline for comparison. How does µMAMA's performance improvement change if the baseline is, instead, a system of uncoordinated Bingo prefetchers, which your own data in Figure 9 suggests is a stronger performer at 8 cores?
The timestep synchronization protocol described in Section 4.3.1 appears to be a scalability concern. Please elaborate on how this mechanism would perform in a system with 32 or 64 cores. Specifically, how would the variance in local agent step completion times affect overall system performance and the learning rate?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper identifies and addresses a critical problem arising at the intersection of online machine learning and multicore processor design: the "Tragedy of the Commons" among independent, RL-based hardware prefetchers. The authors compellingly demonstrate that when multiple cores each use a greedy RL prefetcher (like Micro-Armed Bandit) to optimize their local performance, they collectively create excessive memory bandwidth contention, leading to a globally sub-optimal outcome.

The proposed solution, µMAMA, is a lightweight, hierarchical multi-agent reinforcement learning (MARL) framework. It introduces a central "arbiter" agent that supervises the distributed local prefetcher agents. This arbiter learns to dynamically decide between two modes: (1) allowing local agents to explore actions independently to find new policies, or (2) enforcing a known, high-performing "joint action" for all prefetchers simultaneously from a small cache of globally-optimal policies (the JAV cache). The framework is notable for its practical design, including a hardware-friendly heuristic for estimating system-wide performance at runtime and the ability to be reconfigured to optimize for different system goals, such as throughput or fairness, by simply changing its reward function.

The evaluation shows that µMAMA improves system throughput by 2.1% and fairness (Harmonic Mean Speedup) by 10.4% on an 8-core system compared to uncoordinated agents. Crucially, its benefits are shown to be most significant in systems with constrained memory bandwidth, where coordination is most needed.

Strengths

Excellent Problem Formulation and Contextualization: The paper's primary strength is its clear and insightful framing of a practical microarchitectural challenge within a robust theoretical context. By explicitly connecting the issue of prefetcher contention to foundational concepts from MARL and Game Theory (e.g., selfish agents converging to an inefficient Nash Equilibrium, as illustrated in Section 2.2, page 2), the authors elevate the discussion beyond a simple hardware tuning problem. The motivation presented in Section 3 is particularly strong, with Figure 3 (page 3) providing a powerful empirical justification for the work by showing that the Bandit prefetchers become pathologically aggressive as the core count scales.
An Elegant and Pragmatic System Design: The µMAMA architecture is a clever solution to a problem with an intractably large search space. Instead of a monolithic RL agent, the authors propose a hierarchical system that balances local exploration with global exploitation. This "arbiter" model is a well-known pattern in MARL theory, and its application here is both novel and fitting. The design demonstrates a keen awareness of hardware reality through its lightweight nature, its latency-hiding communication protocol (Section 4.3.2, page 7), and its use of simple, computable heuristics (Equation 5, page 6) rather than requiring costly offline profiling.
Flexibility and Exploration of the Throughput-Fairness Tradeoff: Perhaps the most significant contribution of this work is its demonstration of a flexible framework for policy optimization. By showing that the system's objective can be shifted from throughput (Weighted Speedup) to fairness (Harmonic Mean Speedup) by simply altering the reward calculation (Section 4.2.5, page 7), the authors present µMAMA not just as a better prefetcher, but as a generalizable resource management technique. The Pareto frontier plot in Figure 14 (page 11) is an outstanding piece of analysis that beautifully visualizes this tradeoff space, allowing system designers to choose an operating point that best fits their application's needs. This is a crucial capability for modern heterogeneous computing environments.
Strong Connection Between Thesis and Results: The experimental results directly validate the paper's core thesis. The finding that µMAMA's advantage over independent agents grows as memory bandwidth becomes more constrained (Figure 11, page 10) provides direct evidence that the system is correctly identifying and mitigating the negative effects of resource contention.

Weaknesses

My critiques are less about flaws in the existing work and more about its current boundaries and un-explored connections.

Limited Scope of Agent Coordination: The work convincingly demonstrates coordination among a homogeneous set of identical, Bandit-based L2 prefetchers. This is an excellent starting point, but it represents the simplest case of multi-agent coordination. The real world is messier. Future systems will likely involve heterogeneous agents—both in terms of their learning algorithms (e.g., a Bandit-based agent coexisting with a more complex one like Pythia) and their location in the system (e.g., coordinating an L1 prefetcher with an L2 prefetcher and a DRAM controller). The current µMAMA framework, with its fixed-size joint actions, may need significant adaptation to handle such heterogeneity. The paper acknowledges this briefly in Future Work (Section 7), but the potential challenges are non-trivial.
Reliance on a Specific Behavioral Proxy for Reward: The runtime reward estimation hinges on the heuristic that per-core L2 misses are a strong proxy for a workload's sensitivity to memory contention (Section 4.2.1, page 5). While the results show this works well, and the comparison to a profile-guided version (Section 6.6.3) shows it's a reasonable approximation, this ties the system's intelligence to a specific, indirect "sensor" of system state. This approach is common and necessary in hardware, but it would be valuable to understand its limits. Are there classes of applications (e.g., those with low miss rates but high bandwidth demands due to streaming behavior) where this heuristic could mislead the arbiter, causing it to deprioritize the wrong workloads? This connects to the broader research challenge of identifying robust, low-cost features for online learning in microarchitecture.

Questions to Address In Rebuttal

Could the authors elaborate on how the µMAMA framework might be extended to coordinate a heterogeneous set of agents? For instance, what are the primary challenges in coordinating an L1 prefetcher and an L2 prefetcher, which operate on different timescales and have different action spaces? How would the JAV cache represent a "joint action" in such a scenario?
The reward estimation relies on L2 misses to approximate a workload's slowdown due to contention (S_MP). Have the authors considered or identified scenarios where this heuristic might be less effective? For example, in a system with significant prefetcher-induced cache pollution but not necessarily overwhelming bandwidth demand, could this metric lead the arbiter to make sub-optimal decisions?
The arbiter in its current form acts as a high-level bandit, choosing between local control or a static joint policy from the JAV cache. This is stateless. Do the authors see a path toward a more state-aware coordinator? For example, could the arbiter's decision be conditioned on a global state vector (e.g., total LLC misses, memory bandwidth utilization) to learn a more dynamic, system-wide control policy?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper identifies the well-known problem of resource contention among multiple independent reinforcement learning (RL) agents in a multicore system, specifically focusing on Micro-Armed Bandit prefetchers. When each agent selfishly optimizes its own core's IPC, the system can converge to a globally sub-optimal state (a Nash Equilibrium) due to excessive memory bandwidth consumption. To address this, the authors propose µMAMA, a centralized supervisor that coordinates these distributed bandit agents. The core mechanism involves a high-level RL agent (an "arbiter") that learns to decide between allowing the local agents to explore independently or enforcing a known, high-performing "joint action" stored in a Joint Action Value (JAV) cache. The system is designed to be lightweight and adaptable to different optimization goals, such as throughput or fairness.

Strengths

Clear Problem Formulation: The paper does an excellent job of framing a known microarchitectural issue (prefetcher contention) within the formal language of Multi-Agent Reinforcement Learning (MARL), correctly identifying the behavior as convergence to an undesirable Nash Equilibrium.
Pragmatic, Hardware-Aware Design: The proposed solution is a simplification of more complex hierarchical MARL schemes, tailored for a low-overhead hardware implementation. The JAV cache and the simple bandit-based arbiter are plausible hardware structures.
Demonstration of Flexibility: The evaluation effectively shows how changing the reward function (e.g., from Weighted Speedup to Harmonic Speedup) allows the single architecture to target different points on the throughput-fairness Pareto frontier (Figure 14). This is a strong feature of the design.

Weaknesses

The central weakness of this paper, from a novelty perspective, is that its core conceptual framework is a direct application of well-established principles in hierarchical and cooperative MARL, which have also been previously explored in the computer architecture domain.

Lack of Algorithmic Novelty: The fundamental idea of a high-level supervisor or coordinator managing a set of low-level, distributed agents is a cornerstone of Hierarchical RL (HRL) and cooperative MARL. The authors themselves concede that their solution "resembles some of the latter works" in HRL (Section 2.2.3, page 3) and is designed "in a similar fashion as prior theoretical work [39]" (Section 4.1, page 4). Reference [39] (Ontañón, 2017) explicitly deals with combinatorial multi-armed bandits, where the goal is to find the best combination of actions—which is precisely what the µMAMA arbiter and JAV cache aim to do by learning and enforcing optimal joint actions. The contribution is therefore not the invention of a new coordination paradigm, but rather the specific engineering and application of a known paradigm to bandit-based prefetchers.
Prior Art in Architecture: The concept of using hierarchical RL for resource management in multicores is not new. For instance, Jain et al. [24] proposed a hierarchical Q-learning approach to co-optimize cores, caches, and the NoC. While their work targeted different components (DVFS, LLC partitioning) and used a different algorithm (Q-learning vs. MABs), the architectural pattern of a high-level learning agent coordinating low-level actions is functionally identical. The "delta" between µMAMA and this prior work is the choice of learning algorithm and the specific target, not the hierarchical coordination concept itself.
Marginal Gains vs. Added Complexity: The proposed system introduces a non-trivial level of complexity: a central unit, a communication protocol for synchronization and reward propagation, and additional storage for the arbiter and JAV cache. The reported average throughput improvement is modest, at 2.1% for an 8-core system (Figure 9, page 9). While the gains are higher in more constrained scenarios or for fairness, it is questionable whether this learning-based approach offers a significant advantage over simpler, non-learning coordination heuristics (e.g., a centralized bandwidth arbiter or a global throttling mechanism like in Ebrahimi et al. [13]) that would be far less complex to implement and verify. The novelty does not appear to be disruptive enough to justify the complexity for the performance delivered.

Questions to Address In Rebuttal

Please clarify the fundamental algorithmic novelty of the arbiter/JAV cache mechanism. How does this system differ conceptually from the combinatorial MAB framework in [39], which also seeks to find optimal combinations of actions? Beyond the application domain, what is the new theoretical or algorithmic insight?
The paper's novelty rests heavily on being a superior coordination mechanism. Please provide a comparison against a state-of-the-art heuristic-based (non-RL) prefetcher coordination policy (e.g., a feedback-directed throttling mechanism). This is crucial to establish that the complexity of a MARL approach is truly necessary and superior to simpler engineered solutions.
The JAV cache stores a small, fixed number of globally "best" joint actions. How does this approach avoid premature convergence to a sub-optimal joint policy, especially in complex workloads with many distinct execution phases? It seems plausible that the system could overfit to a policy that was optimal in an early phase, and the limited size of the JAV cache would prevent it from discovering a new, globally optimal policy later on.

Ghost Threading: Helper-Thread Prefetching for Real Systems

Abstract

Memory latency is the bottleneck for many modern workloads. One popular solution from literature to handle this is helper threading, a technique that issues light-weight prefetching helper thread(s) extracted from the original application to bring data ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The paper presents "Ghost Threading," a helper-threading prefetching technique intended for deployment on existing commercial processors. The core proposition is to use an idle SMT context on the same physical core to run a distilled version of the main thread's code (a "p-slice") to prefetch data. To manage the classic timeliness problem in prefetching, the authors propose a novel synchronization mechanism that uses the x86 serialize instruction to slow down the helper thread when it runs too far ahead. The authors claim this software-only approach achieves a 1.33× geometric mean speedup over a single-threaded baseline on an Intel Core i7 processor, outperforming conventional software prefetching and SMT-based parallelization.

However, a rigorous examination reveals that the work rests on several questionable assumptions and significant limitations. The claims of being "software-only" and practical for "real systems" are undermined by a critical dependence on very specific hardware features, a stark performance gap between manual and automated implementations, and poor scalability beyond a single core.

Strengths

Clever Use of an Existing Instruction: The proposal to use the serialize instruction for lightweight thread throttling is an inventive, non-obvious application of a feature designed for other purposes. It avoids the high overhead of OS-level context switching.
Evaluation on Real Hardware: The evaluation is conducted on a modern, commercially available processor, not a simulator. This provides tangible evidence of the technique's behavior in a real-world microarchitectural environment, which is a commendable aspect of the methodology.
Demonstrated Single-Core Performance: For a specific subset of memory-bound, single-threaded workloads, the hand-tuned implementation of Ghost Threading does demonstrate a significant performance improvement (e.g., the synthetic Camel benchmark in Figure 3, page 5), validating that the core concept can be effective under ideal conditions.

Weaknesses

Mischaracterization as "Software-Only": The central claim of a "software-only" solution (Abstract, page 1) is misleading. The technique is critically dependent on two non-trivial and non-ubiquitous hardware features: Simultaneous Multithreading (SMT) and the serialize instruction, which the paper notes is only available on recent generations of Intel processors. This is not a generally applicable software technique but rather a software trick that targets a very specific hardware configuration. This positioning against "prior work relying on... hardware support" is therefore disingenuous.
Unsubstantiated Claims about the Synchronization Mechanism: The paper claims that using serialize is an "almost ideal solution" because it "stops the pipeline from fetching" and thus "only consumes modest backend resources" (Section 1, page 2). This is a strong microarchitectural claim made without any supporting evidence. No performance counter data is presented to show the actual impact on front-end vs. back-end utilization, port contention, or other shared resources. It is entirely possible that this blunt-force stalling mechanism introduces inefficiencies that are not accounted for. The mechanism controls runahead in terms of loop iterations (Figure 10, page 12), which is merely a proxy for prefetch timeliness. The authors provide no evidence that the chosen iteration distance thresholds are optimal for hiding memory latency without causing cache pollution.
Infeasibility of Automation Undermines Practicality: The results in Figure 6 (page 9) show a stark contrast between the manually extracted Ghost Threads (1.33x geomean speedup) and the compiler-extracted version (1.11x geomean speedup). A 22% performance degradation from automation is an enormous gap that effectively invalidates the claim of this being a practical, general-purpose technique. The authors attribute this to "unnecessary control flow" (Section 6.1, page 9) but do not elaborate on why their compiler pass is incapable of solving this. If the technique requires heroic, manual, application-specific effort to be effective, its real-world impact is negligible.
Poor Multi-core Scalability: The paper positions Ghost Threading for "many modern workloads," which are overwhelmingly multi-threaded and run on multi-core systems. However, the multi-core evaluation in Figure 9 (page 11) shows that the performance advantage of Ghost Threading over SMT OpenMP diminishes as the core count increases, and both techniques show poor scaling. A technique that provides its main benefit on a single core and provides marginal gains in a realistic multi-core scenario is of limited utility. The paper fails to adequately analyze the source of this limitation (e.g., memory bandwidth or last-level cache contention).
Methodology Relies on "Magic Numbers" and Manual Tuning: The heuristic for selecting target loads (Section 4.1, page 5) relies on arbitrary, hard-coded thresholds (CPI > 21, loop size > 10 instructions, coverage > 15%). There is no justification for these values or a sensitivity analysis to show their robustness. Furthermore, the authors admit that the crucial synchronization hyper-parameters are tuned manually by profiling, stating that a predictive model is "beyond the scope of this work" (Section 4.3.2, page 7). A technique that requires extensive, manual, per-application tuning is not a robust or deployable solution.

Questions to Address In Rebuttal

Can the authors justify the "software-only" label given the technique's absolute dependence on SMT and the serialize instruction, both of which are specific hardware features unavailable on many systems?
Please provide quantitative, microarchitectural evidence (e.g., from performance counters) to support the claim that the serialize instruction allows the main thread to effectively utilize back-end resources while the helper thread is throttled.
The performance gap between the manual and compiler-automated versions is a critical flaw for practical adoption. What are the specific, fundamental compiler analysis challenges that prevent the automated version from matching the manual one? Are these challenges solvable, or is manual extraction an intrinsic requirement for good performance?
The load selection heuristic uses several hard-coded constants (CPI > 21, etc.). How were these specific values determined? Please provide a sensitivity analysis showing how performance changes as these thresholds are varied. Are these values specific to the Intel Core i7-12700 architecture?
Given the diminishing returns in the multi-core study (Figure 9), what is the primary bottleneck (e.g., memory bandwidth, LLC contention) that limits the scalability of Ghost Threading? How does this affect its viability for the "modern workloads" it claims to target?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents "Ghost Threading," a software-only helper-thread prefetching technique designed to mitigate memory latency on modern, commodity processors. The central problem it addresses is that while helper threading is a well-established concept in the literature for improving single-thread performance, most proposed schemes rely on bespoke hardware support for inter-thread synchronization, rendering them impractical for real systems.

The authors' core contribution is a novel, low-overhead synchronization mechanism that repurposes an existing ISA feature—the serialize instruction on modern Intel processors—to cheaply throttle a prefetching helper thread running on a sibling SMT context. This clever reuse of an instruction allows the "ghost thread" to run far enough ahead of the main application thread to be effective, but not so far ahead that its prefetches are evicted from the cache before use. By making this critical synchronization step software-only and efficient, the authors make helper threading a viable technique on today's hardware.

The paper provides a strong empirical evaluation on an Intel Core i7 processor, demonstrating significant geometric mean speedups of 1.33x over a single-threaded baseline, and outperforming state-of-the-art software prefetching (1.25x) and SMT-based parallelization (1.11x). Notably, these benefits are maintained and even amplified on a busy server with high memory bandwidth pressure, highlighting the technique's robustness.

Strengths

Novelty in Practicality: The primary strength of this work lies not in inventing helper threading, but in making it practical. The insight to use the serialize instruction as a lightweight, user-space throttling mechanism is a genuinely clever piece of systems-level thinking. It connects a known high-level performance problem with a low-level architectural feature in a non-obvious way. This is precisely the kind of work that bridges the gap between academic exploration and real-world utility.
Excellent Contextualization and Motivation: The authors do a superb job in Section 3 (page 3) of carving out the specific problem space where Ghost Threading excels. The use of the Camel benchmark to illustrate the shortcomings of both conventional software prefetching and naive SMT parallelization is very effective. It clearly shows that there is a "sweet spot" for workloads with high-latency loads followed by non-trivial computation, and that Ghost Threading is tailored to fill this gap.
Robust and Realistic Evaluation: The evaluation is thorough and convincing. Comparing against both a strong software prefetching technique [3] and SMT parallelization is the correct methodology. The decision to evaluate on both an idle and a busy server (Section 6.3, page 10) is particularly commendable, as it demonstrates that the technique is not just a "laboratory" optimization but can contend with resource pressure in a more realistic environment. The inclusion of energy measurements (Section 6.2, page 9) further strengthens the paper's claims of practical benefit.
Enabling Future Research: By providing a practical software primitive for fine-grained inter-thread synchronization on SMT cores, this work could serve as a foundational building block for other techniques beyond prefetching. One could imagine its use in speculative execution schemes, runtime code optimization, or other scenarios requiring tightly-coupled but independent threads of execution without OS or hardware intervention.

Weaknesses

From a contextual standpoint, the main weaknesses relate to the generality and automation of the approach, which are typical for this stage of research but important to acknowledge.

Dependence on a Specific Architectural Feature: The core mechanism is tied to the serialize instruction, which is only available on recent-generation Intel processors. This inherently limits the portability of the technique to other architectures (e.g., AMD, ARM, RISC-V). While this does not diminish the novelty of the idea, it frames it as an Intel-specific optimization for now, rather than a universally applicable technique. The paper would benefit from a discussion of whether analogous mechanisms exist on other platforms.
Manual Tuning and Heuristics: The current implementation relies on profiling to identify target loads and, more critically, manual tuning of the synchronization hyperparameters (e.g., the runahead distance thresholds mentioned in Section 4.3.2, page 7). This raises questions about the robustness and sensitivity of the technique. A highly-tuned "hero run" is valuable, but its practical impact is limited if achieving similar results requires extensive, expert-driven effort for every new application.
Scope of Application: As the authors' own motivation shows, Ghost Threading is not a panacea. It targets a specific class of memory-bound loops. While the paper defines a clear heuristic for identifying these loops (Section 4.1, page 5), a broader characterization of the application domains where this is likely to succeed would be valuable for positioning the work.

Questions to Address In Rebuttal

Architectural Generality: Could the authors elaborate on the potential for implementing this technique on other architectures? Have they investigated whether AMD, ARM, or RISC-V processors have user-space instructions that could produce a similar low-overhead pipeline stall, even if not explicitly designed for serialization? A brief discussion on this would significantly broaden the perceived impact of the work.
Sensitivity of Hyperparameters: The effectiveness of the throttling mechanism seems to depend on carefully tuned parameters that define the "too close" and "too far" distances (Figure 4d, page 6). Could the authors provide a sensitivity analysis for one or two key benchmarks? Showing that the performance benefit holds across a reasonable range of parameter values—or clarifying how sharply it drops off—would provide crucial insight into the technique's practical robustness.
Interaction with Hardware Prefetchers: Modern processors have aggressive hardware prefetchers. How does Ghost Threading interact with them? Is it possible that the ghost thread's memory accesses "train" the hardware prefetcher in a beneficial way? Conversely, could there be scenarios where the software prefetches from the ghost thread conflict with the hardware prefetcher, leading to resource contention (e.g., in the MSHRs) or cache pollution?
Path to Full Automation: The paper mentions a prototype compiler pass for automating thread extraction (Section 4.4, page 7). What do the authors see as the most significant challenge in moving from this prototype to a fully automated system? Is it the p-slice extraction, or is it the much harder problem of automatically determining the optimal synchronization hyperparameters without manual profiling and tuning?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents "Ghost Threading," a software-only helper-thread prefetching technique designed for modern processors with Simultaneous Multithreading (SMT). The authors correctly identify that the central challenge in helper threading is achieving timely and low-overhead inter-thread synchronization. The core novel claim of this work is the use of the existing x86 serialize instruction as a lightweight mechanism to throttle a helper thread, preventing it from running too far ahead of the main thread. This approach avoids the high costs of OS-level synchronization and the need for bespoke hardware modifications proposed in prior work. The paper demonstrates that this technique provides significant performance improvements over baseline sequential execution, a state-of-the-art software prefetcher, and SMT-based parallelization on a range of memory-intensive workloads.

While the broader concepts of helper threading and its implementation on SMT contexts are well-established, the specific synchronization mechanism proposed here is, to my knowledge, genuinely novel.

Strengths

Novel Synchronization Mechanism: The primary strength of this paper is the identification and application of the serialize instruction for inter-thread throttling. This is a clever exploitation of an existing ISA feature for a purpose it was likely not designed for. Prior art in efficient helper-thread synchronization has largely bifurcated into two camps: (1) high-overhead software techniques (e.g., OS system calls) or (2) hypothetical hardware mechanisms requiring ISA extensions or microarchitectural changes (e.g., Kim et al. [23], Collins et al. [8]). This work carves out a new, practical path by finding a "sweet spot" on existing commercial hardware. The mechanism is functionally distinct from spin-loops using pause (which still consume backend resources) or memory fences (which serve a different purpose). The authors correctly identify in Section 4.3.1 (page 7) that serialize stalls the frontend, which is the ideal behavior for this use case.
Pragmatism and Real-System Implementation: The entire contribution is grounded in what is possible on today's hardware. This stands in contrast to a significant body of work in the prefetching and helper-threading space that relies on simulation of non-existent hardware. The focus on a "software-only" solution for "real systems" is not just rhetoric; the core idea is immediately deployable on a specific but important class of modern processors.

Weaknesses

Limited Generality of the Novel Contribution: The core novelty is inextricably tied to the serialize instruction, which is only available on recent Intel processors. The paper's title, "Helper-Thread Prefetching for Real Systems," is perhaps too broad. A more accurate description would be "for recent Intel systems." The lack of discussion regarding equivalent mechanisms on other major architectures (e.g., AMD, Arm) significantly constrains the applicability of the novel idea. Without a path to portability, the contribution, while clever, risks being a niche optimization rather than a general technique.
The Novelty is Confined to the "How," Not the "When": The paper introduces a new way to implement throttling but relies on existing, manual methods to decide when to throttle. As acknowledged in Section 4.3.2 (page 7), the crucial hyperparameters that control the inter-thread distance are tuned manually via profiling. This is a well-known and persistent weakness in this domain. While the serialize instruction provides a better tool, the fundamental challenge of dynamically determining the optimal runahead distance remains unsolved. A more groundbreaking contribution would have coupled the novel mechanism with a novel, automated policy for its application.
Gap Between Manual and Automated Implementations: The evaluation reveals a significant performance discrepancy between the manually implemented Ghost Threads (1.33x geomean speedup) and the compiler-extracted version (1.11x geomean speedup), as shown in Figure 6 (page 9). This suggests that the p-slice extraction and synchronization code placement are non-trivial problems for a compiler. While the novel synchronization primitive is simple, its effective integration into an automated framework appears to be an open challenge. This weakens the claim of practical, widespread adoption, as manual intervention is still required for optimal performance.

Questions to Address In Rebuttal

On Portability: Could the authors comment on the feasibility of this technique on non-Intel architectures? Are there instructions or mechanisms on modern AMD or Arm processors that could serve a similar function to serialize (i.e., a low-overhead, user-space instruction that stalls the frontend pipeline of one SMT thread without heavily consuming backend resources)? If not, should the contribution be framed more narrowly?
On the Compiler Gap: The performance drop from manual to compiler-extracted threads is substantial (from 1.33x to 1.11x). Can the authors elaborate on the specific compiler analysis or transformation challenges that account for this gap? Is it primarily due to difficulties in precise p-slice extraction, or are there other factors? Is this gap fundamental, or could it be closed with more sophisticated compiler techniques?
On Robustness: The synchronization mechanism relies on manually tuned distance thresholds (TOO_FAR, CLOSE, etc., shown in Figure 4d, page 6). How sensitive is the performance of Ghost Threading to these parameters? For instance, if the optimal TOO_FAR distance is 100 iterations, what is the performance impact if the tuned value is 80 or 120? A sensitivity analysis would help clarify whether the technique is robust or requires fragile, workload-specific tuning.

Elevating Temporal Prefetching Through Instruction Correlation

Abstract

Temporal prefetchers can learn from irregular memory accesses and hide access latencies. As the on-chip storage technology for temporal prefetchers’ metadata advances, enabling the development of viable commercial prefetchers, it becomes evident that the ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The paper proposes Kairos, a hardware temporal prefetcher designed to improve upon existing on-chip temporal prefetchers like Triangel. The central claim is that Kairos can achieve superior performance at a fraction of the hardware cost by focusing on "critical" memory access instructions. The proposed mechanism involves three phases: (1) detecting instructions that contribute disproportionately to cache misses, (2) evaluating the effectiveness of prefetches issued for these instructions, and (3) dynamically partitioning metadata storage using a PID controller. The authors present simulation results showing that Kairos outperforms a baseline IP-stride prefetcher and the state-of-the-art Triangel on a range of SPEC CPU and CloudSuite benchmarks.

Strengths

Low Hardware Overhead: The most compelling aspect of the design is its exceptionally low fixed hardware overhead, reported as 251 bytes. If the performance claims hold, this represents a significant improvement in the efficiency of temporal prefetching.
Focus on Prefetch Utility: The design correctly identifies that not all temporal metadata is equally useful. The "Coverage-Based Classification" (Section 3.3.2, page 5) attempts to directly measure and act upon the utility of generated prefetches, which is a logical design principle.
Extensive Workload Evaluation: The authors evaluate their proposal against a broad set of single-core and multi-core workloads, including memory-intensive SPEC benchmarks and representative cloud applications, which provides a comprehensive performance picture.

Weaknesses

My primary concerns with this submission relate to the unsubstantiated nature of key design parameters and the potential for exaggerated performance claims. The methodology appears to be built on a series of heuristics whose effectiveness may be brittle and over-fitted to the selected benchmarks.

Arbitrary and Unjustified Thresholds: The core mechanisms of Kairos are governed by several "magic numbers" that are presented without justification.
- In the Critical Instruction Detection (Section 3.2.1, page 4), an instruction is deemed critical if its miss contribution exceeds 12.5%.
- In the Coverage-Based Classification (Section 3.3.2, page 5), instructions are classified as 'Positive' if coverage is ≥ 87.5% and 'Negative' if coverage is < 12.5%. There is no theoretical or empirical justification provided for these specific values. Without a sensitivity analysis, it is impossible to know if these are robust parameters or if they have been finely tuned to maximize performance on the evaluated workload suite, potentially leading to poor performance on other applications.
Unsupported Claim of "System-Agnostic" Partitioning: In Section 3.4.2 (page 6), the authors claim their PID controller parameters are "system-agnostic" because they address a "fundamental" performance trade-off. This is a highly suspect claim. The optimal balance between metadata storage and cache capacity is inherently dependent on system parameters such as LLC size, associativity, memory latency, and bandwidth. Asserting that a single set of coefficients (α=1.0, β=-0.3, γ=0.1) is universally applicable without providing evidence from simulations on varied system configurations is a significant overstatement.
Exaggeration of Claims: The abstract and introduction make bold claims that are not precisely supported by the data presented in the paper.
- The abstract claims a reduction in "storage overhead by two orders of magnitude." The data shows Kairos at 251 B (Table 1, page 7) and the competitor Triangel at 17.63 KiB (Table 3, page 8). This is a factor of ~72x (18053 / 251), which is not two orders of magnitude (100x). This is a significant exaggeration.
- The abstract claims Kairos "outperforms state-of-the-art Triangel by 10.1%." However, the text in Section 4.2.1 (page 8) reports speedups of 1.25x for Kairos and 1.15x for Triangel. The relative improvement is (1.25 / 1.15) - 1 ≈ 8.7%. This discrepancy undermines confidence in the authors' analysis.
Potential Flaw in the Detection Mechanism: The critical instruction detection mechanism (Section 3.2.1, page 4) appears biased. Once an instruction is classified as "critical," its subsequent misses are not counted towards the total miss counter. This means the first few instructions to cross the threshold will artificially inflate the relative importance of subsequent instructions, potentially leading to incorrect classifications. The "Bias Mitigation" described in Section 3.2.2 does not fully resolve this issue; it merely ensures frequently-missing non-critical IPs are not evicted from the Detecting Unit, but does not fix the skewed denominator (total miss counter) used for the criticality test itself.
Ambiguity in Comparison Fidelity: The paper compares against a reimplementation of Triangel from a public GitHub repository (Section 4.1, page 8). While common practice, there is no validation presented to assure the reader that this implementation faithfully reproduces the performance of the original work or is configured optimally. Without such validation, the possibility of comparison against a weakened baseline cannot be ruled out.

Questions to Address In Rebuttal

The authors must address the following points to substantiate the claims made in this paper:

Provide a rigorous justification for the selection of the 12.5% and 87.5% thresholds used in the detection and classification mechanisms. A sensitivity analysis showing how performance varies with these parameters is required to demonstrate their robustness.
Substantiate the claim that the PID controller parameters are "system-agnostic." Please provide simulation data showing the performance of Kairos with the chosen parameters on systems with different LLC sizes, latencies, and/or memory bandwidths.
Please correct or justify the quantitative claims in the abstract regarding storage overhead reduction ("two orders of magnitude") and performance improvement over Triangel ("10.1%"). The numbers presented in the body of the paper do not appear to support these figures.
Explain how the critical instruction detection mechanism avoids the measurement bias described in Weakness #4. How do you ensure that instructions that become critical later in an observation window are not unfairly disadvantaged?
What steps were taken to ensure your implementation of the SOTA competitor, Triangel, is a high-fidelity and fair baseline for comparison?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents Kairos, a novel hardware prefetcher that targets temporal memory patterns. The work is situated within the ongoing effort to create efficient, fully on-chip temporal prefetchers. Its core contribution is a conceptual shift in how potentially useful metadata is identified and managed. Instead of first sampling memory addresses for repetition (like SOTA prefetcher Triangel), Kairos first identifies "critical" memory access instructions (IPs) that are responsible for a disproportionate number of cache misses. Only for these high-impact IPs does it attempt to learn and store temporal correlation metadata. Furthermore, Kairos introduces a lightweight, coverage-based feedback mechanism to continuously evaluate the utility of the prefetches generated by each IP, dynamically retaining useful metadata and discarding ineffective entries. The authors demonstrate that this instruction-centric approach allows Kairos to outperform the state-of-the-art Triangel by 10.1% while requiring less than 2% of the dedicated hardware storage, making it a highly efficient and practical design.

Strengths

The primary strength of this work lies in its elegant reconceptualization of the metadata filtering problem for temporal prefetching.

A Powerful Conceptual Simplification: The distinction between Kairos and its predecessors, particularly Triangel, is fundamental. Where previous work adopted a bottom-up, data-centric view ("do these addresses repeat?"), Kairos adopts a top-down, control-flow-centric one ("does this instruction matter?"). This is beautifully illustrated in Figure 3 (page 3). By focusing on the source of cache misses (the IP) rather than the symptoms (the miss addresses), Kairos creates a much more direct and efficient path to identifying patterns that are worth learning. This insight alone is a significant contribution to the field.
Exceptional Hardware Efficiency: The most striking result is the dramatic reduction in storage overhead. At just 251 bytes (Table 1, page 7), Kairos is two orders of magnitude smaller than Triangel (17.63 KiB) and Triage (24.14 KiB). This isn't just an incremental improvement; it fundamentally changes the practicality of deploying sophisticated temporal prefetching. Such a low area cost makes the technique viable for a much broader range of designs, from high-performance cores to more area-constrained mobile or embedded systems. This efficiency is a direct result of the conceptual simplification mentioned above.
Robust Feedback Loop: The second principle of Kairos—evaluating metadata based on prefetch coverage rather than just metadata reuse—is a crucial and well-executed idea. Many prefetchers can become polluted by metadata that is frequently accessed but generates useless prefetches. By directly measuring success (i.e., a subsequent access hitting a prefetched line), Kairos ensures that its precious metadata storage is dedicated to patterns that are demonstrably improving performance.
Strong Contextualization and Motivation: The authors do an excellent job positioning their work. The introduction (Section 1) clearly articulates the historical evolution from off-chip to on-chip temporal prefetchers and identifies metadata pollution as the key remaining challenge. The analysis in Figure 2 (page 3), showing that a few instructions cause the majority of misses, provides a clear and compelling motivation for their instruction-centric approach.

Weaknesses

The paper's core ideas are strong, but the presentation leaves a few areas where the design choices could be more rigorously justified.

Parameter Sensitivity and "Magic Numbers": The design relies on several key thresholds whose derivations are not fully explained. For example, an instruction is deemed "critical" if its miss contribution exceeds 12.5% in an observation window (Section 3.2.1, page 4). Similarly, metadata is classified as "Positive" or "Negative" based on 87.5% and 12.5% coverage rates, respectively (Section 3.3.2, page 5). While these values may be well-tuned for the evaluated workloads, the paper would be strengthened by a sensitivity analysis showing how performance varies with these parameters. This would provide confidence in the robustness of the approach across different applications and microarchitectures.
Justification for PID Controller Parameters: The use of a PID controller for dynamic partitioning is a clever application of control theory. However, the claim in Section 3.4.2 (page 6) that the chosen parameters are "system-agnostic" is a very strong one that needs more support. While the high-level rationale for the P, I, and D terms is sound, the direct mapping from performance metrics to control adjustments is complex. It would be beneficial to understand if this "agnostic" nature holds across systems with different cache hierarchies, memory latencies, or core counts.
Interaction with Other Prefetchers: The paper rightly positions Kairos within a system that includes a baseline IP-stride prefetcher and evaluates a composite "Kairos+Berti" prefetcher (Figure 16, page 11). This is a realistic context. However, the paper could delve deeper into the nature of the interaction. For instance, do Kairos's temporal prefetches and a spatial prefetcher's requests ever conflict or create resource contention (e.g., in the MSHRs or LLC PQ)? A more detailed discussion of how these different prefetching paradigms coexist, and whether any coordination is needed, would be a valuable addition for system integrators.

Questions to Address In Rebuttal

Could the authors elaborate on the methodology for selecting the 12.5% criticality threshold and the coverage-based classification thresholds? How sensitive is Kairos's performance to these specific values, and how might they be tuned for different core configurations?
The paper claims the PID controller parameters are system-agnostic. Could the authors provide further evidence or a more detailed argument to support this claim, perhaps by showing performance sensitivity to these parameters or discussing their robustness across different memory system timings?
The "Kairos+Berti" result in Figure 16 is intriguing, demonstrating orthogonality. Could the authors provide more insight into the mechanisms of interaction between Kairos and a sophisticated spatial prefetcher? Specifically, can they discuss how a production system might manage the combined bandwidth pressure and potential for cache pollution from both prefetchers operating simultaneously?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents Kairos, a hardware temporal prefetcher designed to improve metadata utilization efficiency. The authors identify that prior temporal prefetchers either store metadata indiscriminately or employ complex and potentially slow sampling mechanisms to filter useful metadata. The core idea of Kairos is to address this through a three-stage pipeline: (1) A lightweight "Detecting Unit" identifies "critical" instructions based on their contribution to the total miss count, avoiding expensive pre-sampling. (2) A "Training Unit" then evaluates the prefetches generated from these critical instructions by tracking their actual prefetch coverage, classifying instructions as positive, negative, or neutral. (3) A "Partition Unit" uses a Proportional-Integral-Derivative (PID) controller to dynamically adjust the LLC partition size dedicated to metadata storage based on observed prefetch utility. The authors claim this approach significantly improves performance over the state-of-the-art (Triangel) while drastically reducing the required hardware storage overhead.

Strengths

The primary novelty of this work lies not in the invention of a new fundamental mechanism, but in the clever synthesis and application of existing concepts to create a new, highly efficient feedback loop for temporal prefetching.

Shift from Reuse to Utility: The most significant conceptual advance is the shift from evaluating metadata based on its reuse rate (the core idea in Triangel's sampling) to evaluating it based on the measured utility (i.e., prefetch coverage) of the prefetches it generates. The "Coverage-Based Classification" described in Section 3.3.2 (page 5) is a direct and elegant mechanism for this. While prefetcher effectiveness has always been a goal, explicitly tying the coverage metric back to the generating instruction to control its metadata's lifecycle is a novel and powerful feedback mechanism.
Novel Application of Control Theory: While PID controllers are not new to computer architecture, their application here to dynamically partition a shared cache for prefetcher metadata based on a continuous utility signal is a novel use case. This appears to be a more sophisticated and potentially more stable approach than the discrete set-dueling mechanisms seen in prior work like Triangel.
Efficiency of the Criticality Heuristic: The method for identifying critical instructions (Section 3.2.1, page 4)—a simple counter-based approach that triggers when an instruction's misses exceed a percentage of the total—is a departure from the complex, multi-level sampling structures of Triangel. While simple heuristics are not new, proposing one that is demonstrably less complex and more effective than the state-of-the-art is a valid and significant contribution. The novelty here is in achieving superior results with a simpler, more reactive method.

Weaknesses

My critique focuses on the degree to which the constituent parts of Kairos are truly novel and the robustness of the claims made about their composition.

Novelty by Composition, Not Invention: The work's novelty is almost entirely derived from the composition of pre-existing ideas. Instruction-based prefetching (Triage, Triangel), feedback mechanisms, and PID controllers are all established concepts. The paper frames its contribution as a new prefetcher, "Kairos," but it could also be viewed as an evolutionary step: taking an instruction-based temporal prefetcher and replacing its filtering and resource management modules with more efficient alternatives. The authors should be more precise in delineating which parts are novel applications versus truly new mechanisms.
Overstated "System-Agnostic" Parameters: In Section 3.4.2 (page 6), the authors claim the PID controller parameters are "system-agnostic" and provide design guidelines. This is a very strong claim. Control theory parameters are notoriously sensitive to the dynamics of the system they are controlling (e.g., latency, bandwidth, application behavior). The justification provided is high-level ("inherent performance trade-off") and lacks rigorous proof or extensive sensitivity analysis. Without this, the novelty of the PID controller is weakened, as it may appear to be a solution that is simply well-tuned to the authors' specific simulation environment rather than a generally applicable principle.
Simplicity of the Detection Heuristic: The 12.5% miss contribution threshold for identifying a "critical" instruction seems arbitrary. While its simplicity is a strength, it also raises questions about its robustness. In programs with rapidly changing phases or diffuse miss sources, such a simple threshold could lead to instability—either failing to identify critical instructions or incorrectly flagging transient ones. The novelty of this simple mechanism is contingent on it being robust, which is not fully explored.

Questions to Address In Rebuttal

Could the authors clarify the core novelty of their contribution? Is it the specific synthesis of these three known techniques (filtering, coverage-feedback, PID control), or do they claim novelty in any of the individual mechanisms themselves?
Please provide a stronger defense for the claim that the PID parameters are "system-agnostic." A sensitivity analysis showing the performance impact of varying the PID coefficients (α, β, γ) and thresholds (θ, τ) would be necessary to substantiate this claim. How does performance degrade as these parameters deviate from the chosen values?
The 12.5% threshold for the Detecting Unit is a critical parameter. How was this value determined? Please provide data showing performance sensitivity to this threshold. How does Kairos's detection mechanism behave during program phase transitions where the set of high-miss-rate instructions may change rapidly? Is it possible for the total_miss_counter to be dominated by a few early-phase instructions, thus preventing later-phase instructions from ever being marked as critical (as per the mechanism in Section 3.2.2)?

Quartz: A Reconfigurable, Distributed-Memory Accelerator for Sparse Applications

Abstract

Iterative sparse matrix computations lie at the heart of many scientific computing and graph analytics algorithms. On conventional systems, their irregular memory accesses and low arithmetic intensity create challenging memory bandwidth bottlenecks. To ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present Quartz, a reconfigurable, distributed-SRAM accelerator architecture for iterative sparse applications. The central thesis is that existing architectures are either too specialized (unprogrammable) or too general (inefficient). Quartz proposes to bridge this gap using a programming model based on Einsum notation to generate short, dataflow tasks that are executed on reconfigurable processing elements (PEs). A significant part of the contribution is a novel set of data partitioning techniques, including a two-phase approach, designed to minimize communication and balance load for computations involving both static and dynamic sparsity patterns. The paper claims a geometric mean speedup of 21.4x over a state-of-the-art baseline, "Dalorex++," and 93x over an H100 GPU.

Strengths

Problem Formulation: The paper correctly identifies a critical and persistent challenge in high-performance computing: the memory bandwidth bottleneck for irregular, sparse computations. The motivation is clear and well-established.
Systematic Programming Model: The conceptual framework of translating high-level Einsum cascades into partitioned, tile-local tasks (Section 3) is methodical. It provides a structured way to think about decomposing these complex computations for a spatial architecture.
Novelty in Partitioning: The work's focus on partitioning for workloads with mixed static-dynamic operand sparsity (Section 5.3) addresses a known limitation of prior techniques. The proposed two-phase clustering and hypergraph partitioning strategy is a non-trivial contribution to the problem space.

Weaknesses

My primary concerns with this manuscript center on the validity of its core performance claims, which appear to rest on a questionable evaluation methodology and an overstatement of the system's practical usability.

Unconvincing Baseline Comparison: The central claim of a 21.4x speedup is made against "Dalorex++." However, as described in Section 6.1.4, this is not the original Dalorex system but rather a model constructed by the authors: "We model Dalorex by replacing the Quartz specialized PEs with a model of an in-order RISC core..." within their own simulation framework. This is a critical methodological flaw. The performance of this re-implemented baseline is not validated against the original work's published results or open-source simulator. It is entirely possible that this custom baseline model is unoptimized or fails to capture key performance characteristics of the actual Dalorex architecture, thus artificially inflating the speedup of Quartz. Claims of superiority must be demonstrated against faithful, validated representations of prior art, not potentially biased re-implementations.
The "Programmability" Claim is Premature: The paper puts forth a "flexible programming model" as a key contribution. However, the authors explicitly state that the compilation from Einsums to executable tasks is not automated: "...we leave it to future work to encode them as compiler passes" (Section 3, page 4). A programming model without a corresponding compiler or automated toolchain is a conceptual framework, not a programmable system. The claim of high programmability is therefore unsubstantiated, as the actual effort required to map a new application to Quartz remains undefined and potentially prohibitive.
Prohibitive and Narrowly Justified Partitioning Cost: The paper reports a geometric mean partitioning time of 18.5 minutes (Section 6.4). The authors justify this extreme offline cost by arguing it can be amortized over millions of runs in domains like navigation or physics simulations. This argument severely limits the applicable domain of Quartz. Any application where the input graph or matrix structure changes with non-trivial frequency would be impractical, as it would require re-running this costly partitioning step. The work fails to adequately characterize the large class of problems for which this static-structure assumption does not hold.
Insufficient Architectural Validation: The performance results are derived from a custom cycle-accurate simulator (Section 6.1.1). The paper states the simulator's functional correctness was validated against CPU implementations, but provides no information on how the performance model itself was validated. Without validation against RTL models or hardware prototypes, the cycle-level accuracy of the PE design, network contention model, and memory subsystem remains an open question. Similarly, the area and power estimates (Section 6.5) are derived from standard tools (CACTI, etc.) using a predictive PDK (ASAP7). While a common practice, these are high-level estimates, making the claims of 21x higher performance-per-area over a production chip like the H100 GPU appear highly optimistic and speculative.

Questions to Address In Rebuttal

The authors should address the following points directly:

Regarding the Dalorex++ baseline: Can you provide evidence that your model of the Dalorex PE and system is faithful to the original architecture? Specifically, what was the rationale for not using the publicly available Dalorex simulator, and can you provide a comparison of your model's performance against theirs on a common benchmark to establish its validity?
Regarding the compiler: Please clarify the current status of the compilation toolchain. How much of the process described in Section 3 is currently automated? What is the extent of manual intervention required to take a new application, express it as an Einsum cascade, and generate the PE configurations and task mappings for Quartz?
Regarding partitioning applicability: Can you more precisely define the application domains where the assumption of a static sparsity pattern over millions of runs holds? Conversely, please acknowledge and discuss the application domains for which the 18.5-minute partitioning overhead would make Quartz an impractical solution.
Regarding dynamic matrix sparsity: The paper focuses on handling dynamic sparsity in vectors (e.g., the BFS frontier). How would the proposed architecture and partitioning scheme handle problems where the sparsity structure of the primary matrix operand changes over time, as is common in applications like adaptive mesh refinement or dynamic graph analysis? Would this necessitate a full, multi-minute re-partitioning?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents Quartz, a complete co-designed system for accelerating iterative sparse matrix computations. The core contribution is the synthesis of a reconfigurable, task-based dataflow architecture with a large, distributed on-chip SRAM memory system. This architectural vision is supported by two critical enabling components: a systematic programming model that translates high-level computations expressed in Einsum notation into hardware-friendly tasks, and a novel set of data partitioning techniques designed to simultaneously minimize communication and balance load, even for challenging applications with mixed static and dynamic sparsity. The authors position Quartz as a solution to the well-known tradeoff in this space, where prior architectures have sacrificed either performance for programmability (using general-purpose cores) or programmability for performance (using fixed-function hardware). The work claims significant speedups over both a state-of-the-art academic accelerator (Dalorex) and a commercial GPU (NVIDIA H100).

Strengths

The true strength of this paper lies in its ambitious and coherent system-level vision. The authors are not merely proposing a new hardware unit; they are presenting a cohesive solution that spans the stack from a high-level programming abstraction down to the microarchitecture and the physical data layout.

Elegant Synthesis of Architectural Concepts: Quartz effectively bridges several distinct domains of computer architecture research. It combines the high aggregate bandwidth of distributed-SRAM systems (seen in systems like Cerebras, Groq, and Dalorex) with the computational efficiency of Coarse-Grained Reconfigurable Arrays (CGRAs) and the dynamic execution model of dataflow machines. As the authors’ taxonomy in Table 1 (page 3) correctly identifies, this combination occupies a novel and promising point in the design space. By placing reconfigurable compute within a distributed memory fabric, the paper re-frames the core challenges away from managing cache hierarchies and toward managing data distribution and network traffic, which it then proceeds to solve.
A Principled Approach to Programmability: The use of an extended Einsum notation as the programming entry point (Section 3, page 4) is a standout feature. This provides a formal, mathematical foundation for describing a wide range of sparse computations. The subsequent "Einsum Cascade" methodology for systematically partitioning the computation across tiles and lowering it to concrete hardware tasks is a powerful idea. This offers a credible path toward a compiler that can target this complex hardware, addressing the programmability challenge that has plagued many specialized accelerators.
Sophisticated Solution to a Classic Hard Problem: Data partitioning is the Achilles' heel of many distributed systems. The paper correctly identifies the dual objectives of load balancing and communication minimization. The proposed two-phase partitioning strategy for non-all-active algorithms (Section 5.3, page 9 and Figure 12) is particularly insightful. The idea of first using graph partitioning on the input graph to create behaviorally-related clusters, and then using hypergraph partitioning to distribute each cluster across the tiles, is a novel and well-motivated heuristic for handling the unpredictable nature of dynamic sparsity. It strikes a pragmatic balance between locality and parallelism.
Convincing and Well-Structured Evaluation: The evaluation is thorough and compelling. The head-to-head comparison with Dalorex++ provides a clear measure of the benefits of reconfigurable compute over general-purpose cores in this context. The ablation studies (Figure 1, page 2, and Figure 15, page 12) are excellent, clearly attributing performance gains to the specific hardware and partitioning contributions. The paper is also forthright about the high offline cost of its partitioning scheme (Section 6.4, page 12) and provides a reasonable justification for why this is acceptable for the target application domain.

Weaknesses

The weaknesses of the paper are largely related to the scope and boundaries of its ambitious claims, rather than fundamental flaws in the core ideas.

The "Programmability" Boundary: The Einsum-based programming model is powerful for computations that fit within the tensor algebra paradigm. However, the true test of general programmability is how the system handles algorithms with more complex, data-dependent control flow that is not easily captured by map/reduce-style operations. While BFS is a good start, it would be valuable to understand the conceptual limits of this model. What kinds of sparse algorithms would be difficult or impossible to express and execute efficiently in Quartz?
Compiler and Task Generation Complexity: The paper presents the Einsum-to-task translation as a systematic process, which is a great first step. However, it also notes that the compiler passes are future work. This glosses over what is likely a significant engineering and research challenge. For instance, how does the system handle an application that requires a new task type that maps poorly to the existing functional units in the reconfigurable PE? Does this require a hardware redesign, or is the fabric flexible enough to accommodate a wide diversity of task primitives? The scalability of this compilation process as application complexity grows remains an open question.

Questions to Address In Rebuttal

Could the authors elaborate on the limitations of the extended Einsum programming model? Can you provide an example of a sparse algorithm that would be a poor fit for the Quartz programming model and architecture, and explain why? This would help to better define the boundaries of the proposed solution.
Regarding the programming model, what happens when a new application is introduced that requires a new set of task types? Is the PE fabric designed to be general enough to construct arbitrary task dataflows, or is it optimized for a certain family of primitives (e.g., scale-and-accumulate)? How would the system handle a task that requires, for example, complex bitwise operations or transcendental functions?
The paper makes a strong case for amortizing the high partitioning cost (Section 6.4, page 12). However, in a scenario where the sparsity pattern changes frequently (e.g., dynamic graphs) or where interactive, low-latency execution is needed for a single run, this cost could be a major barrier. Have the authors considered faster, lower-quality partitioning heuristics as a potential fallback, and how might that impact the performance tradeoff?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper proposes Quartz, a distributed-memory, reconfigurable accelerator for iterative sparse applications. The authors identify that existing distributed-SRAM systems are limited by either the poor performance of general-purpose cores (e.g., Dalorex) or the poor programmability of fixed-function units (e.g., Azul).

To address this, the authors' core claims of novelty are threefold: 1. A Novel Architecture: The synthesis of a distributed all-SRAM memory system with task-driven, message-triggered reconfigurable processing elements (PEs). 2. A Novel Programming Model: A systematic method to translate high-level computations expressed as Einsum cascades into short, hardware-friendly tasks, where the Einsum notation is extended to explicitly represent data partitioning and inter-tile communication. 3. A Novel Partitioning Heuristic: A two-phase partitioning technique that claims to be the first to simultaneously optimize for communication minimization and load balance for workloads exhibiting both static (matrix) and dynamic (vector) sparsity.

The paper demonstrates significant performance gains over a state-of-the-art distributed-SRAM system (Dalorex++) and a high-end GPU (NVIDIA H100).

Strengths

The primary novel contribution of this work, and its most significant strength, lies in the partitioning methodology for mixed static-dynamic sparsity workloads (Section 5.3, page 9). To my knowledge, the authors' claim of being the first to explicitly tackle this specific problem by combining graph and hypergraph partitioning is accurate. Prior work has focused on the simpler case of all-static sparsity (as in Azul [30]) or has prioritized one objective over the other (e.g., communication minimization via coordinate-space tiling or load balancing via round-robin). The proposed two-phase heuristic—first clustering via graph partitioning to capture temporal locality, then distributing each cluster via hypergraph partitioning—is an elegant and clever approach to a notoriously difficult problem. The results in Figure 15 (page 12) compellingly demonstrate that this novel technique is essential for achieving good performance on non-all-active algorithms like BFS.

A secondary, but still significant, novel aspect is the specific co-design of the architecture and programming model. While the constituent parts are not new in isolation, their synthesis is. The extension of the Einsum framework (Section 3.2, page 5) to include a partitioning rank and "distribution tensors" (T1, T2) to explicitly model inter-tile communication is a conceptually clean and powerful abstraction. Mapping this abstraction onto an architecture of many simple, fast-reconfiguring PEs that execute short, message-triggered tasks represents a novel design point not fully explored by prior art.

Weaknesses

While the paper presents a compelling system, the novelty of some of its core ideas must be carefully contextualized against a backdrop of extensive prior work.

The Architectural Paradigm is an Integration, Not an Invention. The core architectural template is a synthesis of well-established concepts.
- Distributed All-SRAM Architectures: This model is the foundation for numerous prior works, including Dalorex [59], Azul [30], and commercial systems like the Groq TSP [2] and Cerebras WSE [20]. The novelty is not in the memory system itself.
- Reconfigurable Compute for Sparsity: Using coarse-grained reconfigurable arrays (CGRAs) to accelerate sparse computations is also a known technique, explored in systems like PolyGraph [24] and Onyx [45]. The key difference is that those systems were coupled with conventional shared memory hierarchies.
- Task-Based/Message-Driven Execution: The execution model is functionally very similar to classic active messages [28] and more recent triggered-instruction paradigms [61]. The concept of dispatching small compute tasks in response to network messages is not fundamentally new.
The true architectural contribution is therefore the specific integration of these three concepts to create a scalable fabric of CGRAs, each with high-bandwidth local SRAM. This is a strong engineering contribution, but it should not be mistaken for the invention of a fundamentally new architectural class.
The "Systematic" Programming Model May Obscure Manual Effort. The paper claims a "systematic, highly automatable process" (Section 3, page 4) for converting Einsums to tasks. However, the critical step of formulating the initial iterative Einsum cascade for a given application is non-trivial. The derivation of the four distinct task types for BFS (page 6), for instance, requires significant insight into the algorithm's structure. The paper defers the actual implementation of this compilation flow, stating, "we leave it to future work to encode them as compiler passes." This weakens the claim of a fully realized, systematic programming model and suggests the current process may rely on considerable developer effort to fit applications into the required Einsum structure. The novelty is in the proposed flow, but its generality and automation remain unproven.
High Cost of Novelty. The novel partitioning heuristic comes at a steep price: a gmean of 18.5 minutes of offline preprocessing time (Section 6.4, page 12). While the authors provide a valid justification for workloads where this cost can be amortized, this fundamentally limits the application domain of Quartz. The novelty of the solution must be weighed against this practical constraint. It is a specialized solution for nearly-static problems, not a general-purpose accelerator for ad-hoc sparse computations.

Questions to Address In Rebuttal

Architectural Delta: Could the authors more precisely articulate the novel architectural mechanisms in Quartz compared to a hypothetical system combining a triggered-instruction execution model (e.g., Parashar et al. [61]) with a reconfigurable fabric like that in Onyx [45]? Beyond the memory subsystem, what is fundamentally new in the PE design or execution model that is essential for this problem domain?
Generality of the Programming Model: The translation from Einsum to tasks is presented as systematic. Could the authors walk through how a more complex algorithm not evaluated in the paper (e.g., betweenness centrality or a non-linear solver) would be expressed in their extended Einsum notation? How confident are the authors that this process can be fully automated without algorithm-specific templates or transformations?
Robustness of the Partitioning Heuristic: The two-phase partitioning is a heuristic. Have the authors identified any graph topologies (e.g., expander graphs, or graphs with no clear community structure) where the initial graph clustering phase might provide a poor partition, leading the second hypergraph phase to produce a suboptimal result for load balancing? How sensitive is the final performance to the quality of the initial clustering?

SeaCache: Efficient and Adaptive Caching for Sparse Accelerators

Abstract

Sparse tensor computations are highly memory-bound, making on-chip data reuse in SRAM buffers critical to the performance of domain-specific sparse accelerators. On-demand caches are commonly used in recent sparse accelerators, due to the advantage of ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

This paper, "SeaCache," proposes a suite of techniques to improve on-chip caching for sparse tensor accelerators. The authors identify two primary weaknesses in prior work: inefficient mapping of variable-length sparse fibers to fixed-size cache blocks, and the high implementation cost of theoretically optimal guided replacement policies like gLRU. To address this, they propose (1) a "fiber packing and splitting" scheme to improve space utilization, (2) a "guided LFU" (gLFU) policy as a cheaper alternative to gLRU, augmented with virtual tags, and (3) a two-phase adaptive mechanism to partition cache space between data and guide metadata. The authors evaluate their design using a custom simulator and claim significant average speedups over several state-of-the-art cache designs.

While the problems identified are valid, the proposed solutions rest on a series of strong assumptions, ad-hoc design choices, and insufficiently justified claims. The experimental methodology also raises concerns about the fairness of the baseline comparisons.

Strengths

The paper correctly identifies critical and well-known challenges in designing caches for sparse accelerators, namely the difficulty of handling highly variable fiber lengths and the practical overhead of implementing oracle-guided replacement policies.
The work is ambitious in scope, attempting to provide a holistic solution that touches on data mapping, replacement policy, and dynamic resource partitioning.
The authors recognize the need to evaluate against a scratchpad-style baseline (Tailors), which is a relevant comparison point in the accelerator design space.

Weaknesses

The paper's core contributions are undermined by several significant methodological and technical weaknesses:

The Fiber Packing/Splitting Scheme Introduces Unjustified Constraints and Brittle Failure Modes.
- The packing mechanism requires that fibers have contiguous IDs to be packed together (Section 4.1, page 5). This is a strong, application-dependent assumption that is not justified. What is the frequency of this pattern in real workloads, and how much performance is lost when it does not occur? The mechanism appears tailored to a best-case scenario.
- The splitting mechanism has a hard limit on the number of segments, derived from the need to uniquely identify segments using only the lowest 16 bits of the original fiber ID (Section 4.1, page 6). The authors claim "Almost all sparse matrices in SuiteSparse are within this limit." This is not rigorous. The solution for fibers exceeding this limit—"reduce the tile size, or simply discard the remaining elements"—is a brittle failure mode that can cause a catastrophic performance drop. The performance impact of this edge case is not evaluated.
The Superiority and Practicality of Guided LFU (gLFU) is Not Convincingly Argued.
- The paper claims gLFU is a "practical" and "near-optimal" policy. However, its implementation still requires an additional port on the cache tag array to handle counter updates from prefetched metadata (Section 4.2, page 7). This is a non-trivial hardware cost in terms of area, power, and potentially timing closure, which undermines the claim of it being a "much cheaper" alternative to gLRU.
- The introduction of "virtual tags" feels like an ad-hoc patch. As Figure 5 (page 7) clearly shows, the integrated counter design ("gLFU w/ 0 vtag") performs very poorly compared to an idealized gLFU. The virtual tags are a workaround to recover lost performance, but this adds complexity and overhead. The choice of 4 virtual tags is empirical and not derived from first principles. This suggests the core integrated design is flawed.
- The claim that idealized gLFU outperforms idealized gLRU (Figure 5) because "gLRU cannot achieve Belady's Optimal with a limited prefetch size" (page 7) is a hand-wavy argument. A more rigorous analysis is required to explain why frequency information (LFU) would be superior to recency information (LRU) in the context of limited lookahead. It is entirely possible this is an artifact of the specific benchmarks and prefetch window size chosen.
The Adaptive Prefetch Sizing Mechanism is Overly Complex and Its Effectiveness is Unproven.
- The mechanism relies on simulated annealing (Algorithm 1, page 8), a complex heuristic with several magic numbers (e.g., the initial temperature, cooling schedule, 20% miss/discard threshold). The sensitivity of the final performance to these parameters is not explored, making the design's robustness questionable.
- The authors claim the mechanism can "reach a size within 95% performance of the optimal selection" (page 8), but provide no evidence to substantiate this. How was the "optimal selection" determined for comparison? An exhaustive search is implied but not described. Without this data, the 95% figure is an unsubstantiated assertion.
Baseline Comparisons Appear Unfair and Potentially Misleading.
- The implementation of SpArch's online gLRU is based on the authors' own "assumed efficient implementation" (page 4), as they note the original paper lacked microarchitectural details. This creates a high risk of a strawman comparison. The performance of SpArch is highly sensitive to how its gLRU metadata cache is managed, and it is unclear if the authors' implementation is representative or fair.
- X-Cache is configured with 16-byte blocks while SeaCache uses 64-byte blocks. This is a fundamental architectural difference. While the paper's goal is to improve large-block performance, this setup makes a direct, apples-to-apples comparison difficult. A sensitivity analysis showing how X-Cache performs with larger blocks (and suffers from underutilization) would be necessary to fairly contextualize SeaCache's contribution. The block size sensitivity study in Figure 14 (page 13) is only performed for SeaCache itself.

Questions to Address In Rebuttal

Please provide data on how frequently the "contiguous fiber ID" assumption for packing holds true across the evaluated benchmarks. What is the performance impact when this assumption is violated?
Regarding fiber splitting, what percentage of fibers in the evaluated tiled matrices exceed the 4096-block (~256 KB) length limit? Quantify the performance degradation when a tile size must be reduced or data is discarded due to this limitation.
Can you provide a quantitative area and power comparison between your gLFU implementation (including the extra tag port and virtual tags) and the assumed gLRU implementation for SpArch? The claim that gLFU is "much cheaper" must be backed by data, not just qualitative argument.
Please provide the data that substantiates the claim that the adaptive prefetch algorithm converges to within 95% of the optimal performance. This requires showing, for each benchmark, the performance of the chosen prefetch size versus the empirically-found optimal prefetch size.
How can you justify that your implementation of SpArch's online gLRU is a fair and representative baseline, given that you had to assume the microarchitectural details? Specifically, could a different metadata management scheme for SpArch have significantly improved its performance?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents SeaCache, a holistic on-chip cache management system designed specifically for sparse tensor accelerators. The authors identify two primary, well-known challenges in this domain: 1) the inefficient mapping of variable-length sparse data structures (fibers) onto fixed-size cache blocks, leading to poor space utilization, and 2) the prohibitive implementation cost of theoretically optimal, guided replacement policies (like gLRU) that leverage the structure of one operand to predict access patterns for another.

To address this, SeaCache proposes a synergistic set of three techniques: 1. Fiber Packing and Splitting: A mapping scheme that allows multiple short fibers to be packed into a single cache block or a single long fiber to be split across multiple blocks, thus maximizing space utilization. 2. Guided LFU (gLFU) with Virtual Tags: A practical replacement policy that substitutes the complex, pointer-heavy implementation of guided LRU with a simpler, counter-based guided LFU. This is augmented with "virtual tags" to track important, non-resident blocks, improving accuracy without the full cost of an idealized implementation. 3. Two-Phase Adaptive Partitioning: A mechanism that dynamically partitions the shared cache space between the actual tensor data and the prefetched metadata required by the guided replacement policy, starting with an offline estimate and tuning it online based on runtime metrics.

The authors integrate these techniques into a sparse accelerator design and demonstrate significant performance improvements (2.8x on average) over several state-of-the-art sparse cache designs.

Strengths

The primary strength of this work lies in its synthesis of multiple ideas into a cohesive and practical solution for a difficult, real-world problem. It moves beyond proposing a single, isolated optimization and instead presents a complete, well-engineered system.

Contextual Problem-Solving: The authors correctly identify and tackle the key pragmatic challenges that prevent the straightforward application of conventional caching to sparse accelerators. The work is well-grounded in the limitations of prior art, such as the mapping issues in X-Cache [30] and the high overhead of guided LRU in designs like InnerSP [3] and SpArch [38].
Adaptation of Broad Concepts to a Specific Domain: This paper is an excellent example of applying established computer science principles to a specialized domain with great effect.
- The "fiber packing and splitting" mechanism (Section 4.1, page 5) can be seen as a domain-specific adaptation of cache compression techniques. By leveraging the known structure and read-only nature of the sparse data, it achieves the goal of increasing effective capacity without the overhead of general-purpose compression algorithms.
- The move from guided LRU to guided LFU (Section 4.2, page 6) represents a classic and valuable architectural trade-off. It knowingly sacrifices theoretical optimality for a massive reduction in implementation complexity (counters vs. linked lists), a trade-off that is often justified in hardware design. The use of virtual tags is a clever microarchitectural trick to claw back some of the lost predictive accuracy.
- The adaptive partitioning of the cache between data and metadata (Section 4.3, page 7) is effectively a control system for a shared resource. The two-phase approach, combining offline analysis with online, metric-driven tuning, is a robust and well-principled methodology.
Strong Empirical Results: The evaluation is thorough, comparing against multiple relevant baselines (including a scratchpad design) across a wide variety of real-world sparse matrices. The significant speedups reported are compelling and demonstrate that the proposed combination of techniques is highly effective. The breakdown of contributions in Figure 10 (page 12) is particularly insightful, showing how each component builds upon the last.

Weaknesses

The paper is strong, and my points here are less about fundamental flaws and more about opportunities for deeper contextualization and discussion.

Accumulated Complexity: While each individual technique is well-justified, their combination results in a cache controller of considerable complexity. The system requires fuzzy tag matching for packed fibers, adjusted ID calculation for split fibers, an extra tag port for counter updates, virtual tags, and the online monitoring and control logic for adaptive partitioning. The paper presents an area breakdown (Table 2, page 10), which is good, but a more direct discussion of the complexity trade-off would be welcome. Is there a point of diminishing returns where a simpler subset of these techniques might be sufficient?
Sensitivity of the Adaptive Mechanism: The online adaptive partitioning scheme uses a simulated annealing-like approach with several heuristics and hyperparameters (e.g., the 20% miss/discard rate threshold, the specific temperature schedule). While the results are good, the robustness of this control mechanism could be explored further. How sensitive is the final performance to these choices? Is it possible for the system to get trapped in a poor local optimum for matrices with unusual phase-change behavior?
Positioning Against Scratchpads: The authors make a convincing case for on-demand caching over scratchpads in Section 2.3 (page 3) for the general sparse case. Their performance win over the Tailors [35] scratchpad design validates this. However, it would strengthen the paper to also discuss the inverse: are there specific scenarios (e.g., static, well-known sparsity patterns where compiler analysis is more tractable) where a sophisticated scratchpad management scheme might still outperform SeaCache? Acknowledging the boundaries of the proposed solution's superiority would provide a more complete picture.

Questions to Address In Rebuttal

Regarding the complexity mentioned above, can the authors provide a more direct comparison of the area and power overhead of the SeaCache controller logic (the "Internal cache modifications" and other custom logic in Table 2) relative to the simpler controllers in the baseline designs like InnerSP or X-Cache? This would help quantify the cost of the performance gain.
Could the authors comment on the sensitivity of the two-phase adaptive prefetcher (Algorithm 1, page 8)? How much tuning was required to find the 20% threshold and the annealing schedule? Have you tested its stability on datasets with dynamic or rapidly changing access patterns?
The core argument of the paper is the superiority of a highly-optimized cache. Could you elaborate on the fundamental information gap that prevents a compiler-managed scratchpad from achieving similar performance? Is it primarily the unpredictability of operand intersections, or are there other factors where on-demand hardware fetching provides an insurmountable advantage?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper, "SeaCache," proposes a cache architecture for sparse accelerators, aiming to improve upon existing designs by addressing two key issues: inefficient mapping of variable-length sparse fibers to fixed-size cache blocks, and the high implementation cost of theoretically optimal replacement policies. The authors introduce three primary techniques: (1) a "fiber packing and splitting" scheme to improve space utilization, (2) a "guided LFU" (gLFU) replacement policy as a cheaper alternative to guided LRU (gLRU), and (3) a two-phase adaptive mechanism for partitioning the cache space between tensor data and the replacement policy's guide metadata.

While the paper presents a well-engineered system with impressive performance gains, a critical analysis of the prior art reveals that the novelty lies less in foundational new concepts and more in the clever synthesis and specific implementation of existing principles for the sparse accelerator domain. The core ideas of packing variable data, using predictive information for replacement, and dynamically partitioning resources are all well-established. The contribution is in the specific mechanisms developed to realize these ideas in a cohesive and effective system.

Strengths

The primary strength of this work is the novel synthesis of several known techniques into a practical system that solves a clear problem. While the individual components may have conceptual precedents, their combination and specific hardware implementation are non-trivial.

The "Virtual Tags" Mechanism: The introduction of "virtual tags" (Section 4.2, page 7) to support the gLFU policy is a genuinely clever and non-obvious hardware technique. The problem—that counters for important but currently uncached fibers would be lost—is a direct consequence of moving from an idealized model to a practical, integrated one. Using tag-only entries to store this predictive metadata is an elegant solution and appears to be a novel implementation detail for this context.
Adaptive Metadata Sizing Mechanism: The two-phase adaptive mechanism for partitioning the cache between B-tensor data and A-tensor metadata (Section 4.3, page 8) is a sophisticated control system. While dynamic resource partitioning is a known concept, the specific control loop—using "discard rate" on the prefetch side and "miss rate" on the access side as inputs to a simulated annealing-based adjustment policy—is a novel application for this particular architectural trade-off.

Weaknesses

My main concern revolves around the fundamental novelty of the core claims when deconstructed and compared to prior art across different domains. The paper frames its contributions as major new techniques, but they are more accurately described as specific, albeit effective, adaptations of existing ideas.

Fiber Packing and Splitting is Conceptually Analogous to Cache Compression: The idea of mapping variable-sized logical units into fixed-size physical cache blocks (Section 4.1, page 5) is the central premise of compressed caches. Prior work, such as [27, 28, 29] cited by the authors, has explored packing multiple compressed blocks into a single cache line to increase effective capacity. SeaCache's scheme is a simplified version of this, where the "compression" is simply the absence of zero values, and the metadata (Cnt and Extra bits) serves the same role as the metadata in a compressed cache: locating the sub-blocks. The novelty is limited to the application of this known principle to the fiber-ID-based indexing scheme, not the concept of packing/splitting itself.
Guided LFU is an Engineering Trade-off, Not a New Policy Paradigm: The central idea of "guided replacement" is directly inherited from prior work on gLRU, as acknowledged with citations to P-OPT [4], InnerSP [3], and SpArch [38]. The authors' contribution is to replace the LRU mechanism (recency/distance tracking) with an LFU mechanism (frequency counting). The LFU vs. LRU debate is one of the oldest in cache design. Proposing LFU as a cheaper alternative to LRU in this specific "guided" context is a sound engineering decision, but it does not constitute a fundamentally new replacement policy concept. The novelty is in the practical hardware implementation (e.g., virtual tags), not the policy itself.
Marginal Gains vs. Added Complexity: The paper argues that gLFU is cheaper than gLRU. While it avoids wide pointers for linked lists, the proposed solution adds its own complexity: virtual tags, an extra read port on the tag array to handle counter updates, and "fuzzy" tag comparators to handle packed fibers. This feels less like a clear complexity reduction and more like a complexity shift. A more rigorous analysis is needed to demonstrate that this new set of complexities is truly more beneficial than the old one, beyond the empirical performance results. For example, the 2.8x average speedup over X-Cache is a result of both a better replacement policy (LRU vs gLFU) and a much better mapping scheme. It is difficult to isolate the benefit of the novel aspects alone.

Questions to Address In Rebuttal

On Fiber Packing/Splitting: Please elaborate on the novelty of this scheme beyond its application to sparse fibers. What is fundamentally different about this mapping compared to prior compressed cache designs that also handle variable-length data units by packing them into fixed-size blocks with offset metadata?
On Virtual Tags: Could the authors clarify the novelty of the "virtual tags" mechanism? Are there precedents for tag-only cache entries used to store predictive metadata for uncached items in other domains (e.g., advanced prefetchers that track miss streams, or metadata for off-chip resources)? A clearer positioning against the landscape of predictive hardware structures would strengthen this claim.
On Adaptive Partitioning: The use of a simulated annealing-like heuristic to manage the data/metadata partition is interesting. Can the authors confirm if this is a novel application of this algorithm for cache partitioning, or has this heuristic been explored for similar dynamic resource trade-offs in prior architectural work? Please clarify the "delta" from existing dynamic resource management schemes, particularly those in the multi-core or QoS domains.

NetSparse: In-Network Acceleration of Distributed Sparse Kernels

Abstract

Many hardware accelerators have been proposed to accelerate sparse computations. When these accelerators are placed in the nodes of a large cluster, distributed sparse applications become heavily communication-bound. Unfortunately, software solutions to ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present NetSparse, a suite of four hardware mechanisms designed to accelerate communication in distributed sparse computations. The proposed mechanisms include: (1) a Remote Indexed Gather (RIG) operation offloaded to the NIC to reduce host overhead, (2) runtime filtering of redundant requests, (3) concatenation of requests to the same destination to improve goodput, and (4) in-switch caching to serve local rack requests. The evaluation, performed on a simulated 128-node cluster, claims substantial performance improvements (up to 38x over a single node) compared to software-only approaches. However, these claims are predicated on a purely simulated evaluation methodology with highly idealized and arguably unfair baseline comparisons, and the proposed architectural changes introduce significant complexity and scalability concerns that are not fully addressed.

Strengths

Problem Characterization: The paper does a commendable job of identifying and quantifying the inefficiencies of existing software-based communication strategies for sparse kernels (Section 3, page 3). The analysis of Sparsity-Unaware (SU) and Sparsity-Aware (SA) approaches, highlighting issues like redundant transfers, low line utilization, and header overheads, is thorough and provides strong motivation.
Comprehensive Proposal: The set of four proposed hardware mechanisms is cohesive and targets different aspects of the communication bottleneck. The design attempts to address the problem at multiple points in the system, from the host-NIC interface down to the network switch fabric.
Ablation Study: The inclusion of an ablation study (Section 9.2, Table 8, page 12) is a positive step, as it attempts to isolate the performance contribution of each proposed mechanism. This provides some insight into which components of NetSparse are most impactful under different conditions.

Weaknesses

Fundamentally Flawed Evaluation Methodology: The paper's conclusions are built entirely on simulation (SST, Section 8.2), with no real hardware prototype or testbed validation. More critically, the baselines used for comparison are constructed in a way that appears to artificially inflate the benefits of NetSparse.
- The SUOpt baseline is a theoretical optimum (100% line utilization, no overheads), not a realistic system. While useful as a ceiling, comparing against it overstates practical gains.
- The SAOpt baseline is particularly concerning. The authors state they calibrate its software overheads by measuring performance between cores on the same node on the NCSA Delta system (Section 8.1, Figure 10, page 10). This is a methodologically invalid proxy for inter-node communication overheads, which involve network protocol stacks, OS bypass mechanisms (like RDMA), and physical network latencies that are entirely different from on-chip interconnects.
- The resource allocation for the comparison is indefensible. In SAOpt, the authors use all 64 CPU cores for communication tasks. In contrast, NetSparse requires only a single CPU core to manage the RIG units. This is a classic apples-to-oranges comparison that pits a resource-starved version of the proposed system against a resource-saturated version of the baseline, creating a misleadingly large performance gap.
Questionable Architectural Practicality and Scalability: The paper hand-waves away significant architectural challenges.
- The concatenation mechanism requires one Concatenation Queue (CQ) per destination for both reads and responses, totaling 2(N-1) queues. The authors acknowledge this scales poorly (Section 7.2, page 9) and propose "virtualizing the CQs" as a solution without providing any design details, implementation costs, or analysis of the performance overhead of such a dynamic management system. This is a critical scaling limitation that is dismissed as future work.
- The proposed switch architecture (Section 6.2.1, Figure 8, page 8) adds a second crossbar and a new "middle pipe" stage. This is a radical departure from standard high-performance switch ASIC designs. The authors' area overhead estimate of "1-15%" (Section 9.5, page 13) is an enormous range and glosses over the profound design, verification, and cost implications of such a modification. This complexity seems unjustified without a more rigorous comparison to less invasive solutions.
Insufficient Handling of System-Level Realities:
- Packet Loss: The paper assumes a lossless network (Section 7.1, page 9). The proposed recovery mechanism—a watchdog timer that fails an entire RIG operation if a single response is lost—is exceptionally coarse-grained. For a batch of 32k nonzeros, the loss of one packet would trigger a catastrophic failure and re-transmission of the entire batch, which is highly inefficient. A robust system requires a more fine-grained error-handling protocol.
- Load Imbalance: The authors' own analysis (Section 9.4, Figure 19, page 13) shows that inter-node communication imbalance is a primary performance limiter for several benchmarks. NetSparse accelerates the communication itself but does nothing to address this underlying imbalance, which is a function of the data partitioning. Therefore, NetSparse is a hardware solution to a problem that might be better solved, or at least significantly mitigated, by software partitioning algorithms. The paper fails to explore this trade-off.

Questions to Address In Rebuttal

Please provide a rigorous justification for using intra-node software overhead measurements (Figure 10) to model inter-node communication in your SAOpt baseline. How does this calibration account for the distinct overheads of a real RDMA network stack and physical link traversal?
Please defend the 64-to-1 CPU core allocation disparity between the SAOpt baseline and the NetSparse evaluation. How would the results change if SAOpt were run on a single core, or if NetSparse were required to use more host resources for more complex control flow?
The proposed solution to the 2(N-1) concatenation queue scaling problem is to "virtualize the CQs." Please provide a concrete microarchitectural design for this mechanism. What are the SRAM/CAM overheads, control logic complexity, and performance penalties (e.g., added latency for dynamic allocation) of this virtualized system?
Regarding packet loss, is failing and re-issuing an entire multi-thousand-request RIG operation a practical or scalable recovery strategy? Have you considered alternative, more fine-grained acknowledgement or re-transmission schemes that could be integrated with the RIG unit?
Your analysis in Section 9.4 correctly identifies load imbalance as a key performance bottleneck. Given that advanced graph partitioning algorithms can mitigate this imbalance, could a sophisticated software-only approach with better partitioning outperform a simple partitioning scheme accelerated by NetSparse hardware? Where is the break-even point?

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents NetSparse, a comprehensive, hardware-centric approach to accelerate communication in distributed sparse computations. The authors correctly identify that as single-node sparse accelerators become more powerful, communication rapidly becomes the dominant bottleneck in large clusters. Traditional software-based communication strategies are shown to be highly inefficient, either by transferring vast amounts of redundant data (Sparsity-Unaware) or by suffering from high software overheads and network underutilization (Sparsity-Aware).

The core contribution is a co-designed system of hardware extensions for both SmartNICs and network switches, comprising four key mechanisms: 1. Remote Indexed Gather (RIG) Offload: A new NIC primitive that offloads the generation of fine-grained remote memory requests from the host CPU to specialized hardware units on the NIC. 2. Redundant Request Filtering/Coalescing: In-NIC hardware to eliminate duplicate requests for the same remote data at runtime. 3. Request Concatenation: A low-level protocol implemented in both NICs and switches to bundle multiple small requests for the same destination into a single larger network packet, thus amortizing header overhead. 4. In-Switch Caching: A novel, data-plane-updatable cache in the Top-of-Rack (ToR) switch to serve requests for remote data locally within a rack, exploiting inter-node data reuse.

Through detailed simulation of a 128-node cluster, the authors demonstrate that their approach yields a 38x speedup over a single-node system, vastly outperforming an optimized software baseline (3x speedup) and achieving over half the performance of an ideal system with zero communication overhead.

Strengths

This is a strong systems paper with a clear vision and significant potential impact. Its primary strengths are:

Excellent Problem Motivation: The authors do an outstanding job in Section 3 of quantifying the problem with existing software approaches. Using data from a real-world supercomputer, they show that the Sparsity-Unaware approach can have a useful-to-redundant transfer ratio of over 1:1900 (Table 1, page 3) and that the Sparsity-Aware approach can result in network line utilization below 1% (Table 2, page 3). This provides a compelling, data-driven justification for a hardware-level intervention.
Holistic, Synergistic Design: The paper's main strength lies in its recognition that this is not a problem that can be solved by a single trick. The four proposed mechanisms are not independent; they are synergistic. The RIG offload increases the rate of request generation, which creates the traffic density needed for the concatenation and filtering mechanisms to be effective. The in-switch cache then acts as a final optimization layer, capturing a different form of locality. This end-to-end, co-designed vision from the NIC to the switch is the paper's most significant contribution.
Contextualization within Modern Trends: This work fits perfectly at the confluence of several major trends in high-performance computing and architecture:
- Domain-Specific Architectures (DSAs): It extends the concept of acceleration beyond the processor and into the network fabric itself, arguing for an application-aware network.
- In-Network Computing (INC): It is a prime example of INC, moving computation (filtering, caching logic) closer to the data as it transits the network.
- SmartNICs/DPUs: It provides a "killer app" for the capabilities of modern SmartNICs, showing how their processing power can be harnessed for something beyond storage or security offloads.
Thorough and Convincing Evaluation: The evaluation methodology is sound. Using an idealized software baseline (SAOpt) makes their performance gains more credible. The end-to-end results (Figure 13, page 11) are impressive and clearly demonstrate the system's value. Furthermore, the ablation study (Table 8, page 12) effectively teases apart the contribution of each mechanism, and the sensitivity analysis (Section 9.3) explores the design space thoroughly. The inclusion of a hardware overhead analysis (Section 9.5) adds a crucial layer of practicality to the proposal.

Weaknesses

The weaknesses of the paper are minor relative to its strengths and mostly relate to the scope and complexity of the proposed hardware.

Significant Switch Architecture Modification: The proposed switch architecture with a second crossbar and a new "middle pipe" layer (Figure 8, page 8) is a non-trivial hardware change. While the authors provide a reasonable justification, the paper could be strengthened by discussing alternative, potentially less invasive, designs. For instance, could similar caching functionality be implemented within a more traditional, single-pipeline switch architecture, perhaps at the cost of some latency or throughput? A discussion of this trade-off would add valuable depth.
Limited Scope of Kernels: The work is heavily optimized for and evaluated on sparse-dense or sparse-vector multiplication patterns (SpMM, SpMV, SDDMM). A major challenge in the field is sparse-sparse matrix multiplication (SpGEMM), which introduces more complex, two-sided communication patterns. While the authors note this as future work, a brief discussion of how the NetSparse primitives might (or might not) apply to these more complex kernels would help in understanding the proposal's generality.
Static Parameterization: The authors rightly point out in their analysis (Section 9.4) that many of the system's parameters (e.g., RIG batch size, concatenation delay) are chosen statically. This leaves performance on the table and suggests that a crucial software/hardware co-design element—a runtime system for dynamically tuning these parameters—is missing. While this is likely beyond the scope of a single paper, acknowledging it more explicitly as a key component for a production-ready system would be beneficial.

Questions to Address In Rebuttal

Regarding the switch architecture (Section 6.2.1), can the authors elaborate on the decision to introduce a second crossbar? What are the primary limitations of a single, deeply pipelined switch architecture that led to this design? Could a recirculating packet design within a single-crossbar switch achieve similar functionality for cache hits, and what would be the performance implications?
The paper's primitives seem exceptionally well-suited for the one-sided "gather" communication pattern in SpMM. Could the authors comment on the challenges of applying NetSparse to kernels like SpGEMM, which often require two-sided communication and have less predictable data access patterns? Would the RIG primitive need to be fundamentally redesigned?
The sensitivity analysis shows that optimal performance depends on tuning parameters like the RIG batch size (Figure 15, page 12) and concatenation delay (Figure 17, page 12). How do the authors envision these parameters being set in practice? Would this require offline profiling for each sparse matrix, or could a lightweight online runtime make these decisions dynamically based on observed traffic characteristics?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper, "NetSparse," proposes a holistic, hardware-centric architecture to accelerate communication in distributed sparse computations. The core claim of novelty rests on the synthesis of four specific hardware mechanisms: 1) A Remote Indexed Gather (RIG) operation offloaded to SmartNICs to reduce host overhead; 2) Hardware-based filtering and coalescing of redundant property requests on the NIC; 3) A low-level protocol for concatenating requests to the same destination within both NICs and switches; and 4) A data-plane updatable hardware cache within Top-of-Rack (ToR) switches to serve local requests for remote data.

While individual concepts such as caching, request coalescing, and packet aggregation have historical precedents in various domains, the paper's primary novel contribution is the specific, synergistic application and hardware instantiation of these ideas tailored to the unique communication patterns of distributed sparse kernels. The most significant novel element is the proposal for a fast, data-plane-updated switch cache, which stands in contrast to prior work on control-plane-managed in-network caches.

Strengths

The paper's novelty is most apparent in the following areas:

Data-Plane-Updated In-Switch Cache: The proposal for the "Property Cache" in Section 6.2 (page 7) is a significant departure from prior art like NetCache [42]. The authors correctly identify that the short-lived nature of sparse kernel iterations makes control-plane cache management infeasible. By proposing a mechanism for data-plane updates (where response packets populate the cache), they architect a genuinely new solution for this specific problem domain. The corresponding switch architecture with middle pipes (Figure 8, page 8) is a concrete and novel proposal to enable this functionality without disrupting the primary forwarding path.
The RIG Abstraction as a Semantic Offload: The concept of a Remote Indexed Gather (RIG) operation (Section 4, page 4) is a powerful and novel semantic offload. While RDMA provides primitive one-sided operations, the RIG encapsulates the entire "read index list, fetch remote data" pattern common in sparse computations. Offloading this entire pattern to a specialized "RIG Unit" (Section 5, page 5) is a novel step beyond simply offloading individual reads. It fundamentally changes the host-NIC interaction model for this class of problems.
Synergistic System Co-design: The primary strength of the work is not in a single isolated idea, but in the holistic co-design of the four mechanisms. For example, the RIG Units generate a high-rate stream of Property Requests (PRs), which in turn creates the opportunity for the hardware Concatenators to be effective. The switch cache then acts as a sink for many of these requests, reducing network traffic further. This tight integration of mechanisms at different points in the network (NIC and switch) represents a novel systems-level contribution.

Weaknesses

My analysis of prior art reveals conceptual overlap that tempers the novelty claims for some of the constituent mechanisms. The paper would be stronger if it more explicitly positioned its work against these broader concepts:

Request Coalescing is Not Fundamentally New: The "Property Request Filtering and Coalescing" mechanism (Section 4, page 4) is, at its core, a form of hardware-based request deduplication. This concept is well-established in other areas, such as memory controllers coalescing requests to the same DRAM row or write-combining buffers in CPUs. While the specific implementation via an "Idx Filter" and "Pending PR Table" is tailored to this problem, the underlying principle is an application of a known technique, not a de novo invention.
Packet Concatenation is an Established Principle: The idea of concatenating multiple smaller messages into one larger packet to amortize header overhead (Section 6.1, page 6) is a foundational concept in networking. Host-based implementations like TCP Segmentation Offload (TSO) and Generic Receive Offload (GRO) have existed for decades. The novelty here lies in the in-network implementation that can combine requests from different threads or even different source nodes (at the switch). However, the paper presents the concept of concatenation itself as a primary contribution, when the real novelty is its specific, dynamic implementation in the data plane.
Framing of Novelty: The paper occasionally frames established principles as novel contributions. A more precise framing would be to acknowledge the established principles (e.g., request coalescing, packet aggregation) and then clearly state that the novelty lies in the specific, high-performance hardware architecture that implements these principles for the domain of sparse computations.

Questions to Address In Rebuttal

The authors should address the following questions to better delineate the novelty and justify the complexity of their proposal:

On the RIG Abstraction: The proposed RIG operation appears to be a batch of independent reads. Could a similar performance benefit be achieved with a simpler hardware primitive, such as support for a "chained list" of RDMA Read operations, which might require less specialized hardware than the full RIG Unit shown in Figure 5? Please clarify why the proposed RIG abstraction is fundamentally superior to simpler extensions of existing RDMA verbs.
On In-Switch Concatenation Complexity: The proposal to perform concatenation within the switch (Section 6.1.2, page 7) and the associated architectural changes (e.g., the second crossbar in Figure 8) introduce significant complexity to the switch ASIC. What is the performance delta of performing concatenation only at the NIC versus performing it at both the NIC and the switch? The rebuttal must justify that the marginal benefit of cross-node concatenation within the switch is substantial enough to warrant this radical departure from standard switch design.
On Cache Coherence and Updates: The novel data-plane cache update mechanism appears to implicitly assume that the properties being fetched are read-only for the duration of a kernel's iteration. What is the proposed mechanism for invalidation or updates if a property value changes at its source host mid-iteration? Please clarify the precise consistency model the Property Cache guarantees and the workload assumptions this model relies upon.

ColumnDisturb: Understanding Column-based Read Disturbance in Real DRAM Chips and Implications for Future Systems

Abstract

We experimentally demonstrate a new widespread read disturbance phenomenon, ColumnDisturb, in real commodity DRAM chips. By repeatedly opening or keeping a DRAM row (aggressor row) open, we show that it is possible to disturb DRAM cells through aDRAM ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper presents an experimental characterization of a purported new read disturbance phenomenon, "ColumnDisturb," on a large set of commodity DRAM chips. The authors claim that repeatedly activating a single aggressor row can induce bitflips in cells sharing the same columns across multiple subarrays, affecting thousands of rows. They attribute this phenomenon to a bitline-voltage-induced disturbance mechanism. The paper characterizes this effect across various parameters (temperature, data patterns, etc.), argues that it worsens with technology scaling, and concludes that it has severe implications for the robustness of future systems and the efficacy of retention-aware refresh schemes.

Strengths

Large-Scale Experimental Study: The characterization is performed on a substantial number of devices (216 DDR4 and 4 HBM2 chips) from three major manufacturers. This breadth lends some credence to the claim that the observed effects are not isolated to a single device or manufacturer.
Systematic Parameter Sweep: The authors conduct a comprehensive set of experiments, systematically varying operational parameters such as temperature, data pattern, timing parameters (tAggOn), and aggressor location. This methodological approach is commendable for its thoroughness.
Clear Presentation of Data: The results are, for the most part, presented clearly. Figures such as Figure 2 provide a compelling visual summary of the central claim, contrasting the alleged ColumnDisturb failures with RowHammer, RowPress, and retention failures.

Weaknesses

The paper’s foundational claims rest on several questionable assumptions and an insufficient decoupling of the observed phenomenon from known failure mechanisms. The primary weaknesses are:

Insufficient Differentiation from Activity-Induced Retention Degradation: The paper's most significant flaw is its failure to prove that "ColumnDisturb" is a fundamentally new physical phenomenon rather than a manifestation of accelerated retention failures under intense activity. The methodology for "Filtering Out Retention" (Section 3.2, p. 5) is inadequate. Profiling retention time in an idle state, even repeatedly, does not capture the worst-case retention behavior of a cell when the chip is under the thermal and electrical stress of a high-activity hammering test. It is well-established that DRAM cell retention is sensitive to temperature and voltage noise. The intense, localized activity of the test pattern will inevitably create thermal gradients and power supply fluctuations that are not present during an idle retention test. The paper does not provide sufficient evidence to rule out the simpler hypothesis: that "ColumnDisturb" is merely a new name for the well-known phenomenon of activity-induced retention degradation affecting a large number of "weak" but not-quite-failing cells.
Unsubstantiated Causal Mechanism: The central "Key Hypothesis" (Section 4.6, p. 8) that ColumnDisturb is caused by exacerbated subthreshold or dielectric leakage due to bitline voltage levels is purely speculative. The authors provide no device-level simulations, physical models, or direct measurements to substantiate this claim. The observed correlations (e.g., lower average column voltage leading to more bitflips) are consistent with this hypothesis, but they do not prove it. Other mechanisms, such as thermally-induced leakage, could produce similar macroscopic effects. Without stronger proof, the claim of a specific bitline-induced mechanism is an unsubstantiated leap. The paper's call for future device-level studies is an admission of this critical gap.
Unsupported Claims of Worsening with Technology Scaling: The conclusion in Observation 2 (p. 6) that vulnerability worsens with technology scaling is based on the weak proxy of die revision codes (Footnote 3, p. 4). This is a well-known heuristic in the research community, but it is not a rigorous method. Die revisions can denote metallization changes, minor circuit fixes, or other alterations that do not necessarily correspond to a shrink of the fundamental DRAM cell process technology. To make such a strong claim, the authors would need to provide direct evidence of a process shrink (e.g., from physical analysis) or a much more robust dataset that unequivocally links specific die revisions to known technology nodes. As it stands, this conclusion is not adequately supported.
Overstatement of Immediate System Impact: The critical claim in Observation 3 (p. 6) that ColumnDisturb induces bitflips within the nominal tREFW is based on results from "a single 16Gb F-die Micron module." This is a classic example of generalizing from an outlier. To justify the urgent tone and the broad implications claimed for current systems, the authors must demonstrate that this behavior is prevalent across a statistically significant portion of the tested modules. Without this, the finding appears to be a corner-case behavior of a particularly weak module rather than a widespread, imminent threat.

Questions to Address In Rebuttal

The authors must provide satisfactory answers to the following questions to validate the paper's core contributions:

On Decoupling from Retention: How can the authors definitively decouple the observed bitflips from activity-induced retention degradation? The current filtering methodology, which tests retention in an idle state, seems insufficient. What experiments can be performed to prove that the failure mechanism is distinct from retention loss accelerated by the thermal and electrical stress of the test itself?
On the Physical Mechanism: Given that the proposed bitline-voltage-induced leakage is a hypothesis, what alternative physical mechanisms (e.g., localized thermal effects, substrate noise, power supply droop) were considered and ruled out? What evidence allows the authors to definitively reject these alternative explanations?
On Technology Scaling: The conclusion regarding technology scaling hinges on the assumption that die revision letters directly correlate with process node shrinks. What concrete evidence supports this assumption for the specific chips tested? Without this, how can the claim be considered valid?
On Prevalence within tREFW: Regarding Observation 3, what percentage of all tested modules (not just cells or chips) exhibited ColumnDisturb bitflips within the nominal 64ms tREFW? The impact on current systems is minimal if this is an outlier affecting less than 1% of modules. Please provide a distribution.
On the Blast Radius: The claim of disturbing rows across three subarrays is based on the open-bitline architecture. How have the authors verified that there are no other long-range coupling mechanisms at play? Could the observed spatial distribution of errors be an artifact of the physical layout of power delivery or other shared resources that are stressed by the aggressor row activation?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces and experimentally characterizes a novel and widespread read disturbance phenomenon in commodity DRAM, which the authors term "ColumnDisturb." Unlike the well-studied RowHammer and RowPress phenomena that cause bitflips in physically adjacent rows within a single subarray, ColumnDisturb is a column-based effect. By repeatedly activating (or keeping active) a single aggressor row, the authors demonstrate that it is possible to induce bitflips in cells that share the same columns (bitlines) across multiple physically adjacent subarrays.

The core contribution is the discovery, comprehensive characterization (across 220 chips from three major manufacturers), and system-level impact analysis of this fundamentally new disturbance mechanism. The authors convincingly show that ColumnDisturb has a much larger "blast radius" than RowHammer, affecting thousands of rows simultaneously. Critically, they demonstrate that this phenomenon undermines the foundational assumptions of existing retention-aware heterogeneous refresh mechanisms, potentially negating their performance and energy benefits.

Strengths

Fundamental and Significant Discovery: The paper's primary strength is the identification of a new, qualitatively different hardware failure mechanism. The academic and industrial communities have spent the better part of a decade focused on row-adjacency-based disturbances. This work compellingly argues that our mental model of read disturbance is incomplete. By shifting the focus from horizontal (row-to-row) coupling to vertical (column-based) coupling, the authors open up a new and important avenue of inquiry in memory reliability and security. The clear visual distinction in Figure 1 (page 2) and the supporting data in Figure 2 (page 2) immediately establish the novelty and credibility of the finding.
Exceptional Experimental Rigor: The experimental methodology is exhaustive and represents a high standard for systems research. The characterization across 216 DDR4 and 4 HBM2 chips, spanning all three major manufacturers and multiple die revisions, provides strong evidence that ColumnDisturb is a widespread and systematic issue, not an anomaly. The detailed analysis under varying conditions (temperature, data patterns, timing parameters) provides a rich dataset that will be invaluable for future work by device physicists, security researchers, and system architects.
Crucial Contextualization and Impact Analysis: This is not merely a paper about a new type of bitflip; it is a paper about the systemic consequences of that bitflip. The most insightful part of the work is the analysis in Section 6.2, where the authors evaluate the impact of ColumnDisturb on a state-of-the-art retention-aware refresh mechanism (RAIDR). By showing that ColumnDisturb can completely diminish the benefits of such schemes (as quantified in Figures 22 and 23), the authors connect their low-level discovery to high-level system performance and energy goals. This bridges the gap between device characterization and computer architecture, making the work highly relevant to this conference's audience. It effectively shows that a cell's "strength" cannot be defined by its retention time alone, a critical insight that challenges a large body of prior work.

Weaknesses

My concerns are not with the quality of the work presented, but rather with the natural limits of a discovery-focused paper. These are areas that represent exciting opportunities for follow-up research.

Lack of a Definitive Physical Explanation: The authors provide a well-reasoned "Key Hypothesis" (page 8) that ColumnDisturb is caused by exacerbated subthreshold leakage of the access transistor and/or dielectric leakage between the capacitor and the bitline, driven by the voltage difference. While plausible and consistent with the data, this remains a hypothesis. The paper stops short of a definitive device-level physical model or simulation that could confirm the root cause. This is understandable given the scope, but it is the most critical piece missing from a complete scientific understanding of the phenomenon.
Preliminary Nature of Proposed Mitigations: The proposed solutions in Section 6.1, particularly Proactively Refreshing ColumnDisturb Victim Rows (PRVR), are presented as high-level ideas rather than fully architected and evaluated mechanisms. The analysis is largely analytical and serves to demonstrate the inefficiency of simply increasing the global refresh rate. While this effectively highlights the problem's difficulty, a more developed mitigation strategy would have strengthened the paper's contribution to building robust future systems.

Questions to Address In Rebuttal

The authors have presented a fascinating and important piece of work. I would be very interested to hear their thoughts on the following points to better understand the broader context and future trajectory of this research:

On the Physical Mechanism: The paper hypothesizes that voltage differences across the access transistor or capacitor dielectric are the root cause. Could the authors elaborate on why they favor these mechanisms over other potential ones, such as charge sharing phenomena or disturbances propagated through the substrate or shared sense amplifiers in a way not previously understood? Are there any experiments they have considered (or could suggest) that might further pinpoint the exact physical cause?
Beyond Retention-Aware Refresh: The paper brilliantly demonstrates the impact on retention-aware refresh optimizations. Have the authors considered other classes of system-level or circuit-level optimizations that might be inadvertently compromised by ColumnDisturb? For example, could techniques that rely on data compression within DRAM, or certain processing-in-memory (PIM) operations that might stress bitlines, be similarly vulnerable?
Security Implications: The discovery of RowHammer quickly led to a new class of security exploits. While this paper focuses on reliability, the ability to flip bits in thousands of rows by accessing a single aggressor row seems ripe for exploitation. Could the authors comment on the potential for ColumnDisturb to be "weaponized" into a security attack? Would such an attack be fundamentally easier or harder to mount than a traditional RowHammer attack, considering its broad but potentially less targeted nature?
Future Technology Scaling: The paper provides strong evidence that ColumnDisturb worsens with technology scaling in DDR4 (Observation 2, page 6). How do the authors project this phenomenon will manifest in newer memory technologies like DDR5, LPDDR5, and emerging 3D-stacked DRAM architectures? Do they foresee any fundamental changes in device structure that might mitigate or exacerbate this effect?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present "ColumnDisturb," which they claim is a novel, column-based read disturbance phenomenon in commodity DDR4 and HBM2 DRAM chips. The core idea is that activating an aggressor row perturbs the shared columns (bitlines), inducing bitflips in thousands of victim rows across multiple physically adjacent subarrays. This stands in contrast to well-known phenomena like RowHammer and RowPress, which are row-based and affect only a few physically adjacent rows within a single subarray. The paper provides an extensive experimental characterization of this phenomenon across various parameters (technology scaling, temperature, data patterns) and evaluates its implications, particularly for retention-aware refresh mechanisms. The authors also propose and evaluate mitigation strategies.

Strengths

Identification of a New Phenomenon in Commodity DRAM: The central and most significant contribution is the experimental identification and characterization of a read disturbance phenomenon that is conceptually distinct from the well-known row-based disturbances. The paper does a commendable job of differentiating ColumnDisturb from RowHammer, RowPress, and simple retention failures through careful experimental design (Section 3.2, pages 4-5). The observation that a single aggressor row can induce failures in up to 3072 rows across three subarrays (Observation 4, page 6) is a powerful demonstration of this distinction. If this phenomenon is indeed as widespread as the 216 tested chips suggest, its discovery is a novel and important contribution to the field of memory reliability.
Rigorous Phenomenological Characterization: The novelty is not merely in the initial discovery but in the comprehensive characterization presented in Sections 4 and 5. The analysis of how the phenomenon is affected by scaling (Observation 2, page 6), data patterns (Section 4.4, page 7), and average bitline voltage (Section 4.6, page 8) provides a strong foundation for future work. This detailed data substantiates the claim that this is a repeatable and distinct physical effect, not an experimental artifact.
Novel Implications for Existing Systems: The paper uncovers a novel failure mechanism that has profound and previously unconsidered implications for existing and proposed technologies. The analysis in Section 6.2 (page 13) showing how ColumnDisturb can severely degrade or even completely negate the benefits of retention-aware refresh mechanisms (like RAIDR) is a novel insight. This challenges the fundamental assumptions of a large body of prior work that only considers retention failures when defining "weak" rows.

Weaknesses

Qualification of Novelty Regarding Bitline Disturbance: The paper's claim to be the first work to demonstrate a column-based disturbance needs careful qualification. The concept of bitline-induced disturbance is not entirely new in the broader context of DRAM research. As the authors themselves acknowledge in Section 7 (page 15), prior work on emerging 4F2 VCT DRAM architectures has identified vulnerabilities to disturbances from the bitline with a "hammering-like access pattern" [37-39, 129, 145]. While the authors correctly argue that the device physics and architecture (4F2 VCT vs. commodity 6F2) are different, the conceptual principle of bitline voltage stress inducing errors in physically separate cells is overlapping. Therefore, the core novelty of this work is being the first to discover, demonstrate, and characterize this class of phenomenon in widespread, commodity 6F2 DRAM, rather than inventing the concept of bitline disturbance itself. This distinction should be made clearer.
Hypothetical Causal Mechanism: While the paper provides a strong phenomenological characterization and a compelling hypothesis linking the effect to average bitline voltage (Key Hypothesis, Section 4.6, page 8), the explanation of the underlying device physics remains hypothetical. The authors suggest subthreshold leakage or dielectric leakage as potential causes but do not provide definitive evidence to confirm the mechanism or distinguish between these possibilities. The novelty lies in the observation, but a truly complete contribution would require a deeper, device-level validation of the proposed cause. This is a common challenge in experimental papers but remains a limitation on the fundamental novelty of the explanation.
Incremental Novelty of Proposed Solutions: The proposed mitigation strategy, Proactively Refreshing ColumnDisturb Victim Rows (PRVR), described in Section 6.1 (page 13), is a logical but not fundamentally novel application of proactive refresh principles. The core idea is to identify and refresh victims before they fail. This concept is the basis for most RowHammer mitigations. The novelty here is in the engineering adaptation to the unique victim profile of ColumnDisturb (thousands of rows across subarrays). This is an important engineering contribution, but it does not represent a new algorithmic or architectural concept for mitigation.

Questions to Address In Rebuttal

Regarding Prior Art on Bitline Disturbance: Please elaborate further on the fundamental differences between ColumnDisturb and the bitline disturbances observed in 4F2 VCT DRAM [37-39]. Beyond the device architecture, is the conceptual principle of bitline voltage stress causing errors in non-accessed cells fundamentally new, or is this work's key contribution the first demonstration that this principle applies to and is widespread in commodity DRAM?
Regarding the Physical Mechanism: The "Key Hypothesis" in Section 4.6 (page 8) is critical to understanding the novelty of the underlying effect. What evidence, beyond the correlation with average bitline voltage, can you provide to support your hypothesized mechanisms (subthreshold/dielectric leakage)? Are there any experiments you could propose or conduct (e.g., by varying timing parameters in a specific way) that could provide stronger evidence favoring one hypothesis over the other?
Regarding the Novelty of PRVR: The PRVR mechanism appears to be a targeted proactive refresh scheme. Could you please contrast its core algorithmic novelty against prior work in targeted refresh for other phenomena like RowHammer (e.g., ProTRR [140]), beyond the obvious difference in the scale and location of victim rows? Is there a novel tracking or scheduling component that is not simply an adaptation of existing ideas?

SuperSFQ: A Hardware Design to Realize High-Frequency Superconducting Processors

Abstract

Superconducting computing using single flux quantum (SFQ) technology has been recognized as a promising post-Moore’s law era technology thanks to its extremely low power and high performance. Therefore, many researchers have proposed various SFQ-based ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose "SuperSFQ," a design methodology intended to overcome the frequency limitations of conventional clocking schemes in Single Flux Quantum (SFQ) circuits, particularly those with feedback loops. The methodology is composed of three parts: (1) "SuperClocking," a scheme that uses different clock signals for feedforward and feedback paths to break synchronous timing constraints; (2) an "SFQ Feedback Synchronizer" to manage the resulting timing unreliability at the feedback interface; and (3) two architectural guidelines, "Loop Alignment" and "Alternating Synchronizer," to correct functional failures introduced by their solution. The authors claim that applying this methodology to a general-purpose SFQ CPU and other benchmark circuits results in dramatic frequency and performance improvements (up to 88.5x higher frequency) with what they characterize as a modest overhead (34.4% JJ).

However, the paper's extraordinary claims rest on a methodologically flawed validation strategy. The core results are not derived from a full-scale, verifiable simulation but from a combination of analysis on a severely scaled-down 4-bit model and isolated partial simulations. The proposed architectural solutions, while intended to fix functional bugs, introduce significant complexity and their own unsubstantiated assumptions, undermining the robustness of the entire framework.

Strengths

Problem Identification: The paper provides a clear and accurate analysis of the limitations of existing SFQ clocking schemes (H-tree, concurrent-flow, counter-flow), correctly identifying feedback loops as the primary performance bottleneck in concurrent-flow designs (Section 2.2, page 4-5). This diagnosis of the problem is sound.
Conceptual Framework: The core idea of treating the feedforward-to-feedback path interface as an asynchronous boundary is a conceptually valid direction for investigation. The decomposition of the problem into a clocking scheme, a synchronizer, and architectural rules is logical.

Weaknesses

Fundamentally Inadequate Validation: The paper’s claims of performance are not substantiated by rigorous, scalable evidence. The authors concede that "analog simulations cannot support the large number of JJs in the 32-bit SuperCore" (Section 6.1, page 10). Their validation relies on two insufficient proxies:
- A 4-bit SuperCore: A 4-bit datapath is a toy model. It is not representative of a 32-bit processor. Critical physical design issues such as clock distribution skew, jitter accumulation over long paths, and power grid noise do not scale linearly. Extrapolating performance and reliability from a 4-bit design to a 32-bit, ~1M JJ processor is an unjustifiable leap of faith.
- Partial Simulation: Simulating only the "feedback path of 32-bit SuperCore" while emulating feedforward paths with RTL is highly suspect. The interface between the analog simulation (JoSIM) and the RTL emulation is not described in sufficient detail. It is unclear how the precise, analog-level timing jitter and bias fluctuation from the feedforward path—the very phenomena this paper is about—are accurately modeled and injected into the simulation of the feedback path. This approach decouples components that are intrinsically linked in a real circuit, invalidating the results.
Unjustified Assumptions in Synchronizer Design: The proposed SFQ Feedback Synchronizer is built on a critical, and questionable, design choice. The authors state they "exclude the handshaking to achieve a high frequency" (Section 4.3, page 7), arguing it is unnecessary in a single clock domain. This argument is specious. SuperClocking explicitly creates an asynchronous interface where the arrival time of feedback data is non-deterministic. This is a classic clock-domain crossing scenario where flow control (i.e., handshaking) is essential to prevent data overflow and loss. The authors provide no formal proof that overflow is impossible, especially in scenarios with high-frequency burst data in the feedback loop.
Unproven Correctness of Architectural Solutions: The architectural guidelines proposed to fix the functional flaws of SuperClocking introduce their own unsubstantiated claims of correctness.
- In the Alternating Synchronizer, the authors claim an "empty cycle always exists" to reset the arbiter's state, which is necessary to prevent collisions (Section 5.2.2, page 9). This is an extremely strong claim presented as an observation ("we observe that...") rather than a formal proof. A rigorous proof by induction or a formal model is required to guarantee this condition holds for all possible instruction sequences and timing violations. Without it, the arbiter’s correctness is not guaranteed.
- The Loop Alignment guideline appears to contradict the paper's primary motivation. By forcing the feedforward and feedback paths to "share identical clock" (Section 5.1.2, page 8), it seems to re-introduce the very timing dependencies that SuperClocking was designed to break. The paper fails to analyze whether this re-coupling constrains the feedforward path's frequency, potentially negating the benefits of SuperClocking.
Understated Overhead and Complexity: The paper claims "only 34.4% Josephson junction overhead" (Abstract, page 1). However, the breakdown in Figure 17(c) (page 12) shows that "Loop Alignment" accounts for 21.1% of the total JJ count, making it the single largest contributor to the overhead. This is not a minor tweak; it is a significant architectural modification that fundamentally alters the pipeline structure by parallelizing stages. The complexity cost of this and the Alternating Synchronizer is non-trivial and is not adequately captured by a simple JJ count.

Questions to Address In Rebuttal

Provide a rigorous justification for why results from a 4-bit processor model can be reliably extrapolated to a 32-bit processor, especially concerning timing jitter accumulation and clock network integrity.
Detail the methodology for interfacing the RTL emulation of the feedforward path with the analog simulation of the feedback path. Specifically, how are analog effects like thermal jitter and bias-dependent delay from the emulated path modeled and accurately injected into the JoSIM simulation?
Present a formal proof that data overflow cannot occur in the SFQ Feedback Synchronizer after the handshaking mechanism was removed. The proof must hold under bursty, high-frequency feedback conditions.
Provide a formal proof for the claim that an empty cycle "always exists" between collision events in the Alternating Synchronizer (Section 5.2.2). An observational argument based on simulation is insufficient.
Quantify the timing impact of Loop Alignment on the feedforward path. Does forcing paths to share a clock not create new critical paths that limit the maximum achievable frequency, thereby reducing the claimed benefits of SuperClocking?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper addresses a fundamental and well-recognized bottleneck in the field of superconducting computing: the inability of existing clocking schemes to enable high-frequency operation in general-purpose circuits, particularly those containing feedback loops. The authors propose "SuperSFQ," a comprehensive hardware design methodology that co-designs the clocking scheme, circuitry, and architecture to unlock the multi-GHz potential of single-flux-quantum (SFQ) technology.

The core contribution is a three-part solution: 1. SuperClocking: A novel clocking scheme that strategically breaks the synchronous clocking convention. It allows feedback signals to be triggered by an earlier, independent clock pulse, effectively removing the long latency of the feedback path from the critical path of the main circuit. 2. SFQ Feedback Synchronizer: To manage the reliability issues (timing violations from jitter and bias fluctuation) introduced by SuperClocking, the authors develop a specialized, low-overhead synchronizer circuit, adapted from CMOS multi-flop principles but optimized for SFQ by removing handshaking and using concurrent-flow clocking internally. 3. Architectural Guidelines: Recognizing that resolving timing violations does not guarantee functional correctness, the authors propose two architectural patterns—"Loop Alignment" and "Alternating Synchronizer"—to handle data mismatch and data collision issues that arise in specific types of feedback loops common in processors (e.g., write-back, data forwarding).

The authors validate their methodology through extensive simulation on the latest general-purpose SFQ processor design (SuperCore) and a wide range of benchmark circuits, demonstrating frequency improvements of over 60x compared to conventional SFQ designs with a manageable hardware overhead.

Strengths

Addresses a Foundational Bottleneck: This work does not present an incremental improvement; it tackles what is arguably the primary obstacle preventing SFQ technology from realizing its theoretical performance potential in complex, general-purpose processors. The analysis in Section 2.2 (page 4) provides a clear and compelling diagnosis of why existing clocking schemes (H-tree, concurrent-flow, counter-flow) are fundamentally inadequate for designs with feedback loops, which are ubiquitous in any non-trivial architecture. By solving this problem, the paper unlocks a path forward for the entire field.
An Elegant and Pragmatic Core Idea: The central concept of SuperClocking is intellectually satisfying. Rather than forcing a globally synchronous design, the authors adopt a more nuanced approach that is effectively "asynchronous-where-it-hurts, synchronous-where-it-matters." This mirrors the successful Globally Asynchronous, Locally Synchronous (GALS) paradigm in the CMOS world but is applied here in a novel and SFQ-specific context. Decoupling the feedback path timing from the feedforward path is a powerful insight that elegantly sidesteps the core limitation of concurrent-flow clocking.
Holistic, System-Level Co-Design: The most significant strength of this paper is its completeness. The authors do not stop at the clever clocking scheme. They follow the thread of consequences from the circuit level to the architectural level. They anticipate the reliability problems created by their own solution and design a custom circuit (the SFQ Feedback Synchronizer, Section 4, page 6) to solve it. They then anticipate the functional correctness problems and propose architectural patterns (Loop Alignment and Alternating Synchronizer, Section 5, page 7) to solve those. This demonstrates a deep, systems-level understanding and transforms a simple circuit trick into a robust and genuinely usable design methodology.
Excellent Contextualization and Demonstrated Impact: The paper does an outstanding job of grounding its contribution in the real world. By targeting the state-of-the-art SuperCore processor, the work is immediately relevant. Furthermore, the evaluation across 48 standard benchmark circuits demonstrates generality. Most importantly, the comparison against a modern out-of-order CMOS processor (SonicBOOM) in Section 8.1 (page 13) provides the crucial link to the broader computer architecture community. It shows that with this methodology, SFQ processors can achieve comparable end-to-end performance to high-performance CMOS designs (despite a much simpler in-order core), making a compelling case for the technology's relevance.

Weaknesses

My critiques are not focused on fatal flaws but rather on opportunities to further strengthen the paper's intellectual positioning and explore the boundaries of the proposed methodology.

Implicit Connection to Asynchronous Design Paradigms: While the work is brilliant, it could be better situated within the broader history of asynchronous and semi-synchronous design. The paper frames its contribution almost exclusively against synchronous SFQ schemes. However, the core idea of using synchronizers to bridge timing domains is the cornerstone of GALS design. A more explicit discussion of how SuperSFQ relates to, borrows from, or differs from these well-established paradigms would not diminish the contribution but rather place it on a firmer academic foundation.
Scalability of the Architectural Guidelines: The guidelines for Loop Alignment and the Alternating Synchronizer are presented clearly and work well for the cases shown. However, the process described for handling nested and intertwined loops (Section 5.1.3, page 8) appears to be a manual, architectural refactoring. For future architectures of even greater complexity, this manual intervention could become a significant design burden. The paper would benefit from a discussion on the potential for automating this analysis and transformation within an EDA tool flow. Is this methodology fundamentally reliant on clever architects, or can it be systematized?
Potential for Overhead Growth in Pathological Cases: The reported 34.4% JJ overhead for SuperCore is very reasonable for the massive performance gain. However, the paper notes that Loop Alignment can require converting feedforward data into feedback data, thereby increasing the bit-width of the synchronizers. In an architecture with a very high density of complex, matching-required loops, it is conceivable that the overhead from these widened synchronizers and complex arbiters could become more substantial. A brief discussion of the potential worst-case overhead would add valuable nuance.

Questions to Address In Rebuttal

Could the authors elaborate on the relationship between SuperClocking and the established Globally Asynchronous, Locally Synchronous (GALS) design paradigm? To what extent can SuperSFQ be considered an SFQ-specific implementation of GALS principles for managing long-latency feedback paths?
The proposed architectural guidelines are demonstrated effectively on SuperCore. As designs become more complex, does the application of Loop Alignment and Alternating Synchronizers remain a manual process for the architect, or do the authors foresee a path toward automating this analysis and transformation within a design tool flow?
The JJ overhead is shown to be manageable for the evaluated benchmarks. However, could the authors comment on how the overhead of the synchronizers and arbiters might scale in architectures with a higher density of "consecutive" and "matching-required" loops, where the bit-width of the synchronizers might increase significantly due to the Loop Alignment process?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents "SuperSFQ," a hardware design methodology aimed at overcoming the frequency limitations of Single Flux Quantum (SFQ) circuits, particularly in designs with feedback loops, such as general-purpose processors. The authors identify that existing synchronous clocking schemes, especially concurrent-flow clocking, are bottlenecked by the long latency of feedback paths.

Their proposed solution consists of three main components: 1. SuperClocking: A new clocking scheme that breaks the synchronous convention by triggering feedback data with a different, earlier clock pulse than the feedforward data, effectively treating the feedback path as an asynchronous channel. 2. SFQ Feedback Synchronizer: A specialized synchronizer circuit placed at the receiving end of the feedback path to resolve the resulting timing unreliability and re-synchronize the feedback data to the main pipeline clock. 3. Architectural Guidelines: Two specific architectural patterns, "Loop Alignment" and "Alternating Synchronizer," to ensure functional correctness (i.e., preventing data mismatch and data collision) in architectures that adopt SuperClocking.

The authors claim this co-design of clocking, circuitry, and architecture unlocks significant frequency improvements, which they validate through simulation on benchmark circuits and a general-purpose SFQ CPU design.

Strengths

The paper correctly identifies a critical and well-known bottleneck in high-performance SFQ design: the timing closure of feedback loops. The analysis in Section 2 (pages 3-5) is a clear and accurate summary of the state of the art and its limitations.
The work is holistic. The authors did not just propose a clocking scheme but also anticipated and addressed the subsequent reliability (timing) and correctness (functional) issues. This systematic approach of identifying a problem and providing a complete set of solutions (circuit-level and architectural) is commendable.
The paper does a good job of differentiating its work from the most closely related prior art in SFQ, namely "dual clocking" (Section 8.2.1, page 13), correctly pointing out the limitations of that approach for general-purpose designs with complex, intertwined loops.

Weaknesses

The primary weakness of this paper lies in the fundamental novelty of its core ideas. While the application and integration of these ideas into a complete SFQ design methodology is well-executed, the underlying concepts themselves are adaptations of well-established principles from the broader field of digital logic and asynchronous design.

"SuperClocking" is functionally a GALS approach: The core idea of SuperClocking—decoupling the timing of a specific path (the feedback loop) from the main synchronous domain—is conceptually identical to a Globally Asynchronous, Locally Synchronous (GALS) design paradigm. In GALS, different synchronous islands communicate via asynchronous channels, using synchronizers at the boundaries. SuperClocking effectively treats the main feedforward pipeline as one synchronous island and the feedback path as an asynchronous wrapper channel that delivers data back to the same island. The novelty is therefore not in the invention of this technique but in its specific application to solve the feedback problem in concurrent-flow SFQ circuits.
The "SFQ Feedback Synchronizer" is an optimization of a known circuit: The paper acknowledges that its synchronizer is based on the multi-flop FIFO synchronizer (Section 4.1, page 6), a standard circuit for cross-clock-domain communication. The authors’ contributions are two optimizations: (a) removing the handshaking logic and (b) using concurrent-flow clocking within the DFF chain itself. These are clever, domain-specific optimizations that improve performance for this particular use case, but they do not represent a fundamentally new synchronizer topology. The novelty is incremental, not foundational.
The "Architectural Guidelines" are applications of standard design patterns:
- The Alternating Synchronizer (Section 5.2, page 8) is a classic ping-pong buffer architecture. Interleaving data streams between two parallel resources to handle back-to-back inputs is a textbook technique used in everything from I/O controllers to network switches. The implementation using SFQ logic is specific to this paper, but the architectural pattern is not novel.
- Loop Alignment (Section 5.1, page 7) is an architectural refactoring to manage a data dependency hazard created by the SuperClocking scheme. Re-architecting pipelines to ensure data arrives at the correct time and place is a standard part of computer architecture. The novelty is in identifying this specific hazard and proposing a solution, but the act of pipeline restructuring itself is not a new concept.

In essence, the paper's contribution is a significant engineering achievement that cleverly combines and adapts existing design principles to the unique constraints of SFQ technology. However, it does not introduce a fundamentally new theory of clocking or synchronization.

Questions to Address In Rebuttal

The core concept of SuperClocking appears to be an application of the GALS design style to a single feedback loop. Could the authors please clarify what makes this approach fundamentally novel beyond the application of known asynchronous principles to the SFQ domain? Is there a key insight that is unique to SFQ physics or circuits that makes this more than a direct translation of a known concept?
The Alternating Synchronizer presented in Figure 13 (page 9) is an implementation of a 2-way interleaved (ping-pong) synchronizer. Given that this is a standard architectural pattern, please justify the claim of novelty. Is the novelty in the specific SFQ arbiter circuit design, or in the application of the pattern itself?
The paper's greatest strength appears to be the co-design and the holistic integration of multiple known concepts into a working system that achieves impressive results. Would the authors agree that the primary contribution is this novel synthesis of techniques, rather than the novelty of the individual component ideas themselves? If so, the paper might be strengthened by framing its contribution more explicitly in this light.

Characterizing and Optimizing Realistic Workloads on a Commercial Compute-in-SRAM Device

Abstract

Compute- in-SRAM architectures offer a promising approach to achieving higher performance and energy efficiency across a range of data-intensive applications. However, prior evaluations have largely relied on simulators or small prototypes, limiting the ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present a performance characterization of a commercial compute-in-SRAM device, the GSI APU. They propose an analytical performance model and three optimization strategies (communication-aware reduction mapping, coalesced DMA, broadcast-friendly layouts) aimed at mitigating data movement bottlenecks. These are evaluated on the Phoenix benchmark suite and a Retrieval-Augmented Generation (RAG) workload. The central claims are that these optimizations yield significant speedups (e.g., up to 6.6x in RAG retrieval over CPU) and that the system can match the RAG performance of an NVIDIA A6000 GPU while being substantially more energy efficient. However, the work's conclusions rest on a precarious foundation of a partially simulated system, questionable baseline comparisons, and optimizations that may not generalize beyond the target device's idiosyncratic architecture.

Strengths

Use of Commercial Hardware: The study is grounded in a real, commercial compute-in-SRAM device. This is a commendable departure from the purely simulation-based or small-prototype studies that are common in this field, providing valuable data on the characteristics and challenges of a production system.
Relevant Workload: The choice of Retrieval-Augmented Generation (RAG) is timely and complex. It stresses memory bandwidth, compute, and data layout, making it a suitable test case for evaluating the claimed benefits of compute-in-SRAM architectures.
Systematic Optimization Breakdown: The paper does a clear job of isolating and evaluating its proposed optimizations. Figure 12 provides a lucid breakdown of how each optimization contributes to the final performance of the binary matrix multiplication kernel, which helps in understanding their individual effects.

Weaknesses

The RAG Evaluation is Fundamentally Flawed by a Hybrid Real-Simulated Methodology: The paper's headline claim regarding RAG performance matching an A6000 GPU is severely undermined by the experimental setup. The authors concede in Section 5.3.1 (Page 11) that the GSI APU's actual DDR bandwidth (23.8 GB/s) would be a "bottleneck." To circumvent this, they model the off-chip memory using a simulated HBM2e system. This is a critical methodological flaw. The paper is no longer "Characterizing... a Commercial... Device" but rather characterizing a hypothetical system that does not exist. The performance and energy results for the most significant workload (RAG) are therefore speculative. The interaction latency and energy between the real APU's memory controller and the simulated HBM are not detailed, leaving the accuracy of this hybrid model in serious doubt.
Insufficient Detail and Rigor in Baseline Comparisons: The claims of superiority depend entirely on the fairness of the baselines, which are not sufficiently established.
- GPU Baseline: The paper claims to match the performance of an NVIDIA A6000 GPU on RAG. However, there is a stark lack of detail on the GPU implementation. Was an optimized library like FAISS-GPU used? If so, which index was employed? A brute-force inner product search on a high-end GPU can be heavily optimized with CUDA. Without a detailed description of the GPU software configuration and optimization level, the claim of "matching performance" is unsubstantiated. It is possible the presented APU system is being compared against a suboptimal GPU implementation.
- CPU Baseline: While the use of FAISS with AVX512 and OpenMP (Section 5.3.2, Page 11) is a respectable starting point, the term "optimized CPU baseline" is used without detailing what specific optimizations were performed beyond using the library as-is.
Overstated Generality of Optimizations: The paper presents its three optimizations as general principles for compute-in-SRAM. However, their efficacy appears deeply tied to the unique and arguably peculiar architecture of the GSI APU.
- The core "communication-aware reduction mapping" optimization hinges on the fact that intra-VR reductions are significantly more expensive than inter-VR reductions on this specific device (Section 2.1.2, Page 4). This is a microarchitectural artifact of the GSI APU's ultra-long vector design, not a fundamental property of all C-SRAM systems.
- Similarly, coalescing DMA via "subgroup copy" (Section 4.3, Page 8) relies on a specific hardware feature.
- Consequently, the paper demonstrates clever engineering for one specific device but fails to provide convincing evidence that these are broadly applicable architectural principles. The conclusions are over-generalized from a single, atypical data point.
The Analytical Framework Lacks True Predictive Power: The proposed analytical framework (Section 3, Page 5) appears to be more of an empirical curve-fitting exercise than a first-principles model.
- Equation 1, which models the latency of subgroup reductions, is a cubic polynomial whose coefficients are themselves logarithmic functions of group size. The justification for this specific functional form is superficial ("multi-level shifting, alignment, and accumulation"). This is an observation, not an explanation. An insightful model would derive this complexity from architectural primitives.
- The model is validated in Table 7 by showing it can reproduce the performance of the same device from which its parameters were measured. This demonstrates descriptive accuracy but provides no evidence for its claimed utility in "architectural design space exploration" (Section 3.1, Page 5), which requires predictive power for architectures with different parameters.

Questions to Address In Rebuttal

Please provide a rigorous justification for using a simulated HBM memory system for the RAG evaluation. How can the paper's central claims about RAG performance and energy on a "commercial device" be considered valid when the most critical system component for this memory-bound problem is hypothetical? Provide details on how the simulation was integrated with the real hardware to ensure model fidelity.
Provide explicit details of the GPU software stack used for the RAG baseline. Specify the exact library (e.g., FAISS-GPU), version, index type (e.g., IndexFlatIP), and any custom CUDA kernel development or tuning performed. This is essential to validate the claim of matching GPU performance.
Elaborate on how the proposed optimizations, particularly the reliance on the cost disparity between intra-VR and inter-VR operations, can be generalized to other compute-in-SRAM architectures (e.g., bit-serial, associative, or different vector lengths) that do not share the GSI APU's specific microarchitecture.
Provide a more fundamental, first-principles derivation for the cubic complexity of the reduction cost model in Equation 1. What specific sequence of micro-operations leads to this complexity, and why should we expect this to hold for other designs?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a comprehensive characterization and optimization study of a commercial Compute-in-SRAM (CiSRAM) accelerator, the GSI APU, on realistic, data-intensive workloads. The authors' core contribution is bridging the significant gap between the theoretical promise of CiSRAM architectures, which have largely been studied via simulation, and their practical viability. They achieve this through a three-pronged approach: (1) developing an analytical performance model to expose architectural trade-offs, (2) proposing a set of architecture-aware optimizations focused on data layout and movement, and (3) demonstrating the system's effectiveness on a modern, high-impact workload—Retrieval-Augmented Generation (RAG) for large language models. The key result is that their optimized CiSRAM system can match the retrieval latency of a high-end NVIDIA A6000 GPU while consuming over 50x less energy, providing a compelling real-world data point for the future of memory-centric computing.

Strengths

This is an excellent and timely paper that the community should pay close attention to. Its primary strengths are:

Grounded in Reality with a Commercial Device: The most significant strength of this work is its use of a real, commercial CiSRAM chip. For years, the architecture community has seen promising simulation-based results for compute-in-memory (e.g., CAPE [11], Compute Caches [1]). This paper provides a much-needed anchor to reality, revealing the practical challenges (like the asymmetry between inter- and intra-vector operations, as discussed in Section 2.1) and immense potential of these architectures. This is a crucial step in maturing the field from academic exploration to practical system design.
High-Impact and Well-Chosen Workload: The choice to focus on Retrieval-Augmented Generation (RAG) is superb. RAG is a critical component of modern AI systems, and its core, the nearest-neighbor search, is fundamentally a data-bound problem—a perfect match for a memory-centric accelerator. By connecting CiSRAM to the LLM ecosystem, the authors make their work immediately relevant to one of the most active areas of research and industry today. The end-to-end evaluation in Section 5.3 (page 11) is particularly compelling.
Systematic and Principled Optimization Strategy: The authors don't simply present a heroic hand-tuned result. They build a case for their optimizations systematically. The analytical framework (Section 3, page 5) provides a clear model for reasoning about performance, and the three proposed optimizations (Communication-Aware Reduction Mapping, DMA Coalescing, and Broadcast-Friendly Layouts in Section 4, page 7) directly address the key bottlenecks identified in the architecture. The breakdown in Figure 12 (page 10) clearly illustrates the contribution of each optimization, which is excellent scientific practice.
Exceptional Energy Efficiency Results: The headline result—matching an A6000 GPU's performance on RAG retrieval with 54.4×–117.9× lower energy consumption (Section 5.3.5, page 12)—is staggering. This isn't an incremental improvement; it's a step-function change in efficiency. This single result provides a powerful argument for pursuing specialized CiSRAM hardware for data-intensive search and comparison tasks, especially in power-constrained environments like the edge.
Strong Potential for Broader Impact: This paper serves as a foundational case study for a whole class of emerging architectures. The lessons learned about data layout, communication patterns, and the programming model are likely to be applicable to future CiSRAM and PIM designs. It essentially provides a "playbook" for how to extract performance from these unconventional systems.

Weaknesses

While this is a strong paper, there are areas where its context and implications could be explored further. My points are not meant to detract from the quality of the work but rather to frame its limitations and suggest avenues for future inquiry.

Generalizability of the Optimizations: The proposed optimizations are highly effective but also highly tailored to the specific microarchitecture of the GSI APU—namely, its extremely long vector registers (32K elements) and the significant performance delta between inter- and intra-VR operations. It is not immediately clear how these specific techniques would translate to other CiSRAM designs that might feature different vector lengths, memory bank organizations, or reduction network capabilities. The contribution could be strengthened by a discussion on the principles that would generalize versus the implementation details that are device-specific.
Reliance on a Simulated Memory System for RAG: The authors are transparent about modeling the off-chip memory system (HBM2e) with Ramulator for the RAG experiments (Section 5.3.1, page 11). While this is a reasonable and necessary choice to avoid having the low-end DDR on the evaluation board become an unfair bottleneck, it does mean the end-to-end results are a hybrid of real measurement and simulation. This is a minor weakness, but it's important to acknowledge that the system-level performance is projected, not fully measured.
Programmability Remains a Major Hurdle: The paper demonstrates what is possible with careful, expert-driven optimization. However, it implicitly highlights the immense programmability challenge of such architectures. The required transformations (e.g., redesigning data layouts for broadcasting, as shown in Figure 11 on page 10) are non-trivial and seem unlikely to be discovered by a conventional compiler. The paper would benefit from a discussion of the path from this manual effort to a more accessible programming model.

Questions to Address In Rebuttal

I am strongly in favor of accepting this paper. The following questions are intended to help the authors strengthen the final version and to clarify the broader context of their work.

On Generalizability: Your analytical framework and optimizations are deeply tied to the GSI APU's architecture. Could you elaborate on which parts of your framework you believe are fundamental to most CiSRAM vector architectures, and which are specific to the GSI device? For instance, if a future CiSRAM device had hardware support for efficient intra-vector reductions, how would your optimization strategy change?
On the Path to Automation: The optimizations presented required significant manual effort and deep architectural knowledge. What do you see as the key challenges and opportunities in building a compiler or library ecosystem that could automate these data layout and loop mapping transformations for CiSRAM targets, making them accessible to non-expert programmers?
On Future Workloads: Based on your deep characterization of this device, what other application domains beyond RAG and the Phoenix suite do you believe are the most promising "killer apps" for this style of architecture? Specifically, what workload characteristics (e.g., data types, memory access patterns, compute kernels) make an application a good or bad fit for this platform?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents a performance characterization and optimization study of the GSI APU, a commercial Compute-in-SRAM (CiS) device. The authors make three primary claims of novelty: (1) a comprehensive evaluation of this device on realistic, large-scale workloads (Phoenix benchmarks, Retrieval-Augmented Generation); (2) an analytical performance framework for this class of architecture; and (3) a set of three optimizations—communication-aware reduction mapping, coalesced DMA, and broadcast-friendly data layouts—that significantly improve performance.

My analysis concludes that the primary novel contribution of this work lies in the first claim: the rigorous, end-to-end experimental characterization of a commercial CiS accelerator on complex, modern workloads. This provides valuable, and to my knowledge, first-of-its-kind data on the practical viability of such architectures. However, the secondary claims regarding the novelty of the analytical framework and, most notably, the proposed optimizations are significantly overstated. These optimizations are direct applications of well-established, decades-old principles from the field of parallel computing, particularly from the GPU domain. The contribution is in the application and tuning of these principles to a new hardware target, not in the invention of the principles themselves.

Strengths

The core strength and genuine novelty of this paper is its experimental contribution. While prior work has microbenchmarked the GSI APU on smaller kernels ([18], [19], [33]), this paper is the first I am aware of to conduct a thorough, end-to-end evaluation on workloads as complex and data-intensive as RAG over 200GB corpora. This is not a simulation or a small prototype evaluation; it is an empirical study on commercial hardware. The findings, such as matching an NVIDIA A6000 GPU in RAG latency while using over 50x less energy (Section 5.3.5, page 12), provide a critical data point for the architecture community on the potential of CiS. This characterization is a valuable and novel contribution.

Weaknesses

My primary concern is the lack of novelty in the "proposed optimizations" detailed in Section 4 (page 7). The paper presents these as new contributions, but they are functionally and conceptually analogous to standard optimization techniques for parallel architectures.

Communication-Aware Reduction Mapping: The core idea presented in Section 4.2 is to map a reduction from an expensive communication domain (intra-VR spatial reduction) to a cheaper one (inter-VR temporal reduction via element-wise operations). This is a fundamental principle of parallel algorithm design. For any architecture with a non-uniform memory/communication hierarchy, programmers and compilers seek to map computation to minimize costly data movement. This is conceptually identical to optimizing reductions on a GPU by favoring warp-level or shared-memory-based reductions over more expensive global atomic operations. The principle is not new; its application to the APU's specific VR structure is an implementation detail.
Coalesced DMA: The technique described in Section 4.3 and depicted in Figure 10 is a direct parallel to "coalesced memory access," a foundational optimization for GPUs since their earliest programming models. The goal of combining multiple small, disparate memory accesses into a single, large, contiguous transaction to maximize memory bus utilization is textbook parallel programming. The paper even uses the standard term "coalescing." While the use of the APU's subgroup copy primitive is specific to this hardware, the underlying optimization strategy is not novel.
Broadcast-Friendly Data Layout: The transformation described in Section 4.4, which reorganizes data to make elements accessed together contiguous in memory (Figure 11), is a classic data layout optimization. This is analogous to Array-of-Structs (AoS) vs. Struct-of-Arrays (SoA) transformations used to optimize for SIMD/SIMT execution. The goal is to align the data structure in memory with the hardware's natural access granularity. Again, this is a well-known principle, not a new one.

The analytical framework (Section 3) is a useful engineering contribution for modeling this specific device. However, the methodology—profiling latencies of primitive operations (Tables 4 and 5) and composing them into a performance model—is a standard approach for building bottom-up performance estimators. It does not represent a novel modeling paradigm.

Questions to Address In Rebuttal

The paper presents three core optimizations as novel contributions. Please explicitly differentiate these from their well-established analogues in the parallel computing literature, particularly GPU optimizations (e.g., hierarchical reduction strategies, coalesced memory access, and AoS/SoA data layout transformations). What, precisely, is the novel conceptual leap beyond applying known principles to a new microarchitecture?
In Section 4.2, the concept of mapping a spatial reduction to a temporal one is described. This sounds like a form of loop transformation to change the order of operations and improve data locality. Could the authors frame this contribution in the context of established compiler transformation theory and clarify what makes their mapping scheme fundamentally new?
Regarding the analytical framework, is the claimed novelty in the methodology of building the model, or in the specific model parameters derived for the GSI APU? If the latter, the contribution should be framed as a specific device model rather than a general framework.

C3ache: Towards Hierarchical Cache-Centric Computing for Sparse Matrix Multiplication on GPGPUs

Abstract

Sparse matrix multiplications (SPMMs) are fundamental kernels in various domains and are highly demanded to be executed on general-purpose graphics processing units (GPGPUs). However, it is a challenge to efficiently execute SPMMs across varying sparsity ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose C3ache, a hierarchical in-cache computing architecture for Sparse Matrix Multiplications (SPMMs) on GPGPUs. The proposal is a full-stack solution encompassing a novel hybrid dataflow (combining outer-product and Gustavson's), a hardware-level restructuring of the GPGPU L1/L2 cache hierarchy into specialized Processing-In-Memory (PIM) units, and a new "memory-aware" compressed format, C3SR. The stated goal is to mitigate the memory access overhead that dominates SPMM performance, particularly in high-sparsity regimes, where traditional processor-centric optimizations like Sparse Tensor Cores (STCs) are claimed to be suboptimal.

Strengths

Problem Identification: The paper correctly identifies that memory access, not computation, is the primary bottleneck for SPMM on GPGPUs, especially as sparsity increases. Figure 1 provides a clear, albeit expected, illustration of this problem.
Architectural Scope: The work is comprehensive in its scope, addressing the problem from the data format (C3SR) up through the memory subsystem (modified L1/L2) and the execution model (hybrid dataflow). This end-to-end consideration is commendable.
Conceptual Decoupling: The core idea of decoupling the SPMM into a multiplication phase (mapped to L1 PIMs) and a merging phase (mapped to L2 PIMs) is logical. It attempts to map computational characteristics (high data reuse in multiplication, high synchronization in merging) to appropriate levels of the memory hierarchy.

Weaknesses

Unsubstantiated Claims Regarding the Hybrid Dataflow: The authors claim the outer-product dataflow achieves "perfect data reuse" (Section 3.2, page 5). This is a significant overstatement. While input reuse is high, this dataflow generates a massive number of partial products, creating a well-known output locality and memory bandwidth problem, which is the very reason it is seldom used in practice. The paper presents its in-cache merging solution as a fix, but it frames the initial dataflow choice with an unjustifiably positive claim.
Insufficiently Rigorous Baselines and Evaluation: The experimental evaluation (Section 4, page 9) is not convincing.
- The "CUDA Cores" baseline is implemented on the authors' own RISC-V GPGPU simulator. The performance of such a baseline is entirely dependent on the quality of its software kernel. It is not clear if this represents a naive implementation or a highly optimized one comparable to state-of-the-art libraries like NVIDIA's cuSPARSE. Without this context, the reported 7.22x speedup is difficult to interpret.
- The comparison against Sparse Tensor Cores (STC) appears to be a strawman. The authors use a specific pruning strategy (VectorSparse) and data format (BSR) to "maximize the performance of STC" (Section 5, page 10). This may not be the optimal configuration for all workloads. More importantly, the evaluation is against a simulated model of STC from a 2019 paper [61], not against a modern hardware implementation like that in the NVIDIA A100/H100, which has its own highly co-designed software stack.
- The comparison against the NVIDIA A100 in Figure 12 is fundamentally flawed. Comparing a scaled-down simulator running at a "normalized frequency" against a real, highly optimized commercial product is not a valid methodology for claiming superior performance.
The C3SR Format Trades One Problem for Another: The proposed C3SR format (Section 3.4, page 7) aims to reduce memory transactions by packing multiple short rows into a single cache line. While this may reduce the volume of data fetched (as shown in Figure 14), it introduces non-trivial on-the-fly decoding complexity. The hardware must now parse row start flags, offsets, and indices within a cache line before computation can begin. The paper provides no analysis of the latency or area/power overhead of this decoding logic. It is plausible that the saved memory latency is nullified by the increased decoding latency, especially for a hardware implementation.
Generality Claims Are Not Supported by Evidence: The claim that C3ache "incurs virtually no performance loss" on other kernels (Figure 13, page 12) is based on a tiny set of benchmarks. A 0.2% average difference is suspiciously low. Partitioning cache ways for PIM functionality (Pways) fundamentally reduces the effective associativity and capacity for all other applications, which should increase conflict misses and degrade performance. A comprehensive evaluation using a standard benchmark suite (e.g., Rodinia) is required to substantiate this claim. The fused SGEMM+SPMM execution example is anecdotal and lacks the detail needed for proper scrutiny.

Questions to Address In Rebuttal

Regarding Baselines: Please justify the choice of baselines. How does C3ache compare against a highly optimized library like cuSPARSE running on a contemporary GPU (e.g., NVIDIA A100/H100) for the workloads in Table 4? This comparison to real, state-of-the-art hardware and software is essential for validating your claims.
Regarding C3SR Overhead: Please quantify the hardware decoding overhead (latency in cycles, area in µm²) of the C3SR format. How does this added latency compare to the memory access latency it aims to save? Provide a sensitivity analysis showing at what level of sparsity this trade-off becomes beneficial.
Regarding Generality: The generality analysis in Figure 13 is insufficient. Please provide a more comprehensive evaluation using a standard GPGPU benchmark suite (e.g., Rodinia, PolyBench) to demonstrate the performance impact of the partitioned cache on non-SPMM kernels that are memory-intensive.
Regarding Architectural Cost: The area breakdown in Table 6 focuses on the MSPM macro. What is the total area and power overhead of the entire C3ache modification—including the Stream-aware Request Management Unit (SRMU), the modified cache controllers, and the local adder layer—relative to a standard GPGPU cache hierarchy of equivalent size?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper proposes C3ache, a novel hierarchical in-cache computing architecture designed to accelerate sparse matrix multiplication (SPMM) on GPGPUs. The core contribution is a profound re-imagining of the GPGPU's cache hierarchy as an active computational resource. The authors astutely identify that SPMM can be decoupled into two distinct computational phases: a massively parallel multiplication phase and a contention-heavy merging (reduction) phase. C3ache elegantly maps these phases onto a restructured hierarchy: the distributed L1 caches across Streaming Multiprocessors (SMs) are transformed into parallel "Multiplication PIM" units, while the shared L2 cache is converted into an "In-situ Merging PIM" unit.

This central hardware concept is supported by a holistic co-design, including: (1) a hybrid outer-product/Gustavson dataflow that maximizes data reuse while managing intermediate data, (2) a novel C3SR sparse format that aligns data granularity with cache lines to reduce memory transactions, and (3) a custom multi-precision PIM macro (MSPM). The experimental results are compelling, demonstrating significant speedups (avg. 7.22x over CUDA cores) and dramatic reductions in energy-delay product (EDP) and memory traffic, particularly for the highly sparse matrices where conventional GPGPUs falter.

Strengths

Elegant Problem-Architecture Mapping: The paper's most significant strength is its insightful mapping of the SPMM algorithm's structure directly onto the GPGPU's physical memory hierarchy. Recognizing that the distributed L1 caches are ideal for the "embarrassingly parallel" multiplication step and the shared L2 is the natural locus for the global merging step is a powerful conceptual leap. This moves beyond simply adding another accelerator and instead leverages the billions of transistors already dedicated to on-chip memory, directly addressing the data movement bottleneck that plagues this class of problem.
A Holistic, Systems-Level Approach: The authors present a commendably complete vision. The work is not just a hardware proposal; it is a co-designed system. The creation of the hybrid dataflow (Section 3.2, page 5), the memory-aware C3SR format (Section 3.4, page 7), and the corresponding ISA extensions and programming model (Figure 5f, page 6) shows a deep understanding of the problem. This holistic approach makes the proposal far more credible and practical than a point solution focused on only one aspect of the system.
Addressing a Key GPGPU Weakness: GPGPUs excel at dense, regular computation but often see their efficiency plummet for memory-bound, irregular workloads like high-sparsity SPMM. Processor-centric solutions like Sparse Tensor Cores (STCs) are an improvement but, as the authors correctly identify, they are still fundamentally limited by the von Neumann bottleneck. C3ache directly targets this fundamental weakness by turning the memory system into the solution. This work provides a compelling architectural direction for making GPGPUs more effective for a broader range of scientific computing and graph analytics workloads, which are often characterized by high sparsity.
Strong Contextualization and Motivation: The paper does an excellent job of positioning itself within the broader landscape. The analysis of different dataflows (Section 2.3, page 4) and the clear motivation provided in the introduction (Figure 1, page 2) effectively establish the "why" behind their work. This places C3ache as a logical and innovative next step in the evolution of GPGPU architecture, moving from processor-centric to memory-centric acceleration.

Weaknesses

While the core idea is powerful, the paper could be strengthened by addressing the following points in more detail:

Generality and System-Wide Overheads: The paper briefly touches upon generality in Figure 13 (page 12), showing minimal performance impact on other kernels. However, this analysis feels somewhat superficial. Modifying cache way logic, adding a Stream-aware Request Management Unit (SRMU), and implementing PIM functionality in SRAM arrays inevitably introduces area, power, and potentially timing complexity. A more thorough analysis of the potential impact on the GPGPU's critical path frequency and static power consumption for workloads that do not use C3ache would make the proposal more robust. Is the 0.2% performance difference truly representative of the system-wide cost?
Scalability of the L2 Merge Network: The paper describes the L2 merging PIM as leveraging a "peripheral local adder layer" (page 7). As the number of SMs in future GPGPUs continues to scale (e.g., beyond 108 in the A100), this shared merging resource could become a significant point of contention. The paper would benefit from a more detailed architectural description and scalability analysis of this merge network. At what scale of SM count would this L2 adder layer become the new system bottleneck, replacing the previous DRAM access bottleneck?
Practicality of the C3SR Format: The proposed C3SR format is cleverly designed to align with cache lines. However, its multi-level indirect indexing (Cache Line Offset, Row Offset, Row Indices) introduces a degree of software complexity. More importantly, sparse formats often involve a trade-off between compression efficiency and decoding overhead during execution. A discussion of the preprocessing time required to convert standard formats (like COO or CSR) to C3SR, and how C3SR performs on matrices with more structured sparsity patterns (where formats like BSR excel), would provide a more complete picture of its practical utility.

Questions to Address In Rebuttal

Regarding the system's generality, could the authors provide a more detailed breakdown of the area and static power overhead of the C3ache modifications? Furthermore, could you elaborate on why these changes are believed to have a negligible impact on the GPGPU's maximum clock frequency when running conventional, non-PIM workloads?
Could you provide more architectural detail on the L2 "in-situ merging PIM"? Specifically, what is the structure and bandwidth of the "local adder layer," and how is contention managed as dozens of SMs attempt to write and accumulate partial products concurrently? How does this mechanism scale compared to the L2 crossbar's native bandwidth?
The C3SR format is presented as a key enabler. Could you discuss the preprocessing overhead required to generate this format? Additionally, could you comment on its performance characteristics on matrices with structured or block-sparsity, where its row-packing strategy might be less advantageous compared to formats like BSR?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper, "C3ache," proposes a hierarchical, cache-centric computing architecture to accelerate Sparse Matrix Multiplications (SpMMs) on GPGPUs. The authors identify memory access as the primary bottleneck for SpMMs, especially at high sparsity, and argue that processor-centric solutions like Sparse Tensor Cores (STCs) provide diminishing returns.

The core novel claim of this work is the co-design of a new GPGPU memory hierarchy, a hybrid computational dataflow, and a supporting sparse data format. Specifically, the contribution is a synthesis of three main ideas: 1. A Hierarchical In-Cache Computing Architecture: The L1 and L2 caches of a GPGPU are fundamentally restructured. The L1 cache is transformed into a large-scale parallel "Multiplication PIM" unit, while the L2 cache is converted into an "In-situ Merging PIM" unit. 2. A Hybrid Dataflow: The SpMM computation is decoupled into a multiplication phase (based on outer-product) and a merging phase (based on Gustavson's dataflow). These phases are then architecturally mapped onto the L1 and L2 PIM units, respectively. 3. A Memory-Aware Compressed Format (C3SR): A new sparse format is proposed to coalesce multiple short sparse rows into a single cache line, aligning data fetching with the memory subsystem's granularity and the proposed hardware.

While the constituent technologies (in-cache computing, hybrid dataflows, custom sparse formats) are not new in isolation, their specific synthesis into a functionally partitioned, hierarchical PIM system for GPGPUs appears to be the central claim to novelty.

Strengths

Novelty in Architectural Synthesis: The primary strength of this paper lies in its novel system-level vision. While prior art has extensively explored in-cache computing (e.g., [16, 17]), these works typically focus on transforming a single level of cache (often L1 or LLC) into a monolithic compute fabric. The C3ache proposal to create a functionally specialized hierarchy—where L1 PIMs perform parallel multiplication and L2 PIMs perform atomic merging/reduction—is a conceptually new architecture for GPGPUs. This tight coupling of algorithmic phases to distinct hierarchical hardware levels is the paper's most significant novel contribution.
Novelty of the C3SR Format: The proposed C3SR format, detailed in Section 3.4 (page 8), presents a non-trivial advancement over standard formats like CSR. The core idea of dynamically compressing multiple logical rows into a single physical cache line to maximize memory transaction efficiency is a clever software/hardware co-design. While related to blocked formats (e.g., BSR), its explicit goal of aligning with the cache-line quantum and enabling the proposed dataflow is a distinct and novel approach.
Boldness of the Proposal: The authors propose a fundamental redesign of the GPGPU's cache and memory subsystem. This represents a significant departure from more incremental approaches (e.g., adding specialized functional units alongside the existing hierarchy). Such a holistic architectural transformation, while complex, is precisely the kind of high-risk, high-reward idea that pushes the field forward.

Weaknesses

Insufficient Differentiation from Prior Art on Hierarchical PIM: The paper claims to be the "first hierarchical in-cache computing architecture" (page 2) for this purpose. However, the concept of multi-level or hierarchical processing-in-memory is not entirely unexplored. The authors need to more rigorously defend this claim by positioning their L1-multiply/L2-merge functional split against other systems that may use PIM at different levels of the memory hierarchy (e.g., logic in DRAM controllers vs. logic in LLC). The novelty is in the specifics of the functional partitioning, which should be emphasized more clearly.
Incremental Novelty of the Hybrid Dataflow: The paper's description of its hybrid dataflow (Section 3.2, page 5) as a combination of outer-product and Gustavson's is accurate. However, the core idea of decoupling SpMM into a partial-product generation phase and a reduction/accumulation phase is a well-established pattern in hardware accelerators. For instance, OuterSPACE [37] is built on the outer-product, which inherently separates these two phases. The true novelty here is not the decoupling itself, but rather the mapping of these phases onto their proposed hierarchical PIM architecture. The paper's narrative could be sharpened to make this distinction clear.
Overstated Novelty of the PIM Macro (MSPM): The design of the MSPM macro (Section 3.5, page 8), while a necessary and detailed engineering effort, is built upon established techniques. The use of exponent-mantissa pipelining for floating-point PIM and in-memory Booth encoding are known concepts in the PIM/in-cache computing literature. The novelty is in the specific implementation and integration, but it does not represent a fundamental conceptual breakthrough in PIM circuit design.

Questions to Address In Rebuttal

The claim of proposing the "first hierarchical in-cache computing architecture" is a strong one. Can the authors provide a more detailed comparison to any prior work that has proposed distinct computational functionalities in different levels of the cache/memory hierarchy and articulate why C3ache's specific L1-multiply/L2-merge split is fundamentally different and novel?
Regarding the hybrid dataflow, please clarify the novelty beyond the architectural mapping. Is the claim that the combination of outer-product for multiplication and Gustavson's for merging is itself a new algorithmic formulation for SpMM, or is the novelty exclusively in how this known computational pattern is mapped to the new cache hierarchy?
The C3SR format's novelty hinges on coalescing rows to the granularity of a cache line. Are there conceptually analogous data packing schemes from other domains (e.g., network packet payload optimization, file systems) that achieve a similar goal? Please elaborate on what makes C3SR's approach uniquely suited for SpMM on PIM architectures.
The proposed architecture is a highly specialized solution for SpMM. From a novelty perspective, how generalizable is the core architectural idea? Could the proposed L1-multiply/L2-merge hierarchy be a novel template for other algorithms with similar computation patterns (e.g., certain graph algorithms, database join operations), or is its novelty strictly confined to the SpMM domain?

Leveraging Chiplet-Locality for Efficient Memory Mapping in Multi-Chip Module GPUs

Abstract

While the multi-chip module (MCM) design allows GPUs to scale compute and memory capabilities through multi-chip integration, it introduces memory system non-uniformity, particularly when a thread accesses resources in remote chiplets. In this work, we ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The authors propose CLAP (Chiplet-Locality Aware Page Placement), a memory management mechanism for multi-chip module (MCM) GPUs that aims to select an optimal effective page size for different data structures. The work is predicated on a property the authors term "chiplet-locality," which they define as the tendency for contiguous virtual pages to be accessed by the same chiplet. CLAP works by first profiling a fraction of memory allocations using small pages to identify this locality. It then uses a tree-based analysis to determine the largest granularity of contiguous pages that maintain locality and maps the remainder of the data structure using this effective size. These physically contiguous regions are then intended to be covered by a single, coalesced TLB entry, thereby achieving the translation efficiency of large pages while preserving the placement granularity of small pages.

Strengths

Problem Motivation: The paper does an adequate job of motivating the core problem. The trade-off between address translation overhead (favoring large pages) and data placement locality (favoring small pages) in an MCM context is a genuine challenge. The preliminary data in Figure 1 and Figure 2 (page 2) effectively illustrates that a one-size-fits-all approach to paging is suboptimal.
Core Concept: The high-level idea of creating physically contiguous page-like regions to match an application's access granularity is a logical approach to bypassing the need for extensive hardware support for numerous, fixed intermediate page sizes.
Evaluation Scope: The authors compare their proposal against a reasonable set of baselines, including static small and large pages, an idealized C-NUMA implementation, and other prior work in MCM GPU optimization.

Weaknesses

My primary concerns with this submission center on the foundational premise of "chiplet-locality," the introduction of several unscrutinized "magic numbers" in the methodology, and the potential oversimplification of hardware overheads.

The Foundational Premise of "Chiplet-Locality" is Circular: The paper's central claim rests on the existence of "chiplet-locality" as an intrinsic workload property. However, the evidence presented seems to be an artifact of the experimental setup. In Section 3.1 (page 4), the authors state their baseline uses a First-Touch (FT) policy, which inherently places data pages on the chiplet of the thread that first requests it. The analysis in Figure 10 (page 6), which shows very high chiplet-locality, is performed after this locality-aware policy has already been applied. Therefore, the paper is not measuring an intrinsic property of the application, but rather the effectiveness of the baseline FT policy. The claim that this is a fundamental characteristic of GPU workloads is not rigorously substantiated and appears to be a self-fulfilling prophecy.
Arbitrary Methodological Thresholds: The CLAP mechanism is governed by several key parameters that lack rigorous justification.
- The Partial Memory Mapping (PMM) threshold is set to 20% (Section 4.2, page 8). The authors claim this is a "conservative choice empirically derived," but provide no sensitivity analysis to support this. For applications with distinct execution phases, the access patterns in the first 20% of page faults may be entirely unrepresentative of the subsequent 80%.
- The Opportunistic Large Paging (OLP) mechanism is disabled if more than 5% of VA blocks release their reservations. This is another "magic number" presented without analysis. A robust system should not depend on such finely-tuned, unexplained constants.
Understated Hardware Complexity and Assumptions:
- The Remote Tracker (RT) mechanism requires commandeering an "unused bit of the last-level page table entry (PTE)" to store an allocation ID (Section 4.3, page 8). While the authors claim modern PTEs have reserved bits (page 9), these are often targets for other system software or future hardware features. Assuming exclusive access to these bits for a single optimization is a significant architectural imposition.
- The claimed area overhead for the TLB coalescing logic (0.0003% of the die area, Section 4.6, page 11) seems exceptionally low. While the logic itself may be simple, its integration into the critical path of the TLB/MMU, including control logic and potential timing implications, is non-trivial. This figure is presented without sufficient breakdown to be credible.
Evaluation Concerns: The "Ideal C-NUMA" baseline assumes zero latency for page migrations and related operations (Section 5, page 12). While this establishes an upper bound, it also creates a strawman. The primary advantage of a proactive scheme like CLAP should be its ability to avoid the high, non-zero costs of a reactive scheme. By idealizing the baseline, the paper obscures the true magnitude of this benefit and presents a comparison that is arguably flattering to the proposed work.

Questions to Address In Rebuttal

The authors must address the following points to substantiate their claims:

On "Chiplet-Locality": Please clarify the methodology used to generate the data in Figure 10. Can you demonstrate that "chiplet-locality" is an intrinsic property independent of the initial page placement policy? Specifically, what would the results of that analysis be if the initial PMM phase used a chiplet-agnostic policy, such as round-robin or random page placement?
On Methodological Robustness: Can you provide a detailed sensitivity analysis for the PMM threshold (e.g., varying from 5% to 50%) and the OLP disable threshold? How does the system's performance change, and how do you justify that your chosen values are optimal or robust across diverse workloads, particularly those with dynamic behavior?
On Application Dynamics: The CLAP mechanism is fundamentally proactive. How does it handle workloads where data access patterns change significantly after the initial PMM phase is complete (e.g., in different kernel invocations)? The "CLAP+migration" extension presented in Figure 20 (page 14) suggests this is a known limitation. Does this imply that for dynamic applications, the core benefit of CLAP is voided, requiring a full fallback to a reactive migration scheme?
On Hardware Feasibility: Please provide a more thorough justification for the claimed hardware overheads. Regarding the RT, what are the known or anticipated competing uses for the reserved PTE bits you intend to use? For the TLB coalescing logic, can you provide a more detailed breakdown of the components included in the 0.0024mm² area estimate and discuss its impact on TLB access latency?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper addresses a fundamental tension in the memory systems of modern Multi-Chip Module (MCM) GPUs: the trade-off between address translation efficiency and memory access locality. The authors correctly identify that large pages, while beneficial for reducing TLB misses, can lead to poor data placement and increased remote memory traffic across chiplets. Conversely, small pages allow for fine-grained, locality-aware placement but suffer from higher translation overhead.

The core contribution is the identification and exploitation of a workload property the authors term "chiplet-locality"—the tendency for groups of virtually contiguous pages to be accessed predominantly by a single chiplet. Building on this insight, the paper proposes CLAP (Chiplet-Locality Aware Page Placement), a mechanism that proactively profiles data structures to determine their characteristic chiplet-locality granularity. CLAP then maps these page groups to physically contiguous frames on the appropriate chiplet, effectively creating "large page-like regions." These regions reap the address translation benefits of large pages (via a proposed TLB coalescing mechanism) without sacrificing the fine-grained placement necessary for high memory locality.

Strengths

The true strength of this paper lies in its elegant synthesis of ideas from operating systems, computer architecture, and parallel programming to solve a timely and important problem.

Excellent Problem Formulation and Contextualization: The paper does a superb job of positioning its work. It correctly identifies the rise of MCM GPUs as a pivotal shift in high-performance computing and clearly articulates the resulting NUMA-like challenges. The introductory analysis in Section 1 (page 1) and the motivational study in Section 3 (page 4) are compelling, effectively demonstrating that a one-size-fits-all paging policy is suboptimal. The paper situates itself perfectly at the intersection of physical data placement strategies (e.g., [13], [47]) and address translation optimizations (e.g., [32], [87]), arguing convincingly that these two aspects must be considered jointly.
The "Chiplet-Locality" Insight: While the underlying principle of spatial locality is well-known, its formalization as "chiplet-locality" in the context of MCM GPUs is a valuable conceptual contribution. It connects the structured parallelism of the GPU programming model (e.g., threadblocks) to the physical hierarchy of the hardware (chiplets). By observing that this locality has a consistent, per-data-structure granularity, the authors uncover a predictable behavior that is ripe for optimization. The quantification of this property in Figure 10 (page 6) provides a solid empirical foundation for the entire approach.
A Proactive, Low-Overhead Design: The proposed CLAP mechanism is a clever alternative to reactive, migration-based schemes like C-NUMA [28, 34], which are often ill-suited to the GPU's execution model and incur high overheads (e.g., TLB shootdowns). By using a brief, low-overhead profiling phase (PMM) at the beginning of an allocation's lifecycle, CLAP makes a one-time, predictive decision. This proactive approach avoids the complexities and performance penalties of continuous runtime monitoring and data migration, making it a much more natural fit for GPU systems.
Bridging the Gap Between Page Sizes: The solution of creating physically contiguous regions of small pages is elegant. It circumvents the need for complex hardware support for a multitude of arbitrary page sizes. Instead, it leverages the existing 64KB page infrastructure and relies on a well-defined TLB coalescing mechanism [86] to achieve the performance benefits of larger, intermediate page sizes. This makes the proposal practical and more easily adoptable.

Weaknesses

The weaknesses are not in the core idea, which is sound, but in the assumptions about workload behavior and the full implications of the proposed hardware.

Static Workload Assumption: The core design of CLAP seems best suited for applications where memory access patterns are established early and remain stable. The initial profiling during the PMM phase determines the mapping for the lifetime of the data structure. While the authors present an extension using page migration for dynamic scenarios (Section 5.2, page 14), this feels like an add-on rather than a fundamental part of the design. The effectiveness of CLAP could diminish in workloads with significant phase changes or highly dynamic memory allocation patterns where the initial profile quickly becomes stale.
Complexity of the Remote Tracker (RT): The paper presents the RT as a simple, low-area hardware addition. However, any modification to the GMMU and the page walk pipeline warrants scrutiny. The claim that the RT is "not on the critical path of memory accesses" (page 9) is asserted but could benefit from a more detailed analysis. For latency-critical applications, even minor delays or resource contention within the GMMU could have a performance impact.
Interaction with System-Level Schedulers: The concept of chiplet-locality relies on a relatively stable mapping of threadblocks to chiplets, as provided by the First-Touch or Static-Analysis policies. The paper does not explore how CLAP would interact with more dynamic, system-level schedulers that might perform load balancing by migrating threadblocks (or entire kernel grids) between chiplets. In such a scenario, the chiplet predicted to be the primary accessor during the PMM phase may no longer be correct later in the execution.

Questions to Address In Rebuttal

On Dynamic Behavior: The CLAP+migration experiment is promising. Could the authors elaborate on the criteria and overhead for triggering a re-evaluation of a data structure's mapping? For instance, would the Remote Tracker need to be extended to continuously monitor for pattern shifts post-PMM, and how would the system decide that the cost of migration is worth the benefit?
On the Remote Tracker's Criticality: Could the authors provide a more detailed breakdown of the interaction between a page walk and the RT? Is the RT lookup and update fully pipelined and/or asynchronous with the page walk's primary function of fetching a PTE, ensuring zero impact on translation latency?
On Broader Applicability: How does the efficacy of CLAP depend on the application having sufficient parallelism to saturate all chiplets? In cases where a workload only utilizes a subset of chiplets, would CLAP's analysis still hold, or would the concept of a "preferred" chiplet become less meaningful? Furthermore, how would CLAP handle data structures that are intentionally shared and frequently accessed by all chiplets (beyond the matrix-B GEMM example, which has a predictable broadcast pattern)?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper introduces CLAP (Chiplet-Locality Aware Page Placement), a hardware/software co-design to manage memory pages in Multi-Chip Module (MCM) GPUs. The central problem addressed is the well-known trade-off between large pages (which reduce TLB overhead but can cause remote accesses and poor locality) and small pages (which improve locality at the cost of higher TLB pressure).

The authors' proposed solution is to determine a "suitable page size" on a per-data-structure basis. This is achieved by: 1. Profiling a small fraction (20%) of a data structure's pages using a first-touch policy with small base pages (64KB). 2. Using a new hardware unit, the Remote Tracker (RT), to monitor the locality of these initial placements. 3. Employing a driver-level tree-based analysis (MMA) to identify the largest granularity of virtually contiguous pages that are consistently accessed by the same chiplet—a property the authors term "chiplet-locality." 4. Mapping the remainder of the data structure by creating physically contiguous regions of this "suitable size" from base pages. 5. Leveraging a TLB coalescing mechanism to treat these physically contiguous base pages as a single, larger effective page in the TLB.

The core novel claim appears to be the synthesis of these components into a proactive, predictive system that structures physical memory to create "synthetic" large pages tailored to the observed access patterns of GPU applications.

Strengths

The primary strength of this work lies in its cohesive, full-stack approach to a fundamental problem. While the constituent parts of the solution are not entirely new in isolation, their combination and application to the MCM-GPU domain are well-conceived.

The most compelling aspect is the idea of proactively creating physical contiguity based on a profile. Instead of relying on reactive migration (like C-NUMA) or hoping for incidental contiguity (as in traditional memory allocators), CLAP deliberately engineers the memory layout to maximize the effectiveness of a TLB coalescing mechanism. This proactive stance, guided by the "chiplet-locality" heuristic, is an elegant way to get the benefits of large pages without paying their full locality penalty.

The concept of "chiplet-locality" itself, while an intuitive extension of spatial locality in parallel workloads, is framed and quantified in a useful manner, providing a clear target for the optimization.

Weaknesses

My main concern with this paper is the degree of novelty of its core technical components. When deconstructed, the mechanism appears to be a clever recombination of pre-existing concepts.

Dynamic Page Granularity: The idea of dynamically adjusting page sizes based on access patterns is not new. The most direct prior art is C-NUMA [28, 34], which dynamically promotes and demotes pages between base and huge sizes in response to traffic. While the authors correctly differentiate CLAP as proactive and migration-averse, the fundamental goal of matching page granularity to access locality is the same. The paper needs to more strongly argue why its proactive, profile-then-map approach is a significant conceptual leap forward rather than an implementation choice.
TLB Coalescing: The hardware support for merging TLB entries for contiguous memory regions is a well-established technique. The paper's mechanism, described in Section 4.6, is functionally very similar to prior work like CoLT [86]. The authors' contribution here seems to be the implementation and integration, but not the invention of the core concept.
Profiling for Placement: Using a sampling/profiling phase to guide data placement is a standard technique in systems research. For example, GRIT [104] (cited by the authors) also profiles page accesses to guide migration decisions in multi-GPU systems. The use of a small hardware tracker is a common way to reduce software profiling overhead.

The novelty, therefore, rests entirely on the synthesis. The paper would be stronger if it explicitly framed its contribution as such: a novel co-design that synergistically combines known techniques (profiling, dynamic sizing, TLB coalescing) in a way that is uniquely suited for the predictable, parallel access patterns of MCM-GPUs. As written, it sometimes reads as if these individual mechanisms are novel contributions in their own right.

Questions to Address In Rebuttal

The authors define "chiplet-locality" as the core property they leverage. How is this phenomenon fundamentally different from the well-understood concept of spatial locality exhibited by blocks of threads in a GPU programming model? Please justify why coining this new term and building a mechanism to measure it represents a novel insight, rather than an application of known locality principles to a new hardware topology.
The proactive "profile-then-map" strategy is positioned as superior to C-NUMA's reactive migration-based approach. However, this assumes that access patterns are static after the initial profiling phase. Could the authors comment on workloads where the access pattern evolves over time? In such cases, would CLAP's static decision (made after the 20% PMM phase) become suboptimal, and would a reactive approach like C-NUMA prove more robust?
The proposed solution constructs physically contiguous pages to enable TLB coalescing. This requirement for physical contiguity could increase memory fragmentation, especially for data structures with fine-grained or irregular chiplet-locality. How does CLAP compare to a system that uses a more flexible, scatter-gather style of TLB entry (e.g., using segment registers or block-based PTEs) that does not require physical contiguity? Is the added complexity of managing physical contiguity justified over alternative hardware designs that achieve similar TLB reach?
The complexity vs. benefit trade-off needs further justification. The performance benefits of CLAP are primarily realized by reducing TLB misses. If we consider an alternative path, such as significantly increasing the size and sophistication of the page walk caches or using speculative page walkers [85], could similar performance gains be achieved without the added complexity in the memory manager and the requirement for physical contiguity? Please defend the novelty of your approach in the context of these alternative solutions to the same root problem.

Security and Performance Implications of GPU Cache Eviction Priority Hints

Abstract

NVIDIA provides cache eviction priority hints such asevict_firstandevict_laston recent GPUs. These hints allow users to specify the eviction priority that should be used for individual cache lines to improve cache utilization. However, NVIDIA does not ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

This paper presents a reverse engineering study of NVIDIA's evict_first and evict_last cache eviction priority hints. The authors conduct a series of microbenchmarks to characterize the behavior of these hints, particularly the rules governing their interaction and capacity within an L2 cache set. Based on these findings, they construct two security attacks: a high-bandwidth covert channel using evict_first and a performance degradation attack using evict_last. Finally, they analyze the performance implications for application developers, demonstrating that improper use of evict_last can lead to performance degradation due to cache thrashing, and identify an optimal usage threshold.

Strengths

Systematic Characterization: The paper's core strength lies in its methodical reverse engineering of the cache hint behaviors in Section 4. The experiments designed to deduce the rules for eviction, update, and interaction are logically structured.
Quantitative Performance Analysis: The analysis in Section 6, particularly the correlation between the microbenchmark results (Figure 8) and the real-world application performance (Figure 9), provides compelling evidence for the authors' claims about the performance pitfalls of misusing the evict_last hint. The alignment of the optimal data size (3.75 MB) with the 12/16 threshold for a 5 MB cache is a strong result.
Novel Security Vectors: The paper successfully demonstrates that these undocumented features introduce new, non-trivial security vulnerabilities. The Load+Load covert channel, in particular, shows a significant bandwidth improvement over existing public methods on the same hardware.

Weaknesses

My primary concerns with this paper relate to the generalizability of its core findings, the rigor of its security attack evaluations, and several unsupported or under-supported claims.

Limited Architectural Scope and Unexplored Variables: The reverse engineering results, while detailed, appear to be predominantly from a single platform (RTX 3080 with a specific driver). The authors themselves identify a critical dependency on the driver version for evict_last behavior (Table 2, p. 7), which changes the maximum resident lines from 12 to 3. This is a massive change that fundamentally alters the system's behavior. However, the implications of this are not propagated through the rest of the paper. All subsequent security and performance analyses seem to assume the older, 12-line behavior. This raises serious questions about the relevance and applicability of the results to current or future systems. Are the evict_first behaviors (e.g., the 11-line limit) also driver-dependent? This is not addressed.
Insufficient Evidence for Key "Takeaways": Several of the fundamental behavioral claims ("Takeaways") are presented without sufficient visual or tabular evidence, forcing the reader to trust the authors' narrative.
- Takeaway 5 (p. 6): The claim that a regular load/store does not change the status of an evict_last line is a critical distinction from evict_first, but it is asserted without a corresponding figure or data table.
- Takeaway 8 (p. 7): The complex eviction logic described when both evict_last and evict_normal lines are present is based on an experiment that is described but not shown. Given its importance to the performance thrashing argument, this omission is significant.
- Takeaway 10 (p. 8): The claim of an evict_last line being removed after ~10^8 cycles is based on a coarse-grained experiment shown in Table 3. The threshold for "active use" by another process is not defined, making the condition for this eviction ambiguous and difficult to reproduce.
Flawed "Stealth" Argument for the Degradation Attack: The claim that the evict_last-based attack is "more stealthy" is not well-substantiated. The comparison in Section 5.2 (p. 10) is contrived. The authors force the baseline "scanning" attack to have a "similar idle rate" to their new attack, which is not how a real adversary would operate. An adversary using scanning would maximize access frequency to maximize impact, not to match the idle time of another attack. A more meaningful metric would be "performance degradation per attacker L2 transaction" or an analysis of detectability using performance counters. As presented, the stealthiness claim is a product of a biased experimental setup rather than an intrinsic property of the attack.
Oversimplification of Covert Channel Resilience: In Section 5.1.3 (p. 9), the noise tolerance evaluation (Table 7) shows the channel becomes "unusable" (50% error) under a common workload like Vector-Add. This suggests the channel is extremely fragile in the presence of any significant L2 contention, a critical limitation that is understated in the abstract and conclusion.

Questions to Address In Rebuttal

Regarding Takeaway 9 and the driver-dependent limit for evict_last lines (12 vs. 3): Please provide the performance analysis corresponding to Figure 9 for a system with the newer driver (3-line limit). Does the optimal pinned data size shrink to 3/16 of the L2 cache size as your theory would predict? How does the efficacy of the degradation attack in Table 8 change with this 3-line limit?
Please provide the data/figures to substantiate the claims made in Takeaway 5 (update policy of evict_last) and Takeaway 8 (interaction logic between evict_last and evict_normal lines).
Please justify the "stealthiness" claim of the performance degradation attack with a more robust metric than the current comparison, which seems to handicap the baseline scanning attack. For example, show a comparison where both attacks are configured to achieve the same level of performance degradation and then compare the required access rates or resulting signatures in performance counters.
For the evict_last timeout mechanism described in Takeaway 10, what specific operations and frequency constitute "actively using the L2 cache" by another process? Please provide a more precise characterization of this condition.
Are the behavioral limits you discovered for evict_first (e.g., the maximum of 11 lines in Takeaway 2) consistent across the other GPU architectures and driver versions listed in Table 1? If this was not tested, the claims regarding evict_first must be scoped explicitly to the tested configuration.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a comprehensive reverse engineering of NVIDIA's undocumented evict_first and evict_last cache eviction priority hints. The authors meticulously characterize the microarchitectural behavior of these hints in both single- and multi-process environments, revealing specific, non-obvious rules governing their operation (e.g., the maximum number of lines of a given type allowed per cache set).

Building on these findings, the paper makes a dual contribution. From a security perspective, it demonstrates that these hints introduce potent new attack primitives. The evict_first hint is leveraged to create "Load+Load," a highly efficient cache covert channel that significantly outperforms existing GPU-based channels. The evict_last hint is shown to enable a stealthy and effective performance degradation attack in multi-tenant scenarios. From a performance perspective, the paper provides a crucial, empirically-derived "user manual" for these hints, showing that naively using evict_last on too much data can paradoxically lead to cache thrashing and severe performance degradation, contrary to its intended purpose.

Strengths

The core strength of this work lies in its positioning at the intersection of microarchitectural analysis, systems security, and high-performance computing. It bridges these communities in a compelling way.

Novel Attack Surface Identification: The paper moves beyond analyzing incidental microarchitectural behaviors (like standard LRU policies) and instead targets an explicit, programmer-facing, yet undocumented, control mechanism. This is a valuable perspective, highlighting that features designed for optimization can become a potent and easily exploitable attack surface. The "Load+Load" channel (Section 5.1, page 8) is an elegant demonstration of this, simplifying the logic of contention-based channels down to a conflict over a single, specially-designated slot.
Dual Contribution to Security and Performance: The work is not just an attack paper. The performance implications discussed in Section 6 (page 11) are equally significant. The discovery that evict_last has a hard limit (12 or 3 lines per set, depending on the driver) before it induces thrashing is a critical finding for any developer seeking to use this feature for performance tuning. This transforms the work from a niche security paper into a broadly relevant study for the GPU computing community.
Thorough and Systematic Reverse Engineering: The methodology used to deduce the cache behaviors is sound and the results are presented clearly as a series of "Takeaways" throughout Section 4 (pages 4-8). The investigation across different driver versions (Takeaway 9, page 7) is particularly insightful, revealing that this behavior is not static, which adds an important dimension to the findings.

Weaknesses

The weaknesses of the paper are primarily in its framing and the depth of its contextual analysis, rather than in its technical execution. The core ideas are strong, but could be situated more powerfully.

Limited Contextualization within Hardware Security Trends: While the paper compares its channel to prior work, it misses an opportunity to connect to the broader narrative in hardware security over the last decade. The central theme here—performance optimization features creating unforeseen security vulnerabilities—is the very story of speculative execution attacks like Spectre. While the mechanism is different, drawing this thematic parallel would elevate the paper's significance and place it within a larger, well-understood intellectual framework.
Superficial Discussion of Countermeasures: The countermeasures section (Section 7.2, page 12) is brief and feels like an afterthought. The ideas (noise injection, remapping hints) are reasonable starting points, but lack depth. A more substantive discussion would explore the implementation challenges and, crucially, the performance impact of these defenses. For example, if the OS driver starts randomly ignoring evict_first hints, what is the performance cost for legitimate applications that rely on them?
Unexplored Implications of Driver-Dependent Behavior: The finding that the evict_last capacity changes dramatically with the GPU driver version is fascinating but underexplored. Does this suggest the policy is implemented in mutable microcode or managed by the driver software itself? This has profound implications for both attackers (who must now be driver-aware) and defenders (who might be able to patch policies). The paper presents the observation but does not delve into what it might signify about the underlying hardware-software interface.

Questions to Address In Rebuttal

The authors have presented a compelling piece of work. I would encourage them to consider the following points to further strengthen their contribution:

The discovery that the behavior of evict_last is dependent on the driver version (Takeaway 9, page 7) is one of the most intriguing findings. Could you speculate on the implementation? Does this suggest the replacement policy logic is partially implemented in software or updatable firmware/microcode? What are the broader implications of such a design for security analysis?
The "Load+Load" covert channel is very efficient because it creates contention on a single, privileged slot within a cache set, rather than requiring the sender to evict an entire set as in traditional Prime+Probe. Can you elaborate on how this fundamental difference in mechanism might affect the feasibility of detection or mitigation strategies compared to classic set-contention channels?
In your view, what is the fundamental trade-off that a vendor like NVIDIA must navigate when designing and documenting features like these? Your work suggests that full disclosure could aid attackers, but non-disclosure harms performance-oriented programmers and leaves security holes undiscovered. How can your findings inform future best practices for hardware feature design and documentation?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents a reverse engineering study of NVIDIA's evict_first and evict_last cache eviction priority hints. The authors claim four primary novel contributions based on their findings: 1) a detailed microarchitectural model of how these hints behave, including specific quantitative limits on their use within a single L2 cache set; 2) a new, high-bandwidth covert channel, named Load+Load, that exploits the behavior of the evict_first hint; 3) a new, stealthy performance degradation attack that leverages the "pinning" capability of the evict_last hint; and 4) a counter-intuitive performance analysis showing that over-utilization of the evict_last hint leads to cache thrashing and performance degradation, contrary to its intended purpose.

Strengths

The core novelty of this work lies in its systematic deconstruction of a previously undocumented hardware feature and the subsequent demonstration of new security and performance phenomena that arise from it. My analysis confirms the following as genuinely novel contributions:

Novel Experimental Insights: While the methodology of microarchitectural reverse engineering is well-established (e.g., Jia et al., arXiv:1804.06826; Zhang et al., USENIX Security '24), the subject of this investigation—the eviction priority hints—has not been previously characterized at this level of detail. Prior works like Zhuang et al. (OSDI '24) [57] and Jain et al. (MICRO '24) [15] use these hints but treat them as an opaque primitive. This paper's core contribution is revealing the underlying rules, such as the finding in Section 4.1.3 that a maximum of 11 evict_first lines can coexist in an empty set, or the critical threshold of 12 (or 3, depending on the driver) evict_last lines before thrashing occurs (Section 4.2.5, Table 2). These findings (summarized in Takeaways 1-10) represent new knowledge about the hardware.
Novel Covert Channel Mechanism: The Load+Load channel described in Section 5.1 is not merely an incremental improvement over existing GPU covert channels. Its novelty lies in the mechanism. Prior conflict-based channels like GPU Prime+Probe (Dutta et al., ISCA '23 [8]) require the sender to evict a line by filling an entire cache set (e.g., all 16 ways). In contrast, Load+Load exploits the newly discovered property that (in a full set) there is effectively a single, contended "slot" for an evict_first line. This allows for a conflict to be created with a single load instruction, which is a fundamentally more efficient mechanism. The delta over prior art is significant, moving from set-level contention to way-level (or slot-level) contention.
Novel Denial-of-Service Attack Vector: The performance degradation attack in Section 5.2 introduces a novel element of stealth. The concept of cache-based DoS is not new, but existing methods rely on high-frequency "scanning" to generate contention. The authors' method, which uses evict_last to "pin" cache lines, requires only sporadic refreshes (on the order of 10⁸ cycles, per Section 4.2.6). This low-activity profile makes the attack qualitatively different and harder to detect than traditional cache thrashing attacks. The novelty is the exploitation of a persistence mechanism rather than a contention mechanism for DoS.

Weaknesses

While the primary contributions are novel, some of the secondary claims in the paper represent incremental advancements rather than new concepts.

Incremental Fingerprinting Attack: The "Efficient application fingerprinting attack" described in Section 7.3 is presented as a consequence of the evict_last hint. However, the core idea is functionally identical to Prime+Probe. The only difference is that the attacker pins n cache lines with evict_last and then performs Prime+Probe on the remaining 16-n lines. This is an optimization that reduces the number of memory accesses required for the "probe" step, but it is not a new side-channel attack primitive. The conceptual framework remains unchanged from prior art. The delta is one of efficiency, not of kind.
Re-application of Existing Detection Concepts: The "Cache eviction hints detection attack" (Section 7.3, page 13) is an interesting observation but its novelty as an attack is limited. The mechanism relies on observing which cache line gets evicted (the overall LRU line vs. the LRU within a subset of lines) to infer the victim's use of a specific instruction hint. This is a specific application of the general principle of using cache replacement state to infer program behavior, a concept that underlies numerous existing side-channel attacks. The novelty is in what is being inferred, not in the method of inference.

Questions to Address In Rebuttal

Regarding the application fingerprinting attack (Section 7.3): Please clarify what, if any, is the fundamental novelty of this attack beyond being a performance optimization of the established Prime+Probe technique. Is there a new type of information leaked that was not accessible before?
The paper's findings show a stark difference in the maximum number of evict_last lines (12 vs. 3) depending on the driver version (Takeaway 9, Table 2). Your work is framed as reverse engineering the microarchitecture. Is this limit a configurable hardware feature being set differently by the driver, or is it a policy purely enforced in software by the driver's compiler/runtime? The novelty of this finding as a hardware insight depends on this distinction. Please clarify your assessment.

COSMOS: RL-Enhanced Locality-Aware Counter Cache Optimization for Secure Memory

Abstract

Secure memory systems employing AES-CTR encryption face significant performance challenges due to high counter (CTR) cache miss rates, especially in applications with irregular memory access patterns. These high miss rates increase memory traffic and ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose COSMOS, a scheme to optimize secure memory performance for applications with irregular access patterns. The system uses two distinct reinforcement learning (RL) predictors: one to predict whether data resides on-chip or off-chip after an L1 miss to enable early counter (CTR) access, and a second to predict the locality of CTRs to inform a locality-centric CTR cache replacement policy. The authors claim a 25% performance improvement over the MorphCtr baseline with what they describe as "minimal" hardware overhead. While the problem is well-motivated, the proposed solution rests on a complex interplay of components whose underlying assumptions and evaluations appear to have significant flaws.

Strengths

Problem Motivation: The paper does a competent job in Section 3 of demonstrating the limitations of existing approaches. The analysis showing the ineffectiveness of simply scaling the CTR cache (Figure 3) and the failure of conventional prefetchers and replacement policies for this specific problem (Figure 5) provides a solid foundation for exploring a new solution.
Ablation Study: The evaluation methodology includes an analysis of COSMOS-DP (data predictor only) and COSMOS-CP (CTR predictor only) against the full COSMOS system. This is a methodologically sound practice that helps isolate the source of performance gains, as shown in Figures 10 and 11.

Weaknesses

Fundamentally Flawed Reward Mechanism for CTR Locality: The entire RL-based CTR locality predictor is predicated on a weak and questionable proxy for "ground truth." The "observable" for locality is a hit within the CTR Evaluation Table (CET), a small, 8192-entry LRU-managed buffer (Section 4.1.1, page 6). An LRU buffer is not a reliable oracle for locality. A CTR access that misses in the CET simply because its previous access was pushed out by other traffic is not evidence of "bad locality." This design choice conflates cache contention with inherent data locality, fundamentally undermining the integrity of the reward signal and, by extension, the learned policy.
Unjustified and Potentially Overfitted Hyperparameters: The reward values and hyperparameters presented in Table 1 (page 9) appear arbitrary and lack justification. For instance, why is the reward for a correct off-chip data prediction (RD_mo) 12, while the penalty for an incorrect one (RD_mi) is -30? Without a sensitivity analysis, these values suggest a high degree of manual tuning on a specific benchmark (DFS). The authors admit the system needs re-tuning for different workload domains (Section 4.5), and the weaker performance on MLP (Figure 8) and machine learning workloads (Figure 17) confirms that the chosen parameters are not general but are instead overfitted to graph algorithms.
Understated Hardware Overhead and Complexity: The claim of "minimal hardware overhead" is misleading. The proposed design adds 147KB of SRAM and consumes an additional 206.65 mW (Section 4.6, page 9). In the context of a memory controller, where every square millimeter and milliwatt is critical, this is a significant cost. Comparing the 147KB overhead to an 8MB LLC is an irrelevant comparison; the relevant context is other on-MC structures, where this size is substantial.
Oversimplification of Speculative Memory Access: The data location predictor triggers a speculative DRAM access for predicted off-chip requests (Section 4.4, page 8). The paper glosses over the practical complexities of this mechanism. An incorrect prediction (which occurs ~15% of the time per Figure 12) results in a speculative DRAM access that must be "halted" (Algorithm 3, line 11). Halting an already-issued DRAM command is non-trivial and incurs both latency and power penalties. The paper fails to quantify the wasted memory bandwidth and power from these mis-speculations, which could easily erode the claimed performance gains.
Unfair Comparison to State-of-the-Art: The comparison with EMCC is methodologically unsound. The authors state they implemented an "ideal EMCC implementation" and followed its "original flow [65] while excluding additional overheads" (Section 6.2, page 12). This means they are comparing their detailed, overhead-inclusive COSMOS implementation against an idealized, best-case version of a competing work. A rigorous comparison requires modeling competitors with the same level of detail and realistic overheads. This choice artificially inflates the reported 10% performance gain over EMCC. The comparison to RMCC is purely textual and therefore unsubstantiated.

Questions to Address In Rebuttal

Please justify the use of a small, LRU-managed CET as a reliable ground truth for CTR locality. How does this mechanism distinguish true lack of reuse from simple capacity- or conflict-induced eviction from the CET itself?
Provide a sensitivity analysis for the reward values and hyperparameters listed in Table 1. How much does performance degrade if, for example, all positive rewards are set to +10 and all negative rewards to -10? This is necessary to demonstrate that the system is robust and not just tuned to a single data point.
Detail the precise timing and power model for a mispredicted off-chip access. What is the latency cost of issuing and then canceling a DRAM request? What is the energy cost? How does this overhead affect the overall performance gain, especially for workloads with lower prediction accuracy?
The claim of outperforming EMCC by 10% is predicated on a comparison against an idealized model. Please provide a comparison where the overheads of EMCC (e.g., L2 pipeline modifications, potential NoC traffic) are modeled with the same fidelity as the overheads for COSMOS.
Justify the design decision to use two separate RL agents. Could a single agent with a broader action space (e.g., {on-chip, off-chip-good-locality, off-chip-bad-locality}) not accomplish the same task with potentially less hardware overhead from duplicated Q-tables?

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents COSMOS, a novel scheme for optimizing the performance of secure memory systems that use AES-CTR encryption. The authors correctly identify a critical performance bottleneck: the high miss rate of the counter (CTR) cache, particularly for applications with irregular memory access patterns like graph algorithms. A CTR cache miss is extremely costly, as it requires not only a DRAM access but also a traversal of a Merkle Tree for integrity verification.

The core contribution of COSMOS is a sophisticated, dual-predictor framework based on Reinforcement Learning (RL). Instead of relying on static heuristics or complex hardware modifications, COSMOS decomposes the problem into two cooperative tasks: 1. An RL-based Data Location Predictor that, after an L1 cache miss, speculatively determines if data is on-chip or off-chip. For predicted off-chip accesses, it initiates an early, parallel fetch of the corresponding CTR, effectively moving the CTR access point earlier in the memory hierarchy without structural changes. 2. An RL-based CTR Locality Predictor that classifies accessed CTRs as having "good" or "bad" locality. This prediction informs a new Locality-Centric CTR (LCR-CTR) cache, which uses a novel replacement policy to preferentially retain CTRs predicted to have good locality.

The synergy between these two components is key: the first predictor populates the CTR cache with a stream of accesses that have better locality than the post-LLC stream, and the second predictor intelligently manages this population to maximize hit rates. The authors demonstrate a significant 25% performance improvement over the state-of-the-art MorphCtr baseline for irregular workloads, with a modest hardware overhead.

Strengths

Elegant Problem Decomposition and Novel Solution: The paper's primary strength lies in its insightful decomposition of the CTR cache problem into two distinct but related sub-problems: access timing and cache residency. Designing two specialized, cooperating RL agents to tackle these is a novel and powerful architectural pattern. This moves beyond simply applying a known technique (RL) and represents a thoughtful co-design of algorithm and architecture.
Addresses a Critical and Timely Problem: The performance overhead of secure memory is a well-known barrier to its widespread adoption in high-performance domains. This work tackles the central bottleneck—CTR management—for a particularly challenging and important class of workloads (irregular access patterns). As data-centric and graph-based computing grows, this problem only becomes more relevant.
Strong Contextual Positioning and Evaluation: The authors have done an excellent job placing their work in the context of prior art. They build upon the lineage of CTR optimization (SplitCTR, MorphCtr) and convincingly argue why their learning-based approach surpasses both simple hardware scaling and more recent architectural changes like EMCC. The evaluation is thorough, featuring:
- A compelling ablation study (COSMOS-DP vs. COSMOS-CP, page 11) that clearly demonstrates the individual and combined contributions of the two RL predictors.
- An analysis of the system's robustness by testing it on regular ML workloads, where it correctly shows minimal impact (neither helping nor harming significantly), thereby defining its application scope.
- A practical consideration of hardware overhead, keeping the design within a plausible on-chip budget.
High Potential for Impact: COSMOS presents a new paradigm for managing security-related metadata. Instead of relying on static structures and policies, it introduces an adaptive, learning-based approach. This concept is powerful and could inspire future work on dynamically managing other overheads in secure and reliable systems. The demonstration that RL can outperform complex, hand-tuned heuristics in this challenging, irregular domain is a significant result for the broader computer architecture community.

Weaknesses

Dependency on Hyperparameter Tuning: The performance of any RL system is sensitive to its hyperparameters. The authors perform a one-time tuning for the "irregular memory access" domain using DFS (Section 4.5, page 9). While they show good generalization to BFS and even an MLP, this approach might be brittle. The true strength of online learning is adaptation to dynamic phase changes within an application or across diverse workloads run in succession. The current evaluation doesn't fully explore this dynamic adaptability, which is central to the promise of RL.
Complexity of a Dual-Agent System: While the dual-predictor system is elegant, it introduces interaction complexities. The paper notes a "beneficial side effect" where mispredictions from the data location predictor helpfully populate the CTR cache with high-locality entries (Section 6.1.2, page 11). This suggests the interaction dynamics are not fully characterized. There could be scenarios with negative interference, where the learning process of one agent might transiently degrade the performance of the other, leading to instability.
Limited Scope of Optimization: The work is explicitly focused on optimizing the CTR cache miss rate. The authors acknowledge that for workloads with high temporal locality, the re-encryption overhead (triggered by CTR overflow) becomes the dominant bottleneck, and COSMOS offers little help. While this is a fair limitation, it's worth emphasizing that this is a solution for one specific, albeit important, performance pathology in secure memory systems.

Questions to Address In Rebuttal

Regarding hyperparameter sensitivity: How does the system perform during the initial "warm-up" phase of learning? If an application exhibits a dramatic phase change (e.g., from a graph traversal phase to a dense matrix operation phase), how quickly does the online learning framework adapt, and what is the performance penalty during this adaptation period compared to a statically-tuned system?
Regarding the interaction between the two RL agents: The observation that incorrect off-chip predictions from the data location predictor are beneficial is fascinating. Was this an intended part of the design, or a fortunate emergent property? Could you elaborate on the potential for negative interference between the two agents? For example, could a poorly trained data location predictor pollute the CTR cache with a stream of accesses that confuses the locality predictor's learning process?
Looking at the broader vision: The dual-predictor RL pattern is a powerful concept. Do the authors see this architectural pattern being applicable to other coupled optimization problems in computer architecture? For instance, could a similar framework be used to co-manage LLC prefetching (predicting what to fetch) and cache replacement (predicting how long to keep it)?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors propose COSMOS, a system designed to mitigate performance overheads in secure memory systems using AES-CTR encryption. The central claim of novelty rests on a dual-predictor architecture powered by Reinforcement Learning (RL). The first RL predictor, the "data location predictor," speculates whether a memory access following an L1 cache miss will be serviced on-chip (L2/LLC) or off-chip (DRAM). An off-chip prediction triggers an early, speculative access to the counter (CTR) cache, aiming to hide the on-chip cache lookup latency from the critical path of a CTR access. The second RL predictor, the "CTR locality predictor," assesses the reuse potential of CTRs. This prediction informs a specialized replacement policy in a "locality-centric CTR cache" (LCR-CTR), which prioritizes retaining CTRs predicted to have high locality. The authors claim this combined approach significantly improves performance for applications with irregular memory access patterns by reducing CTR cache misses.

Strengths

The primary strength of this work lies in its novel synthesis and application of existing machine learning concepts to the specific, and challenging, domain of CTR cache management. While ML-based memory management is an established field, its application to the architectural side-effects of secure memory mechanisms like AES-CTR is less explored.

The decomposition of the problem into two distinct prediction tasks (location and locality) and the deployment of specialized RL agents for each is a clever system-level design. This demonstrates a clear understanding of the problem's bottlenecks: (1) the latency introduced by accessing the CTR cache late in the memory pipeline and (2) the poor locality of accesses that populate the CTR cache in the first place. The paper successfully identifies that simply scaling the CTR cache is ineffective (Figure 3, page 4) and that enabling earlier access is key (Figure 4, page 5), providing a solid motivation for the proposed architecture.

Weaknesses

My primary concern is the degree of fundamental novelty in the core architectural primitives presented. When deconstructed, the constituent components of COSMOS appear to be adaptations of previously proposed ideas, and the "delta" over this prior art is not sufficiently established.

The Data Location Predictor: The concept of an early-pipeline predictor that determines if a memory access will ultimately require DRAM service is not new. This idea is functionally identical to the "off-chip load prediction" proposed in Hermes (Bera et al., MICRO 2022) [6]. Hermes uses a perceptron-based predictor after the L1D cache to identify long-latency loads and initiate actions to accelerate them. COSMOS's data location predictor solves the exact same binary classification problem (on-chip vs. off-chip) at the exact same pipeline stage (after an L1 miss) for the exact same purpose (to initiate a speculative, long-latency-related action early). The only significant difference is the choice of model: RL in COSMOS versus a perceptron in Hermes. The paper fails to provide a compelling argument for why RL is fundamentally better or more novel than a perceptron for this specific, well-defined prediction task. The novelty here appears to be algorithmic substitution rather than a new architectural concept.
The CTR Locality Predictor and LCR-CTR Cache: The use of a predictor to learn the reuse characteristics of cache blocks and guide a replacement policy is a well-established research direction. The state-of-the-art in cache replacement has moved towards learning-based approaches that predict reuse. For instance, SHiP (Cui et al., MICRO 2011) [9] uses signature-based prediction, Mockingjay (Shah et al., HPCA 2022) [50] learns reuse distances to mimic Belady's MIN, and other works have applied deep learning and imitation learning directly to the replacement problem. The CTR locality predictor is another instance of this general principle. While its application to a CTR cache is specific, the core idea of "predicting locality to improve replacement" is not fundamentally new. The LCR-CTR cache is a standard cache augmented with metadata bits provided by this predictor, a common implementation pattern for learning-based policies. The novelty is in the adaptation of this concept to CTRs, not the concept itself.
Holistic RL-based Management: The idea of a holistic, learning-based framework for cache management has also been explored. For example, CHROME (Lu et al., HPCA 2024) [35] uses a single online RL agent to make concurrent decisions about cache replacement, bypass, and prefetching. While COSMOS uses two separate agents, the overarching concept of leveraging RL for fine-grained cache control is part of the current art.

In summary, the novelty of COSMOS seems to be in the engineering of a system that combines and adapts existing predictive techniques for a new problem domain. This is a valuable contribution, but it falls short of introducing a fundamentally new architectural mechanism. The work would be stronger if it explicitly acknowledged the strong conceptual overlap with prior art like Hermes and provided a deeper analysis of why its specific algorithmic choices represent a significant advancement.

Questions to Address In Rebuttal

Please clarify the novelty of the RL-based data location predictor over the perceptron-based off-chip load predictor in Hermes [6]. Given that both address the same problem at the same architectural point, what is the key insight that makes the RL-based approach a novel contribution beyond a different choice of algorithm? Are there characteristics of the on-chip/off-chip access stream that make it uniquely suited to RL over simpler predictors?
The CTR locality predictor aims to solve a reuse prediction problem. How does this problem fundamentally differ from the reuse prediction problem in conventional data caches, which has been addressed by policies like Mockingjay [50] and other ML-based approaches? What unique challenges of CTR locality justify the complexity of an RL agent over adapting these existing state-of-the-art reuse predictors?
The system's main benefit comes from the synergy of the two predictors. Could a simpler, less novel system (e.g., using the Hermes predictor for location and a simpler heuristic for replacement) achieve a significant fraction of the performance gains? This would help isolate the performance benefits derived specifically from the novel aspects of your design versus those derived from already-established concepts.

CryptoBTB: A Secure Hierarchical BTB for Diverse Instruction Footprint Workloads

Abstract

Timing attacks leveraging shared resources on a CPU are a growing concern. Branch Target Buffer (BTB), a crucial component of high-performance processors, is shared among threads and privileged spaces. Recently, researchers discovered numerous ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose CryptoBTB, a hardware security enhancement for hierarchical Branch Target Buffers (BTBs). The design aims to mitigate both conflict-based and collision-based side-channel attacks. The core mechanism involves randomizing the BTB index using region-based cryptographic pads (RCPs), which are cached to reduce encryption latency. To manage the overhead of frequent key updates required for security, the design introduces a "lazy remapping" scheme that utilizes a shadow tag array to preserve old mappings temporarily. The authors claim their solution incurs a low performance overhead of 4.27% and a hardware overhead of 33%, significantly outperforming the prior state-of-the-art, HyBP.

Strengths

The fundamental idea of decoupling index encryption from the index itself by using cached cryptographic pads is a reasonable approach to addressing the high latency of on-the-fly encryption in the processor frontend.
The paper correctly identifies significant weaknesses in the prior art (HyBP), namely the intra-region collisions and BTB underutilization that lead to its high performance overhead. The analysis presented in Figure 8 provides a clear illustration of this problem.
The performance evaluation is comprehensive in its breadth, covering a wide range of workloads from SPEC2017, CVP, and IPC benchmark suites, providing a good overview of the design's performance characteristics under normal (non-adversarial) conditions.

Weaknesses

My primary concerns with this submission center on the insufficiency of its security analysis and the limitations of its evaluation methodology, which call into question the central claims of security.

Inadequate Security Validation: The paper's claims of security are not sufficiently substantiated. The security analysis in Section 6 is purely theoretical and descriptive. A paper proposing a security architecture must go beyond asserting security; it must demonstrate it.
- Lack of Empirical Attack Evaluation: The authors have not implemented or simulated any known BTB attacks (e.g., variants of those in [12, 13, 32, 77]) against their own proposed architecture. Without this, the claim that CryptoBTB is secure remains an unproven hypothesis.
- Introduction of New Attack Surfaces: The design introduces several new stateful structures: the RCPB hierarchy, the shadow tag array, and the global phase counter. The security analysis of these components is superficial. For instance, the discussion on hits in the shadow tag array (Section 6.4.1, Page 9) concludes that its "use remains secure" without a rigorous argument. Any new state that is updated based on speculative or non-speculative execution is a potential source for a new side channel. The authors have not analyzed timing variations resulting from hits/misses in these new structures. Speculative access to the L1RCPB (Section 6.4.2, Page 9) is another clear example of a potential new channel that is dismissed too quickly.
Fundamentally Flawed Evaluation Methodology for Security Claims: The choice of the ChampSim simulator (commit 2b8d3fc), as noted by the authors themselves in Section 7, is a critical flaw. The authors state, "ChampSim does not simulate the wrong path." Security vulnerabilities like Spectre, and many side channels in general, fundamentally rely on the transient execution of instructions on a mis-speculated path. A simulator that abstracts away this behavior is incapable of providing meaningful evidence about a design's security against speculation-based attacks. While the authors place Spectre out of scope, the BTB itself is a core component of speculative execution, and its interaction with mis-speculation cannot be ignored.
Unjustified Threat Model and Security Boundaries: The threat model presented in Section 3 is narrowly defined.
- The explicit exclusion of Spectre and Meltdown is problematic. While a design does not need to solve all problems, a "Secure BTB" should be analyzed in the context of other known speculative execution attacks. The authors fail to discuss how CryptoBTB might interact with or be subverted by a Spectre-style gadget that precedes a branch accessing the BTB.
- The security of the random number generator used for key updates and the key management protocol is assumed but not detailed. A full system's security depends on the strength and implementation of these components.
Understated Complexity and Overhead: The hardware overhead of 33% (Table 2, Page 12) is substantial. This includes multiple new caches (RCPBs, MSB Tag Caches) and a complete duplication of the L1BTB tag array. The cost of frequent context switches, which requires flushing multiple structures and resetting state (Section 5.6, Page 8), is also non-trivial. The results in Figure 14 show a noticeable performance degradation (~8% for server workloads) even at a 16-million-instruction interval, which is significant.

Questions to Address In Rebuttal

The authors must address the following points to make a convincing case for this paper's acceptance:

Can you provide empirical evidence of CryptoBTB's security by implementing and evaluating at least one known conflict-based and one collision-based BTB attack against your design? A theoretical discussion is insufficient.
How can the security claims be substantiated given that the chosen simulator (ChampSim) does not model wrong-path execution, which is the root cause of many microarchitectural side channels? Please justify why this methodological limitation does not invalidate your security conclusions.
Please provide a more rigorous security analysis of the new architectural components. Specifically, how do you prove that the timing of accesses to the shadow tag array and the RCPB hierarchy (especially under speculation) does not leak information about previous mappings or execution history?
Please justify the remapping interval calculation for the L1BTB (Section 6.1.1, Page 9). The formula from [57] was derived for traditional caches. What evidence suggests that BTB access patterns are sufficiently similar to cache access patterns for this formula to hold and provide the claimed security guarantees?

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

The authors present CryptoBTB, a novel architecture for securing hierarchical Branch Target Buffers (BTBs) against conflict-based and collision-based side-channel attacks. The core contribution is a low-latency index randomization scheme that decouples the encryption from the index itself. This is achieved by generating a "Region Cryptographic Pad" (RCP) for a given address space region (defined by the upper bits of the PC) and XORing it with the BTB index. These pads are cached to leverage spatial locality, minimizing latency. To address the overhead of frequent key updates required for security, the paper introduces a "lazy remapping" mechanism that preserves old BTB entries temporarily via a shadow tag array. The evaluation shows that CryptoBTB incurs a modest 4.27% performance overhead compared to an insecure baseline, a significant improvement over the 31.89% overhead of the prior state-of-the-art, HyBP.

Strengths

This paper presents a well-motivated and architecturally elegant solution to a critical problem in microarchitectural security. Its primary strengths are:

Identifies and Solves the Core Flaw in Prior Art: The paper does an excellent job of contextualizing its work against HyBP [79]. It correctly identifies that HyBP's method of using encrypted indices as pads leads to internal collisions and BTB underutilization (Section 4, Figure 8). CryptoBTB’s region-based pad approach is a direct and effective fix, ensuring a one-to-one mapping of original indices to encrypted indices within a region, thereby preserving the BTB's effective capacity. This is a crucial insight that drives the performance gains.
Clever Amortization of Cryptographic Latency: The central idea of using a single cryptographic pad for an entire memory region is a very strong architectural contribution. It correctly identifies that the BTB is too latency-sensitive for per-access cryptography, a lesson learned from years of secure cache research. By leveraging the spatial locality of instruction fetches, the RCPB cache effectively amortizes the cost of the block cipher computation across many accesses, making strong cryptography practical for the processor front-end.
Pragmatic Approach to Remapping Overhead: Security against conflict-based attacks requires periodic remapping (key changes). A naive flush on every key update would be prohibitively expensive for a structure like the BTB. The proposed lazy remapping scheme (Section 5.2, page 6) is a sophisticated and practical solution. By maintaining access to entries from the previous mapping epoch via a shadow tag array, it smooths the performance impact of remapping, transforming a hard flush into a gradual, on-demand update process.
Strong Connection to Modern Architectural Trends: The design is explicitly tailored for a hierarchical, exclusive BTB, which reflects the organization of front-ends in many modern high-performance processors. This grounding in contemporary design choices makes the proposal highly relevant and credible.

Weaknesses

While the core idea is strong, there are areas where the paper could be strengthened by broadening its context or exploring the implications of its design more deeply.

Complexity and Potential for New Side Channels: The lazy remapping mechanism, while effective, introduces significant complexity. It involves a global phase counter, primary and shadow tag arrays, and logic to handle hits in either structure. The security analysis in Section 6.4.1 (page 9) argues that the shadow tag array's use is secure because an attacker would need to repopulate the eviction set anyway. However, it does not consider whether the timing difference between a primary hit (fast) and a shadow tag hit (one cycle penalty) could itself constitute a side channel, potentially leaking information about the timing of key updates or the age of a victim's BTB entries.
A Narrowed View of BTB Security: The paper explicitly scopes out speculative execution vulnerabilities (e.g., Spectre-BTB variants) in its threat model (Section 3, page 4). This is a standard practice to manage complexity, but it leaves an important question unanswered. Many defenses against such attacks involve altering the timing of BTB updates (e.g., only updating at commit). It is not immediately clear how CryptoBTB's intricate state (especially the lazy update mechanism) would interact with these orthogonal security schemes. A brief discussion of compatibility would place the work in a more complete security context.
Practicality of the RCPB Hierarchy: The RCPB is a new structure added to the critical path of instruction fetch. While the paper analyzes the average performance impact, it gives less attention to worst-case scenarios. An L1RCPB miss followed by an L2RCPB miss forces a multi-cycle stall for a block cipher computation. For workloads with poor spatial locality (e.g., frequent jumps between distant code regions), this could introduce high-latency events that are averaged out in the overall IPC numbers but could be detrimental to real-time or quality-of-service-sensitive applications.

Questions to Address In Rebuttal

Regarding the lazy remapping mechanism: Can the authors provide a more detailed security argument concerning the timing difference between a primary tag hit and a shadow tag hit? Could an attacker exploit this one-cycle penalty to infer when a key update has occurred or to gain information about the state of the victim's remapping process?
How does the CryptoBTB design envision co-existing with defenses against speculative execution attacks that target the BTB? For instance, if BTB updates are buffered and only committed non-speculatively, how would this interact with the lazy remapping and the need to update both the primary and shadow entries? Is the design fundamentally compatible with such delayed-update policies?
Could the authors comment on the tail latency implications of the RCPB hierarchy? While the average performance hit from RCPB misses is low, what is the frequency and performance impact of a full pipeline stall due to a miss in the entire RCPB hierarchy? Are there specific workload characteristics that would make this worst-case scenario more common?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper proposes CryptoBTB, a secure hierarchical Branch Target Buffer (BTB) designed to mitigate both conflict-based and collision-based side-channel attacks. The core of the proposal rests on two primary ideas. First, it introduces a low-latency index randomization scheme that uses "region-based cryptographic pads." Unlike prior work (specifically HyBP [79]), where pads are indexed by lower-order PC bits leading to collisions, CryptoBTB generates a single pad for a large address "region" (defined by the upper bits of the PC). This pad is XORed with the original BTB index, effectively scattering entries while preserving the uniqueness of indices within the same region. These pads are cached in a dedicated structure (RCPB) to exploit spatial locality. Second, to address the high performance overhead of frequent re-keying required for small L1 structures, the paper proposes a "lazy remapping" mechanism. This involves a shadow tag array that allows the BTB to temporarily service requests using the previous mapping, while entries are lazily migrated to the new mapping upon use.

Strengths

The primary strength of this paper lies in its identification and solution of a critical flaw in the immediate prior art, coupled with a novel mechanism to overcome a long-standing performance challenge in this domain.

Novel Solution to HyBP's Collision Problem: The most significant novel contribution is the shift from an index-keyed pad (as in HyBP's "code book") to a region-keyed pad. While the cryptographic primitive (XORing with an encrypted nonce) is standard, its architectural application here is new and insightful. By using the upper PC bits (the region) as the input to the cipher, the authors ensure that distinct original indices within that region remain distinct after randomization. This directly resolves the collision and BTB underutilization problems that plague HyBP, as detailed in Section 2.3 and demonstrated in Figure 8 (page 6). This is a clear and elegant improvement over the state-of-the-art.
Novel Lazy Remapping Architecture: The second novel contribution is the "lazy remapping" scheme (Section 5.2, page 6). The need for frequent re-keying to secure small structures like an L1 BTB typically incurs prohibitive flush-on-update costs. The proposed solution—combining a shadow tag array, a preceding key buffer (Keyprev), and a global phase versioning system—is a novel architectural construct for this problem space. It amortizes the cost of re-keying by allowing old, valid entries to persist and be used, which is critical for performance. This mechanism is what makes frequent remapping feasible and distinguishes the work from prior secure cache/BTB schemes that rely on costly flushes.

Weaknesses

The paper's claims of novelty are generally well-supported, but the presentation could be strengthened by more clearly situating its building blocks within the broader context of computer architecture and cryptography, as some of the underlying concepts are not new in isolation.

Component-Level Novelty vs. System-Level Novelty: The paper's novelty is primarily in the combination and application of existing concepts to solve a new problem. The use of a cryptographic pad generated by Encrypt(key, nonce) is functionally equivalent to a stream cipher or counter-mode encryption. The idea of caching cryptographic material (the RCPB) is a standard performance optimization. The use of shadow structures or versioning to manage state transitions is also a known architectural pattern. The paper should be more explicit that its novelty lies not in these individual components, but in their synthesis into a coherent system that solves the specific performance and security challenges of hierarchical BTBs, a claim which I believe is valid.
Limited Exploration of the Design Space for Regions: The paper defines a "region" as the address space covered by the Full-Tag (Section 5.1, page 5). This is a static and straightforward definition. However, there is a rich design space here. The security and performance of the scheme are tied to this definition. For instance, could regions be defined differently (e.g., dynamically, or based on process IDs) to offer different trade-offs? The novelty of the contribution would be enhanced by a discussion of why this specific definition was chosen over potential alternatives.

Questions to Address In Rebuttal

The core cryptographic operation, New Index = Index ⊕ Encrypt(Key, Region), is a standard cryptographic construction. Can the authors further clarify why applying this specific construction is non-obvious in the BTB context and how it fundamentally differs from tweakable block ciphers or other similar primitives that have been proposed for randomizing storage structures? Please focus the answer on the architectural implications.
The lazy remapping mechanism adds considerable complexity (shadow tag array, dual key storage, phase counters, parallel lookups). Could a simpler scheme have achieved a significant fraction of the benefit? For example, instead of a full shadow array, could a small victim-buffer-like structure hold recently re-mapped entries, or could a policy of only flushing a subset of the ways on a key update have been effective? Please justify the choice for this specific, and complex, implementation.
The security of the L1BTB relies on a remapping interval of ~18k accesses (Section 6.1.1, page 9) to prevent eviction set discovery. This number is derived from a formula in [57] for caches. Given that BTB access patterns can be more structured than general cache access patterns, is there a risk that this interval is not conservative enough? How sensitive is the security of the scheme to this parameter?

Efficient Security Support for CXL Memory through Adaptive Incremental Offloaded (Re-)Encryption

Abstract

Current DRAM technologies face critical scaling limitations, significantly impacting the expansion of memory bandwidth and capacity required by modern data-intensive applications. Compute eXpress Link (CXL) emerges as a promising technology to address ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

This paper proposes AIORE, a hardware framework to mitigate the performance overhead of securing CXL-attached memory. The core idea is to dynamically and adaptively select between CTR and XTS encryption on a per-page basis, driven by an access-frequency "hotness" tracker. The design introduces mechanisms for incremental and offloaded re-encryption to manage the state transitions between these modes and handle counter overflows. The authors claim this complex system significantly reduces security overhead compared to a state-of-the-art XTS TEE baseline. However, the work introduces significant system complexity and, most critically, dismisses a fundamental security vulnerability—timing side channels—that its very design creates. The robustness of its core heuristic-driven mechanisms under diverse workload conditions is also not sufficiently established.

Strengths

Problem Motivation: The paper provides a solid analysis of the performance trade-offs between XTS and CTR mode encryption in the context of CXL memory (Section 3.1). The identification of counter overflow as a key bottleneck for split-counter CTR schemes is accurate and serves as a strong motivation for a hybrid approach.
Ablation Study: The ablation study presented in Section 6.2 (Figure 18) is a methodologically sound way to decompose the performance contributions of the different components of AIORE. This provides clear insight into which parts of the complex design contribute most to the claimed performance improvements.
Comprehensive Baseline Comparison: The evaluation in Section 6 compares AIORE against a reasonably comprehensive set of seven alternative schemes, including standard TEE modes and recent academic proposals. This provides a broad context for the performance results.

Weaknesses

Critical Security Flaw: Dismissal of Timing Channels: The most significant weakness of this work is its handling of the security model. The adaptive encryption mechanism, by its very nature, causes memory access latencies to become data-dependent. A "hot" page (CTR mode) will have a different access latency profile than a "cold" page (XTS mode). This directly creates a timing side channel that leaks information about the application's memory access patterns—specifically, which pages are frequently accessed. The authors explicitly acknowledge this and dismiss it, stating in Section 4.8: "...industry TEEs and CXL IDE exclude timing-based channels from their threat models. Our design follows the same assumptions." This is an unacceptable justification. A proposal that introduces a new information leakage channel cannot simply inherit the threat model of prior work that did not have this vulnerability. The work fundamentally trades performance for confidentiality, but presents it as a pure performance optimization.
Unsubstantiated Claims Regarding Offload Impact: The mechanism for offloaded re-encryption (Section 4.4) involves blocking host access to a page if it arrives while the offloaded task is in process. The authors claim this "marginally impacts performance, as the offloaded pages are those that are infrequently used." (Section 4.4, page 9). This is an unsubstantiated assertion. No data is provided to quantify the frequency or duration of these stalls. A workload could easily exhibit behavior where a page is "cold" for a period, gets offloaded for re-encryption, and then immediately becomes "hot," leading to costly stalls on the critical path.
Fragility of Heuristic-Based Hotness Tracking: The entire system hinges on the Page Hotness Tracker (Section 4.5), which is a heuristic-driven mechanism. It relies on fixed initial thresholds (16 and 8) and a dynamic adjustment policy targeting a 95% counter cache hit rate. This raises several questions of robustness:
- Why is 95% the optimal target hit rate? No justification is provided. A different target might be better for different workloads.
- How does the system behave under workloads with rapid and frequent phase changes? The tracker may constantly be making suboptimal decisions, triggering expensive re-encryptions that negate any performance benefit. The evaluation on SPEC and graph benchmarks with stable execution phases may not expose this fragility.
Unanalyzed Resource Contention: The offloaded re-encryption tasks utilize the Memory Encryption Engines (MEEs) on the CXL memory device (Section 4.4, Figure 14). These are the same hardware resources required to service normal, on-demand memory read/write requests from the host. The paper provides no analysis of the resource contention between these background re-encryption tasks and foreground critical-path memory accesses. This contention could easily degrade overall system performance, an effect not captured in the evaluation.
Motivation Based on Static Analysis: The core motivation for a dynamic adaptive system is supported by Figure 10, which shows the results of statically partitioning pages. While this demonstrates the potential of a hybrid approach, it does not prove that a complex dynamic system is superior to a simpler, profile-guided static partitioning scheme. The overhead and complexity of the dynamic tracking and transition machinery may outweigh its benefits over a less complex alternative.

Questions to Address In Rebuttal

On the Timing Channel: Please provide a rigorous justification for excluding the timing channel introduced by AIORE from the threat model. Given that the mechanism directly leaks page access frequency, how can the confidentiality claims of the underlying TEE be maintained? Provide a quantitative analysis of the information leaked (e.g., in bits per access) through this channel.
On Hotness Tracker Robustness: Please provide a sensitivity analysis of the hotness/coldness thresholds and the 95% hit rate target. Furthermore, how does AIORE perform on workloads specifically designed to have rapid and frequent phase changes, which would stress-test the adaptability of the tracker and potentially induce re-encryption thrashing?
On Offload-Induced Stalls: Quantify the "marginal" performance impact of blocking accesses to pages undergoing offloaded re-encryption, as claimed in Section 4.4. What is the measured frequency and average duration of these stalls across the evaluated benchmarks?
On Device-Side MEE Contention: Your design places background re-encryption work on the same device-side MEEs that service foreground requests. Please provide an analysis of the performance impact of this resource contention. How much are foreground memory accesses delayed due to the MEEs being occupied by offloaded re-encryption tasks?

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper addresses the significant performance overhead associated with securing expanded memory in Compute eXpress Link (CXL) environments, a critical challenge for the adoption of CXL in multi-tenant public clouds. The current state-of-the-art, combining Trusted Execution Environments (TEEs) with CXL's Integrity and Data Encryption (IDE) standard, relies on XTS encryption, which introduces substantial latency on the critical path of memory reads.

The authors propose AIORE (Adaptive Incremental Offloaded (Re-)Encryption), a novel framework that intelligently hybrids two encryption modes: the fast, counter-based CTR mode for frequently accessed ("hot") memory pages and the metadata-free XTS mode for less-used ("cold") pages. The core contribution is not merely the hybrid approach, but the elegant, three-part mechanism for managing it: 1. Adaptive Encryption: A page hotness tracker dynamically monitors access patterns and triggers transitions between encryption modes to optimize for latency and counter cache usage. 2. Incremental Re-encryption: The expensive process of re-encrypting a page during a mode transition is performed incrementally, piggybacking on existing program reads and writes to hide the latency from the critical path. 3. Offloaded Re-encryption: Incomplete re-encryption tasks for pages that are no longer being accessed are offloaded to the CXL memory device itself, freeing up the host and preventing stalls.

Through simulation with Gem5, the authors demonstrate that AIORE reduces the security overhead of CXL memory to an average of 3.7%, a significant improvement over the baseline's ~10% overhead and other counter-based schemes.

Strengths

Timely and High-Impact Problem: The paper tackles a problem of immediate and practical importance. As CXL moves from a specification to a deployed technology, ensuring its security without compromising its primary benefit—high-performance memory expansion—is paramount. This work is situated directly at the intersection of computer architecture, security, and systems, making it highly relevant to the community.
A Cohesive, System-Level Solution: The true strength of AIORE lies in its synthesis of multiple architectural techniques into a single, elegant framework. Rather than proposing a single-point optimization, the authors have designed a complete system that addresses the full lifecycle of a hybrid encryption policy: when to switch (adaptive), how to switch without stalling (incremental), and how to handle edge cases efficiently (offloaded). This holistic approach is commendable.
Excellent Motivation and Contextualization: The authors do an exceptional job in Section 3 (page 4) of analyzing the existing design space. The critical path diagrams in Figure 6 clearly illustrate the trade-offs between XTS and CTR modes. Furthermore, their analysis of prior work, particularly the critique of Counter Light [82] for its reliance on a non-standard ECC bus and a less-optimal, bandwidth-based switching policy, builds a very strong justification for their design choices.
Strong and Illuminating Evaluation: The experimental methodology is sound. The comparison against seven other schemes, including the established baseline and recent academic proposals, provides a robust benchmark of AIORE's performance. The ablation study presented in Section 6.2 (Figure 18, page 12) is particularly valuable, as it clearly quantifies the performance contribution of each of AIORE's three core ideas, validating the design's integrity.

Weaknesses

While the core idea and its evaluation are strong, the paper could be improved by addressing the following points, which are more about depth and future implications than fundamental flaws.

Implementation Complexity: AIORE introduces several new hardware components and state management mechanisms (Page Hotness Tracker, IREBC, IREBB, and the coordination logic). While conceptually sound, a brief discussion of the practical hardware implementation complexity and area/power overhead would strengthen the paper's claims of feasibility. The current area overhead analysis in Section 4.6 (page 9) is a good start, but a qualitative discussion of design complexity would add value.
Robustness of the Adaptive Policy: The adaptive policy hinges on hot/cold thresholds that are dynamically tuned to maintain a target counter cache hit rate (95%). This seems reasonable for the evaluated workloads, but it may be less effective in scenarios with very rapid and dramatic application phase changes. The system could potentially oscillate or lag in its response, leading to suboptimal performance. A discussion of the policy's robustness under more adversarial or dynamic workload patterns would be insightful.
Limited Exploration of the CXL Design Space: The paper primarily models CXL as a direct-attached memory expander (CXL.mem). The AIORE framework, particularly the offloading component, seems perfectly suited for the richer, switched-fabric topologies enabled by CXL 2.0/3.0, which involve memory pooling and sharing. Positioning AIORE within this broader, more disaggregated future would elevate the work's perceived impact and foresight.

Questions to Address In Rebuttal

Regarding implementation complexity: Can the authors comment on the feasibility of integrating the Page Hotness Tracker and the Incremental Re-Encryption Bitmap Cache (IREBC) into a modern CXL root complex? Are there particular challenges in verifying the correctness of the intricate state transitions, especially the handoff between the incremental and offloaded stages?
Regarding the adaptive policy: The framework uses a fixed 95% counter cache hit rate as its optimization target. Have the authors considered the sensitivity to this target value or how the system might behave in workloads with rapid phase changes where the set of "hot" pages changes dramatically in a short time?
This work provides an excellent solution for securing CXL memory. Could the authors comment on whether the core AIORE framework—adaptive policy, incremental state change, and offloaded management—could be generalized to solve other problems in disaggregated memory systems? For example, could a similar approach be used for managing data placement in tiered memory (DRAM vs. SCM) or for applying different compression algorithms to hot/cold pages?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents AIORE (Adaptive Incremental Offloaded Re-Encryption), an architectural framework designed to mitigate the performance overhead of securing CXL-attached memory. The authors identify the key performance bottleneck as the static use of XTS encryption, which is on the critical path for memory reads.

The core novelty claim is a three-part strategy applied in concert: 1. Adaptive Encryption: A per-page, dynamic selection between Counter (CTR) mode encryption for "hot" (frequently accessed) pages and XTS mode for "cold" pages. This selection is driven by a hardware "Page Hotness Tracker." 2. Incremental Re-Encryption: When a page's mode is switched (e.g., from XTS to CTR) or a CTR counter overflows, the required full-page re-encryption is performed incrementally. Instead of stalling the processor to re-encrypt all 64 cache lines, the re-encryption occurs on a line-by-line basis as the program naturally reads or writes to them. 3. Offloaded Re-Encryption: To handle cases where a page in transition is not fully accessed by the program in a timely manner, the task of completing the re-encryption is offloaded from the host CPU's critical path to the CXL memory device's controller.

The authors claim this combination significantly reduces security overhead compared to existing static XTS or CTR-based solutions for CXL.

Strengths

The primary strength of this work lies not in the invention of a new cryptographic primitive, but in the novel and sophisticated synthesis of several architectural concepts to create a highly efficient system.

Novel Re-Encryption Mechanism: The combination of incremental and offloaded re-encryption is the most genuinely new contribution. Standard split-counter schemes [87] suffer from high-latency, blocking re-encryption on overflow. The idea of piggybacking re-encryption on inherent program accesses to amortize the cost, and then offloading the remainder of the work, is a clever and previously unexplored mechanism in this context. It directly addresses the primary performance pathology of using compact, high-hit-rate counters.
Strategic Application of Hybrid Encryption: While hybrid security mechanisms are not new in themselves (see Weaknesses), the authors' proposal to use access hotness as the selection criteria between CTR and XTS is a logical and well-justified policy. It correctly identifies that CTR's benefits are maximized for hot data where counter caching is effective, while XTS is superior for cold data where maintaining counter state is pure overhead. This is a clear improvement over prior work like Counter Light [82], which uses a less direct proxy (bandwidth utilization) for its switching policy.

Weaknesses

While the overall system design is novel, a breakdown of its constituent parts reveals that many of the underlying concepts have precedents in prior art. The novelty is in the integration, not the individual ideas.

Conceptual Overlap with Prior Art:
- Hybrid Encryption: The concept of switching between different encryption modes to optimize performance is not fundamentally new. The paper itself discusses Counter Light [82], which proposes a hybrid CTR/XTS scheme for TEEs. The delta here is the policy (hotness vs. bandwidth) and the re-encryption mechanism, but the core idea of a hybrid approach is established.
- Offloading to "Smart" Memory: The idea of offloading security or management tasks to a compute-capable memory controller is an emerging theme, especially with CXL. Toleo [26], cited by the authors, proposes using trusted components in CXL memory to manage freshness, which is a form of offloaded security processing. AIORE's contribution is the specific application of offloading to the re-encryption problem.
- Hotness-Aware Optimization: Optimizing system policies based on data access "hotness" is a classic technique in computer architecture, applied to everything from cache replacement to data migration. Applying it to select an encryption mode is a new application, but not a new principle.
Significant Architectural Complexity: The proposed solution introduces a substantial number of new hardware components and state-tracking mechanisms. This includes the Page Hotness Tracker, the Incremental Re-Encryption Bitmap Cache (IREBC) on the host, the Incremental Re-Encryption Bitmap Buffer (IREBB) on the CXL device, and modifications to page table entries. While the authors' evaluation shows a clear performance benefit, this benefit comes at the cost of considerable design and verification complexity. The performance gain must be weighed against this implementation burden. The paper analyzes area overhead, but not the complexity of the control logic required to manage the incremental and offloaded states concurrently with normal memory accesses.

Questions to Address In Rebuttal

The novelty of this work rests heavily on the synergy between its components. Could the authors please clarify the relationship between their incremental re-encryption and analogous concepts in other fields, such as incremental garbage collection or lazy data migration? Acknowledging such parallels would help situate the novelty of their specific implementation more precisely.
The offloading mechanism relies on a new communication protocol between the host and the CXL device to transfer re-encryption state (PFN, bitmap, counter state). As detailed in Section 4.4 and Figure 14, this introduces a new class of control messages. Does this mechanism introduce any new, subtle side-channels related to the timing or frequency of these offload and completion messages that are not covered by the baseline threat model discussed in Section 2.5 and 4.2.1?
The performance of AIORE seems critically dependent on the accuracy of the Page Hotness Tracker. The current policy for adjusting the hotness threshold is based on maintaining a target counter cache hit rate (95%). How robust is the system if this heuristic is inaccurate, for example, during rapid phase changes in an application's memory access patterns? Could this lead to pathological behavior, such as "thrashing" pages between CTR and XTS modes?

Citadel: Rethinking Memory Allocation to Safeguard Against Inter-Domain Rowhammer Exploits

Abstract

Rowhammer is a hardware security vulnerability at the heart of every DRAM-based memory system. Despite its discovery a decade ago, comprehensive defenses in current systems remain elusive, while the probability of successful attacks grows with DRAM ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose Citadel, a memory allocator designed to mitigate inter-domain Rowhammer exploits by physically isolating security domains. The core idea is a two-level allocation scheme that uses coarse-grained "chunks" for large domains to amortize guard row overhead, and fine-grained, high-overhead "zonelets" (akin to ZebRAM) for small domains. The paper claims this design supports thousands of domains with a modest 7.2% average memory overhead and no performance loss.

While the conceptual approach of balancing allocation granularities is sound, the paper's central claims are built on a fragile foundation. The evaluation is conducted almost exclusively under a simplified, best-case DRAM addressing model that does not reflect the complexity of many modern systems. The authors' own analysis in Section 8.5 reveals that under more realistic conditions, the overhead would likely approach 25%, a figure that undermines the paper's primary contribution. Furthermore, the claim of "no performance loss" is unsubstantiated, and the decision to completely disable inter-process memory sharing severely curtails the system's practical applicability.

Strengths

The fundamental concept of a two-level allocation strategy to balance the trade-offs between capacity loss (from guard rows) and memory stranding (from coarse-grained reservations) is a logical and interesting direction for research in this area. It correctly identifies the primary weaknesses of prior art like ZebRAM and Siloz.
The paper's design is more flexible than its predecessors, offering a mechanism to support domains of highly variable sizes, from single kernel pages to large multi-gigabyte applications.

Weaknesses

My analysis reveals several critical weaknesses that question the validity and practicality of the presented results.

The Evaluation Rests on an Unrealistic Model: The entire quantitative evaluation (Figures 9-12) is performed using a simple DRAM mapping model. The paper relegates the far more common and complex mappings (involving scrambling, mirroring, etc.) to a brief, analytical discussion in Section 8.5. In that section, the authors concede that zone expansion becomes less effective, and capacity loss is "more likely to dominate and approach memory overheads of 25%." This is a fatal flaw. The headline claim of 7.2% overhead is not representative of many real-world systems, and the paper provides no empirical data to validate its performance under these more challenging, realistic conditions. The work has been evaluated in its best-case scenario, not its typical one.
Unsupported and Misleading Performance Claims: The Abstract claims "no performance loss," yet Figure 10 (page 10) shows performance variations from -13% to +15%. More troublingly, the authors report a 4.1% average speedup but state they "could not pinpoint the gain's root cause." Unexplained performance improvements are as suspect as unexplained slowdowns and often point to uncontrolled variables or artifacts in the experimental setup. A rigorous paper cannot make a strong performance claim on the back of an unexplained result. The claim should be, at best, "negligible average performance impact," but even that is questionable without understanding the source of the variance.
Disabling Core OS Functionality: The authors disable inter-process memory sharing to "simplify our implementation" (Section 6.3, page 9). They justify this by showing their chosen workloads make little use of it (Section 8.6, page 12). This is a textbook case of shaping the experiment to hide a system's deficiency. Memory sharing via copy-on-write (CoW) after fork() is a cornerstone of Unix-like operating systems, and features like Kernel Samepage Merging (KSM) are critical in virtualization environments. A memory allocator that cannot support this without massive memory duplication is not a practical solution for general-purpose servers.
Security Guarantees Are Brittle: The security model is predicated on the assumption that NG guard rows are sufficient to stop any attack (Section 4.7, page 5). While the authors acknowledge attacks like Half-Double, the landscape of Rowhammer is constantly evolving. Claiming the design offers "complete protection coverage against future unknown RH attacks" (Section 1, page 1) is a dangerously strong and unsubstantiated claim for a software-only defense that relies on a fixed, small number of guard rows.
Re-introduction of "Impractical" Overheads: The authors rightly criticize ZebRAM's 50-67% capacity loss as "impractical." However, their "zonelet" primitive for small domains uses the exact same striping mechanism and incurs the same high overhead (Section 4.6, page 5). While the two-level design mitigates this at a system level, it does not solve the underlying problem. For workloads dominated by a vast number of small domains (a plausible scenario in microservice architectures or with per-page-table isolation), the average overhead would trend towards ZebRAM's impractical levels, a scenario not adequately stressed in the evaluation.

Questions to Address In Rebuttal

The discrepancy between the empirically-measured 7.2% overhead under a simplified model and the analytically-derived ~25% overhead under a complex model is the most significant issue. Can you provide empirical results from a full evaluation on a system with complex DRAM address mappings to show the true overhead of Citadel?
Please provide a clear, evidence-based explanation for the 4.1% average speedup. If the cause remains unknown, on what basis can you claim your system has "no performance loss" rather than "unpredictable performance impact"?
The decision to disable memory sharing is a major limitation. Please provide a quantitative analysis of the memory overhead Citadel would incur on a workload that heavily utilizes fork() and CoW (e.g., a pre-forking web server like Apache under load) or KSM. How can Citadel be considered a general-purpose solution without supporting this fundamental OS feature efficiently?
Given that your system's headline 7.2% overhead is only achievable under an idealized mapping, and that you disable a key OS feature, how do you justify the claim that Citadel is "readily deployable across legacy, contemporary, and future platforms"?
What is the quantified duration of the system's window of vulnerability during the bootstrapping process described in Section 6.4 (page 9)?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

The paper presents Citadel, a novel memory allocator for the Linux kernel designed to provide robust protection against inter-domain Rowhammer (RH) exploits. The work correctly identifies the fundamental challenge in software-based RH isolation: a difficult trade-off between the high memory capacity loss of fine-grained, row-by-row isolation schemes (e.g., ZebRAM) and the excessive memory stranding and limited domain scalability of coarse-grained, subarray-level schemes (e.g., Siloz).

Citadel's core contribution is a practical and elegant synthesis of these two approaches. It introduces a two-level allocation strategy that reflects the typical memory usage patterns in modern systems. For the numerous, small-footprint domains (like background processes or individual page table pages), it uses fine-grained "zonelets." For the few, large-footprint application domains, it uses coarser "reservation chunks." This hybrid approach allows Citadel to amortize the cost of guard rows effectively, supporting thousands of security domains of arbitrary size. The authors implement their design in the Linux kernel and demonstrate through a comprehensive evaluation that it incurs a modest 7.2% average memory overhead and no performance degradation, while successfully supporting complex workload mixes that prior solutions cannot handle.

Strengths

Elegant and Principled Core Idea: The central contribution of this work is its recognition that a one-size-fits-all approach to RH isolation is inefficient. The dual-granularity design based on "zonelets" and "chunks" is a clear and powerful idea that directly addresses the primary weaknesses of its predecessors. The paper's problem formulation, especially as illustrated in Figure 1 (Page 2), is exceptionally clear and provides a strong motivation for the proposed solution.
Excellent Contextualization and Positioning: The authors have done a superb job of placing their work within the broader landscape of RH mitigations. They clearly articulate the limitations of both hardware defenses and existing software isolation schemes, positioning Citadel as the logical and necessary next step in this line of research. The work serves as a perfect synthesis of the ideas presented in ZebRAM [41] and Siloz [45], combining their respective strengths to create a more general and practical system.
Thorough and Realistic Evaluation: The experimental methodology is a significant strength. The creation of 11 diverse workload mixes (Table 3, Page 10) that include not only main applications but also background processes and emulated per-page-table domains demonstrates a deep understanding of real-world system behavior. Evaluating at the scale of a 128GB server with up to ~57K domains shows that the solution is designed for contemporary challenges, such as high core counts and the need for fine-grained process isolation. The main results in Figure 9 (Page 10) are compelling and clearly show the benefits of Citadel.
Pragmatic and Well-Considered Design: The implementation as a Linux memory allocator shows a commitment to practical application. The design thoughtfully considers complex, real-world issues that are often overlooked in purely academic proposals. This includes the bootstrapping process (Section 6.4, Page 9), integration with kernel subsystems, and most importantly, the implications of complex internal DRAM addressing schemes like row scrambling and mirroring (Section 6.1, Page 7). This foresight significantly strengthens the paper's credibility.

Weaknesses

The "Oracle" Problem of DRAM Address Mappings: The most significant dependency of this work, shared by all similar software-only spatial isolation schemes, is the requirement of knowing the physical-to-DRAM address mapping. The authors acknowledge this in Section 6.1.4 (Page 8), but the practicality of this step for widespread deployment remains a major hurdle. While reverse-engineering is possible, it is a non-trivial, system-specific process. This dependency moves Citadel from a "drop-in software patch" to a solution requiring significant, expert-level per-system calibration, which could limit its adoption.
Overhead Under Complex Mappings: The paper's headline result of ~7% memory overhead is based on a simple DRAM mapping. The authors' own analysis in Section 8.5 (Page 12) projects that for more complex (and common) mappings, the overhead could approach 25% due to reduced opportunities for zone expansion. This is a crucial point that significantly tempers the paper's main claims. While the honesty is commendable, the core evaluation does not use what might be the more common case, potentially presenting a best-case scenario as the primary result.
Handling of Shared Memory: The prototype simplifies its implementation by disabling inter-process memory sharing (Section 6.3, Page 9). The authors justify this by noting the low sharing factor in their chosen workloads. However, in other important scenarios, such as virtualization or environments with high degrees of library sharing or copy-on-write forks, this could be a significant limitation, leading to either security vulnerabilities or increased memory pressure. The proposed solution of placing shared pages in zonelets is plausible but unevaluated.

Questions to Address In Rebuttal

Regarding the dependency on DRAM address mapping: Could the authors elaborate on the practical path to deployment for Citadel? Would this involve creating a community-maintained database of DIMM mappings, or do they envision an automated profiling tool that a system administrator could run? How robust is the system to partially incorrect mapping information?
The analytical projection of ~25% overhead for complex DRAM mappings is a significant departure from the 7.2% demonstrated. Can you provide more intuition on why the overhead increases so dramatically? Is this a hard ceiling, or could further allocator optimizations (e.g., more sophisticated placement algorithms) mitigate this? It would strengthen the paper immensely if you could provide even a single data point from an experiment on a real system known to have complex mappings to validate this analytical model.
Could you elaborate on the design for safely handling shared memory? You propose placing shared pages in zonelets, which seems to imply that processes sharing a page must be co-located in the same data row. How would this impact the allocator's flexibility and potentially fragment the address space for shared pages? Are there fundamental challenges beyond mere implementation complexity?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents Citadel, a memory allocator designed to provide software-based isolation against inter-domain Rowhammer (RH) exploits. The authors identify a key trade-off in prior art: fine-grained, row-level isolation schemes (e.g., ZebRAM) incur prohibitive capacity loss due to guard rows, while coarse-grained, subarray-level schemes (e.g., Siloz) suffer from a limited number of domains and excessive memory stranding.

The authors claim novelty in their two-level allocator design that aims to resolve this trade-off. The core mechanism is a new software primitive called a "reservation chunk," a configurable group of contiguous global rows. For small memory footprints (e.g., page tables, daemons), Citadel employs "zonelets," which are regions of memory striped with guard rows, functionally similar to ZebRAM. For larger footprints, it allocates memory in "zones" composed of one or more reservation chunks, amortizing the guard row cost by only placing them at the boundaries of a zone. This hybrid approach claims to support thousands of variably sized domains with modest memory overhead.

Strengths

The primary strength of this work lies not in the invention of a single new mechanism from first principles, but in the novel synthesis of existing concepts into a new, and demonstrably more practical, point in the design space for RH mitigation.

A Novel Intermediate Granularity: The "reservation chunk" primitive (Section 4.3, page 4) is a well-defined software abstraction that sits neatly between the hardware-dictated granularities of a single row (ZebRAM) and an entire subarray (Siloz). While creating software-defined memory chunks is not new in general memory allocation, its specific application here—as a tunable unit to balance guard-row loss against stranding for RH protection—is a novel contribution.
A Novel Hybrid Allocation Strategy: The two-level design (zones and zonelets, described in Sections 4.3-4.6, pages 4-5) is the paper's most significant novel idea. Prior solutions have been monolithic, applying a single strategy (either fine-grained or coarse-grained) to the entire memory space. Citadel is the first system I am aware of to propose a dynamic, hybrid approach that applies a high-cost, high-granularity strategy only where necessary (for small domains) and a low-cost, lower-granularity strategy for the bulk of memory. This targeted application of different techniques is a genuinely new approach in this specific problem domain.
Significant Delta Over Prior Art: The quantitative "delta" between Citadel and the closest prior works is substantial. By moving from the monolithic approaches of ZebRAM and Siloz to a hybrid model, the authors convert what were largely academic curiosities (due to extreme overheads or limitations) into a potentially deployable system. The results in Figure 9 (page 10), which show Citadel succeeding on workloads where both prior systems fail, underscore that the novelty is not merely incremental but enabling.

Weaknesses

The novelty of the work is based on combination and refinement, not fundamental invention. Therefore, the paper's claims must be carefully scoped.

Constituent Components are Not New: The core mechanisms used by Citadel are, in isolation, well-established. The use of guard rows for spatial isolation is the central idea of ZebRAM [41] and GuardION [68]. The concept of isolating domains in larger, physically distinct regions is the core of Siloz [45]. The "zonelet" primitive is, functionally, a reimplementation of ZebRAM's striping within a bounded region. The novelty is exclusively in the combination and the management logic, a point that could be stated more explicitly.
Novelty is Predicated on Complex Engineering: The entire premise of Citadel, like Siloz, relies on the ability to know or reverse-engineer the physical DRAM address mapping (Section 6.1, page 7). This makes the solution an engineering construct built atop another complex, and potentially fragile, engineering construct. While practical, it detracts from the conceptual purity of the novel allocator design, tying its fate to the continued feasibility of address mapping discovery. The paper's own analysis of complex DRAM mappings in Section 8.5 (page 12) reveals that the overhead can jump to 25%, significantly eroding the benefit that makes the design so compelling. This suggests the novelty may be less robust than presented.

Questions to Address In Rebuttal

The core components of your system—guard rows and coarse-grained isolation—are directly inherited from ZebRAM and Siloz, respectively. The novelty appears to be in the synthesis and the management layer that decides which strategy to apply. Could the authors clarify if there is any other element of Citadel they consider fundamentally new, beyond this novel synthesis?
The "reservation chunk" is proposed as a key primitive. The effectiveness of this primitive relies on the ability to form contiguous "zones" to amortize guard row overhead. Your own analysis in Section 8.5 (page 12) shows that prevalent, complex DRAM addressing schemes can increase overhead to 25% due to reduced opportunities for zone expansion. Does this not significantly weaken the novelty of your contribution by constraining its effectiveness to simpler, and perhaps less common, memory systems? How does the core idea remain novel if its primary benefit is so sensitive to underlying hardware complexities that are outside the allocator's control?

EcoCore: Dynamic Core Management for Improving Energy Efficiency in Latency-Critical Applications

Abstract

Modern data centers face increasing pressure to improve energy efficiency while guaranteeing Service Level Objectives (SLOs) for Latency-Critical (LC) applications. Resource management in public cloud environments, typically operating at the node or ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose EcoCore, a dynamic core management system for latency-critical (LC) applications. The central claim is that jointly managing core allocation for both application threads (T) and network packet processing (P), along with adaptively tuning packet processing intervals (I), leads to significant energy efficiency gains without violating Service Level Objectives (SLOs). The system relies on a lightweight, tree-based regression model to predict performance and energy consumption, guiding a greedy exploration policy to select configurations. The authors evaluate EcoCore against several baselines, including static allocation and state-of-the-art policies like CARB and Peafowl, across multiple workloads and platforms, including AWS.

Strengths

Sound Motivation: The paper correctly identifies a gap in existing core management literature. The initial investigation in Section 3 provides a clear and compelling motivation for considering network packet processing not as a secondary effect but as a primary factor in core idleness and energy consumption. The insight that co-managing intervals and core counts can unlock further savings (Insight-3, page 5) is a valid hypothesis.
Comprehensive Workloads: The evaluation is conducted across four distinct and relevant LC applications (memcached, nginx, grpc-bench, mongodb), lending some generality to the findings.

Weaknesses

My primary concerns with this paper lie in the methodological rigor of the evaluation and the robustness of the proposed control system. The claims of superiority are not, in my view, substantiated with the necessary level of evidence.

Unsupported Energy Claims in Cloud Environments: The energy efficiency results from the AWS evaluation (Section 5.3, page 10) are methodologically unsound. The authors state that direct measurement is not permitted and instead rely on a "state-based power model" using generic power values for C-states (e.g., CC0=4W, CC6=0.1W). This is a critical flaw for several reasons:
- Hardware Abstraction: These power values are not specific to the AWS m5zn instances used. Actual power draw is highly dependent on the specific CPU microarchitecture, platform, and uncore components (e.g., memory controller, LLC), which are not accounted for.
- Ignoring Active Power: The model only seems to account for idle state power, but changing the number of active cores and packet processing intensity fundamentally alters active power consumption (P-states, memory traffic, etc.), which is ignored.
- Unverifiable Claims: The headline claim of "additional energy savings of up to 35.8%" (Abstract, page 1) is derived from this model, not measurement. It is an estimation based on a series of unverified assumptions, not an empirical result. As such, it cannot be accepted as a factual finding.
Sub-Optimal Exploration Strategy: The Explorer component (Section 4.3, page 7) employs a greedy tree-based search to navigate the configuration space. This is a heuristic approach that provides no guarantee of finding an optimal, or even near-optimal, solution.
- The paper claims to solve the problem of a "vast search space," but the greedy approach simply prunes that space aggressively. It is highly susceptible to converging to a local minimum. For example, a configuration that requires simultaneously increasing T while decreasing I might never be reached if each individual step appears suboptimal to the scoring function.
- The "Dynamic Scaling Unit Management" policy (Equation 4, page 8) appears overly simplistic and potentially unstable. The floor(P99/SLO) logic could cause large, jerky changes in core counts, leading to performance oscillations, especially under bursty workloads.
Questionable Fairness of Baselines: The comparison to Peafowl (Section 5.1, page 9) is suspect. The authors state they "re-implemented Peafowl as a user-space daemon" to broaden its applicability. The original Peafowl was a more integrated system. It is not clear that this re-implementation is faithful to the original or if it performs optimally. Any performance deficit observed could be an artifact of this specific implementation rather than a fundamental limitation of the Peafowl approach. A robust comparison requires either using the original authors' artifact or providing a detailed validation of the re-implementation.
Lack of Statistical Rigor: The majority of the results presented in the figures (e.g., Figure 13, Figure 14) lack error bars or confidence intervals. Given the inherent variability in network workloads and system performance, reporting only mean or point values is insufficient. The claimed improvements (e.g., 11.7% on average) may not be statistically significant if the run-to-run variance is high. The absence of this analysis prevents a rigorous assessment of the results.

Questions to Address In Rebuttal

The authors must address the following points directly to salvage this submission:

On the AWS Power Model: Please provide a sensitivity analysis for your power model. How do the claimed energy savings in Section 5.3 change if the power values for each C-state are incorrect by ±25% or ±50%? Better yet, can you justify why this abstract model is a valid proxy for real energy consumption on the complex, multi-tenant hardware used by AWS? Without this, all claims related to the AWS experiment should be removed.
On the Exploration Heuristic: How can you be sure your greedy explorer does not get trapped in a poor local minimum? Please provide evidence, perhaps from an exhaustive search over a smaller, tractable sub-space, that demonstrates how close your heuristic's chosen configurations are to the true optimal configurations.
On the Peafowl Baseline: Please provide a detailed validation of your Peafowl re-implementation. How does its performance (latency, throughput) compare to the results published in the original Peafowl paper under similar conditions? Without this, the comparison is not credible.
On the Scoring Function: The scoring function weight w is a critical hyperparameter, stated to be between 0.4 and 0.6. How was this range determined? Please provide data showing the system's performance and energy savings when w is varied outside this range (e.g., 0.2, 0.8) to demonstrate the sensitivity of the system to this choice.
On Stability: The dynamic scaling unit (Equation 4) appears reactive and potentially unstable. Can you provide a plot showing the number of cores (T and P) and the chosen interval over a long period of time for a stable workload? This would demonstrate whether the system converges to a steady state or oscillates continuously.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents EcoCore, a dynamic core management system designed to improve the energy efficiency of dedicated instances running latency-critical (LC) applications without violating their Service Level Objectives (SLOs). The authors' core thesis is that existing dynamic resource managers are fundamentally incomplete because they overlook the significant impact of network packet processing on both performance and energy consumption. They argue that to effectively manage energy, one must treat application threads and network packet processing as two distinct, yet interconnected, workloads.

EcoCore advances this thesis by proposing a system that co-manages three distinct control knobs: 1) the number of cores allocated to application threads, 2) the number of cores allocated to network packet processing (via RSS/XPS), and 3) the network packet processing interval (i.e., interrupt coalescence). The system uses a lightweight, online-trained predictive model to navigate the vast configuration space, identifying settings that reduce energy consumption by maximizing core residency in deep sleep states (C-states), while ensuring the tail latency remains below the SLO. The authors validate EcoCore through extensive experiments, including on a 64-core server and the AWS public cloud, demonstrating significant energy savings (avg. 11.7%, up to 20.3%) over state-of-the-art approaches.

Strengths

A Salient and Well-Motivated Core Insight: The paper's primary strength lies in its clear-eyed identification of a critical gap in the field. While the challenges of managing LC workloads are well-documented, the community has largely bifurcated into two camps: those focusing on co-location for utilization (e.g., Parties, Heracles) and those focusing on application-level core scaling for energy (e.g., CARB). This paper compellingly argues that for dedicated LC instances—a common and important deployment model—network packet processing is not a secondary effect but a primary driver of energy inefficiency. The motivational experiments in Section 3 (pages 3-5) are excellent, clearly demonstrating how frequent network interrupts prevent cores from entering deep sleep states and how uni-dimensional policies fail to capture this.
Synthesizing Disparate Control Dimensions: The true novelty of EcoCore is its holistic approach. It effectively synthesizes concepts from two previously separate domains: systems-level core allocation and network-stack tuning. By creating a unified control plane for application cores, packet cores, and packet intervals, the authors have framed a more complete and accurate model of the problem space. This multi-dimensional control allows EcoCore to find optimization points that are inaccessible to other systems, such as trading a slightly shorter packet interval (worse for energy) to reduce latency, thereby creating enough SLO headroom to shut down an entire application core (a major energy win). This is a sophisticated and powerful perspective.
Strong and Practical Evaluation: The experimental validation is thorough and convincing. The authors not only test against multiple representative LC applications but also demonstrate scalability on a 64-core NUMA system and, most importantly, practicality in a public cloud environment (AWS). The AWS evaluation (Section 5.3, page 10), though reliant on a power model, is critical for demonstrating the work's relevance beyond the lab. It shows that even with the abstractions and constraints of virtualization, the core principles hold true. The dynamic load analysis (Figure 19, page 11) is particularly strong, as it shows the system adapting its multi-dimensional policy in real-time.
Pragmatic Design: The system is designed for real-world adoption. By not requiring any application modifications and leveraging standard kernel interfaces like cgroups and ethtool, the authors have significantly lowered the barrier to entry. This positions EcoCore not as a radical, disruptive technology but as an intelligent control layer that could plausibly be integrated into cloud management platforms or run as a sidecar daemon by sophisticated users.

Weaknesses

Insufficient Contextualization with Kernel-Bypass/User-Space Networking: While the related work section (Section 6, page 13) mentions user-space networking stacks (e.g., IX, mTCP), the paper misses an opportunity to deeply contrast its philosophical approach. User-space networking often achieves low latency at the cost of energy efficiency (e.g., via busy-polling), representing one end of the design spectrum. EcoCore represents the other: working with the kernel to maximize efficiency while maintaining "good enough" performance. A more explicit discussion of this trade-off would better situate EcoCore within the broader landscape of high-performance networking and clarify the specific niche it aims to fill. Is EcoCore the answer for mainstream LC apps, while user-space stacks are for ultra-low-latency financial trading?
The Controller's Complexity vs. Contribution: The paper presents a control loop with an online-trained regression model and a greedy search explorer. While this is a sound engineering approach, the evaluation doesn't fully explore whether this level of complexity is necessary. The core contribution is identifying the three control knobs; the mechanism to tune them is secondary. A comparison against a simpler, well-tuned heuristic controller (e.g., a simple feedback loop based on PID principles) would strengthen the claim that the predictive model is essential and not just an implementation detail.
Potential for Interaction with Cloud Provider Policies: The AWS experiments are a strength, but they implicitly assume the hypervisor is a static environment. In reality, cloud providers have their own complex, host-level resource managers that may perform CPU frequency scaling, power capping, or transparent VM migration. There is a potential for adversarial interactions where EcoCore's guest-level decisions fight against the provider's host-level policies. Acknowledging and briefly discussing this limitation would add nuance and demonstrate a broader systems awareness.

Questions to Address In Rebuttal

Could the authors please elaborate on the fundamental trade-offs between EcoCore's kernel-integrated approach and the kernel-bypass philosophy of user-space networking stacks, particularly from the perspective of the energy/performance spectrum?
The paper's central contribution is the insight to co-manage the three specified dimensions. How critical is the chosen machine learning-based predictor and explorer to the system's success? Could a significant portion of the benefits be achieved with a simpler, non-learning-based heuristic controller, and if not, why?
The scoring function weight w (Section 4.3, page 7) is a key parameter that balances energy and latency priorities. The paper states it is set "between 0.4 and 0.6 depending on the application." Could you provide more insight into how this value is determined? Is it set manually per application, or is there a methodology to derive it?
Regarding the public cloud deployment, have the authors considered the potential for negative interactions between EcoCore's guest-level resource management and the opaque, host-level management policies of the cloud provider? Does this represent a potential threat to the stability or effectiveness of the proposed system in a production environment?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper proposes EcoCore, a dynamic core management system designed to improve energy efficiency for latency-critical (LC) applications without violating Service Level Objectives (SLOs). The authors identify that prior work in dynamic core allocation often overlooks the energy impact of network packet processing. The central claim of novelty is a system that jointly and dynamically manages three parameters: the number of cores allocated to application threads (T), the number of cores for network packet processing (P), and the network packet processing interval (ITR). This three-dimensional optimization is guided by a lightweight, online-trained predictive model that estimates latency and energy, coupled with a greedy, tree-based search policy to navigate the vast configuration space.

Strengths

The primary strength of this paper lies in its identification and synthesis of a new, multi-dimensional optimization space.

Novelty in Synthesis: While the individual components of the proposed solution have antecedents in prior art, the holistic co-management of application cores, packet processing cores, and packet processing intervals within a single, unified framework appears to be novel. Prior works have typically focused on a subset of these dimensions:
- CARB [60] focuses on application core allocation (T).
- Peafowl [4] considers application and packet reception cores (T and a part of P).
- DynSleep [14] introduced the concept of delaying packet delivery to extend sleep states, which is conceptually identical to managing the packet processing interval (ITR).
- NMAP [31] managed packet processing modes but not dynamic core allocation.
EcoCore's contribution is the integration of these three control knobs into a single policy, arguing—with convincing empirical evidence (e.g., Figure 6, page 5)—that optimizing them jointly unlocks energy savings that are inaccessible when managing them in isolation.
Clear Delineation of the Problem Space: The authors do an excellent job in Section 3 (pages 3-4) of demonstrating empirically why both packet processing core allocation (P) and interval (ITR) are first-order factors in energy consumption, not just application core allocation (T). This foundational analysis crisply motivates the need for the proposed multi-dimensional approach, establishing the conceptual ground on which their novel synthesis is built.

Weaknesses

My critique focuses on the degree of novelty of the constituent parts and the positioning against the most relevant prior art.

Constituent Ideas are Evolutionary, Not Revolutionary: The core ideas underpinning EcoCore are extensions of existing concepts.
- The idea of managing the packet processing interval for energy savings is not new. As the authors cite, DynSleep [14] proposed delaying packet delivery for exactly this reason. EcoCore’s novelty here is the generalization of this concept to modern NICs with parallel packet processing capabilities (RSS/XPS) using standard kernel interfaces (ethtool), whereas DynSleep was more limited. This is a significant engineering advancement but an evolutionary conceptual step.
- The co-management of application and network processing resources for performance and energy has also been explored. Peafowl [4] managed application and packet reception cores. More pointedly, IX [9] proposed a system that integrates packet processing and application threads, using dynamic core allocation to balance performance and energy. The core concept of treating network processing as a resource to be co-managed with the application for energy efficiency is therefore not de novo.
Insufficient Comparison with the Closest Prior Art: The experimental evaluation (Section 5.2, page 9) is missing a direct comparison to what I consider one of the most conceptually similar systems: IX [9]. While IX is mentioned in the Related Work section (Section 6, page 13), the authors dismiss it as a polling-based userspace network stack. However, IX explicitly addresses the co-allocation of cores between application and packet processing to manage energy. The goal is identical, even if the mechanism (polling vs. interrupt moderation) differs. A lack of quantitative comparison against such a closely related predecessor weakens the paper's claims about its advancement over the state-of-the-art. The current evaluation compares EcoCore against systems that manage a strict subset of its dimensions (e.g., CARB, Peafowl), making its victory somewhat predictable.
The Optimization Mechanism is Standard: The novelty of the paper lies in the policy space (what to control), not the mechanism (how to control it). The use of a tree-based greedy search guided by a Gradient Boosting Regressor is a standard machine-learning-in-systems approach. While effective, it does not represent a novel algorithmic contribution to optimization or system control.

Questions to Address In Rebuttal

The core novelty appears to be the synthesis of three previously disparate control knobs (T, P, ITR). Can the authors clarify if their contribution is primarily this integration, or if there is a more fundamental insight beyond demonstrating that "managing more things is better"? Please explicitly differentiate the conceptual advance over a hypothetical system that combines the ideas from DynSleep [14] and Peafowl [4].
The most critical question: Why was IX [9] not included in the experimental evaluation in Section 5? Given that IX also performs dynamic core allocation for both application logic and integrated packet processing with an explicit goal of energy efficiency, it seems to be the most relevant baseline for evaluating the co-management of T and P. A convincing rebuttal must justify this omission or acknowledge it as a limitation.
The proposed Explorer uses a greedy search, which does not guarantee optimality. Given the added complexity of a third optimization dimension (ITR), how can the authors provide confidence that their method finds solutions close to the true optimum? Was any form of offline analysis or exhaustive search on a constrained problem space performed to quantify the optimality gap of the greedy approach? This is important for assessing whether the added complexity of the third dimension is being effectively harnessed.

Flexing RISC-V Instruction Subset Processors to Extreme Edge

Abstract

This paper presents an automated approach for designing processors that support a subset of the RISC-V instruction set architecture (ISA) for a new class of applications at Extreme Edge. The electronics used in extreme edge applications must be area and ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper presents a methodology for automatically generating customized, single-cycle RISC-V processors, termed RISSPs (RISC-V Instruction Subset Processors). The core idea is to treat each instruction as a discrete, pre-verified hardware block. For a given application, the required instruction blocks are selected from a library and stitched together, with a standard synthesis tool performing the final optimization. The authors evaluate this approach for a newly defined application class called "Extreme Edge," implementing the resulting processors as flexible integrated circuits (FlexICs). The paper claims significant area and power reductions compared to a full-ISA processor generated with the same methodology, and superior energy efficiency compared to the Serv processor. A Generative AI-based tool is proposed to handle software updates for these subset processors.

While the paper addresses an interesting application space, the work suffers from questionable novelty, a flawed experimental evaluation based on weak and inappropriate baselines, and a proposed software update solution that is critically underdeveloped and unreliable.

Strengths

Physical Implementation: The physical layout and analysis of the processors on a 0.6µm IGZO-based FlexIC process (Section 4.3, Figure 10) is a commendable strength. It grounds the synthesis results in a real-world technology, demonstrating that the proposed designs are physically realizable within the target domain.
Verification-Centric Approach: The concept of a pre-verified, instruction-level hardware library (Section 3.4.1) is methodologically sound for reducing the verification effort of the final integrated core. By ensuring correctness at the block level, the authors rightly simplify a major bottleneck in processor design.

Weaknesses

Misleading Novelty Claims: The proposed methodology is presented as a novel approach but is, in essence, a simplified version of a standard additive Application-Specific Instruction Processor (ASIP) design flow. The core steps—profiling an application to identify a required instruction subset (Step 1) and composing hardware blocks for that subset (Step 2)—are foundational to ASIP design. Attributing "Redundancy removal by Synthesis tools" (Figure 2) as a step in their methodology is misleading; this is a standard feature of any synthesis tool, not a contribution of their flow. The work appears to re-brand established techniques without sufficient acknowledgment or differentiation.
Fundamentally Flawed Performance and Efficiency Comparisons: The central claims of the paper rest on two deeply problematic comparisons:
- Weak Internal Baseline: All percentage savings (e.g., "8-to-43% reduction in area") are relative to "RISSP-RV32E," a full ISA implementation generated by the authors' own flow. The quality and efficiency of this baseline are never established. Without benchmarking this baseline against other well-regarded, area-optimized RISC-V cores (e.g., PicoRV32), there is no way to know if the claimed savings are meaningful or simply the result of trimming down an inefficient baseline design.
- Inappropriate External Baseline: The energy efficiency comparison to Serv ("~40 times more energy efficient," Section 4.2.4, Figure 9) is invalid. Serv is a bit-serial processor with a high Cycles Per Instruction (CPI) of ~32, whereas the authors' RISSPs are single-cycle (CPI=1). Comparing Energy Per Instruction (EPI) is an apples-to-oranges comparison, as a single instruction on the RISSP accomplishes significantly more work than a single, multi-cycle instruction on Serv. A valid comparison would require measuring total energy to complete a specific task, not comparing a manipulated per-instruction metric.
Vague Application Space and Unsubstantiated Generalizations: The concept of "Extreme Edge" is poorly defined and supported by only three exemplars (armpit, af_detect, xgboost). The conclusion that applications in this domain only use "24-86% of the full RISC-V ISA" (Section 4.1) is a generalization based on an insufficient and potentially cherry-picked sample. The analysis in Figure 5 only shows the number of distinct instructions, which is a poor proxy for determining an optimal instruction subset. It completely ignores dynamic instruction frequency, which is critical for performance and energy optimization.
Unreliable and Undeveloped AI-based Retargeting: The proposed solution for software updates via a Generative AI tool (Section 5) is technologically immature and introduces unacceptable reliability risks. Relying on an LLM like ChatGPT to correctly rewrite assembly code is fundamentally unsound for hardware deployment. The verification method described—"functionally verified... with custom test cases"—is wholly inadequate for proving the correctness of instruction semantics across all corner cases and input values. This approach ignores decades of research in formal methods and compiler verification, replacing it with a probabilistic tool. Furthermore, the admitted code size increases of up to 36% (Figure 12) represent a severe penalty that is largely downplayed.

Questions to Address In Rebuttal

Please clarify the novelty of the RISSP generation methodology in contrast to existing additive ASIP design flows. Specifically, what part of the flow, beyond using pre-verified blocks, is a novel contribution?
The "RISSP-RV32E" baseline is central to your PPA savings claims. Please provide characterization data (e.g., gate count, max frequency) for this baseline against at least two well-known, open-source, area-optimized 32-bit RISC-V cores to validate its competitiveness.
Please justify the EPI comparison against the bit-serial Serv processor. To support the "40 times more energy efficient" claim, provide a full-task energy consumption comparison (running an identical benchmark to completion) between your smallest RISSP and Serv.
The proposed AI-based code retargeting (Section 5) lacks the rigor required for hardware. What formal methods, if any, are used to guarantee that the LLM-generated macros are semantically equivalent to the original instructions for all possible operand values and machine states? How do you manage verification for complex instructions with subtle side effects?
The instruction subset selection is based on a static count of distinct instructions. Have you performed a dynamic instruction mix analysis for your benchmarks? Please provide data on instruction frequency to demonstrate that the statically-chosen subsets are indeed optimal from a performance and energy perspective.

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a compelling, holistic vision for computing in the "Extreme Edge" domain, a class of applications characterized by extreme cost sensitivity, conformability, and often disposability (e.g., smart labels, wearable patches). The authors argue that conventional silicon is ill-suited for this domain and propose flexible electronics (FlexICs) as the enabling technology.

The core contribution is not merely a small RISC-V processor, but a complete, automated methodology for generating application-specific RISC-V Instruction Subset Processors (RISSPs). This methodology is built upon a novel concept: a library of pre-verified, discrete hardware blocks, where each block implements a single RISC-V instruction. A custom processor is automatically constructed by identifying the instructions required by a target application, pulling the corresponding blocks from the library, and stitching them together. This "compose-from-verified" approach fundamentally reduces design and verification time. The paper evaluates this by generating RISSPs for several applications, demonstrating significant area and power savings on FlexIC technology compared to a full-ISA processor and superior energy efficiency compared to the state-of-the-art small core, Serv. Finally, the paper proposes a forward-looking Generative AI-based solution to handle software updates for long-lasting applications on these constrained hardware targets.

Strengths

Excellent Problem Formulation and Contextualization: The paper does a superb job of defining and motivating the "Extreme Edge" computing paradigm (Section 1, page 1). By classifying applications into short-lived and long-lasting categories and identifying their unique requirements (ultra-low cost, conformability, sustainability), the authors establish a clear and convincing need for a new approach to hardware design. This work is not a solution in search of a problem; it is a direct and well-argued answer to a nascent but potentially enormous market.
Novel and Pragmatic Design Methodology: The central idea of an "instruction hardware block" library (Section 3.1, page 4) is powerful. It shifts the processor design paradigm from a monolithic "design-then-verify" cycle to a modular "compose-from-verified" flow. This has the potential to democratize custom hardware design, much like standard cell libraries did for logic design. By integrating formal verification at the block level (Step 0, Figure 2), the methodology significantly lowers the barrier to creating reliable, bespoke processors, which is perfectly aligned with the need for rapid, low-cost customization enabled by FlexIC technology.
A True System-Level Contribution: The paper's strength lies in its synthesis of ideas across multiple domains. It seamlessly connects an application domain (Extreme Edge), a manufacturing technology (FlexICs), a processor architecture (RISC-V), and a design automation methodology (the RISSP generator). This holistic perspective is rare and highly valuable. The authors have not just designed a processor; they have architected a complete workflow from application concept to physical implementation for a new class of electronics.
Forward-Thinking Approach to Software Evolution: The software update problem for subset ISAs is a well-known challenge. The proposed Generative AI-based code retargeting framework (Section 5, page 10) is a creative and highly relevant solution. Instead of attempting complex compiler modifications, the authors propose a post-compilation transformation step using LLMs. This is a pragmatic acknowledgment of the challenges of toolchain modification and an insightful application of modern AI techniques to solve a classic computer architecture problem. It points toward a future where software can be fluidly adapted to constrained, bespoke hardware.

Weaknesses

My criticisms are not of the work's core validity, but rather of its unaddressed scope and future implications, which I encourage the authors to consider.

Limited Microarchitectural Exploration: The methodology is demonstrated on single-cycle, non-pipelined processors. This is perfectly adequate for the target kHz-range performance. However, the paper does not discuss how the "instruction hardware block" concept might scale to more complex microarchitectures. Would a pipelined implementation require fundamentally different, stage-specific blocks? The beauty of the current approach is its simplicity; its extensibility to higher-performance designs, which future Extreme Edge applications may require, is an open question.
The Broader Tooling Ecosystem: While the paper elegantly sidesteps the need for a custom compiler, a processor is more than its RTL. The debugging and performance profiling experience on a RISSP would be non-standard. For example, a debugger stepping through code would encounter only valid instructions, but the developer might not immediately know the complete set of supported opcodes. The work could be strengthened by briefly discussing the implications for the broader software development and debug ecosystem.
Comparison with Configurable Cores: The comparison against a full-ISA baseline generated by the same methodology and the bit-serial Serv core is well-justified. However, the RISC-V landscape is rich with configurable open-source cores (e.g., VexRiscv, PicoRV32) that allow features/instructions to be enabled/disabled at synthesis time via configuration flags. A discussion of how the proposed bottom-up, block-stitching approach compares philosophically and practically (in terms of final PPA and design effort) to these top-down, configurable cores would add valuable context.

Questions to Address In Rebuttal

Regarding the Generative AI retargeting framework (Section 5), could the authors comment on the potential overhead? Specifically, what is the typical code size expansion and performance penalty observed when translating complex instructions into macros of simpler ones? While the feasibility is demonstrated, understanding the trade-offs is crucial.
Could you elaborate on the process and effort required to add a custom instruction to the pre-verified hardware library? While standard RISC-V instructions have clear semantics for verification, defining and formally verifying novel, application-specific instructions seems like it would remain a significant, non-recurring engineering effort for users.
How does the methodology envision handling shared hardware resources that are more complex than a simple ALU (e.g., a multiplier/divider unit) that might be used by several instructions? The current approach lets the synthesis tool find and optimize shared logic, but would it be more efficient to have a library of shared "functional unit blocks" in addition to "instruction blocks"?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents a methodology for automatically generating RISC-V Instruction Subset Processors (RISSPs) tailored for "Extreme Edge" applications, particularly on flexible substrates (FlexICs). The central thesis is a design automation flow that treats individual RISC-V instructions as discrete, formally pre-verified hardware blocks. These blocks are selected based on application analysis and "stitched" together to form a custom processor core (ModularEX). The authors claim this methodology reduces design and verification time by building a processor from a library of trusted components, offloading complex optimization to standard synthesis tools. A secondary contribution is a Generative AI-based framework for retargeting application code to these subset processors, aiming to circumvent the need for custom compiler backends.

My review focuses exclusively on the novelty of these contributions relative to the state of the art in processor design and automation.

Strengths

The paper's novelty does not lie in the concept of application-specific or subset processors, which is a well-established field (e.g., ASIPs). Instead, the novelty is found in the specific methodology proposed:

Novelty in Microarchitectural Abstraction: The core novel idea is the "instruction-as-a-block" microarchitectural template. Traditional processor design focuses on creating a unified, shared datapath (ALU, shifters, register ports) that is controlled by a decoder. This work proposes a conceptually different approach: decomposing the processor into independent, self-contained hardware modules, one for each instruction. While modular design itself is not new, applying it at the fine granularity of an instruction is a distinct and novel take. This shifts the design paradigm from optimizing a shared datapath to composing pre-verified, standalone functional units.
Verification-Centric Generation Flow: The tight integration of formal verification at the instruction-block level (Step 0, Figure 2, page 3) is a significant and novel aspect of the methodology. In most design flows, verification follows design. Here, verification precedes composition. By creating a library of formally verified blocks, the methodology transforms the processor generation problem into a system integration problem, where the components are already trusted. This "correct-by-construction" philosophy, applied at this scale, represents a novel approach to reducing the verification burden of custom processors.
Creative Application of Generative AI: The use of an LLM to solve the software retargeting problem (Section 5, page 10) is a timely and novel contribution. The classic challenge for any non-standard ISA (including subsets) is the software toolchain. Modifying a compiler backend (like GCC or LLVM) is a monumental task. The proposed solution—using an LLM to translate unsupported instructions into macros of supported ones—is an inventive workaround that leverages a cutting-edge technology to solve a decades-old problem in hardware specialization.

Weaknesses

The primary weaknesses relate to the depth of the novelty claim and its positioning against conceptually adjacent prior art.

Insufficient Differentiation from High-Level Synthesis (HLS): The proposed flow is conceptually similar to HLS-based processor generation. One could describe the semantic behavior of each instruction in a high-level language (like C++/SystemC), use an HLS tool to generate RTL for each, and then compose them. The paper fails to articulate the novel delta between its "pre-verified RTL block" approach and an HLS-based flow. Is the key difference simply the choice of RTL as the source language? A more rigorous comparison is needed to cement the novelty of the proposed methodology over existing hardware generation techniques.
Implicit Trust in Synthesis as a "Magic Bullet": The methodology's elegance hinges on the assumption that a standard synthesis tool can effectively identify and merge redundant logic from the collection of disparate instruction blocks to form an efficient, shared datapath. While the results are promising, the paper treats this critical step as a black box. The novelty of the decomposition strategy is undermined if the synthesis tool is simply reconstructing a traditional datapath that a human would have designed in the first place. The work would be stronger if it analyzed the synthesized netlist to show how a shared ALU, for example, was formed from the adders present in the add, addi, sub, and branch instruction blocks. Without this, it's unclear if the methodology is truly novel or just a circuitous route to a conventional result.
Novelty of GenAI Approach Tainted by Practicality Concerns: The GenAI retargeting framework, while novel, demonstrates a significant flaw: a code size increase of up to 36% (af_detect in Figure 12, page 11). For the target domain of extreme edge devices, memory is often the dominant constraint on cost, area, and power. A 36% increase in code size is likely a non-starter for many real-world applications. Furthermore, the paper does not quantify the dynamic instruction count overhead, which directly impacts performance and energy consumption. The novelty of the approach is diminished if its practical application is limited by such substantial overheads.

Questions to Address In Rebuttal

Please clarify the novelty of the "instruction-as-a-block" methodology compared to established High-Level Synthesis (HLS) flows for processor generation. What fundamental advantage does designing and maintaining a library of instruction-level RTL blocks offer over describing instruction semantics in a higher-level language and using HLS to generate the hardware?
Can the authors provide a post-synthesis analysis of one of the RISSP designs? Specifically, can you demonstrate that the synthesis tool successfully identified common sub-structures (e.g., an adder, a comparator) across multiple independent instruction hardware blocks and merged them into a single, shared resource, akin to a traditional ALU? This analysis is critical to validating that the proposed decomposition/recomposition flow is an efficient and novel design method.
The Generative AI code retargeting resulted in a 36% code size increase for one application. Given that the target domain is highly sensitive to memory footprint, how do the authors justify this approach as a practical solution? Was the impact on runtime (i.e., total dynamic instructions executed) evaluated? A novel solution must also be viable; please address this practicality gap.

ReGate: Enabling Power Gating in Neural Processing Units

Abstract

The energy efficiency of neural processing units (NPU) plays a critical role in developing sustainable data centers. Our study with different generations of NPU chips reveals that 30%–72% of their energy consumption is contributed by static power ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present ReGate, a hardware-software co-designed system for enabling fine-grained power gating in Neural Processing Units (NPUs) to combat static power dissipation. The paper first motivates the work with a characterization study, using a proprietary simulator, that identifies significant static power consumption across various NPU components (SAs, VUs, SRAM, etc.). The proposed solution, ReGate, applies different power-gating strategies to different components: a hardware-managed, cycle-level approach for Systolic Arrays (SAs); hardware-based idle detection for HBM and ICI controllers; and a software-managed approach for Vector Units (VUs) and SRAM, enabled by an ISA extension (setpm). The authors claim that ReGate can reduce NPU energy consumption by up to 32.8% (15.5% on average) with negligible performance overhead (<0.5%) and modest hardware area overhead (<3.3%).

Strengths

Problem Motivation: The paper correctly identifies static power as a growing contributor to overall energy consumption in modern accelerators, a problem that warrants investigation.
Systematic Approach: The authors attempt a comprehensive, chip-wide solution by analyzing and proposing distinct power-gating strategies for each major NPU component. This is a more systematic approach than a single-point solution.
HW/SW Co-design Principle: The fundamental idea of leveraging a hardware-software co-design, where the predictable nature of ML workloads is exploited by the compiler, is sound in principle.

Weaknesses

The paper's claims, while significant, rest on a foundation that is not as solid as it appears. Several methodological and logical issues undermine the credibility of the results.

Over-reliance on a Non-Public, Unverifiable Simulator: The entire motivation (Section 3) and evaluation (Section 6) are based on a "production-level NPU simulator." While the authors present a validation against real TPU hardware (Figure 16, page 9), this validation is insufficient.
- Plotting simulated vs. profiled execution time on a log-log scale with high R² values can mask significant absolute and relative errors, especially for shorter-running operators.
- The power model is based on McPAT and NeuroMeter, which are themselves models with inherent assumptions. The claim that the estimated idle/TDP power is "within 9%/5% for TPUv2" is a single data point and does not constitute a thorough validation of the power model's accuracy across different components and workloads.
- Without public access to the simulator or a more transparent and exhaustive validation methodology, the paper's core results are fundamentally irreproducible and their accuracy is questionable. All subsequent claims of energy savings are derivatives of this black-box model.
Insufficient Justification for Design Choices (HW vs. SW): The central design decision is how to partition management between hardware and software. The justification provided is qualitative and weak.
- For Vector Units (VUs), the authors claim hardware idle detection is ineffective because idle periods "vary significantly" (Section 4.1, page 7). This is a surprising argument, as sophisticated hardware predictors for generic CPUs have long dealt with variable idleness. Why is a simple idle-detection state machine the only hardware approach considered? A quantitative comparison against a more advanced hardware predictor is needed to justify the compiler-based approach as strictly superior.
- Conversely, for ICI and HBM, a simple idle-detection mechanism is deemed "sufficient" due to long idle intervals. This logic is inconsistent. If long idle intervals make hardware detection easy, they should also make compiler detection trivial and perfect. The rationale for the specific HW/SW split feels ad-hoc rather than rigorously derived.
Questionable Novelty and Comparison to Prior Art: While the integration is comprehensive, the novelty of the individual techniques is not well-established.
- The spatially-aware, cycle-level power gating of the SA (Section 4.1, page 6) is presented as a key contribution. However, the concept of propagating an activation signal diagonally is reminiscent of wavefront computations and is not fundamentally novel. The paper mentions UPTPU [61] in Related Work but fails to adequately differentiate its core SA gating mechanism from it in the main design section. UPTPU also uses zero-weight detection for power gating.
- Compiler-directed power management via ISA extensions is a well-established field for VLIW and other architectures [28, 72]. The paper does not clearly articulate how setpm and the associated compiler analysis are fundamentally different from this body of work.
Optimistic Performance and Overhead Claims:
- The claim of <0.5% performance overhead (Section 6.4, page 11) is extremely aggressive. This relies on the compiler's perfect ability to schedule setpm instructions to hide all wake-up latency. This seems unlikely in complex, fused operator graphs where dependencies may constrain scheduling freedom. The evaluation does not sufficiently stress-test scenarios where such optimistic scheduling is impossible.
- The hardware area overhead of 3.3% (Section 4.4, page 9) seems low for adding power-gating transistors and control logic to every PE in a 128x128 SA and to every 4KB segment of a 128MB SRAM. Did the synthesis account for the routing complexity and potential timing impact of the additional control signals (row_on, col_on, PE_on)? A breakdown of the 3.3% figure across the different components would add credibility.

Questions to Address In Rebuttal

Simulator Fidelity: Can you provide validation data beyond R² values on log-log plots? Specifically, what is the distribution of Mean Absolute Percentage Error (MAPE) for the execution time and power consumption of individual operators across the benchmark suite? How was the power model for power-gated states (e.g., 3% of active leakage) validated?
SA Gating Novelty: Please explicitly contrast your diagonal PE_on propagation scheme with the mechanism in UPTPU [61]. Is the primary contribution the propagation method to hide latency, or the row/column zero-detection logic? If the latter, how does it improve upon prior work?
Compiler Robustness: The software-managed approach hinges on static analysis of the computation graph. How would ReGate handle emerging ML models with dynamic properties, such as Mixture-of-Experts (MoE) with dynamic routing or adaptive computation based on input? The dismissal in Section 4.3 (page 8) that these still consist of "static subgraphs" is insufficient. What happens at the boundaries of these subgraphs?
Justification of HW/SW Split: Please provide a quantitative argument for why a compiler-managed approach for VUs is superior to an advanced hardware-based idle predictor (e.g., one that tracks instruction history or queue occupancy). What is the break-even point in terms of idle period variability where software becomes the only viable option?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper addresses the critical and increasingly relevant problem of static power consumption in Neural Processing Units (NPUs), which are now foundational to modern datacenters. The authors first present a compelling characterization study showing that static power accounts for a staggering 30-72% of energy consumption in modern NPUs, largely due to the underutilization of specialized hardware components for any given workload.

The core contribution is ReGate, a systematic and holistic hardware/software co-design for enabling fine-grained power gating across the entire NPU chip. Rather than a one-size-fits-all solution, ReGate proposes component-specific strategies: a novel, dataflow-aware, cycle-level gating mechanism for Systolic Arrays (SAs); hardware-based idle detection for components with long idle periods like interconnects (ICI) and memory controllers (HBM); and a compiler-driven approach for Vector Units (VUs) and SRAM, enabled by a thoughtful extension to the NPU's instruction set architecture (ISA). The evaluation, conducted on a production-level simulator, demonstrates an average energy reduction of 15.5% (up to 32.8%) with negligible performance overhead.

Strengths

A Systematic and Holistic Approach: The paper's primary strength is its comprehensive methodology. Instead of focusing on a single component, the authors analyze the utilization patterns of every major functional block of an NPU (SA, VU, SRAM, HBM, ICI). This leads to a well-reasoned, heterogeneous power management strategy that applies the right tool for the right job. This systemic view is precisely what is needed for complex systems like modern accelerators and is a significant step beyond piecemeal solutions.
Excellent Problem Motivation and Characterization: The work is exceptionally well-motivated. The characterization study in Section 3 (pages 3-5) is not just a preamble but a foundational piece of the research. Figures 3, 4, and 5 (page 4) provide clear, quantitative evidence of both temporal and spatial underutilization, making a powerful case for the necessity and potential of fine-grained power gating. This grounding in empirical data gives the proposed solution significant credibility.
Elegant Technical Solutions: The proposed mechanism for spatially power-gating the systolic array (Section 4.1, pages 6-7) is particularly clever. By propagating the power-on signal along with the natural diagonal dataflow, it avoids the massive overhead of individual idle-detection logic for each Processing Element (PE) and elegantly masks most of the wake-up latency. This demonstrates a deep understanding of the underlying dataflow architecture.
Pragmatic Hardware/Software Co-design: The decision to manage VUs and SRAM via software (compiler) is insightful. The authors correctly identify that the deterministic nature of ML computation graphs makes the compiler the ideal agent for orchestrating power states, as it has a global view that hardware's local, reactive mechanisms lack. The setpm ISA extension (Figure 14, page 7) is a clean and effective interface to expose this control. This co-design philosophy is a hallmark of mature architectural research.
Connecting to Broader Scientific Context: The inclusion of a carbon efficiency analysis (Section 6.6, page 12) is commendable. It elevates the paper's contribution from a purely technical exercise in power reduction to a meaningful statement on sustainable computing. By showing how ReGate can extend the optimal device lifespan (Figure 25, page 13), the work connects directly to pressing, real-world concerns about the environmental footprint of AI, making the research more impactful.

Weaknesses

While this is a strong paper, there are areas where its context and potential could be further explored:

Generalizability Beyond TPU-like Architectures: The design and evaluation are heavily centered on a TPU-like, weight-stationary systolic array architecture. While this is a prevalent design, the AI accelerator landscape is diversifying. It would strengthen the paper to include a discussion on how the principles of ReGate would apply to other architectures, such as output-stationary SAs, more MIMD-style accelerators (e.g., Graphcore IPU), or emerging designs that heavily rely on near-data processing.
Interaction with Other Power Management Techniques: The paper focuses exclusively on static power reduction via power gating. However, datacenters also employ techniques like Dynamic Voltage and Frequency Scaling (DVFS) and clock gating, which primarily target dynamic power. A discussion of how ReGate would interact with these orthogonal techniques would be valuable. For instance, would a decision made by the ReGate compiler conflict with or complement a decision made by a system-level DVFS governor?
Compiler Complexity and Trade-offs: The paper suggests the compiler implementation is straightforward, but adding a power-management pass creates a new set of optimization constraints. For example, a performance-centric compiler might fuse two small operators to hide latency. However, this fusion could eliminate an idle period that ReGate would have used to power-gate a functional unit. The paper would benefit from a deeper discussion of these potential conflicts and the new trade-off space the compiler must navigate.

Questions to Address In Rebuttal

The proposed SA power gating mechanism is elegantly tied to the weight-stationary, diagonal dataflow of a TPU. Could the authors elaborate on how the core principles of their component-aware approach might be adapted for accelerators with fundamentally different dataflows or architectures, such as those that are not systolic-array-based?
ReGate targets static power, while DVFS and clock gating target dynamic power. Can the authors comment on whether these techniques are purely complementary? Are there scenarios where optimizing for one via ReGate might lead to a suboptimal state for the other (e.g., frequent power-gating/ungating creating transient power demands that challenge DVFS)?
Could the authors provide more insight into the potential for negative interactions between the setpm instruction placement and standard compiler optimizations like operator fusion or instruction scheduling? How does the ReGate compiler pass resolve a situation where the best decision for performance (e.g., fusion) is the worst decision for power savings (e.g., eliminating an idle gap)?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents ReGate, a hardware/software co-designed system for enabling fine-grained power gating in Neural Processing Units (NPUs). The authors identify static power as a significant contributor to energy consumption in modern NPUs and propose a set of techniques to mitigate it by power gating idle components. The core claims of novelty appear to be a combination of: 1) a dataflow-aware, cycle-level power-gating mechanism for individual Processing Elements (PEs) within a systolic array (SA); 2) a new ISA instruction (setpm) to allow software control over the power states of various NPU components; and 3) a compiler-based approach that leverages this ISA extension to manage power for Vector Units (VUs) and on-chip SRAM.

While the overall goal of power gating NPUs is not new, the paper's primary novel contribution is the specific architectural mechanism for managing PEs in the systolic array. Other aspects of the system, such as compiler-directed power management and simple idle detection for peripherals, are adaptations of well-established techniques from the broader processor architecture literature applied to the NPU domain.

Strengths

The single most significant and novel contribution of this work is the hardware mechanism for spatially and temporally power-gating the systolic array, detailed in Section 4.1 (page 6).

Novel Systolic Array Gating Mechanism: The proposed technique of propagating a PE_on signal diagonally along with the dataflow (Figure 13) is a clever architectural solution. It elegantly sidesteps the need for complex and potentially slow idle-detection logic within each of the hundreds or thousands of PEs. By overlapping the PE wake-up with the computation wavefront, it effectively hides the latency, which is a critical barrier for fine-grained gating. This dataflow-aware propagation is a genuinely new mechanism for this context.
Meaningful Delta from Prior Art: The paper's SA gating method presents a significant advancement over the closest prior art, UPTPU [61]. As the authors note in Section 7 (page 13), UPTPU relies on non-volatile memory (STT-MRAM) to achieve its goals. ReGate’s mechanism is implemented in standard CMOS logic, making it far more practical and broadly applicable to conventional NPU designs without introducing exotic manufacturing dependencies. This distinction represents a tangible and important novel step.

Weaknesses

The primary weakness of the paper from a novelty perspective is that a significant portion of the proposed "ReGate" system is built upon existing and well-known concepts. While the integration is sound, the novelty of these individual components is minimal to non-existent.

Application of Standard Idle Detection: The use of hardware-based idle-detection for the Inter-Chip Interconnect (ICI) and HBM controllers is a standard, textbook approach to power management for I/O and memory interfaces that experience long idle intervals. The paper makes no claim of a novel detection algorithm here.
ISA Extension for Power Management is Not a New Concept: The introduction of the setpm instruction (Section 4.2, page 7) to expose power states to software is an implementation of a long-standing idea. Architectures like ARM (WFI/WFE instructions) and Intel (MWAIT) have provided ISA-level hooks for power management for decades. While the specific VLIW encoding is new, the core concept of software-initiated power state transitions via an instruction is not a novel research contribution.
Compiler-Directed Power Gating is Prior Art: The software strategy of having a compiler analyze a static dataflow graph to insert power-down/power-up instructions (Section 4.3, page 8) has been extensively explored in the context of VLIW and DSP processors since the early 2000s (e.g., [28, 72], which the authors cite). The deterministic nature of ML graphs makes NPUs an ideal target for this known technique, but the technique itself is not new. The contribution here is one of application and engineering, not fundamental invention.
Segmented SRAM Power Gating: The concept of partitioning a memory array (cache or scratchpad) and gating unused segments is also a well-established technique, as seen in prior work on drowsy caches [27] and dynamic cache resizing [65]. ReGate applies this known hardware technique and exposes it to the compiler, which is a logical but not fundamentally new step.

Questions to Address In Rebuttal

The core novelty rests on the systolic array power-gating mechanism. Beyond UPTPU [61], can the authors elaborate on how their dataflow-propagated wake-up signal is fundamentally different from other wavefront or data-driven clock/power gating schemes that may exist in the broader literature on massively parallel or dataflow architectures, even outside the specific NPU domain?
The paper presents the software-managed power gating (Section 4.3) as a key contribution. Given the extensive prior art on compiler-directed power management for statically-scheduled architectures [28, 72], could the authors precisely identify the novel aspect of their compiler analysis itself? Is there a new analysis or optimization algorithm, or is the novelty simply its application to the NPU software stack?
The proposed SA mechanism introduces additional control logic and per-row input queues. For workloads that achieve near-100% spatial and temporal SA utilization, does this new hardware introduce a non-trivial static or dynamic power overhead of its own? A discussion on the trade-off—where the complexity of the novel solution might negate its benefit—would be valuable.

Multi-Dimensional ML-Pipeline Optimization in Cost-Effective Disaggregated Datacenter

Abstract

Machine learning (ML) pipelines deployed in datacenters are becoming increasingly complex and resource intensive, requiring careful optimizations to meet performance and latency requirements. Deployment in NUMA architectures with heterogeneous memory ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper proposes an auto-tuning framework for optimizing multi-stage ML pipelines on NUMA systems, including those with emulated CXL memory. The framework utilizes an eBPF-based monitor to collect performance data with low overhead and a user-space core that employs Bayesian Optimization (BO) to navigate the configuration space of thread counts and memory interleaving ratios. The stated goal is to maximize throughput under SLA latency constraints, with a secondary phase using Pareto analysis for power efficiency. The authors claim significant throughput improvements (up to 48%) and search cost reductions (up to 77%) over existing methods.

However, the work is predicated on a fundamentally flawed CXL emulation methodology and relies on modeled, rather than measured, power data. These weaknesses significantly undermine the credibility of the paper's core performance and efficiency claims, particularly those related to CXL environments.

Strengths

The problem statement is well-defined and highly relevant to current trends in datacenter architecture and ML deployment.
The use of eBPF for performance monitoring (Section 3.1.2) is a methodologically sound choice for achieving low-overhead, in-kernel data collection, avoiding the significant performance degradation common with traditional daemon-based tools.
The evaluation includes comparisons against relevant and recent academic work, specifically TPP [40] and Caption [69], which is commendable.

Weaknesses

Invalid CXL Emulation: The paper's entire premise of optimizing for CXL-based disaggregated memory is built on a weak foundation. In Section 2.5, the authors state they "emulated a CXL... memory pool... by disabling all local sockets and treating remote DRAM as a dedicated memory pool." This is an oversimplification that borders on being incorrect. A remote NUMA link is not CXL. This emulation completely ignores critical CXL-specific characteristics such as protocol overhead from the CXL.mem transaction layer, the behavior of the device-side CXL controller and host-side Home Agent (HA), and potential coherency traffic differences. The authors' unsubstantiated claim of a "discrepancy within less than 5% of those observed in our emulation" (Section 2.5.1, Page 4) is presented without any supporting data, experimental setup, or reference to the "actual CXL hardware" used for this validation. Without this evidence, all results pertaining to CXL are speculative at best.
Power Claims are Not Empirically Validated: The power optimization phase (Section 4.2) and the associated power savings claims (up to 14.3% in the abstract) are not based on hardware measurements. The authors explicitly state they "estimate the CXL power using two different models" (Section 4, Page 9), dubbed "CXL-preferred" and "DRAM-preferred." These models are based on high-level assumptions about DDR4/DDR5 refresh rates and supply voltages. Consequently, the Pareto frontiers shown in Figure 11 are the result of a simulation, not an empirical observation of a real system. Such claims of power savings are meaningless without physical measurement.
Insufficient Justification for Bayesian Optimization: While BO is a powerful technique, the authors fail to demonstrate that its complexity is warranted for this problem. The primary claim is a 77% reduction in search cost (Abstract). This is compared to a pseudo-exhaustive search, which is a strawman baseline. The more relevant comparison is the time-to-solution versus a simpler, robust heuristic. In Figure 6, Caption [69] appears to find a strong configuration very quickly (at low "Search Cost") before its performance degrades. Why does Caption degrade? The paper offers no analysis, simply stating it "falls to local minima." This is an insufficient explanation. The authors must provide a rigorous analysis of why the simpler heuristic fails and demonstrate that BO's overhead is justified by consistently finding a superior solution within a practical time budget.
Questionable Baseline Behavior: The performance of the Caption baseline in Figure 6 is highly suspect. The algorithm is designed to incrementally adjust allocation and hill-climb towards a better configuration. The consistent and sharp decline in throughput as search cost increases suggests either a flawed implementation of the baseline or an experimental artifact that the authors have not investigated or explained. A system designed to improve performance should not actively make it worse over time unless it is unstable, which would be a critical finding in itself.
Ambiguity in Overhead Measurement: The claim of "over 5× less overhead than traditional Linux daemon-based tools" and the specific figure of 4.7% overhead for the eBPF monitor (Section 4.1.1, Page 9) lacks context. Was this 4.7% overhead measured on an idle system or under the full load of the benchmark workloads? System monitoring overhead is often non-linear and can become significantly more pronounced under high resource contention. The paper must clarify the conditions under which this overhead was measured to validate its "low-overhead" claim.

Questions to Address In Rebuttal

Please provide the complete data and methodology for the validation of your CXL emulation against "actual CXL hardware," as mentioned in Section 2.5.1. This should include the specific hardware used, the workloads run, and the performance metrics that showed a <5% discrepancy. Without this, how can any of the CXL-related conclusions be trusted?
Given that the power analysis is based entirely on a model, please justify how you can make concrete claims of power savings (e.g., "7.3% power by sacrificing around 4.5% in QPS" in Section 4.2). At a minimum, the authors should rephrase these as purely theoretical findings and explicitly state they are not based on hardware measurements.
Provide a detailed analysis explaining the performance degradation of the Caption baseline in Figure 6. Why does a hill-climbing algorithm consistently result in a worse configuration over time? Is this a limitation of the original algorithm or an artifact of your implementation/environment?
Please clarify the exact system load conditions under which the 4.7% eBPF monitor overhead was measured.
The comparison with TPP (Section 4.1.4) shows significant latency improvements. TPP is primarily designed for transparent capacity tiering. Were your experiments configured in a way that memory capacity was a bottleneck? If not, please justify why TPP is an appropriate performance baseline for a bandwidth-tuning framework.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a novel, adaptive auto-tuning framework for optimizing multi-stage Machine Learning (ML) inference pipelines in datacenters with disaggregated memory architectures, specifically those enabled by Compute eXpress Link (CXL). The core problem addressed is the combinatorial explosion of configuration parameters—such as memory allocation ratios between local DRAM and CXL, and thread-to-core mappings—that arises in these new, heterogeneous systems. The authors propose a holistic, two-phase optimization approach. Phase 1 uses Bayesian Optimization (BO) to navigate the high-dimensional search space and maximize system throughput under Service Level Agreement (SLA) latency constraints. Phase 2 further refines the set of high-performing configurations by using Pareto optimality to identify the most power-efficient options. A key technical contribution is the use of an eBPF-based kernel module for low-overhead, vendor-agnostic performance monitoring, which feeds the user-space optimization core. The experimental results, conducted on a range of ML workloads, demonstrate significant improvements over default and state-of-the-art configurations, achieving up to a 48% throughput increase while simultaneously reducing search costs and enabling substantial power savings.

Strengths

The primary strength of this work lies in its timely and insightful synthesis of several key trends in modern computing. It provides a cohesive and practical solution to a critical, emerging problem.

High Relevance to an Emerging Hardware Paradigm: The paper is exceptionally well-timed. As CXL-enabled servers move from research prototypes to production deployments, the industry will urgently need intelligent management systems. This work directly addresses the fundamental challenge that CXL introduces: a vastly expanded and more complex memory hierarchy. It moves beyond simple page placement heuristics seen in prior work (e.g., "Caption" [69], "TPP" [40]) by embracing the multi-dimensional nature of the problem, making it a forward-looking and highly relevant contribution.
Holistic and Principled System Design: The authors have designed a system that is both powerful and pragmatic. The choice of eBPF for monitoring is astute, providing a low-overhead, transparent, and—most importantly—vendor-agnostic solution that overcomes the limitations of prior platform-specific tools (as noted in Challenge C2, page 2). The coupling of this monitor with a Bayesian Optimization core is a natural fit; BO is an ideal tool for optimizing expensive-to-evaluate black-box functions, which perfectly describes tuning a live datacenter workload. The two-phase approach, separating performance and power optimization, is also a very practical design choice, reflecting real-world operational priorities.
Connecting Research to Real-World Economics: A significant strength is the paper's grounding in the total cost of ownership (TCO) of datacenter operations. The analysis extends beyond raw performance metrics (QPS, latency) to include power consumption (Section 4.2, page 11) and a CAPEX/OPEX comparison against GPU-based solutions (Section 4.3, page 11). This demonstrates a deep understanding of what matters in practice and elevates the work from a purely academic exercise to a solution with a clear value proposition for datacenter operators.
Strong Empirical Foundation: The characterization study in Section 2.5 (page 3) effectively motivates the entire work. Figure 2, in particular, clearly illustrates that no single static memory configuration is optimal across all loads, justifying the need for a dynamic, adaptive tuner. The subsequent evaluation is comprehensive, using a diverse set of modern ML pipeline benchmarks and comparing against multiple relevant baselines, including a state-of-the-art CXL management system. The results convincingly demonstrate the superiority of the proposed BO-based approach.

Weaknesses

The weaknesses identified are not fundamental flaws but rather areas where the current implementation and evaluation could be expanded to address the full complexity of production environments.

Oversimplified Pipeline Stage Management: The paper describes a mechanism for managing pipeline stages using POSIX semaphores and tracking kmalloc events with eBPF to detect the end of a stage's memory allocation (Section 3.1.3, page 6). While clever, this approach seems potentially fragile. It assumes a well-behaved application structure where stages are cleanly separated and their memory allocation phases are distinct. It is unclear how this would generalize to more complex, asynchronous pipelines or those written in higher-level languages where memory management is less explicit.
Unexplored Dynamics of Optimization Convergence: The framework's value proposition depends on its ability to find an optimal configuration in a reasonable amount of time. The authors report search times of "5 to 17 minutes" (page 10), which is excellent for a stable workload. However, datacenter load is often dynamic, with characteristics that can change on shorter timescales. The paper does not explore the system's reactivity. For instance, if a workload's input data characteristics shift dramatically, how quickly can the framework abandon its current model and re-converge on a new optimum? The current evaluation focuses on a static optimization problem rather than a continuous, dynamic one.
Limited Scope (Single-Tenant Focus): The current evaluation is conducted in a single-application, single-tenant context. While the authors briefly discuss extending the framework to multi-tenancy in the Discussion (Section 5.4, page 12), this is a non-trivial extension. In a multi-tenant environment, the optimization for one workload could negatively impact another (the "noisy neighbor" problem). The BO's objective function would need to be reformulated to account for fairness, QoS guarantees for multiple tenants, and global resource contention, which presents a far more complex optimization landscape.

Questions to Address In Rebuttal

Regarding the pipeline stage management mechanism: Could the authors elaborate on the robustness of using kmalloc tracking and semaphores? Have they considered alternative pipelines where memory allocation is interleaved with computation, and if so, how would the framework handle such cases? What are the practical limitations on the types of application pipelines the system can currently support?
Regarding the real-time adaptability of the system: Can the authors comment on the trade-off between the exploration budget (search cost) of the Bayesian Optimization and the framework's ability to react to dynamic changes in workload behavior? For example, how would the system perform if the workload trace from Figure 7 experienced a sudden, sustained spike, fundamentally changing the latency-throughput curve?
While multi-tenancy is noted as future work, could the authors provide a more concrete vision for this extension? Specifically, how might the Bayesian Optimization framework be adapted? Would it involve a multi-objective optimization problem (e.g., balancing the throughput of all tenants), or would it require adding fairness constraints directly into the acquisition function? How would the system ensure performance isolation while optimizing for global efficiency?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents a framework for optimizing the performance and power consumption of multi-stage Machine Learning (ML) pipelines on NUMA systems with heterogeneous memory, specifically including CXL-attached memory. The core of the proposed solution is an adaptive auto-tuner that operates in two phases. The first phase uses Bayesian Optimization (BO) to maximize throughput by exploring a multi-dimensional configuration space of per-stage memory allocation ratios and thread counts. The second phase uses Pareto optimality to select a configuration from the high-performing set that also minimizes power consumption. A key element of the framework's architecture is its use of an eBPF-based kernel module for low-overhead, hardware-agnostic performance monitoring, which feeds metrics into the user-space optimization core.

Strengths

The primary novel contribution of this work lies in the architectural synthesis of its components to create a practical, low-overhead, and portable auto-tuning system. My analysis identifies two key areas of novelty:

The Monitoring Mechanism as a Basis for Auto-Tuning: The decision to use an eBPF-based kernel module (Section 3.1, page 5) for real-time monitoring is a distinct and valuable departure from prior art. Many related works, such as Caption [69] and other vendor-specific studies [81, 85], rely on periodically sampling hardware performance counters (e.g., Intel PCM) or daemon-based tools. These approaches suffer from either a lack of portability across different CPU vendors or significant overhead due to context switching, as the authors correctly identify (Section 1, C2, page 2). By integrating monitoring directly and efficiently into the kernel via eBPF, the authors present a genuinely novel approach in this specific context that addresses both portability and overhead, which their own measurements confirm (Section 4.1.1, page 9).
Sophistication of the Search Strategy: The work moves beyond the simpler search heuristics seen in closely related work. For instance, Caption [69] employs a binary search-like algorithm to tune a single dimension (the CXL memory ratio). This paper's use of Bayesian Optimization to navigate a multi-dimensional space (per-stage memory ratios and thread counts) is a significant step up in capability. While applying BO to systems optimization is not new in a general sense, its application to this specific, complex problem of per-stage ML pipeline tuning on CXL is a novel application that demonstrably reduces search cost compared to exhaustive methods and finds better optima than simpler heuristics.

Weaknesses

While the overall framework is novel in its composition, its individual building blocks are well-established. My primary critique is that the paper presents a significant engineering contribution by cleverly integrating existing technologies, but it does not introduce a fundamentally new optimization algorithm or a new theoretical insight into system performance.

Component-level Novelty is Limited: Bayesian Optimization is a standard technique for black-box function optimization. Pareto optimality is a classic method for multi-objective decision-making. Using eBPF for system tracing and monitoring is also a widely adopted practice. The novelty here is exclusively in the combination and application of these tools to the problem of ML pipeline tuning on disaggregated memory. The paper should be careful not to overstate the novelty of the constituent parts.
Incremental Advance over Conceptually Similar Ideas: The core idea of a closed-loop system that monitors performance and adjusts resource allocation is not new. The delta here is in the "how": using eBPF instead of Perf/PCM, and BO instead of hill-climbing/binary-search. While the results show this delta is impactful (e.g., Figure 6, page 10), the conceptual framework of "monitor -> decide -> actuate" is familiar. The paper's contribution is a much more effective implementation of this concept, not the invention of the concept itself.
Scoped Novelty: The paper's novelty is scoped almost exclusively to CPU-based inference on NUMA/CXL systems. This is a relevant domain, but the broader trend in large-scale ML involves heterogeneous systems with hardware accelerators (GPUs, TPUs). While the authors suggest a potential extension to XPUs (Section 5.2, page 12), this remains speculative. The demonstrated novelty is confined to a specific, albeit important, hardware class.

Questions to Address In Rebuttal

To strengthen the paper's claims of novelty, I would ask the authors to address the following:

Clarifying the Conceptual Delta: The paper successfully argues that its framework is superior to prior art like Caption [69] and TPP [40]. However, can the authors articulate the single most significant conceptual leap this work makes? Is it the move from single-dimensional to multi-dimensional search, the adoption of kernel-level monitoring for this specific feedback loop, or another factor? A more precise articulation of the core inventive step, beyond just being a more effective integration, would be valuable.
Justification of Bayesian Optimization: The rationale for choosing Bayesian Optimization is that it is well-suited for expensive black-box functions. The paper's own results in Figure 6 (page 10) show that a Genetic Algorithm (GA) also achieves strong performance (84-89% of optimal), sometimes appearing more stable than PSO. Could the authors provide a more rigorous justification for why the additional complexity and specific modeling assumptions of a Gaussian Process in BO are fundamentally better for this problem space than other global search metaheuristics? Is the 5-10% performance delta over GA worth the implementation and computational overhead of BO?
eBPF Hooking Strategy: The use of eBPF is the strongest novel component. Could the authors provide more technical detail on the specific kernel events, tracepoints, or kprobes they hook into to monitor the start and end of a stage's execution and memory allocation (kmalloc events mentioned in Section 3.1.3)? Discussing the robustness of these hooks across different kernel versions would also strengthen the claim of portability.

CrossBit: Bitwise Computing in NAND Flash Memory with Inter-Bitline Data Communication

Abstract

In- flash processing (IFP), which involves performing data computation inside NAND flash memory, holds high potential for improving the performance and energy efficiency of data-intensive application by minimizing data movement. Recent research has ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

This paper introduces CrossBit, an in-flash processing (IFP) architecture designed to enable both intra- and inter-bitline bitwise computations within NAND flash memory. The authors propose a hierarchical architecture of "local" and "global" computing modules based on dynamic logic to facilitate these operations. A key contribution is an in-flash error correction code (IF-ECC) based on a Hamming code, which purports to enable reliable computation on multi-level cell (MLC) flash, thereby increasing bit-density over prior single-level cell (SLC)-based IFP systems. The architecture is evaluated on fundamental database queries and the Star Schema Benchmark (SSB).

While the paper addresses a critical limitation of existing IFP work—the lack of flexible inter-bitline communication—its central claims regarding reliability, performance, and practicality rest on a series of questionable assumptions and methodological weaknesses that are not sufficiently substantiated by the provided evidence.

Strengths

The work correctly identifies a fundamental and well-known shortcoming of prior IFP architectures (e.g., ParaBit, Flash-Cosmos), namely their restriction to intra-bitline operations, which severely limits their application domain.
The architectural proposal to handle MLC reliability within the flash die is a necessary direction for IFP research to be viable, as the capacity benefits of MLC are a primary motivator for moving computation closer to storage.
The use of existing dynamic circuit structures within the page buffer as a foundation for the computing units is a practical design consideration aimed at minimizing area overhead.

Weaknesses

My analysis reveals several critical flaws that undermine the validity of the paper's conclusions.

Insufficient ECC for MLC Reliability: The cornerstone of the MLC reliability claim is the proposed IF-ECC, which is based on a Hamming code (Section 6.3, Page 7). Hamming codes are single-error correcting (SEC) codes. It is a significant and unsubstantiated leap to assume that a simple SEC code is adequate for MLC NAND flash, which is known to suffer from multi-bit errors, disturbance, and high raw bit error rates (RBER) that increase significantly with P/E cycles and data retention time. Modern SSDs employ far more powerful codes like LDPC or BCH for a reason. The paper dismisses these as "not as cost-efficient" (Page 7) without providing any quantitative analysis of the trade-off or, more importantly, a characterization of the error patterns that IF-ECC would face. The BER evaluation in Figure 12 (Page 9) relies on a Gaussian distribution model from a 2017 source [11], which may not be representative of contemporary, high-density 3D NAND structures. This entire premise seems fundamentally unsound.
Inaccurate and Overstated Bit-Density Claim: The abstract and introduction prominently claim a "1.8x increase in bit-density by using MLC compared to previous IFP designs which uses SLC only" (Page 1). A simple calculation refutes this. The paper states that each local group uses six additional bitlines for parity on what appears to be a 32-bit data word (Section 6.3). This results in a code rate of 32/(32+6) = 32/38 ≈ 0.84. Using MLC (2 bits/cell) yields an effective bit density of 2 * (32/38) ≈ 1.68 bits/cell. This represents a 1.68x improvement over SLC (1 bit/cell), not 1.8x. This is a significant quantitative error that calls into question the rigor of the entire evaluation.
Results Depend on Manual, Non-Generalizable Optimizations: The programming methodology relies on manually converting logic into the proposed prims and then applying several complex, hand-tuned optimizations (Section 5.2, Page 6). The authors explicitly state, "we conducted manual conversion, deferring development of the logic optimizer to future work" (Section 5.1, Page 6). This is a critical weakness. The presented speedups are an artifact of expert-level, manual optimization and cannot be considered representative of what a general-purpose compiler could achieve for arbitrary workloads. The results are therefore a best-case, not a typical-case, scenario.
Oversimplification of Database Query Processing: The paper claims to accelerate the "full set of end-to-end database queries from... SSB" (Abstract, Page 1). However, the description of the Join query (Section 7.3, Page 8) reveals that it relies on a "widely-used partitioned join algorithm" where partitioning is handled by the SSD controller. This offloads significant logical complexity from the flash chip, weakening the central claim of in-flash processing for one of the most challenging database operations. The paper fails to provide a clear, step-by-step breakdown of how a truly complex SQL query (e.g., involving aggregations, sorting, and multi-table joins) is fully mapped to and executed by the CrossBit primitives.
Questionable Circuit-Level and Timing Assumptions: The evaluation fixes the basic operation time to 7.3 ns, which is the "maximum among all basic operations" (Section 8.1, Page 9). This single, fixed value is likely optimistic. A rigorous analysis would require demonstrating robustness across all process, voltage, and temperature (PVT) corners. Furthermore, the reliance on dynamic logic raises concerns about noise immunity and charge leakage, which are not thoroughly analyzed beyond a single Monte-Carlo simulation plot (Figure 11, Page 9) with unspecified variation parameters.

Questions to Address In Rebuttal

The authors must provide clear and convincing answers to the following questions:

Provide empirical data or rigorous simulation results to justify that a single-error correcting Hamming code is sufficient to meet the JEDEC UBER standard for MLC NAND flash across its expected lifetime, considering multi-bit error events and retention-induced errors.
Provide a precise, step-by-step calculation that substantiates the claimed 1.8x bit-density improvement. If the figure is incorrect, all related performance-per-area and efficiency claims must be re-evaluated and corrected.
How do the authors justify presenting performance results that are contingent on manual code optimization? What is the expected performance degradation for code generated by an automated (and currently non-existent) compiler compared to the hand-tuned results shown in the paper?
For a complex SSB query (e.g., Q2.1 or Q3.1), provide a detailed breakdown of the execution flow. Specifically, what percentage of the total operational latency is spent on (a) in-flash computation using CrossBit, (b) data movement and logic in the SSD controller, and (c) host-level processing?
Was the 7.3 ns cycle time for basic operations validated as the worst-case latency across standard PVT corners for the target 150nm process technology? Please provide evidence for this claim.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces CrossBit, a novel in-flash processing (IFP) architecture that enables generic, flexible data communication between different bitlines within a NAND flash page buffer. The authors identify a critical limitation in prior IFP works (e.g., ParaBit, Flash-Cosmos), which were largely restricted to intra-bitline computations. This restriction made them inefficient for complex operations and, crucially, unable to support the error correction required for high-density multi-level cell (MLC) memory, limiting their practical capacity.

CrossBit's core contribution is a hierarchical, dynamic logic-based interconnect that allows for Boolean-complete operations across arbitrary bitlines with minimal (2.2%) area overhead. The authors demonstrate that this new capability unlocks two significant applications: 1. In-Flash ECC (IF-ECC): An in-flash Hamming code implementation that corrects errors within the page buffer, enabling the use of MLC NAND for IFP and achieving a 1.8x bit-density improvement over prior SLC-based designs. 2. Efficient Database Query Acceleration: A significant speedup on fundamental and end-to-end database queries (e.g., pattern match, range, join) that were previously bottlenecked by data movement or cell write overheads in other IFP architectures.

The work is thoroughly evaluated through circuit simulation and system-level modeling, showing substantial performance and energy improvements over the state-of-the-art.

Strengths

This paper makes a significant and timely contribution to the field of in-memory and near-data processing. Its primary strengths are:

Breaking the MLC Barrier for IFP: The most impactful contribution of this work is enabling reliable computation in MLC NAND flash. The field of IFP has long been constrained by the poor raw bit error rate (RBER) of dense flash technologies. By introducing a practical mechanism for in-flash error correction (IF-ECC, Section 6.3, page 7), CrossBit fundamentally changes the value proposition of IFP. It transforms it from a niche technology limited to low-density, high-cost SLC to one that can leverage the high capacity and cost-effectiveness of modern NAND. The resulting 1.8x increase in bit density (Figure 13a, page 9) is a compelling and crucial result for the viability of this research direction.
An Elegant and Practical Architectural Design: The hardware proposal is both clever and pragmatic. Instead of adding complex, static logic gates, the authors extend the existing dynamic circuit-based interconnects (DC-interconnect) already present in modern page buffers (Section 4.2, page 4). The hierarchical design (local and global modules) thoughtfully balances parallelism and hardware cost. This approach demonstrates a deep understanding of memory circuit design constraints, leading to a proposal with a commendably low area overhead of 2.2% (Section 8.5, page 12), making it plausible for industrial adoption.
Generalizing Inter-Bitline Communication: This work can be seen as the logical and necessary evolution of prior art. While architectures like ParaBit and Flash-Cosmos established the potential of bulk bitwise IFP, and Ares-Flash introduced a limited form of inter-bitline communication (unidirectional shifts), CrossBit provides the generic communication fabric that was missing. This generalization is precisely what enables complex logic for applications like ECC and database queries, moving beyond the simple arithmetic acceleration of its predecessors.
Comprehensive and Visionary Evaluation: The evaluation is thorough, covering circuit-level fidelity (HSPICE), system-level performance (MQSim), reliability (BER analysis), and end-to-end application benchmarks (Star Schema Benchmark). Furthermore, the authors display excellent foresight by evaluating a hybrid architecture that combines the strengths of CrossBit (generic communication) and Ares-Flash (fast, parallel shifts) in Sections 8.4 (Figures 18 and 19). This shows a mature understanding of the research landscape, positioning their work not merely as a competitor but as a complementary technology that can be integrated to build even more powerful systems.

Weaknesses

While the core idea is strong and well-executed, the paper could be strengthened by addressing the following points, which relate more to the system-level implications and future work than to flaws in the current proposal.

The Software and Programmability Challenge: The paper presents a powerful new hardware capability but gives less attention to how it would be programmed and utilized by developers. The authors propose a set of primitive functions (prim_OR, prim_AND) and mention manual conversion from higher-level logic (Section 5.1, page 6). This is a significant gap between the hardware's potential and its practical usability. For this technology to have a real-world impact, a compiler or automated logic synthesis toolchain is essential. The lack of a clear path from a high-level language (e.g., SQL) to the CrossBit control signals is the most significant hurdle to its adoption.
Justification of ECC Choice and Scalability: The choice of a Hamming code for IF-ECC is a pragmatic one that serves as an excellent proof-of-concept. However, commercial SSDs rely on much stronger codes like BCH and LDPC to ensure data reliability over the device's lifetime, especially for denser TLC/QLC flash. While the authors correctly note the high complexity of these codes, the paper would benefit from a more detailed discussion on the trade-offs. Is the proposed Hamming code sufficient to meet enterprise-grade reliability standards over many years and P/E cycles, or is it a stepping stone towards a future, more complex in-flash solution?
Sensitivity to Data Layout: The impressive performance gains are, as the authors acknowledge in the Discussion (Section 9, page 12), highly dependent on a data layout that aligns with the hardware's strengths (i.e., columnar storage). While columnar databases are increasingly popular, many legacy and general-purpose systems still rely on row-major layouts. The paper would be more complete if it quantified the performance overhead of performing an initial in-flash data transposition for workloads that do not use an optimal layout. This would provide a clearer picture of the architecture's effectiveness in a broader range of scenarios.

Questions to Address In Rebuttal

Regarding programmability: Could the authors elaborate on the path from a high-level query (e.g., a SQL SELECT statement with a WHERE clause) to the CrossBit primitives and control signals? What are the key challenges in developing a compiler to automate this process?
Regarding the choice of ECC: The Hamming code implementation is a key enabler for MLC. Could the authors comment on the feasibility of implementing a more robust block code, such as a shortened BCH code, within the CrossBit framework? Would the overhead in terms of latency and control signal complexity be prohibitive, or is there a potential path forward?
Regarding data layout: The evaluation rightly focuses on columnar layouts where CrossBit excels. Could the authors provide an estimate or analysis of the performance impact if a one-time, in-flash transposition from a row-major format is required before query processing? How would this overhead affect the overall speedup compared to a host-based approach?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper introduces CrossBit, an in-flash processing (IFP) architecture for NAND flash memory. The authors claim its primary novelty is the ability to perform flexible, Boolean-complete, inter-bitline computations, a capability they argue is absent or severely limited in prior work. This is achieved through a hierarchical architecture featuring a novel "local module" that connects the L2 latches of 32 neighboring bitline buffers via a shared dynamic circuit-based interconnect (DC-interconnect). This mechanism is then leveraged to implement two key applications: 1) an in-flash error correction code (IF-ECC) scheme that enables the use of multi-level cell (MLC) NAND for IFP, and 2) the acceleration of fundamental database queries.

My review will focus exclusively on the novelty of the core architectural mechanism for inter-bitline communication and its qualitative difference from the established prior art.

Strengths

The core novelty of this paper lies in its specific mechanism for achieving general-purpose inter-bitline computation, which represents a qualitative leap over the closest prior art.

Genuinely Novel Inter-Bitline Compute Fabric: The central claim to novelty rests on the design of the local and global inter-bitline computing modules (Section 4.2.2 and 4.2.3). I have analyzed this against the closest prior art, AresFlash [13]. AresFlash introduced inter-bitline communication, but its mechanism is fundamentally a shift register—a unidirectional, point-to-point data transfer between adjacent bitline buffers, optimized for arithmetic operations. CrossBit proposes a fundamentally different and more general architecture: a shared bus (Shared_node) within a local group of bitlines that can perform a multi-input logical NOR. This elevates the communication from a simple shift to a true, Boolean-complete computation fabric. The shift from a specialized data-path (AresFlash) to a general-purpose, shared compute resource (CrossBit) is a significant and novel architectural contribution in the context of NAND flash PIM.
Novel Application Enabled by the Architecture: The proposed IF-ECC is a direct and powerful application of the novel inter-bitline mechanism. While the concept of ECC is not new, and Hamming codes are textbook material, performing ECC inside the flash array without serializing data to an external ECC engine was not feasible with prior intra-bitline-only architectures (e.g., ParaBit, Flash-Cosmos). CrossBit's ability to perform the necessary XOR operations (composed from its NOR primitives) across different bitlines (data and parity bits) is what makes IF-ECC possible. This, in turn, solves a critical roadblock for IFP: the inability to use high-density MLC flash reliably. This is not merely an optimization; it is an enabling technology for a whole new class of IFP systems.

Weaknesses

While the core contribution is novel, several supporting components of the architecture are either incremental advancements or applications of well-known design principles. The novelty is concentrated and specific, not system-wide.

Incremental Intra-Bitline Mechanism: The intra-bitline computing module described in Section 4.2.1 is presented as building upon existing page buffer structures. The authors note it is "conceptually similar to ParaBit [24]" but leverages bidirectional communication between L1 and L2 latches. This is an incremental engineering improvement over prior art, not a fundamental conceptual leap. The true novelty begins when data leaves a single bitline's latch hierarchy.
Application of Standard Design Patterns: The hierarchical organization—dividing the bitlines into "local groups" and using a "global module" to aggregate results—is a standard and well-understood technique for managing complexity and parallelism in large-scale circuit design. While its application here is sound, the pattern itself is not novel. The novelty resides entirely within the circuit implementation of the local and global modules, not in the hierarchical strategy.
Repurposing of Existing Circuit Techniques: The use of dynamic circuits for the DC-interconnect is correctly identified by the authors as a means to achieve high area efficiency (Section 2.1). This is a well-known technique in memory peripheral design. The novelty is not the use of dynamic logic itself, but its specific repurposing to create a shared computational bus connecting multiple bitline buffers.

Questions to Address In Rebuttal

The authors should clarify the following points to better delineate the boundaries and practicalities of their novel contribution:

Scalability of the Local Module: The local group size is fixed at 32 bitlines. This choice appears to be a trade-off between parallelism and the physical limitations (e.g., capacitance, delay, noise immunity) of the shared Shared_node in the dynamic interconnect. Could the authors elaborate on the limiting factors of this shared bus? What prevents scaling this to 64 or 128 bitlines, and what would be the performance and reliability implications?
Qualitative Comparison to an Enhanced AresFlash: The primary delta over AresFlash is generality (Boolean-complete NOR vs. specialized shift). Imagine an enhanced version of AresFlash that supports bidirectional shifting. While still not Boolean-complete, it would be more flexible. Could the authors provide a more direct comparison of their NOR-based fabric against such a hypothetical, more powerful shift-based architecture? For which class of algorithms is the full Boolean completeness offered by CrossBit essential, beyond what a bidirectional shift-and-add unit could accomplish?
Programmability and Logic Synthesis: The paper discusses a programming method using prims and notes that logic conversion was done manually, deferring an automated optimizer to future work (Section 5.1). The novelty of an architecture is intertwined with its usability. How complex is this manual mapping for non-trivial functions like the ones used in IF-ECC? Does this manual effort present a significant barrier to adoption, and what are the conceptual challenges to building a compiler that can effectively target this unique NOR-based fabric?

Nexus Machine: An Energy-Efficient Active Message Inspired Reconfigurable Architecture

Abstract

Modern reconfigurable architectures are increasingly favored for resource-constrained edge devices as they balance high performance, energy efficiency, and programmability well. However, their proficiency in handling regular compute patterns constrains ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present "Nexus Machine," a reconfigurable architecture designed to accelerate irregular workloads by leveraging an active message (AM) paradigm. The core novelty claimed is "in-network computing," where AMs carrying instructions and data can be opportunistically executed on idle Processing Elements (PEs) encountered en-route to their final destination. The paper argues this approach mitigates the load imbalance and underutilization common in traditional CGRAs and data-local architectures when handling sparse applications. While the ambition is noted, the paper's central claims are built on a foundation that lacks sufficient rigor, and the evaluation methodology appears to overstate the benefits while under-reporting the significant overheads and potential failure modes of such a dynamic system.

Strengths

The paper correctly identifies a critical and persistent challenge in reconfigurable computing: the inefficient handling of irregular control flow and memory access patterns found in sparse workloads.
The motivation to move from a static dataflow model (Generic CGRA) or a purely data-local model (TIA) towards a more dynamic execution paradigm is logical. Distributing tensors across PEs is a known strategy to alleviate memory bank conflicts.
The authors compare their architecture against a reasonable set of baselines, including systolic, CGRA, and triggered-instruction architectures, providing a basis for performance comparison.

Weaknesses

My analysis reveals several critical weaknesses that undermine the paper's conclusions:

The Core "In-Network Computing" Mechanism is Under-specified and Potentially Flawed: The central premise of opportunistic execution on any idle PE is presented as a panacea for load imbalance. However, the mechanism is not detailed. What is the hardware cost and latency penalty for a message to query a PE's status, be accepted for execution, and then be re-injected into the network? This process is non-trivial. The paper provides no analysis of how this mechanism behaves under moderate to high network congestion, where few PEs might be idle, and the cost of routing and querying could easily overwhelm the computational benefit. The high "In-network Compute (%)" shown in Figure 11 seems implausible without a corresponding analysis of fabric load.
Unconvincing Deadlock Avoidance Strategy: The introduction of a new class of "algorithmic-level deadlock" due to message re-injection (Section 3.4, page 8) is a significant concern. The proposed solution—a combination of static acyclic data placement and a "lightweight runtime timeout (1024 cycles)"—is alarming. A timeout is not a proof of correctness but an admission of a potential, unresolvable failure mode. It suggests that the static analysis cannot guarantee deadlock freedom. This is unacceptable for a fundamental architectural guarantee. The paper provides no data on how often this timeout could be triggered or if it might prematurely terminate legitimate long-running computations.
Ad-Hoc and Inflexible Active Message Format: The AM format detailed in Figure 7 (Section 3.2, page 6) specifies exactly three destinations (R1, R2, R3). The justification that this is "based on our workload analysis (as SDDMM has three inputs)" is a textbook case of designing an architecture for a benchmark. This raises serious questions about the generality of Nexus Machine. How does it handle workloads with two, four, or more tensor inputs? Does this require a different, wider message format, or does it incur a significant performance penalty from serialization? The architecture's fundamental data-carrying mechanism appears brittle and over-fitted to the chosen applications.
Questionable Scalability Claims: The claim of "near linear scaling" in large arrays (up to 128x128 PEs) presented in Figure 18 (Section 5.5, page 12) is highly suspect. A dynamically routed, message-passing system like this is fundamentally bound by network diameter and congestion. The paper hand-waves this away by stating that "idle PEs along the path can perform computations," but provides no supporting data on average message latency, hop count, or the actual rate of successful en-route executions in these large-scale configurations. Without this data, the scalability claim is unsubstantiated.
Superficial Overhead Analysis: The area and power overhead analysis (Section 5.3, page 11) is presented as "moderate." However, an additional 12% routing area and a 17% power increase over a Generic CGRA are significant costs for resource-constrained edge devices. Furthermore, the 1KB AM Queue per PE is a substantial SRAM budget dedicated solely to holding in-flight messages, which may be underutilized in many workloads, representing a static power drain and area cost. The comparison in Table 2 (page 13) against prior work is also weak, comparing post-synthesis numbers for Nexus against post-P&R numbers for Pipestitch [50], which is not a methodologically sound comparison.

Questions to Address In Rebuttal

The authors must provide clear, data-driven answers to the following questions to make their case credible:

Quantify the cycle overhead for a single instance of "opportunistic execution." This must include the cost of the PE availability check, message arbitration and decoding at the intermediate PE, ALU execution, and message re-assembly and re-injection. How does this overhead compare to simply routing the message to its destination?
Regarding the algorithmic deadlock (Section 3.4): Can you provide a formal proof that your static data placement guarantees acyclic dependencies for all possible workloads, rendering the timeout mechanism purely a safeguard? If not, under what specific conditions can a deadlock occur, and what is the justification for the 1024-cycle value?
Justify the fixed 3-destination active message format. Provide a quantitative analysis of the performance impact on kernels that do not naturally map to a 3-input structure (e.g., element-wise addition of two tensors, or a 4-input tensor operation).
Provide detailed network statistics for the 64x64 and 128x128 scalability experiments (Figure 18). Specifically, show the distribution of message hop counts, average message latency, and the percentage of messages that successfully find an idle PE for en-route execution as a function of distance from the source.
Re-evaluate the SOTA comparison in Table 2 using an apples-to-apples methodology (e.g., all results post-synthesis or all post-P&R on the same technology node). Justify why a 17% power increase over a baseline CGRA is an acceptable trade-off at the edge.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces the Nexus Machine, a reconfigurable architecture that masterfully blends the principles of Coarse-Grained Reconfigurable Arrays (CGRAs) with the classic Active Message (AM) paradigm from parallel computing. The core problem it addresses is the profound inefficiency of traditional CGRAs when executing irregular workloads, such as sparse matrix and graph computations. These workloads suffer from unpredictable data access and control flow, leading to severe load imbalance and underutilization of the processing fabric.

The authors' central contribution is a novel execution model they term "In-Network Computing" or "Opportunistic Execution." Instead of data flowing through a fabric of statically configured processing elements (PEs), Nexus Machine sends active messages containing both instructions and operands. Crucially, these messages can be executed en-route by any idle PE they traverse on their path to a destination. This dynamic, opportunistic execution acts as a powerful hardware-level load balancing mechanism, distributing computational load across the fabric in response to runtime conditions. The paper provides a detailed microarchitecture, a corresponding compiler flow, and a thorough evaluation demonstrating significant performance and utilization gains over state-of-the-art baselines.

Strengths

Elegant Synthesis of Classic and Modern Paradigms: The true strength of this paper lies in its insightful re-imagination of the Active Message model, a concept with deep roots in multicomputer history (e.g., J-Machine [10], CM-5 [12]), for the modern on-chip, reconfigurable fabric. It takes the principle of "sending computation to the data" and evolves it into "computation can happen anywhere along the path to the data." This is an elegant and powerful conceptual leap that directly addresses a fundamental weakness of statically-scheduled CGRAs.
A Compelling Solution to a Critical Problem: The inability to handle irregularity is arguably the single greatest barrier to the widespread adoption of CGRAs beyond niche DSP applications. This paper doesn't just chip away at the problem; it offers a foundational architectural solution. The visual comparison in Figure 3 (page 3) is particularly effective, clearly contrasting the bank conflicts of generic CGRAs and the static load imbalance of Triggered Instruction architectures with the balanced utilization achieved by Nexus Machine's dynamic approach.
Strong Contextualization and Positioning: The authors demonstrate a solid understanding of the research landscape. They correctly position their work against both traditional CGRAs and more recent dataflow-inspired designs like TIA. Furthermore, by framing their approach within the broader history of Active Messages, they provide a strong intellectual foundation for their architecture. It does not feel like an isolated invention but rather a thoughtful evolution of established principles.
Thorough and Convincing Evaluation: The experimental methodology is comprehensive. The choice of baselines is appropriate, covering systolic arrays, a generic CGRA, and the closely related TIA model. The breadth of workloads, spanning sparse, dense, and graph computations, effectively showcases the architecture's versatility. The ablation study in Figure 10 (page 9) and the analysis of network overhead in Section 5.3 (page 11) provide valuable insights into the architectural trade-offs.

Weaknesses

While the core idea is excellent, the paper could benefit from a deeper discussion of its broader implications and challenges.

The Compiler and Programmer's View: The paper presents a compiler flow (Section 3.6, page 8), but the implications of the highly dynamic and opportunistic execution model for the programmer and compiler writer are vast. Performance becomes non-deterministic. How does one debug a program where an instruction may execute on one of several PEs depending on runtime congestion? While the architecture may abstract this away, the lack of performance predictability is a significant challenge for software development that warrants more discussion.
Interaction of Static Placement and Dynamic Execution: The system relies on a static compiler pass to partition and place tensors, followed by a dynamic runtime execution. There seems to be a fascinating tension here. How robust is the dynamic load balancing to a sub-optimal initial data placement? If the compiler places two communicating data chunks at opposite corners of a large array, the system can still function via long-distance messages, but does this create new bottlenecks? A deeper analysis of this interplay would strengthen the paper.
Scalability and Network Dynamics: The paper shows impressive scalability (Figure 18, page 12). However, in very large fabrics, the NoC itself becomes a more complex system. While en-route execution mitigates latency, it doesn't eliminate it. For a 128x128 array, a message might traverse hundreds of PEs. This could lead to second-order effects, such as messages for one kernel interfering with another, or the creation of "traffic jams" that the turn-model routing and congestion control may not fully resolve. The paper could benefit from a qualitative discussion of these large-scale network phenomena.

Questions to Address In Rebuttal

The Active Message paradigm in its original context was a powerful tool for more than just offloading computation; it was used for fine-grained synchronization, remote procedure calls, and distributed data structures. Does the Nexus Machine's AM format and microarchitecture have the potential to support these more complex patterns? Could this framework be extended to handle, for instance, dynamic task spawning or synchronization primitives directly in the fabric?
Regarding the interplay between the compiler and runtime: How sensitive is the overall system performance to the quality of the initial data partitioning (Algorithm 1, page 9)? Could a simpler partitioning scheme (e.g., uniform block partitioning) be compensated for by the dynamic load balancing, or is the dissimilarity-aware mapping critical to preventing network saturation?
The concept of "In-Network Computing" is very powerful. Looking beyond CGRAs, could this principle of opportunistic execution by idle nodes be applied to other parallel computing domains, such as disaggregated datacenters or multi-chiplet processors, where resource utilization and load balancing are also critical challenges?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper introduces the "Nexus Machine," a coarse-grained reconfigurable architecture (CGRA) designed to efficiently execute irregular workloads. The authors identify load imbalance and memory bank conflicts as key challenges for traditional CGRAs when handling sparse applications.

The central claim of novelty lies in the synthesis of three concepts: 1. A reconfigurable fabric of Processing Elements (PEs). 2. An execution model inspired by Active Messages (AM), where instructions and operands travel together in a single network packet. 3. A mechanism for "In-Network Computing," which the authors term "Opportunistic Execution," allowing an active message to be executed by any idle PE it encounters along its route to a final destination.

This third element is presented as the primary novel contribution, designed to dynamically mitigate load imbalance by utilizing the entire fabric's compute resources, rather than restricting computation to source or destination PEs.

Strengths

Novel Execution Model for CGRAs: The core concept of opportunistic, en-route execution of instruction-bearing messages on a CGRA fabric is genuinely novel. While prior work has explored load balancing and active messages separately, their integration in this specific manner to address workload irregularity in CGRAs appears to be a new contribution. The mechanism directly converts idle PEs across the fabric into a distributed, opportunistic compute resource, which is a powerful idea.
Clear Distinction from Dominant Paradigms: The paper effectively differentiates its contribution from established CGRA models. It is not a spatial architecture with static data paths, nor is it a simple triggered-instruction architecture (TIA) where messages merely activate pre-loaded instructions. The key delta is that the instruction itself is mobile and can be executed dynamically at an intermediate location.
Principled Differentiation from Prior Load-Balancing Techniques: The proposed mechanism is conceptually distinct from traditional network load-balancing schemes like Valiant's algorithm [54]. Valiant routing randomizes the path of a packet to avoid hotspots but performs no computation at the intermediate node. Nexus Machine leverages the intermediate node for computation, fundamentally changing the role of the network fabric from pure communication to a hybrid communication-computation substrate. This is a significant conceptual advance.

Weaknesses

Insufficient Differentiation from Task-Based In-Network Execution: The concept of executing work within the network has conceptual overlap with prior architectures like Dalorex [43]. Dalorex spawns fine-grained tasks that are handled by in-order cores distributed across the fabric. While the authors briefly mention Dalorex in the related work (Section 6.1, page 13), they do not sufficiently articulate the novelty of their instruction-level "opportunistic execution" relative to Dalorex's task-level model. The delta appears to be one of granularity (a single instruction vs. a task/thread), but the conceptual foundation of using network-traversing work to find idle compute resources is similar. The paper would be stronger if it provided a direct, quantitative comparison or a clearer architectural argument for why the instruction-level approach is superior for their target (edge) domain.
Overstated Novelty of "Active Message" Adaptation: The authors state their definition of AM "diverges significantly" from the original concept (Section 2.1, page 2). While the application to a CGRA and the multi-destination format are notable engineering contributions, the fundamental idea—a message containing a handler (opcode) and arguments that triggers computation upon arrival—remains intact. The primary novelty is not in the AM format itself, but in where it can be executed (i.e., en-route). The framing could be more precise to credit the novelty to the execution model rather than a reinvention of active messages.
Vague Description of the Core Novel Mechanism: The paper's most novel component—the decision logic for opportunistic execution—is not detailed sufficiently. Figure 8 shows the microarchitecture, but the control logic that enables a router and PE to identify a passing message, assess the ALU's availability, hijack the message for execution, and re-inject a potentially "morphed" message into the network is abstracted away. For a contribution centered on this mechanism, the implementation details are critical. Is this check performed in a single cycle? Does it add latency to messages that are simply passing through? The feasibility and cost of this core mechanism are central to its claimed novelty and practical value.

Questions to Address In Rebuttal

Regarding Dalorex: Please provide a more detailed comparison to the Dalorex architecture. What are the specific architectural trade-offs between your instruction-level "opportunistic execution" and their task-based data-local execution model? Why is your approach fundamentally different or better suited for resource-constrained edge devices?
Control Logic for En-Route Execution: Could you elaborate on the microarchitectural implementation of the opportunistic execution decision? Specifically, how does an intermediate PE determine if a traversing message is a candidate for execution, check for ALU availability, and execute the instruction without disrupting the network pipeline or introducing significant latency? Is the "idle" state of an ALU broadcast or polled?
Clarification of Message "Morphing": The term "morphing" is used to describe how active messages change based on dynamic control flow (Abstract, page 1). Could you provide a concrete example beyond Figure 5 of what information in the active message packet (Figure 7) is modified after an en-route execution? For instance, is the destination list (R1, R2, R3) updated, or is the Result field populated, or both?
Scalability of the Novel Mechanism: Your scalability results in Figure 18 (page 12) show near-linear scaling. However, in very large arrays (e.g., 128x128), the network diameter is large. Does the latency of a long message traversal begin to outweigh the benefit of finding an idle PE for a single instruction's execution? At what network scale or workload characteristic does the overhead of your opportunistic execution mechanism start to diminish its returns?

Crane: Inter-Layer Scheduling Framework for DNN Inference and Training Co-Support on Tiled Architecture

Abstract

Tiled architectures have emerged as a compelling platform for scaling deep neural network (DNN) execution, offering both compute density and communication efficiency. To harness their full potential, effective inter-layer scheduling is crucial for ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present Crane, a framework for inter-layer scheduling on tiled architectures that aims to provide co-support for both DNN inference and training. The central thesis is that prior works are limited due to an inadequate scheduling representation. The paper proposes a hierarchical, table-format representation (ScT and MeT) to capture four key design factors: execution scheme, fusion, recomputation, and batch splitting. This representation is used to formulate the scheduling task as a Mixed-Integer Linear Programming (MILP) problem. The evaluation claims significant improvements in Energy-Delay Product (EDP) and scheduling speed over existing frameworks like SET, Tangram, and TileFlow.

However, the work's core claims of providing a "thoroughly" explored, mathematically structured optimization appear to be undermined by the methodology itself, which is a complex multi-stage heuristic that uses MILP as a component rather than constituting a single, comprehensive optimization. The scalability of the approach is not sufficiently demonstrated, and the evaluation relies on questionable "hypothetical" baselines for its most significant claims in the training domain.

Strengths

Expressive Representation: The proposed table-based representation using a Scheduling Table (ScT) and Memory Table (MeT) is a clear and systematic way to model the complex state space of inter-layer scheduling, including memory lifetime and sub-batch progress (Section 5, pages 6-7). This is a well-conceived abstraction.
Unified Scope: The ambition to unify the scheduling of inference and training workloads under a single framework that considers execution schemes, fusion, recomputation, and batch splitting (E+F+R+B) is commendable. The paper correctly identifies this as a gap in prior research.
Formal Problem Components: The use of MILP to solve sub-problems within the scheduling flow is a valid approach for finding locally optimal solutions for workload and memory configurations, given a fixed block structure and sub-batch size.

Weaknesses

Contradiction in Optimization Claims: The paper repeatedly frames its approach as a "thoroughly" explored, "mathematically structured optimization" in contrast to "heuristic search algorithms" (Abstract, Table 1). However, the overall optimization process described in Section 5.5 (Figure 7) and Section 6 is fundamentally a high-level heuristic search. The process involves:
- Enumerating candidates for sub-batch size (BSsub).
- Pruning the search space to the top K1 candidates based on a Cost_comp MILP.
- Further pruning to the top K2 candidates based on a Cost_traffic MILP.
- An iterative, gradual partitioning for the hierarchical structure that terminates based on a pre-set threshold θ (Section 6, Step-2).
This is not a single, thorough optimization. It is a meta-heuristic that relies on pruning and arbitrary thresholds (K1, K2, θ), which directly contradicts the paper's central criticism of prior work. The claim of avoiding "incomplete and inefficient exploration" is therefore unsubstantiated.
Unproven Universality and Unanalyzed Scalability: Theorem 1 (page 5) claims universality for any DAG-structured model. The provided "proof" is a high-level, informal sketch. It fails to formally address the potential complexity and depth of the required hierarchy for highly branched models, nor does it prove that the pipeline-derived state basis is sufficient for all valid execution schedules. Furthermore, the paper provides no analysis of the scalability of its core MILP formulation. The state space SB for a composite block of N sub-blocks is 2N-1. The runtime of the MILP solver is highly dependent on N. The reported scheduling speedups are based on models of moderate depth, but there is no evidence to suggest this approach would remain tractable for significantly deeper or more complex models.
Weak and Unverifiable Baselines in Training Evaluation: The most dramatic results (e.g., up to 21.01x EDP reduction) are reported in the training evaluation (Section 7.3). However, these results are derived from comparisons against "hypothetical training-oriented SET and Tangram" (page 11). These baselines are not established works; they are the authors' own modifications of inference-only schedulers. The validity of these comparisons is highly suspect. There is no way to verify if these hypothetical baselines were implemented in a fair, robust, or optimized manner. The critical result that SET and Tangram produce an "out-of-memory error" for OPT-6.7B (Figure 11b) is entirely dependent on the authors' unverified implementation choices for these baselines. This is not a scientifically rigorous comparison.
Heuristic Simplifications Within the Formalism: The method for handling recomputation (Section 5.3, page 8) appears to be a heuristic choice rather than an optimally derived one. The process is split into two distinct steps: a "Backward Pass-only Pre-processing" step followed by a "Forward Recomputation-then-Backward Pass" step. This split simplifies the dependency graph by ensuring recomputed activations are always available before they are needed. However, it is not proven that this decoupling is optimal. A truly optimal solution might interleave recomputation and backward propagation on a finer granularity. This simplification appears to be a concession to tractability that undermines the claim of optimality.

Questions to Address In Rebuttal

Please reconcile the claims of "thorough" exploration and moving beyond "heuristic search" with the methodology described in Section 5.5 and 6, which explicitly relies on heuristic pruning (K1, K2) and convergence thresholds (θ). Is the overall framework not, in fact, a meta-heuristic?
Provide empirical data or a formal analysis on the scalability of the MILP solver runtime as the number of sub-blocks N in a composite block increases. At what level of model complexity does Crane's scheduling time become prohibitive?
Regarding the training evaluation (Section 7.3), can the authors provide a detailed description of the implementation of the "hypothetical" SET and Tangram baselines? Specifically, how was recomputation support added, and what steps were taken to ensure this was a fair and competitive implementation rather than a strawman?
Justify why the two-step recomputation process (Section 5.3) is optimal. Have alternative, more coupled strategies for recomputation and backward pass scheduling been considered? If so, why were they discarded? If not, how can the current approach be considered part of a "thorough" optimization?

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents Crane, a novel framework for inter-layer scheduling of DNN workloads on tiled architectures. The authors identify the root cause of limitations in prior work—such as incomplete optimization, scheduling inflexibility, and inefficient search—as the absence of a sufficiently powerful and structured scheduling representation.

The core contribution of this work is not merely a new scheduler, but a new representational framework for the scheduling problem itself. Crane introduces a hierarchical, table-based abstraction (ScT for scheduling, MeT for memory) that is expressive enough to unify the four key design factors: execution scheme (E), layer fusion (F), recomputation (R), and batch splitting (B). This unified representation is mathematically structured, allowing the authors to reformulate the scheduling problem as a Mixed-Integer Linear Program (MILP). This shift from heuristic search (e.g., Simulated Annealing) to a formal optimization method enables a more complete and efficient exploration of the design space, uniquely providing co-support for both DNN inference and training workloads within a single framework.

Strengths

A Fundamental Shift in Problem Representation: The paper's primary strength lies in its diagnosis of the core problem. Instead of treating the symptoms (e.g., slow heuristic search), the authors correctly identify the disease: an inadequate representation. The proposed hierarchical table-format is an elegant solution. It successfully captures the complex interplay between computation, data movement, and memory lifetime across layers, which has been a significant challenge in this domain. The three principles laid out in Section 3.1 (page 3)—rich expressiveness, topological flexibility, and mathematical structuredness—are precisely the right goals for such a representation, and the paper delivers on them.
Unification of Inference and Training Scheduling: This is a major step forward for the field. Historically, tools have been designed for either inference (e.g., SET, Tangram) or training (e.g., Checkmate), because the optimization objectives and constraints (especially memory pressure and the need for recomputation) are so different. By creating a representation that natively incorporates recomputation (R) as a first-class citizen alongside fusion and execution schemes, Crane provides a single, coherent framework for both domains. This is not just a matter of convenience; it allows for the discovery of scheduling strategies that holistically optimize the entire compute lifecycle and simplifies the software stack for hardware vendors and researchers.
Methodological Advancement from Heuristics to Formal Optimization: The move from sampling-based heuristics (like those in SET and TileFlow) to a MILP formulation is a significant increase in methodological rigor. While heuristics can be effective, they offer no guarantee of optimality and often suffer from slow convergence. By structuring the problem for a MILP solver, Crane can perform a more thorough search of the scheduling space. The impressive empirical results, showing a 2.82× scheduling speedup over SET despite exploring a larger search space, validate that this structured approach is not just theoretically sound but also practically efficient for the problem scale considered.
Strong Connection to the Broader Landscape: The authors do a commendable job of situating their work. Table 1 (page 2) provides a clear and concise summary of the state-of-the-art, highlighting the specific gaps Crane aims to fill. The framework builds upon a rich history of work in both intra-layer (which it pragmatically accepts as a plug-in) and inter-layer scheduling, while charting a new path forward. This work synthesizes ideas from compiler theory (hierarchical representations), optimization (MILP), and computer architecture (tiled accelerators) into a cohesive whole.

Weaknesses

My critiques are less about flaws and more about the boundaries and future implications of this approach.

Scalability of the MILP Formulation: The primary concern with any MILP-based solution is its scalability. While the results are excellent for the models tested, MILP problems are NP-hard in the general case. The paper’s hierarchical approach is a clever and necessary method for managing this complexity, effectively acting as a problem decomposition heuristic. However, it's unclear how this approach will scale to the next generation of truly gargantuan models (e.g., Mixture-of-Experts with complex routing, or models with thousands of layers). The runtime of the optimization process itself could become a bottleneck as the number of blocks, layers within blocks, and states grows.
The Optimality of the Hierarchy: Theorem 1 (page 5) establishes the universality of the representation—that any DAG-based schedule can be represented. However, the process of finding the optimal hierarchical block structure itself seems non-trivial. The gradual, cost-driven partitioning process described in Section 6 (page 10) is a heuristic. This introduces a potential weakness: the quality of the final schedule is dependent on the quality of this initial, high-level partitioning. An early suboptimal decision in structuring the blocks could prune away the true globally optimal solution.
Decoupling of Inter- and Intra-Layer Scheduling: The decision to treat the intra-layer scheduler as a "plug-in" that provides cost estimates (α, β, γ in Eqs. 24 & 25, page 10) is pragmatic. However, it abstracts away the deep coupling between these two problems. For instance, an inter-layer decision to fuse two layers might drastically change the optimal intra-layer tiling strategy and dataflow for that fused operator. A simple cost model might not capture these non-linear effects, potentially leading the inter-layer optimizer astray.

Questions to Address In Rebuttal

Regarding the scalability of the MILP solver: Could the authors comment on the growth of the MILP problem size and solver runtime as a function of model depth and width? Have they identified any "cliffs" where the problem becomes intractable, and does the hierarchical pruning strategy effectively mitigate this for the foreseeable future of model sizes?
Concerning the hierarchical partitioning: How sensitive is the final solution quality to the heuristic partitioning strategy described in Section 6? For example, if you start with a different initial block structure (e.g., one layer per block vs. the paper's dependency-based grouping), how much does the final EDP change? This would help clarify whether the framework consistently finds a near-optimal solution regardless of this initial step.
On the interaction between scheduling layers: Could the authors elaborate on the fidelity of the intra-layer cost model? Is a single-pass cost estimation sufficient, or have they considered an iterative approach where Crane's inter-layer decisions are used to refine the intra-layer schedules, which in turn provide more accurate costs back to Crane?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper introduces Crane, an inter-layer scheduling framework for DNNs on tiled architectures. The authors identify the core problem in prior work as the lack of a unified, expressive representation for scheduling decisions. Their central claim to novelty is the creation of a hierarchical table-format representation that co-optimizes four key design factors: execution scheme (E), layer fusion (F), recomputation (R), and batch splitting (B).

This representation consists of two main components: 1. A Scheduling Table (ScT) that tracks the cumulative number of sub-batches processed by each layer/block over time. 2. A Memory Table (MeT) that tracks the lifetime of intermediate data in on-chip (SRAM) and off-chip (DRAM) memory by recording the lower bound of the valid sub-batch range.

The authors argue that this specific, structured representation is novel because it is the first to unify all four design factors, and its mathematical structure allows the complex scheduling problem to be formulated as a Mixed-Integer Linear Program (MILP), which can be solved more efficiently and thoroughly than prior heuristic-based methods.

Strengths

The primary strength of this work lies in the conceptual novelty of its core representation.

Unified Representation for Disparate Factors: The most significant novel contribution is the joint representation of execution scheme, fusion, and recomputation within a single, co-optimizable mathematical structure. Prior art has treated these as largely separate problems. For instance, frameworks like SET [5] and Tangram [9] focus on execution schemes and fusion, while Checkmate [15] is purpose-built for recomputation. These approaches are fundamentally incompatible. Crane's ScT/MeT formulation provides a common, low-level language to describe the effects of all these decisions, which appears to be a genuinely new approach.
Novel Abstraction of Scheduling States: The paper's formalization of execution states as being derived from a canonical pipeline pattern (Section 4, pages 4-5) is a clever and powerful abstraction. This provides a regular, predictable structure to the state space, in contrast to more ad-hoc or constrained representations like the ratio-tree in SET [5]. The claim of universality for any DAG-structured model (Theorem 1, page 5) is a strong theoretical underpinning that distinguishes this work from prior schedulers that are often limited to linear chains of computation.
Representation-Driven Optimization: The novelty is not simply the use of an MILP solver, which has appeared in prior work (e.g., Checkmate [15]). Rather, the novelty is in designing a representation specifically to enable a clean MILP formulation. The use of cumulative sub-batch counts (ScT) and memory range bounds (MeT) translates complex temporal dependencies and memory lifetime constraints into a set of linear inequalities. This tight coupling of representation and optimization is a significant conceptual advance over prior works that relied on heuristic search over less structured representations.

Weaknesses

While the combination of ideas is novel, the individual components are built upon existing concepts. The paper could do a better job of delineating the precise boundary of its novelty.

Incremental Novelty of Components: Hierarchical problem decomposition is a standard technique. The use of tables to track system state is fundamental. MILP is a known tool for optimization problems. The core novelty rests entirely on the specific semantics of the ScT and MeT tables. The authors present this as a revolutionary framework, but it can also be viewed as a very clever, new encoding scheme for a known solver.
Unclear Scalability of the Core Formulation: The reliance on MILP, even with a well-structured problem, raises questions about scalability that are not fully addressed. While the hierarchical search strategy (Section 6, page 10) is a practical solution to prune the search space, it is a heuristic layered on top of the core MILP formulation. What are the scaling properties of the MILP itself for a single, very large, "flat" block with many sub-components? The novelty of the MILP formulation is only valuable insofar as it remains tractable for problems of interest. Does the representation itself offer any intrinsic scaling advantages over, for example, the ILP formulation in Checkmate [15], especially given that Crane's problem space is significantly larger?
Complexity of the Recomputation Model: The model for recomputation (Section 5.3, page 8) requires a two-step process: a backward-only pass followed by a combined recomputation-and-backward pass. This appears to be an artifact of the representation's design. A truly universal and fully expressive representation might be expected to model the bi-directional dataflow of training in a single, unified pass. This two-step process, while functional, suggests a potential limitation in the expressive power of the novel table format and feels less elegant than the forward-pass model.

Questions to Address In Rebuttal

Prior work like SET [5] also uses a hierarchical (tree-based) representation. Please clarify the fundamental conceptual novelty of your pipeline-derived hierarchical state abstraction beyond the stated benefit of "flexibility." What does your abstraction allow that is fundamentally impossible to represent in a ratio-tree or a similar hierarchical graph?
The core claim is that the table format enables a more thorough and efficient search via MILP. Given that MILP solvers can have exponential worst-case complexity, what specific properties of the ScT/MeT constraint formulation (Eqs. 1-12, pages 6-8) make it particularly tractable for this expanded search space (E+F+R+B) compared to prior, more limited ILP/MILP formulations like Checkmate [15]?
Regarding the recomputation model in Section 5.3, is the two-step backward process (pre-processing pass and recomputation pass) a necessary consequence of the ScT/MeT representation, or is it a heuristic choice to simplify the formulation? Could the tables, as defined, represent a single, interleaved recomputation-and-backward process, and if not, what does this imply about the representation's completeness for training workloads?

Elk: Exploring the Efficiency of Inter-core Connected AI Chips with Deep Learning Compiler Techniques

Abstract

To meet the increasing demand of deep learning (DL) models, AI chips are employing both off-chip memory (e.g., HBM) and high-bandwidth low-latency interconnect for direct inter-core data exchange. However, it is not easy to explore the efficiency of these ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The authors present ELK, a compiler framework designed to optimize the performance of deep learning models on Inter-core Connected AI (ICCA) chips equipped with off-chip High Bandwidth Memory (HBM). The paper identifies a key performance challenge as the three-way contention between on-chip compute, inter-core communication, and off-chip I/O. ELK's proposed solution involves an inductive operator scheduling policy, a cost-aware memory allocation algorithm, and a preload order permutation strategy to navigate these tradeoffs. The evaluation, conducted on an emulation framework and a simulator, claims to achieve up to 94.84% of an ideal roofline performance.

While the problem is well-defined and relevant, the work suffers from significant internal contradictions regarding its optimality claims and relies on an evaluation methodology with questionable fidelity, particularly in its baselines and HBM emulation.

Strengths

Problem Formulation: The paper does a commendable job of identifying and articulating the fundamental performance tradeoffs on ICCA chips with HBM. The characterization of contentions in on-chip memory space, interconnect bandwidth, and memory access (Section 2.3, page 3) provides a clear motivation and foundation for the work.
Holistic Approach: The framework attempts to co-optimize multiple interdependent factors (tile sizes, preload depth, data placement, preload order), which is the correct direction for tackling this complex problem. Structuring these factors as a searchable compiler space is a logical approach.
Design Space Exploration: The sensitivity analysis presented in Section 6.4, which explores the impact of varying network topologies and bandwidths, provides useful, quantified insights for architects of future ICCA systems.

Weaknesses

Contradictory Claims of Optimality: The paper makes strong but unsubstantiated claims of optimality. Section 4.2 (page 6) states that the proposed scheduling algorithm "provably finds the end-to-end plan with the shortest total time". However, this claim is directly contradicted by the core memory allocation algorithm described in Section 4.3 (page 7), which employs a greedy heuristic to select from Pareto-optimal plans ("selects the most 'cost-effective' operator"). Furthermore, the preload order permutation in Section 4.4 (page 8) relies on heuristics to prune the search space, such as only reordering "HBM-heavy" operators. An algorithm that relies on greedy heuristics in its critical sub-problems is not globally optimal. The claims of optimality are misleading and must be rectified.
Flawed "Ideal" Baseline: The "Ideal" baseline used for comparison is defined as a theoretical machine with separate, contention-free interconnects for preload and execution, and no on-chip memory capacity constraints (Section 6.1, page 10). This is a physically unrealizable architecture. Evaluating against this unrealistic ideal serves to inflate the reported performance figures (e.g., "achieves 94.84% of the ideal performance"). The purpose of a scheduler on a real machine is to manage contention, not to perform on a magical machine where none exists. A rigorous evaluation requires a comparison against a much tighter, more realistic upper bound, such as a solution from a constraint solver for a small model or an oracle with perfect future knowledge on the emulated hardware.
Low-Fidelity HBM Emulation: The experimental methodology for emulating HBM access is a significant concern. The authors state they use a single core on their IPU-POD4 to act as an "HBM controller" that broadcasts data to all other cores (Section 5, page 9). This one-to-many broadcast pattern is not representative of a real system with multiple HBM controllers connected to the interconnect fabric, as depicted in Figure 1. A real system would generate complex many-to-many traffic patterns, leading to different and likely more severe contention scenarios. This simplification calls into question the validity of the interconnect utilization and contention results, which are central to the paper's claims.
Insufficient Justification of Novelty: The core algorithmic components are not sufficiently differentiated from standard practices. The intra-operator tradeoff analysis involves generating a Pareto-optimal curve of plans, and the inter-operator tradeoff is resolved with a greedy heuristic. This general approach is common in resource allocation problems. The paper fails to adequately argue for the specific novelty of its cost-aware allocation algorithm beyond its application to this specific hardware context.

Questions to Address In Rebuttal

Please clarify the central claim of your paper. Is the ELK framework optimal or heuristic? You must reconcile the claim of a "provably" optimal scheduling algorithm in Section 4.2 with the explicit use of greedy heuristics for memory allocation in Section 4.3 and for search space pruning in Section 4.4.
Please justify the selection of your "Ideal" baseline. Given that it models a physically impossible machine, how can it serve as a meaningful benchmark for a scheduler whose primary role is to manage contention? Could you provide results against a more grounded upper bound, even for a smaller problem, to demonstrate the true quality of your scheduler?
Can you provide a more detailed defense of your HBM emulation methodology? Specifically, how does a single-source broadcast traffic pattern accurately model the interconnect contention generated by multiple, distributed HBM controllers? What is the potential impact on your results if a more realistic many-to-many traffic pattern was modeled instead?
The decision to only reorder "HBM-heavy" operators is a critical heuristic for tractability. What analysis was performed to validate this heuristic? Please provide data or a strong argument to show that the performance potential from reordering operators with smaller memory footprints is negligible.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces ELK, a compiler framework designed to optimize the performance of Deep Learning (DL) models on a class of hardware I will refer to as Inter-Core Connected AI (ICCA) chips. The central thesis is that the performance of these architectures, which feature both large, distributed on-chip SRAM and high-bandwidth off-chip memory (HBM), is governed by a fundamental three-way trade-off between on-chip computation, inter-core communication, and off-chip I/O.

The authors' core contribution is to formalize this trade-off space and develop a compiler that systematically explores it. ELK employs several novel techniques, including a two-level inductive scheduling policy to manage the overlap between execution and preloading, a cost-aware memory allocation algorithm to optimally partition on-chip SRAM, and a preload order permutation strategy to minimize memory pressure and interconnect contention. The work is evaluated through a robust framework that includes an emulator built on real Graphcore IPU hardware and a flexible simulator for architectural design space exploration. The results demonstrate that ELK can achieve performance close to 95% of a theoretical roofline, significantly outperforming more straightforward compilation strategies.

Strengths

Excellent Problem Formulation: The paper does a superb job of identifying and articulating the core challenge. The "fundamental tussle" among compute, communication, and I/O is not new in systems research, but its specific manifestation on ICCA chips is a timely and important problem. Figure 2 provides a wonderfully clear and concise conceptual diagram of the resource contentions that motivate the entire work. This clear framing elevates the paper, making its contributions easy to understand and appreciate.
Systematic and Principled Approach: ELK is not merely a collection of ad-hoc heuristics. The authors have constructed a principled framework that maps the high-level performance factors to concrete, tunable compiler parameters (as shown in Figure 4). The multi-stage optimization process, from inductive scheduling (§4.2) to cost-aware allocation (§4.3) and finally preload reordering (§4.4), represents a comprehensive and well-reasoned strategy for navigating the complex optimization space.
Significant Contribution to Architectural Exploration: Perhaps the most impactful aspect of this work is its utility beyond just being a compiler. The framework presented, particularly the simulator, serves as a powerful tool for architectural design space exploration (§6.4). By allowing researchers to quantify the impact of scaling HBM bandwidth, interconnect topology (mesh vs. all-to-all), and core count, ELK provides a methodology for co-designing future AI hardware and the software that will run on it. This connects the domains of compilers and computer architecture in a very meaningful way.
Novel and Insightful Techniques: The concept of preload order permutation (§4.4) is particularly insightful. The realization that the order of data preloading—not just the amount—can be manipulated to manage the lifetime of large tensors in on-chip memory (Figure 13) is a subtle but powerful optimization. It shows a deep understanding of the system's temporal behavior.
Strong and Convincing Evaluation: The decision to build both an emulator on real hardware and a configurable simulator provides the best of both worlds: the credibility of real-system measurements and the flexibility for what-if analysis. The performance breakdown in Figure 18(a) is especially compelling, as it visually demonstrates how ELK achieves its speedup by converting idle "preload" and "interconnect" time into productive "overlapped" time.

Weaknesses

While the work is strong, its placement in the broader landscape could be strengthened, and some practical limitations could be addressed more directly.

Portability of the Cost Model: The framework's effectiveness hinges on the accuracy of its performance cost models. While the authors demonstrate impressive accuracy for the IPU architecture (Figure 12), the effort required to port this model to a fundamentally different ICCA chip (e.g., SambaNova's RDU with its spatial dataflow model, or Cerebras's WSE) is non-trivial. The paper could benefit from a discussion on what architectural features are most critical to model and the potential challenges in adapting ELK to hardware with different programming or execution paradigms.
Limited Discussion on Dynamic Workloads: The optimization process described is entirely static and performed ahead-of-time (AOT). This is well-suited for models like BERT or GPT where the computation graph is fixed. However, an increasingly important class of models, such as Mixture-of-Experts (MoE), involves input-dependent dataflow. The authors briefly mention MoE in the discussion (§7), but a deeper analysis of how ELK's static assumptions would break down and how its principles might be adapted for just-in-time (JIT) compilation or runtime scheduling would broaden the work's context.
Connection to Classical HPC and Distributed Systems: The problem of overlapping I/O, communication, and computation is a cornerstone of high-performance computing (HPC). While the domain is different (intra-chip vs. inter-node), many of the underlying principles are analogous. The paper could be strengthened by explicitly connecting its techniques to concepts from the broader parallel computing literature, such as communication-avoiding algorithms, asynchronous I/O scheduling, or task-graph-based execution models (e.g., Legion, Dask). Doing so would help contextualize ELK's contributions for a wider audience.

Questions to Address In Rebuttal

Regarding the cost model's portability: Could the authors elaborate on the process of creating a cost model for a new ICCA architecture? What are the key architectural parameters that must be captured, and how much of the existing modeling infrastructure could be reused versus needing a complete rewrite?
Regarding dynamic workloads like Mixture-of-Experts (MoE): As discussed in Section 7, MoE requires loading different "expert" weights at runtime. How would the inductive scheduling approach handle this uncertainty? Would a hybrid online/offline strategy be necessary, where ELK pre-optimizes plans for each expert, and a runtime system selects them?
Could the authors comment on the relationship between their work and classical communication-avoiding algorithms in HPC? ELK's strategy of broadcasting shared data during preload to reduce inter-core traffic during execution (§3.3) seems spiritually similar. Are there deeper parallels or lessons from that domain that could inform future iterations of ELK?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents ELK, a compiler framework designed to optimize the performance of deep learning models on Inter-core Connected AI (ICCA) chips that are augmented with off-chip HBM. The authors identify a core performance challenge as a "fundamental tussle" among three factors: per-core compute, inter-core communication, and off-chip I/O. The central claim of the paper is that ELK is the first framework to jointly and holistically optimize these three competing factors. The proposed novel contributions are a set of compiler techniques to navigate this trade-off space: (1) a two-level inductive operator scheduling policy to decide on the number of operators to preload, (2) a cost-aware memory allocation algorithm to partition on-chip SRAM between executing and preloading operators, and (3) a preload order permutation scheme to decouple preload order from execution order, aiming to reduce resource contention. The work is evaluated using an emulator built on real hardware and a custom simulator.

My analysis concludes that while the high-level concepts leveraged by the authors (e.g., overlapping operations, scheduling heuristics, Pareto-optimality) are well-established in the broader field of compilers and high-performance computing, their specific synthesis, formulation, and application to the unique resource constraints of ICCA chips represent a novel and meaningful contribution.

Strengths

Novel Problem Formulation: The primary strength of this work is its clear identification and formalization of the coupled resource contention problem on ICCA chips with HBM (as illustrated in Figure 2, page 2). Prior work has typically focused on subsets of this problem (e.g., compute/communication in T10 [34], or compute/I/O in traditional GPU compilers [74]). The articulation of the three-way trade-off and its mapping to concrete compiler parameters is a novel conceptual contribution that frames the problem space effectively.
Novel Combination and Application of Techniques: The paper introduces several techniques that, while inspired by existing principles, are applied in a novel context.
- The Preload Order Permutation (Section 4.4, page 8) is the most distinct idea. Decoupling the preloading sequence from the execution sequence for entire DL operators to specifically manage on-chip memory lifetimes and interconnect traffic "rush hours" is a new approach in this domain. Standard prefetching schemes operate at a much finer granularity.
- The Cost-Aware Memory Allocation (Section 4.3, page 6) synthesizes two known ideas—Pareto-optimal plan generation and greedy selection—into a novel heuristic for inter-operator resource partitioning. The idea of selecting plans from multiple operators' Pareto curves simultaneously to satisfy a global memory budget is a non-trivial and new formulation.
Systematic Exploration Framework: The ELK framework itself represents a novel system for systematically navigating the identified trade-off space. The way it structures the problem into a series of nested search and optimization stages (Figure 9, page 5) is a new and coherent methodology for this specific hardware architecture.

Weaknesses

My critique is centered on the degree of fundamental novelty of the underlying algorithmic building blocks. The paper could be strengthened by more explicitly positioning its work against broader prior art in scheduling and compiler optimization, beyond the immediate scope of DL compilers.

Component Techniques Are Adaptations of Prior Concepts:
- Inductive Operator Scheduling (Section 4.2, page 6): The core idea of solving a scheduling problem by working backward from the final state and making locally optimal choices is a classic dynamic programming pattern. While the authors' formalization in Theorem 4.2 is useful, the algorithmic pattern itself is not fundamentally new. The novelty is in its application, not its invention.
- Use of Pareto-Optimal Curves (Section 4.3, page 7): The use of Pareto frontiers to represent time-space trade-offs is a standard technique in compiler auto-tuning (e.g., Ansor [70]) and hardware design space exploration. The paper claims novelty in its use for inter-operator trade-offs, which is fair, but the foundational tool is not new.
Complexity vs. Benefit: The proposed solution is highly complex, involving a multi-stage search that includes exhaustive search over preload numbers, Pareto curve generation for every operator, a greedy heuristic for memory allocation, and a pruned search over permutations. The performance gains (e.g., 1.37x over the Static baseline) are significant but not orders of magnitude. A critical analysis is needed to ascertain if a simpler set of heuristics could have achieved a substantial fraction of these gains. The paper does not explore this, making it difficult to assess if the full novel complexity is justified. The novelty comes at the cost of a sophisticated and potentially fragile compilation pipeline.

Questions to Address In Rebuttal

The concept of generating multiple implementation plans for an operation and selecting one based on system-wide constraints is central to auto-tuning compilers like Ansor [70] and Halide. How does ELK's cost-aware memory allocation (Section 4.3) differ fundamentally from the search and cost-modeling strategies in these frameworks, beyond the specific context of ICCA preloading? Please clarify the key "delta" in the allocation algorithm itself.
The "Preload Order Permutation" (Section 4.4) can be viewed as a form of coarse-grained, software-managed, out-of-order preloading. Could the authors contrast their approach with established concepts in software prefetching and out-of-order execution, highlighting precisely what is novel about their method at the algorithmic level, rather than just the application target?
The entire framework relies on a series of heuristics (greedy allocation, permutation pruning based on HBM-heavy operators). How sensitive are the final results to these heuristics? For example, what is the performance impact if the cost model for inter-core transfer (Figure 12, page 7) has a 10-20% error rate, and how might that affect the greedy choices made during memory allocation? This would help clarify whether the novel framework is robust or brittle.

TAIDL: Tensor Accelerator ISA Definition Language with Auto-generation of Scalable Test Oracles

Abstract

With the increasing importance of deep learning workloads, many hardware accelerators have been proposed in both academia and industry. However, software tooling for the vast majority of them does not exist compared to the software ecosystem and ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present TAIDL, a domain-specific language for defining the instruction set architectures (ISAs) of tensor accelerators. The central idea is to leverage a high-level tensor intermediate representation, XLA-HLO, to describe the operational semantics of instructions. From a TAIDL specification, the system can automatically generate a Python-based "test oracle" which, under the hood, compiles the operations into an XLA computational graph for execution on multi-core CPUs or GPUs. The authors claim this approach improves productivity and yields test oracles that are orders of magnitude faster and more scalable than existing instruction-level functional simulators like Gemmini Spike and Intel SDE.

However, the fundamental premise of using a high-level compiler IR to define a low-level ISA is questionable. The work appears to conflate a high-level functional model with a precise, bit-accurate ISA specification. Furthermore, the empirical evaluation, while showing dramatic speedups, rests on an inequitable comparison between a JIT-compiled, parallelized tensor program and traditional serial instruction interpreters, making the performance claims misleading.

Strengths

Problem Motivation: The paper correctly identifies a critical gap in the academic hardware accelerator community: the lack of robust, well-defined hardware-software interfaces and accessible correctness-testing tools (Section 1.1, Table 1). The motivation is clear and compelling.
Productivity Goal: The goal of automating the generation of functional simulators ("test oracles") is a laudable one. Reducing the significant, repetitive engineering effort required to build such tools for each new accelerator design is a valuable research direction.
Open Source: The authors have made their implementation publicly available, which is a positive contribution to the community, enabling reproducibility and further investigation.

Weaknesses

Fundamental Mischaracterization of an "ISA Definition Language": The core weakness of this paper is its premise. An ISA is a low-level contract specifying precise, bit-level behavior. TAIDL, by using XLA-HLO as its computational primitive, is not defining an ISA in the rigorous sense that languages like Sail [52] do. It is defining a high-level functional behavior. This abstraction is problematic:
- It cannot naturally express novel data types, non-standard floating-point formats, or specific rounding/saturation behaviors not already supported by XLA. The paper's proposed workaround—using custom_call to an external C function (Section 5.5)—is an admission of the language's limitation and defeats the purpose of a self-contained specification.
- Subtle side effects or interactions between instructions that are not representable as a clean dataflow graph of tensor operations would be difficult, if not impossible, to model. TAIDL appears best suited for accelerators whose ISAs map cleanly to an existing compiler IR, which questions its utility for defining genuinely novel architectures.
Misleading Performance Evaluation: The scalability analysis in Section 7 is fundamentally flawed due to an apples-to-oranges comparison.
- The TAIDL-TO "oracle" is not an instruction-level simulator. It is a JIT compiler that transforms a sequence of high-level API calls into an optimized XLA graph, which is then executed by a highly-optimized, parallel backend (e.g., on an A100 GPU).
- In contrast, Gemmini Spike and Intel SDE are true instruction-level simulators/emulators that process one instruction at a time, often in a single-threaded manner.
- The "orders of magnitude" speedup reported in Figures 19 and 20 is therefore not a measure of a better simulation technique, but rather a demonstration that running a compiled, parallelized tensor program is faster than interpreting a sequence of instructions serially. This outcome is expected and does not validate the claims about TAIDL's superiority as a simulation methodology. A fair comparison would be against a hand-optimized C++ functional model of the accelerator, compiled with aggressive optimizations.
The "Oracle" is Not a Golden Reference: A test oracle, by definition, should serve as an independent, trusted source of truth. The TAIDL-TO artifact is generated and executed via the complex XLA compiler toolchain. This introduces a significant risk of common-mode failures. A bug or semantic interpretation within the XLA compiler could manifest in both the code being tested (e.g., a compiler stack targeting the accelerator) and the TAIDL-generated oracle, thereby masking the bug entirely. This approach lacks the semantic independence required for a trustworthy verification tool.
Overstated Expressivity: The examples provided (AMX, TPUv1, Gemmini) are for accelerators with relatively well-structured, data-parallel semantics that map nicely to XLA-HLO operators. The paper does not provide convincing evidence that TAIDL could handle ISAs with more irregular control flow, complex state management, or fine-grained bit-manipulation instructions that are common in hardware but do not have a clean high-level tensor abstraction. The inclusion of IF and REPEAT blocks (Section 4.5) operating on control registers is a minimal step and does not address complex, data-dependent control flow at the instruction level.

Questions to Address In Rebuttal

Please defend the characterization of TAIDL as an ISA Definition Language rather than a High-Level Functional Modeling Framework. How would TAIDL precisely specify the semantics of an instruction that implements a novel 8-bit floating-point format with a non-standard rounding mode, without resorting to an external custom_call?
How do you justify the performance comparison in Section 7? Acknowledge that you are comparing a JIT-compiled, parallel XLA graph against serial instruction interpreters. What insights do these results provide beyond the trivial conclusion that compiled, parallel code runs faster than interpreted, serial code?
The term "test oracle" implies a golden reference. Given that your oracle is dependent on the large and complex XLA compiler stack, how do you mitigate the risk of common-mode failures where bugs in the underlying XLA implementation could mask bugs in the software being tested?
The transform algorithm in Figure 16 appears to perform constant propagation on control registers (state) and unroll loops. What happens if an instruction's behavior depends on a value in a tensor buffer (not a control register)? Does TAIDL support this, and if so, how does the transformation to a static XLA graph handle such data-dependent control flow?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces TAIDL, a domain-specific language for defining the instruction set architectures (ISAs) of tensor accelerators. The authors identify a critical and widening gap in the hardware-software ecosystem: while numerous novel accelerators are proposed, they almost universally lack the software tooling (specifically, well-defined ISA semantics and fast functional simulators) necessary for the software community to build compilers and applications for them.

The core contribution is twofold: 1) The TAIDL language itself, which uniquely leverages a high-level tensor IR, XLA-HLO, as its semantic foundation for describing instruction behavior. 2) A novel methodology to automatically generate fast, scalable "test oracles" (functional simulators) from TAIDL definitions. This is achieved by transforming a sequence of TAIDL instructions into a single XLA-HLO computation graph, which can then be compiled by mature tensor compilers (like XLA) to run efficiently on multi-core CPUs or GPUs.

The authors demonstrate TAIDL's expressivity by modeling diverse, real-world accelerators (Google TPU, Intel AMX, Gemmini). Crucially, they show that the auto-generated test oracles are orders of magnitude faster than established, hand-crafted simulators like Gemmini Spike and Intel SDE, making them practical for large-scale software testing and even end-to-end model simulation.

Strengths

The primary strength of this work lies in its elegant and highly effective synthesis of ideas from different domains to solve a pressing, real-world problem.

Problem Significance and Framing: The paper correctly identifies a major bottleneck in hardware innovation. The "disconnect between the software and hardware research" (Section 1.1, Page 1) is a well-known but undertreated problem. The authors' analysis in Table 1 (Page 2) clearly motivates the need for a solution that provides both programmability (via a clear ISA) and testability (via fast oracles). This work is not a solution in search of a problem; it is a direct and thoughtful response to a critical community need.
The Core Insight: Semantics-as-IR: The most profound contribution is the decision to use a high-level, functional tensor IR (XLA-HLO) to define instruction semantics. Traditional ISA specification languages (e.g., Sail, as mentioned in Section 9) operate at a scalar or bit-vector level. By elevating the semantic definition to the level of tensor operations, the authors unlock the ability to leverage the entire, highly-optimized ML compiler ecosystem for simulation. This is a paradigm shift from writing simulators to compiling them, and it is the key insight that enables the impressive performance results.
Synergistic and Practical Toolchain: The proposed workflow (visualized well in Figure 14, Page 9) is exceptionally clever. Instead of building a new simulation engine from scratch, the authors' transformation algorithm effectively retargets the simulation task to the XLA compiler. This allows the generated oracles to automatically benefit from decades of compiler research in optimization (fusion, layout changes) and parallelization (multi-threading, GPU offload). This synergy makes the approach both powerful and practical.
Compelling Empirical Validation: The performance evaluation in Section 7 is not just incremental; it demonstrates a transformative improvement. The speedups of 1200x to 5600x over Gemmini Spike (Figure 19, Page 11) and significant gains over the industrial-grade Intel SDE (Figure 20, Page 12) are dramatic. These results elevate the tool from a theoretical concept to something genuinely usable for interactive development cycles and testing large, complex kernels, as shown in the I-BERT case study (Section 8.2).

Weaknesses

The weaknesses of the paper are less about flaws in the existing work and more about the inherent limitations of the chosen approach and unaddressed future challenges.

The XLA Abstraction Leash: The paper's greatest strength is also its most significant potential limitation. By tying TAIDL's semantics to XLA-HLO, the language is fundamentally constrained by what can be cleanly expressed in XLA-HLO. While the authors discuss forward-compatibility for custom datatypes (Section 5.5), it is less clear how TAIDL would handle accelerator features that are philosophically misaligned with the XLA model. Examples could include complex, low-level memory dependencies, explicit cache management instructions, or novel synchronization primitives that don't map to standard tensor operations. The expressivity might break down for ISAs that are not "HLO-like."
Scope of a "Test Oracle": The work focuses exclusively on the functional simulation of the data path. A complete software stack also needs to interact with the accelerator's control plane (e.g., command submission, interrupt handling, synchronization with the host). While TAIDL models some state with "control registers," it does not seem equipped to describe the dynamic, asynchronous interactions between the accelerator and a host system. This limits its utility for developing drivers or runtimes, which are also critical parts of the software ecosystem.
The Authoring Effort is Opaque: The paper excellently demonstrates the benefits for the user of a test oracle (the software programmer). However, it does not quantify the effort for the creator of the TAIDL specification (the hardware architect). Translating the intricate microarchitectural behavior of a new accelerator into a sequence of pure XLA-HLO operators may be a significant conceptual and engineering challenge in itself. The paper would be stronger if it acknowledged and discussed this potential shift in the "tooling burden" from writing a C++ simulator to writing a complex TAIDL-HLO specification.

Questions to Address In Rebuttal

Could the authors elaborate on the "escape hatches" for accelerator features that are difficult to model in XLA-HLO? For instance, how would TAIDL model an instruction that performs a scatter operation with data-dependent addressing, or a hardware feature that exposes explicit control over a software-managed cache? Is the expectation that these would be modeled via custom_call (Section 5.5), and what are the performance implications of that for the generated oracle?
The current focus is on functional correctness of kernel execution. What is the team's vision for extending the TAIDL framework to model the broader system context? Specifically, how might you model the host-accelerator interface, command queues, and memory coherence, which are essential for testing the correctness of the runtime and driver software, not just the compiled kernels?
While the benefits of having a TAIDL specification are clear, what is the anticipated learning curve and effort for a hardware architect to write a specification for a novel, complex accelerator? Is there a risk that precisely defining complex hardware behavior using only the constrained vocabulary of HLO operators is as difficult as writing a traditional simulator?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper introduces TAIDL, a domain-specific language for defining the Instruction Set Architectures (ISAs) of tensor accelerators. The authors' central novel claim is not the language itself, but the methodology for automatically generating fast and scalable test oracles (functional simulators) from TAIDL definitions. This is achieved by a unique approach: expressing instruction semantics using a high-level tensor Intermediate Representation (XLA-HLO) and then leveraging a production-grade tensor compiler (XLA) to generate highly optimized, parallel simulator code that can execute on multi-core CPUs and GPUs. The authors demonstrate this by modeling ISAs like Intel AMX and Gemmini, and show that their generated oracles significantly outperform existing, hand-crafted instruction-level simulators.

Strengths

The primary strength of this paper lies in the novel synthesis of two established fields: ISA specification and tensor compilation. The core innovative idea can be summarized as "semantic piggybacking" on a mature compiler ecosystem. To my knowledge, this is a new approach for generating functional simulators.

Novel Compilation Methodology: Prior work on ISA specification languages, such as Sail [8, 52] or ILA [56], primarily focuses on formal correctness and typically generates C code or theorem prover inputs for emulation and verification. These generated artifacts are often serial and not designed for high-performance simulation of large workloads. This paper's core insight—to define semantics in an IR that is the input to an optimizing compiler rather than a low-level language that is the output of a simple transpiler—is the key novelty. This choice directly enables the generation of highly scalable oracles, a claim well-supported by the performance results in Section 7 (pages 11-12).
Pragmatic Abstraction Choice: The decision to use XLA-HLO as the semantic foundation is a clever and domain-appropriate one. Tensor accelerators are, by definition, designed to execute operations that map well to a tensor IR. By staying at this high level of abstraction, the authors avoid the notoriously difficult problem of compiling low-level, bit-precise semantic definitions into efficient, parallel code. They effectively offload this complexity to the XLA compiler team at Google, which is a significant and pragmatic engineering choice that enables their results.
Significant Delta Over Prior Art: The "delta" between this work and the closest prior art (e.g., generating C emulators from Sail) is the performance and scalability of the resulting artifact. The orders-of-magnitude speedup shown in Figures 19 and 20 is not a marginal improvement; it represents a qualitative shift in what is practical for pre-silicon software testing, enabling full end-to-end model simulation (Section 8.2, page 13) where it was previously infeasible.

Weaknesses

My critique is focused on the boundaries and generalizability of the claimed novelty. While the core idea is new and effective, its scope may be narrower than implied.

Limited Generality of the Novel Approach: The central novelty is critically dependent on the semantics of the target ISA being easily expressible in a tensor IR. This works exceptionally well for tensor accelerators whose instructions are coarse-grained operations like matrix multiplies or convolutions. However, this approach would likely fail or be exceedingly cumbersome for general-purpose ISAs or even accelerators with fine-grained, scalar control logic or complex bit-level manipulation instructions (e.g., cryptography or networking accelerators). The paper acknowledges backward compatibility with scalar and bit-vector representations (Section 5.4, page 8), but using a tensor compiler to simulate complex bit-fiddling seems highly inefficient and misaligned. The novelty, therefore, seems confined to a specific, albeit important, class of architectures.
The DSL Contribution is Secondary: The paper presents TAIDL as a new language. However, based on the examples provided (e.g., Figure 2b, page 3), the language itself appears to be largely syntactic sugar—a thin, user-friendly wrapper—around XLA-HLO operations. The fundamental innovation lies in the compilation pipeline, not in the language constructs of TAIDL itself. The paper does not sufficiently argue for the novelty of the language design independent of its compilation target. An alternative approach of a Python library for programmatically constructing XLA-HLO graphs could have achieved similar results, questioning the necessity of a new DSL.

Questions to Address In Rebuttal

The reliance on XLA-HLO seems to tightly couple the approach to accelerators whose semantics are naturally expressed as coarse-grained tensor operations. How would the authors' framework handle instructions with complex, non-tensor semantics, such as intricate bit-level manipulations (e.g., a Galois Field multiply instruction) or stateful control flow logic not easily captured by XLA's select or while operators? Is the claimed novelty fundamentally restricted to the domain of DNN accelerators?
The TAIDL language appears to be a direct mapping to XLA-HLO constructs. What is the fundamental novel contribution of the language design itself, separate from the novel compilation methodology? Could the same result have been achieved by providing a Python library that directly constructs XLA-HLO graphs, and what is the distinct advantage of introducing a new DSL that justifies its novelty?
The framework's novelty and performance are tied to the capabilities of the XLA ecosystem. What happens if future accelerators introduce computational paradigms (e.g., sparsity patterns, dynamic data structures) that are not well-supported or efficiently compiled by XLA-HLO? Does this dependency represent a long-term limitation to the approach's novelty and applicability?

OmniSim: Simulating Hardware with C Speed and RTL Accuracy for High-Level Synthesis Designs

Abstract

High- Level Synthesis (HLS) is increasingly popular for hardware design using C/C++ instead of Register-Transfer Level (RTL). To express concurrent hardware behavior in a sequential language like C/C++, HLS tools introduce constructs such as infinite loops ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present OmniSim, a C-level High-Level Synthesis (HLS) simulation framework. It purports to achieve RTL-level accuracy at near-C speeds for a class of "complex dataflow designs" (termed Type B and C) that are purportedly unsupported by existing commercial and academic tools. The central mechanism involves a multi-threaded simulation model where individual dataflow modules are run in separate "Func Sim" threads. Their execution is orchestrated by a centralized "Perf Sim" thread that resolves hardware timing dependencies (specifically, non-blocking FIFO accesses) by consulting shared FIFO timing tables. The authors evaluate this framework on a set of self-developed benchmarks and claim significant speedups over C/RTL co-simulation and the prior state-of-the-art, LightningSim.

Strengths

Problem Taxonomy: The paper's primary strength is the proposed taxonomy of dataflow designs into Type A, B, and C based on module dependency, FIFO access type, and program behavior (Sec. 3, p. 4). This classification provides a useful conceptual framework for discussing the limitations of existing HLS simulation methodologies.
Problem Identification: The authors correctly identify a significant and well-known limitation in existing HLS C-level simulation flows, namely their inability to correctly model the functional and performance implications of non-blocking I/O and cyclic dependencies. Commercial tool manuals, as cited by the authors (p. 2), explicitly warn against this.
Core Insight: The fundamental insight that functionality and performance simulations are inseparable for Type C designs is sound. The recognition that the functional outcome of a non-blocking access depends on precise cycle timing is a correct diagnosis of the problem.

Weaknesses

Evaluation on Self-Curated Benchmarks: The most significant methodological flaw lies in the evaluation of Type B and C designs (Sec. 8.1, p. 10). The authors evaluate their tool on a benchmark suite of their own creation, designed specifically to exhibit the features that OmniSim is built to handle. This introduces a high risk of confirmation bias. The benchmarks, such as fig4_ex2, fig4_ex3, etc., appear to be small, synthetic kernels. Without evaluation on large-scale, pre-existing, and independently developed hardware designs (e.g., from open-source networking, video processing, or accelerator projects), the claims of general applicability are unsubstantiated. The work fails to demonstrate that its solution is robust beyond the tailored test cases.
Insufficient Proof of "RTL Accuracy": The paper's core claim of "RTL accuracy" relies on C/RTL co-simulation as the ground truth "oracle". For Type C designs, where behavior is cycle-dependent, this is problematic. A simple match of final output values and total cycle counts (as shown in Table 3 and Fig. 8a, p. 11) is insufficient proof of true cycle-accurate behavioral equivalence. Minor discrepancies in OmniSim's timing model could lead to a different, yet functionally valid, execution path that coincidentally produces the same final result. The paper provides no evidence, such as cycle-by-cycle event trace comparisons, to prove that OmniSim's internal state and module interactions precisely mirror the RTL simulation throughout the entire execution.
Unexamined Scalability of the Orchestration Mechanism: The proposed multi-thread orchestration, with a single, central Perf Sim thread processing a global request queue (Fig. 7, p. 8), is a potential architectural bottleneck. As the number of concurrent dataflow modules (and thus Func Sim threads) and the frequency of their communication increase, this single thread is likely to become a serialization point, negating the benefits of parallel simulation. The paper presents a "multicore" benchmark with 34 modules, but provides no sensitivity analysis of how simulation performance degrades as module count or communication density increases. The scalability of the core mechanism is asserted but not proven.
Ambiguous Source of Speedup Over LightningSim: The authors claim a 1.26x geomean speedup over LightningSimV2 on its own benchmark suite, which consists of Type A designs (Table 5, p. 12). For these designs, the complex orchestration mechanism of OmniSim should, in theory, represent pure overhead compared to LightningSim's simpler, decoupled approach. The paper attributes the speedup to its "multithreaded architecture" but fails to provide a convincing rationale. It is more plausible that the performance gains stem from unrelated implementation-level optimizations (e.g., the improved graph representation mentioned in Sec. 7.3.1, p. 10) rather than a fundamental architectural advantage for this class of designs. This confounds the evaluation of the paper's core contribution with general implementation quality.

Questions to Address In Rebuttal

Benchmark Representativeness: Can the authors provide evidence that the custom Type B and C benchmarks are representative of real-world, complex HLS designs? To substantiate the claims of general utility, can the authors demonstrate OmniSim's correctness and performance on at least one large, publicly available, complex dataflow HLS project that was not developed by the authors?
Verification of Accuracy: Beyond matching final cycle counts and output values, what specific steps were taken to verify that the sequence of all FIFO events and state transitions in OmniSim perfectly matches the C/RTL co-simulation on a cycle-by-cycle basis for a complex Type C design like fig4_ex5 or branch? Please provide data (e.g., from a trace diff) to support this.
Orchestration Scalability: Please provide data showing how OmniSim's simulation time scales as a function of (a) the number of concurrent modules and (b) the rate of non-blocking FIFO accesses. At what point does the central Perf Sim thread become a bottleneck, and what is the performance characteristic of this bottleneck?
Incremental Simulation Efficacy: The incremental simulation capability (Sec. 7.2, p. 10) is a key feature. In a realistic design space exploration (DSE) scenario sweeping through dozens of FIFO size configurations for a complex design, what percentage of incremental runs succeed versus failing due to a "constraint violation" that forces a full re-simulation? An assertion of capability is insufficient; data on its practical effectiveness is required.

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces OmniSim, a simulation framework for High-Level Synthesis (HLS) that aims to provide the cycle-accuracy of RTL simulation at speeds approaching native C execution. The authors identify a critical gap in existing HLS flows: the inability to correctly and efficiently simulate complex "dataflow" designs at the C-level, particularly those involving non-blocking FIFO accesses, cyclic dependencies, and data-dependent control flow (which they classify as Type B and C designs).

The core contribution is a novel simulation methodology that tightly couples functionality and performance simulation. It employs a multi-threaded architecture where dedicated "Func Sim" threads simulate individual hardware modules, and a central "Perf Sim" thread orchestrates their execution. This orchestration is key; it maintains hardware-accurate timing information in shared "FIFO tables," allowing functional threads to query the exact state of the system at a specific hardware cycle, thereby resolving the ambiguity that plagues traditional C simulation and OS-based thread scheduling. The authors demonstrate that OmniSim can successfully simulate a suite of designs that are explicitly unsupported by existing commercial and academic tools, achieving significant speedups over the traditional C/RTL co-simulation flow and even outperforming the state-of-the-art LightningSim on its own benchmarks.

Strengths

Excellent Problem Formulation and Taxonomy: The paper’s most significant intellectual contribution, beyond the tool itself, is the clear taxonomy of dataflow designs into Type A, B, and C (Section 3, pages 3-5). This classification provides a principled and much-needed framework for understanding why C-level simulation is so challenging. It brilliantly articulates the progression from concurrency-independence (Type A) to concurrency-dependence (Type B) and finally to full cycle-dependence for functionality (Type C). This framing contextualizes the entire field of HLS simulation and makes the need for a solution like OmniSim self-evident.
Elegant and Novel Simulation Architecture: The proposed solution—a centrally orchestrated, multi-threaded simulation model—is an elegant answer to the challenges laid out in the taxonomy. Instead of treating functionality and performance as decoupled phases (as in prior work like LightningSim), OmniSim recognizes their inherent coupling in complex designs. The concept of a Perf Sim thread serving as a "source of truth" for hardware timing, which Func Sim threads can query on demand (Figures 6 and 7, pages 7-8), is a powerful abstraction that resolves the fundamental mismatch between software scheduling and hardware timing.
Compelling and Direct Evaluation: The evaluation is highly effective because it directly targets the claimed contribution. By developing a benchmark suite of previously "unsimulatable" designs (Table 4, page 11) and showing that OmniSim produces correct results where standard C-sim fails or crashes (Table 3, page 11), the authors provide irrefutable evidence of their system's extended capability. Furthermore, demonstrating a significant performance advantage over LightningSim on its own established benchmarks (Table 5, page 12) is a masterstroke; it proves that OmniSim's more general and powerful architecture does not come at the cost of performance for simpler designs—in fact, it enhances it.
Practical, Real-World Relevance: The work addresses a well-known and painful bottleneck for HLS practitioners. The productivity gains promised by HLS are often squandered in slow, cumbersome RTL verification loops. By enabling rapid and accurate design space exploration before RTL generation for a wider class of designs, OmniSim has the potential to fundamentally improve the HLS workflow. Features like deadlock detection (Section 7.1, page 10) and efficient incremental simulation (Section 7.2, page 10) further underscore the authors' focus on practical utility.

Weaknesses

Benchmark Representativeness: The authors rightly note that finding real-world, open-source Type B and C designs is difficult precisely because poor tool support discourages their creation. While the handcrafted benchmarks are perfectly suited to demonstrate the mechanism and prove correctness, the paper would be strengthened by a discussion of how these patterns manifest in larger, more systemic applications. The multicore example is a good start, but the work's impact will ultimately be judged by its ability to handle, for instance, a complete network-on-chip with adaptive routing or a processor with a non-trivial out-of-order execution core.
Scalability of the Centralized Orchestrator: The Perf Sim thread acts as a centralized bottleneck by design, processing requests from a single queue to ensure correctness. While this is clearly effective for the designs presented, it raises questions about scalability. In a hypothetical future design with hundreds or thousands of concurrently active dataflow modules, could this centralized Perf Sim thread become the limiting factor for simulation performance? The paper does not explore the performance limits of this architectural choice.
Positioning Relative to Discrete-Event Simulation: The underlying mechanism of OmniSim bears a strong resemblance to established principles of Parallel Discrete-Event Simulation (PDES). The Perf Sim thread acts as a global event queue manager, and the Func Sim threads are logical processes. While the application to HLS C-level simulation is novel and highly valuable, framing the work within this broader simulation literature could provide deeper theoretical grounding and potentially suggest further optimizations (e.g., optimistic execution).

Questions to Address In Rebuttal

Regarding the scalability of the Perf Sim thread: Have the authors characterized the overhead of the query/resolve mechanism? At what number of concurrent modules or frequency of non-blocking accesses do you project the centralized orchestration becoming a performance bottleneck, and are there potential paths to parallelizing the Perf Sim thread's logic in the future?
The ability to simulate Type C designs opens the door to using HLS for more dynamic hardware architectures. Could the authors elaborate on a specific, large-scale application (e.g., a cache coherence protocol, a network router) that is currently infeasible to design in HLS due to verification challenges, and walk through how OmniSim would specifically enable its development?
The quality of debugging is paramount for productivity. When OmniSim detects a functional error or a design deadlock, what kind of feedback and state information does it provide to the designer? How does this compare to the (often cryptic) experience of debugging a hung C/RTL co-simulation?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present OmniSim, a framework for High-Level Synthesis (HLS) simulation that aims to provide RTL-level accuracy at near-C speeds, specifically for complex dataflow designs involving non-blocking (NB) FIFO accesses and cyclic dependencies. The central novelty claim lies in its simulation architecture: a tightly-coupled, multi-threaded model where functionality threads (representing HLS modules) are orchestrated by a dedicated performance simulation thread that maintains cycle-accurate hardware state via a set of "FIFO tables." This architecture is designed to correctly resolve the functional behavior of designs where correctness is dependent on the precise cycle of an operation (e.g., an NB FIFO access), a problem explicitly identified as a limitation in prior state-of-the-art simulators like LightningSim and commercial HLS tools.

My analysis concludes that this architectural approach, while having conceptual parallels to classic discrete-event simulation, is a novel and significant contribution in the specific context of HLS C-level simulation. The key innovation is the mechanism for on-demand, cycle-accurate state querying from concurrent software threads to correctly model hardware concurrency, which has not been demonstrated in prior HLS simulation literature.

Strengths

Core Architectural Novelty: The primary strength of this paper is the novelty of the core simulation engine. The state-of-the-art, exemplified by LightningSim [19, 20], relies on a fully decoupled, two-phase approach (trace generation then performance analysis). This is fundamentally incapable of simulating designs where functionality depends on runtime performance (i.e., cycle timing). OmniSim's proposal to "flexibly couple" these phases via a dedicated orchestrator thread (the "Perf Sim thread," detailed in Sections 5 and 6) is a genuinely new architecture in this space. It moves beyond post-hoc trace analysis to a live, interactive simulation model.
Novel Problem Formulation: The taxonomy of dataflow designs into Type A, B, and C (Section 3, Figure 4) is a valuable and novel contribution in its own right. It provides a clear, systematic framework for understanding why certain HLS designs are hard to simulate at the C-level. To my knowledge, such a formal classification and its direct mapping to simulation requirements (concurrency- and cycle-dependence) has not been previously published. This taxonomy effectively carves out the design space where OmniSim's novelty is required.
Mechanism for Resolving Cycle-Dependent Functionality: The specific mechanism of using a centralized Perf Sim thread to maintain FIFO R/W Tables and resolve queries from Func Sim threads (Figure 7) is the concrete implementation of the architectural novelty. This is not merely an application of multi-threading; it is a sophisticated orchestration that correctly serializes access to shared state (FIFO status) based on a modeled hardware timeline, not the host OS's arbitrary thread scheduling. This solution directly addresses the core challenge illustrated in Figure 2.

Weaknesses

My concerns are not with a lack of novelty, but rather with the paper's positioning of its novel ideas relative to broader, established concepts in computer science.

Absence of Context within Classical Simulation Paradigms: The paper does not explicitly position its multi-threaded orchestration mechanism within the broader context of classical simulation paradigms, such as Discrete-Event Simulation (DES). The proposed Perf Sim thread acts as a central event scheduler, processing a queue of requests and advancing a logical clock (cycle count), which is a core concept in DES. While the application to HLS is novel and the specific implementation with live software threads is unique, the lack of discussion on these conceptual underpinnings is a missed opportunity. Situating OmniSim within this established theoretical framework would strengthen the paper's contribution by highlighting how it specializes or adapts these general principles for the HLS domain.
Scalability of the Centralized Orchestrator: The reliance on a single, centralized Perf Sim thread to resolve all queries raises questions about the scalability of this novel approach. For designs with hundreds or thousands of concurrent HLS modules (e.g., large-scale network-on-chip simulations), this single thread could become a significant performance bottleneck, as it must serially process requests from all other threads. While the approach is novel, its practical limits are not explored. The novelty of the solution introduces a new potential performance characteristic that warrants analysis.
Incremental Simulation Novelty Overstated: The paper claims "incremental simulation" as a feature (Section 7.2). While the technique of memoizing query outcomes ("constraints") and re-validating them is clever, its novelty is a delta on top of the incremental simulation capability already present in LightningSim. LightningSim's ability to reuse a simulation graph for new FIFO sizes is the foundational idea. OmniSim's contribution is to make this work for its more complex, coupled simulation model. This is an important engineering extension, but it is not as fundamentally novel as the core simulation architecture.

Questions to Address In Rebuttal

Please discuss the relationship between OmniSim's orchestration model and established paradigms like Discrete-Event Simulation (DES). How does your contribution differ from or specialize these general concepts for the HLS domain?
The Perf Sim thread appears to be a serialization point in your novel architecture. Could you provide an analysis or data on its potential to become a performance bottleneck for designs with a very large number of concurrent modules and high-frequency non-blocking accesses?
The concept of "FIFO tables" (Section 6.2, structure D in Figure 7) is central to your implementation. How novel is this specific data structure for tracking hardware state in a software simulation? Are there analogous structures in other concurrent system simulators (e.g., in architectural simulation or parallel system modeling) that you can compare it against to better highlight the novelty of your specific formulation?

LEGOSim: A Unified Parallel Simulation Framework for Multi-chiplet Heterogeneous Integration

Abstract

The rise of multi-chiplet integration challenges existing simulators like gem5 [55] and GPGPU-Sim [45] for efficiently simulating heterogeneous multiple-chiplet systems due to incapability to modularly integrate heterogeneous chiplets and high ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adversarial Skeptic)

Summary

This paper introduces LEGOSim, a parallel simulation framework for heterogeneous multi-chiplet systems. The authors identify two key challenges: the difficulty of modularly integrating diverse simulators and the performance overhead of existing synchronization schemes. To address this, they propose a framework built on three core ideas: 1) a Unified Integration Interface (UII) for integrating existing simulators ("simlets") with supposedly "minimal modifications"; 2) an "on-demand" synchronization protocol managed by a central Global Manager (GM) to reduce overhead; and 3) a three-stage decoupled simulation methodology to handle inter-chiplet network latencies.

While the paper addresses a relevant and challenging problem, its central claims rest on a methodologically flawed validation approach, unsubstantiated assertions about performance, and an overstatement of the framework's modularity. The reliance on an optimistic rollback mechanism for correctness—whose overhead is never quantified—casts significant doubt on the reported performance benefits.

Strengths

Problem Relevance: The work tackles a critical and timely problem in computer architecture. The need for a flexible and fast simulation framework for multi-chiplet systems is undeniable.
Conceptual Approach to Synchronization: The idea of "on-demand" synchronization, which avoids unnecessary global barriers, is a sound principle for reducing simulation overhead compared to per-cycle or fixed time-quantum methods.
Inclusion of Artifact: The authors provide access to the source code via GitHub and Zenodo, which is commendable and allows for verification of their implementation.

Weaknesses

Fundamental Methodological Flaw in Core Simulation Loop: The paper's three-stage simulation process (Section 3.2, page 5) is critically flawed. In Stage 1, simulation proceeds with zero-load latency estimates. In Stage 3, it re-runs with accurate latencies from a separate NoI simulation. The authors acknowledge that this can lead to timing violations and reordering of memory accesses. Their proposed solution is an "optimistic [27] execution approach" using "checkpointing and rollback to resolve conflicts" (Section 3.2, page 6). This is a massive red flag. The authors then claim that "such violations are rare." This is an extraordinary claim that requires extraordinary evidence, yet no data is provided anywhere in the paper to substantiate this. The frequency of rollbacks, the overhead of checkpointing, and the performance penalty of re-simulation are entirely ignored. Without this data, the performance results presented are meaningless, as they may not account for the potentially crippling cost of ensuring correctness.
Weak and Indirect Accuracy Validation: The validation in Section 5.2 (page 9) is unconvincing. The authors compare LEGOSim's results for the SIMBA and CiM-based architectures against performance numbers reported in the original papers [69, 14]. This is not a rigorous validation. It is an indirect comparison susceptible to countless confounding variables, including minor differences in architectural configuration, workload inputs, and internal simulator assumptions. A proper validation requires a direct, head-to-head comparison against a "golden reference" simulator (e.g., a sequential, cycle-accurate gem5 model) running the exact same configuration and workload. The comparison in Figure 7 is a step in this direction, but it lacks the necessary details on the gem5 configuration to be verifiable.
"Minimal Modification" Claim is Overstated: The UII, presented in Section 4 (page 7-8), is described as enabling integration with "minimal modifications." The evidence provided contradicts this. Integrating Sniper required inserting Sleep() calls to manage its non-cycle-driven model. Integrating Scale-Sim involved writing to and reading from files. Integrating GPGPU-Sim required wrapping calls with cudaMemcpy(). These are significant, non-trivial, and highly simulator-specific engineering tasks. This is not a unified interface but a set of bespoke integration strategies. The claim of "minimal modification" is misleading.
Unaddressed Central Bottleneck: The paper's own scalability analysis in Section 5.4 (Figure 12, page 10) demonstrates that the centralized Global Manager (GM) becomes a performance bottleneck under high inter-chiplet traffic volumes. The authors acknowledge this and propose a "distributed management scheme" as a solution. However, this solution is neither designed, implemented, nor evaluated. It is pure hand-waving. A framework that claims to be scalable must provide a scalable solution to its core coordination mechanism, not just mention one as future work.

Questions to Address In Rebuttal

Provide quantitative data on the optimistic execution mechanism. For the benchmarks presented, what is the precise frequency of timing violations that trigger a rollback? What is the performance overhead (in cycles or wall-clock time) of the checkpointing mechanism and the cost of re-executing from a checkpoint?
Justify the choice of validating against published numbers instead of a direct, controlled comparison against a golden-reference simulator. For the gem5 comparison in Figure 7, please provide the exact configuration scripts and command lines used for both LEGOSim and gem5 to ensure the comparison is on an identical architectural model.
Please provide a concrete example of integrating a new, third-party, event-driven simulator. Detail the specific lines of code and internal simulator logic that must be modified to conform to the UII, to give a more realistic picture of the integration effort beyond the "minimal" claim.
How would the proposed "distributed management scheme" maintain global causal consistency? A distributed system of managers introduces its own complex synchronization and communication overhead. Please provide a preliminary design and analysis showing how this scheme would not simply shift the bottleneck from computation to inter-manager communication.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces LEGOSim, a parallel simulation framework designed to address the increasingly critical challenge of modeling heterogeneous multi-chiplet systems. The work identifies two primary shortcomings in the current simulation landscape: the lack of modular flexibility in monolithic simulators (e.g., gem5) and the prohibitive synchronization overhead of existing parallel simulation techniques.

The core contribution of LEGOSim is a pragmatic meta-framework that integrates existing, specialized simulators (termed "simlets") as independent processes. This is enabled by two key innovations:

A Unified Integration Interface (UII), which provides a standardized API for communication and control, aiming to minimize the modifications required to plug in existing simulators like Sniper, GPGPU-Sim, etc.
An on-demand synchronization protocol built upon a three-stage decoupled simulation methodology. This intelligently triggers synchronization only when inter-chiplet communication occurs, drastically reducing overhead compared to per-cycle or fixed time-quantum approaches while preserving accuracy.

The authors validate LEGOSim's fidelity against published results for real architectures and demonstrate its utility through several case studies exploring the design space of interconnect topologies, memory protocols, and on-chip buffer sizes. The work is positioned as an open-source tool to facilitate community-wide research in the burgeoning field of chiplet-based design.

Strengths

The primary strength of this work is its timeliness and direct relevance to a critical, emergent problem in computer architecture. As the industry pivots from monolithic SoCs to heterogeneous, chiplet-based integration (e.g., AMD's Zen, Intel's Ponte Vecchio), the need for fast, accurate, and flexible pre-silicon evaluation tools has become paramount. This paper is not an academic exercise; it is building a necessary piece of infrastructure for the next generation of hardware design.

The conceptual approach is both elegant and pragmatic. Rather than building another monolithic simulator from scratch, LEGOSim cleverly leverages the vast ecosystem of existing, highly-detailed simulators. This "standing on the shoulders of giants" approach is powerful. The core idea of treating simulators as composable "LEGO bricks" is not new in spirit (SST and SimBricks come to mind), but the specific implementation here is compelling.

The on-demand synchronization mechanism (detailed in Section 3.2, p. 5) is the technical heart of the paper and represents a significant contribution. It provides a sophisticated middle ground in the classic speed-vs-accuracy trade-off. By tying synchronization to actual communication events, it avoids the brute-force overhead of per-cycle locks while mitigating the accuracy loss of coarse-grained time quanta. The three-stage simulation flow (Figure 4, p. 5)—estimating latency, simulating the interconnect in isolation, and then re-simulating with accurate latencies—is a well-reasoned methodology borrowed from optimistic simulation principles and applied effectively here.

Finally, the thoroughness of the evaluation and the inclusion of diverse case studies (Section 6, p. 10-13) are major strengths. The authors don't just present a framework; they demonstrate its concrete value in solving real-world design space exploration problems, from analyzing HBM3 vs. DDR5 (Section 6.4) to selecting an interconnect topology (Section 6.3). This showcases the tool's utility and significantly bolsters the paper's impact. The decision to open-source the framework is commendable and will be a great service to the research community.

Weaknesses

While the work is strong, its positioning within the broader context of parallel discrete event simulation (PDES) and existing modular frameworks could be sharpened.

Positioning Relative to Existing Modular Frameworks: The paper mentions SST and SimBricks but dismisses them somewhat cursorily (Section 2.1, p. 2). A more detailed, principled comparison would be beneficial. For example, SST is also a parallel framework designed for integrating diverse simulation components. What are the fundamental architectural differences in LEGOSim's UII and Global Manager (GM) that enable it to have lower overhead or require fewer code modifications? A deeper discussion of the trade-offs (e.g., SST's distributed scheduler vs. LEGOSim's centralized GM) would help readers better situate this work.
Scalability of the Global Manager: The current architecture relies on a centralized Global Manager (GM) to coordinate all inter-simlet communication and synchronization. This introduces a potential single-point-of-failure and a scalability bottleneck. The authors acknowledge this and suggest a distributed alternative in their scalability analysis (Section 5.4, p. 10), which is a good first step. However, the potential limitations of the current, foundational architecture should be discussed more upfront. The framework's performance is fundamentally tied to the efficiency of this centralized component.
Overhead of Optimistic Execution: The paper notes that timing violations discovered in Stage 3 of the simulation are handled via "checkpointing and rollback" (Section 3.2, p. 6), a classic technique in optimistic PDES. The authors assert that such violations are "rare." This claim is critical to the framework's overall performance, as rollbacks can be extremely costly. However, this assertion is not backed by data. The frequency of rollbacks is highly dependent on the communication patterns and timing characteristics of the workload. Some quantitative evidence on how often these rollbacks occur in their experiments would significantly strengthen the claims of efficiency.

Questions to Address In Rebuttal

The Global Manager (GM) is a central coordinator, which raises concerns about scalability. While you demonstrate a distributed GM can improve performance in Section 5.4, could you elaborate on the performance limits of the baseline single-GM architecture? Specifically, at what number of simlets or inter-chiplet communication rate does the GM itself become the primary performance bottleneck?
Your on-demand synchronization relies on a three-stage process, with the potential for timing violations in Stage 3 to be corrected by checkpointing and rollbacks. You state in Section 3.2 that these violations are "rare." Could you provide quantitative data from your experiments on the frequency of these rollback events and their associated performance overhead? How does this frequency change with workloads that have more irregular or bursty communication patterns?
Could you provide a more detailed, qualitative comparison of your Unified Integration Interface (UII) to the component interface of a framework like SST? What specific design choices in the UII make the process of integrating an existing simulator like gem5 or Sniper fundamentally simpler or require less code modification than doing so within the SST framework?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper introduces LEGOSim, a parallel simulation framework for heterogeneous multi-chiplet systems. The authors identify two key challenges with existing simulators: a lack of modular integration flexibility and inefficient synchronization mechanisms (per-cycle and time-quantum). The proposed solution consists of two primary components: a Unified Integration Interface (UII) to facilitate the modular integration of diverse simulators ("simlets"), and an "on-demand" (OD) synchronization protocol, managed by a central Global Manager (GM), to reduce simulation overhead.

The core novel claim appears to be the OD synchronization protocol, which is realized through a three-stage decoupled simulation flow. While the framework is well-engineered and addresses a timely problem, its foundational concepts are not entirely new. The idea of synchronizing only when communication occurs is a cornerstone of Parallel Discrete Event Simulation (PDES). The contribution of this paper lies not in inventing this concept, but in its specific application and methodological refinement for the multi-chiplet domain.

Strengths

A Novel Simulation Methodology: The most significant novel contribution is the three-stage decoupled simulation workflow described in Section 3.2 and Figure 4 (page 5). This flow—(1) initial simulation with zero-load latency to generate traffic traces, (2) offline NoI simulation on those traces to get accurate latencies, and (3) a final, corrected simulation run—is a clever and practical methodology. It effectively decouples the timing of inter-chiplet communication from the execution of the chiplet simulators themselves, which is the key enabler for the proposed on-demand synchronization. This workflow represents a tangible advancement over prior art that attempts to resolve latencies online.
Well-Considered Abstraction for Integration (UII): While component-based simulation frameworks with defined interfaces are not new (e.g., SST [63]), the UII presented in Section 4 (page 7-8) is thoughtfully designed for the specific challenges of integrating highly disparate simulators. The authors' consideration for handling cycle-accurate (gem5), non-cycle-driven (Sniper), and abstracted DSA simulators within a single API demonstrates a high degree of engineering novelty and provides a valuable blueprint for future work.

Weaknesses

Overstated Novelty of the Core Synchronization Concept: The paper presents "on-demand synchronization" as a new idea that stands in contrast to per-cycle and time-quantum methods. However, this is a well-established concept in the PDES community, often referred to as event-driven synchronization. The work of Fujimoto [27] and others on conservative and optimistic synchronization protocols laid this foundation decades ago. Frameworks like SST [63] are also built on an event-driven core. The authors fail to position their work within this broader context, giving the impression that the idea of synchronizing only on interactions is novel, when in fact it is the application methodology (the three-stage flow) that is new. The contribution is one of methodology, not of a fundamental synchronization principle.
Misleading "Formal Analysis for Validation": Section 3.3 (page 6) is titled "Formal Analysis for Validation." This is a mischaracterization. The section presents a standard queuing theory model (G/G/1) for estimating NoI latency. This model is used to parameterize the simulation, not to validate the correctness of the LEGOSim framework or its synchronization protocol. A formal validation would involve proofs of correctness, such as demonstrating the absence of causality errors or deadlocks in the synchronization mechanism. The current section does not provide this.
Unexamined Scalability of the Central Global Manager: The entire OD synchronization scheme hinges on a centralized Global Manager (GM) that arbitrates all inter-simlet requests. This design creates an obvious potential for a serial bottleneck, limiting the scalability of the parallel simulation. The authors briefly acknowledge this in Section 5.4 (page 10), suggesting a distributed management scheme as a remedy, but this is presented as an afterthought. A core contribution of the paper is simulation performance, yet the performance limitations of its central component are not analyzed. The problem of creating a correct, deadlock-free distributed time management system is a major research challenge in PDES and cannot be waved away as a simple extension.

Questions to Address In Rebuttal

Please clarify the novelty of your on-demand synchronization scheme with respect to established conservative Parallel Discrete Event Simulation (PDES) protocols. How does your Global Manager-based approach fundamentally differ from the time-management kernels in existing modular frameworks like SST [63]? Is the primary novelty the three-stage simulation flow, and if so, should the paper's contributions be re-framed to emphasize this methodological aspect rather than the general concept of on-demand synchronization?
Could you justify the title of Section 3.3, "Formal Analysis for Validation"? As the section describes a latency model rather than a formal proof of the simulator's correctness, please explain what is being formally validated.
The reliance on a single, centralized Global Manager (GM) appears to be a critical scalability bottleneck. Beyond the brief mention in Section 5.4, have you analyzed the performance impact of this centralized design as the number of simlets and the frequency of their communication increases? What are the specific challenges (e.g., ensuring global event ordering, deadlock avoidance) in implementing the proposed distributed management scheme, and how would that affect the framework's complexity and correctness?

LLMulator: Generalizable Cost Modeling for Dataflow Accelerators with Input-Adaptive Control Flow

Abstract

Precise and rapid performance prediction for dataflow-based accelerators is essential for efficient hardware design and design space exploration. However, existing methods often fall short due to limited generalization across hardware architectures, ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

This paper introduces LLMulator, a framework that uses pre-trained Large Language Models (LLMs) for performance prediction of dataflow accelerators. The authors claim to address three key generalization challenges: across diverse applications, hardware configurations, and input-dependent control flows. The proposed method consists of three primary components: (1) a numeric modeling technique that tokenizes numbers and predicts performance values digit-by-digit; (2) a dynamic calibration mechanism using reinforcement learning (DPO) to refine predictions based on runtime feedback; and (3) a progressive data synthesis framework to generate training data. The authors evaluate their framework against several baselines (TLP, GNNHLS, Tenset-MLP) and claim state-of-the-art accuracy.

While the paper tackles a relevant problem, the methodology presents several unexamined complexities, the evaluation contains significant methodological gaps, and the claims of generalization are not fully substantiated by the provided evidence.

Strengths

Well-Motivated Problem: The paper correctly identifies the limitations of existing performance prediction methods, particularly regarding generalization to unseen applications, hardware, and dynamic inputs.
Realistic Ground Truth Generation: The use of an open-source toolchain (SiliconCompiler, Bambu HLS, OpenROAD) for profiling and generating ground truth data (Section 7.1) is a strong point, lending credibility to the dataset used for training and evaluation.
Comprehensive Structure: The framework is structured logically to address the three identified challenges (application, input, and hardware generalization), which provides a clear narrative for the contributions.

Weaknesses

Unjustified Complexity of Numeric Modeling: The proposed "decoupled numerical modeling" (Section 4.2), which predicts performance digit-by-digit via classification, is presented as a key innovation. However, the paper fails to convincingly argue why this complex mechanism is superior to a standard regression head with a proper loss function (e.g., log-scale MSE) that is less sensitive to extreme value ranges. The only evidence is an ablation ("NoEnc" in Table 3), which conflates the input encoding with the output modeling. A direct comparison of the output mechanism alone is required to justify this added complexity.
Misleading Comparison with Rule-Based Models: The comparison against Timeloop in Figure 11 is fundamentally flawed. Timeloop is a specialized analytical model for regular, loop-nest-based tensor computations. The workloads from Table 2, especially those from NLP, contain complex control flow and heterogeneous operator graphs that fall far outside Timeloop's intended domain. To claim superiority by evaluating a general-purpose model against a specialist on a generalist's turf is a classic apples-to-oranges fallacy. This comparison does not validate the model's accuracy but rather highlights a misapplication of the baseline tool.
Opaque and Potentially Biased Dataset Synthesis: The "progressive data generation framework" (Section 6) is described at a high level but lacks the detail required for reproducibility and critical assessment. The process, especially the LLM-based self-augmentation, is a black box. There is no analysis to demonstrate that the synthetic data is representative of real-world hardware design patterns or that it does not simply learn the biases of the generation scripts themselves. The claim of "producing datasets aligned closely with realistic hardware implementations" is an assertion without statistical evidence comparing the properties of the generated dataset to a corpus of real-world designs.
Unexamined Trade-offs of Dynamic Calibration: The dynamic calibration mechanism (Section 5) reports a reduction in MAPE for cycle prediction from 28.9% to 16.4% (Table 3). However, several critical details are omitted:
- Overhead: The computational cost and latency of the DPO update step are not quantified. How many profiling runs and update iterations are needed to achieve this improvement?
- Stability: Reinforcement learning methods can be unstable. There is no analysis of the convergence properties or the risk of overfitting to the most recent profiling data in the replay buffer.
- Accuracy: A final MAPE of 16.4% on dynamic cycles is still a significant error. It is not clear that this level of accuracy is sufficient for reliable design space exploration, especially given the added complexity and runtime overhead.
Prohibitive Inference Latency for Practical Use: Table 4 shows that LLMulator's inference time is approximately 1.01s per prediction on Polybench, an order of magnitude slower than GNNHLS (0.11s). The authors dismiss this as "acceptable compared to the longer synthesis times," but this ignores the primary use case: large-scale design space exploration (DSE), where millions of design points must be evaluated quickly. A 10x slowdown makes the proposed tool impractical for this critical task. The acceleration techniques in Section 5.3 show only marginal improvements (Table 5) and do not close this gap.

Questions to Address In Rebuttal

Regarding Numeric Modeling: Can you provide an ablation study that isolates the output modeling strategy? Specifically, please compare the performance of your digit-by-digit classification approach against a standard regression head (using both MSE and log-scale MSE loss) on the same LLM backbone and input encoding, to prove its superiority.
Regarding Dynamic Calibration: Please quantify the full runtime overhead of the dynamic calibration process. How many profiling runs and DPO updates were required to reduce the MAPE from 28.9% to 16.4%? Furthermore, please justify why a final MAPE of 16.4% on cycles is a strong result for a system with this much complexity, and discuss its practical utility in DSE.
Regarding the Timeloop Comparison: Please justify the comparison in Figure 11. Alternatively, provide a new comparison against Timeloop on a workload for which it is explicitly designed (e.g., a pure GEMM operator with varying tiling and mapping strategies) to demonstrate where LLMulator offers a genuine advantage.
Regarding Dataset Synthesis: Please provide a detailed characterization of the synthetic dataset. This should include statistical distributions of key features (e.g., loop nesting depth, array access patterns, operator types) and a comparison of these distributions against a well-known corpus of real-world HLS benchmarks to substantiate the claim of realism.
Regarding Model Choice and Scalability: The choice of a small 1B parameter model seems arbitrary. How do the results and, critically, the 10x inference latency penalty, scale with larger, more capable models (e.g., 7B, 13B)? Is it possible that a larger model could achieve better accuracy without the complex, bespoke numeric modeling and dynamic calibration frameworks?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces LLMulator, a comprehensive framework for performance prediction of dataflow accelerators that aims to overcome the critical generalization limitations of prior art. The authors correctly identify three major axes of failure for existing models: generalization to unseen applications (especially at numerical extremes), adaptation to input-dependent control flow, and transference to diverse hardware configurations.

To address this, LLMulator is not a single model but a holistic system built on three innovative pillars: 1. A progressive numeric modeling technique that treats numerical values in code and performance metrics as sequences of digits, using a categorical classification approach. This is designed to improve accuracy on out-of-distribution numerical ranges and uniquely provides prediction confidence scores. 2. An input-adaptive dynamic calibration framework based on reinforcement learning (Direct Preference Optimization), which refines performance predictions at runtime by incorporating feedback from live profiling, thereby handling dynamic control flow. 3. A progressive data synthesis framework that systematically generates a diverse and realistic dataset covering software, hardware, and mapping variations, including the generation of intermediate compiler reasoning steps to enhance model learning.

The authors conduct an extensive evaluation, demonstrating that LLMulator achieves a state-of-the-art mean absolute percentage error (MAPE) of 12.2%, significantly outperforming established baselines like TLP and GNNHLS.

Strengths

The primary strength of this work lies in its insightful diagnosis of the core problems with applying machine learning to systems modeling and its subsequent development of a sophisticated, multi-pronged solution. This is not a naive application of a large language model (LLM); it is a well-engineered system that leverages an LLM's strengths while meticulously compensating for its weaknesses.

A Paradigm Shift in Numerical Modeling: The most significant conceptual contribution is the shift away from standard regression for performance prediction. The progressive numeric modeling (Section 4, page 4) is a brilliant insight. By treating performance prediction as a digit-by-digit classification task, the authors elegantly solve the "edge value" problem where regression models using normalization fail catastrophically. Furthermore, this approach's ability to output a confidence distribution (logits) for each digit is a profoundly practical feature for designers, who need to understand not just the prediction, but the model's certainty. This connects to a broader push for interpretability and uncertainty quantification in ML for science and engineering.
Bridging Static Analysis and Dynamic Execution: The dynamic calibration framework (Section 5, page 6) is a very clever solution to the long-standing problem of input-dependent behavior. Most performance models are entirely static. By creating a closed loop with a profiler and adapting the concept of Direct Preference Optimization (DPO) from the RLHF literature, the authors have created a model that learns from execution. This is a powerful idea that bridges the gap between fast, static cost models and the ground truth of dynamic execution. It has the potential to make ML-based prediction far more trustworthy in iterative design space exploration (DSE) loops.
Mature Approach to the "Data Problem": Data-driven methods in computer architecture are often hamstrung by the lack of large, diverse, high-quality datasets. The authors' progressive data synthesizer (Section 6, page 8) is a comprehensive and well-thought-out solution. Moving from general AST-based generation to dataflow-specific and finally LLM-augmented code is a robust strategy. The inclusion of "Reasoning Data Formatting" (Figure 9, page 9), which is conceptually parallel to the "Chain-of-Thought" prompting in NLP, is an excellent cross-pollination of ideas, helping the model learn the why (intermediate compiler features) behind the what (final performance).
Strong Empirical Validation: The experimental results are thorough and convincing. The ablation studies in Table 3 (page 11) and Table 7 (page 12) are particularly strong, as they clearly tie the significant performance gains back to each of the specific contributions (numeric encoding, dynamic calibration, and data synthesis). The comparison against both other ML models (TLP, GNNHLS) and an analytical model (Timeloop, Figure 11, page 10) firmly establishes the work's superiority and breadth.

Weaknesses

While the core ideas are excellent, the work could be strengthened by addressing the following points, which are less about fundamental flaws and more about the practical implications and boundaries of the proposed system.

Practicality of the Calibration Loop: The dynamic calibration loop is a powerful concept, but its real-world utility depends heavily on its overhead. The paper reports runtime latency (Table 4, page 11), showing LLMulator is an order of magnitude slower than GNNHLS. While the authors argue this is acceptable compared to full synthesis, the cost of the profiling step within the DPO loop is a critical factor for DSE. A full profiling run for every prediction update may be prohibitively slow. The paper could benefit from a discussion of the trade-offs here: how many DPO iterations are needed for convergence, and what is the total time cost (prediction + profiling) in a realistic DSE scenario?
System Complexity and Reproducibility: LLMulator is a complex system with many interacting components (three different data generators, a static LLM, a dynamic DPO updater, parsers, profilers). This complexity raises questions about its robustness and the engineering effort required to retarget it to a completely new class of accelerator (e.g., analog in-memory computing, neuromorphic). The paper presents it as a general framework, but its effectiveness is likely tied to the specific rules and templates within the data synthesizer.
Characterization of Failure Modes: The authors commendably note that workloads like jacobi-2d exhibit higher errors due to complexity (Section 7.2, page 10). This is a crucial finding that deserves more exploration. This work sits at the intersection of program analysis and LLM semantics. What are the fundamental limits? Is the problem related to non-local code interactions, complex data structures, or aliasing that breaks the LLM's semantic understanding? A deeper analysis of these failure modes would be a valuable contribution to the broader field of using LLMs for code analysis.

Questions to Address In Rebuttal

Could the authors elaborate on the practical cost of the dynamic calibration loop? Specifically, in a design space exploration scenario, what is the expected wall-clock time for the model to adapt to a significant change in input data distribution, including the necessary profiling runs?
Regarding the data synthesizer: how much domain-specific expertise is required to retarget it for a novel accelerator architecture that has fundamentally different primitives than the dataflow style explored here? For instance, how would the AST-based and dataflow-specific generators be adapted?
The sensitivity study on model size (Table 10, page 12) is very interesting, showing larger models improve accuracy. Does this suggest that the remaining prediction errors are not due to issues with the framework's structure (e.g., numeric encoding) but are rather bounded by the base LLM's reasoning capability? Could you speculate on whether even larger, frontier models could overcome the issues seen in complex workloads like jacobi-2d?
The confidence scores from the numeric output model are a fantastic feature. Have you explored using these scores to guide the DSE process itself? For instance, could the system actively trigger a more expensive, accurate simulation only when the LLMulator's prediction confidence is low?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents LLMulator, a framework for performance prediction of dataflow accelerators. The authors identify three key generalization challenges—across applications, inputs, and hardware configurations—and propose a tripartite solution. The core of the authors' novel claim rests on the synthesis and application of several recent machine learning techniques to the domain of hardware cost modeling. These contributions are: (1) a "progressive numeric modeling" method that treats performance prediction as a digit-by-digit classification task rather than direct regression; (2) a "dynamic calibration" framework using Direct Preference Optimization (DPO) to refine predictions based on live execution feedback for input-dependent control flows; and (3) a "progressive data generation" pipeline that combines syntactic generation, domain-specific templates, and LLM-based augmentation to create a comprehensive training dataset.

While the constituent concepts (e.g., number-aware tokenization, reinforcement learning from feedback, curriculum-based data generation) exist in prior art within the broader machine learning literature, their specific adaptation, integration, and application to solve the multi-faceted generalization problem in dataflow accelerator modeling represents a novel engineering contribution. The core novelty is not in the invention of a new algorithm from first principles, but in the sophisticated scaffolding built around a pre-trained LLM to imbue it with capabilities it natively lacks.

Strengths

The primary strength of this work lies in its novel approaches to adapting large language models for a task they are not inherently designed for: precise, generalizable numerical prediction in a specialized domain.

Novel Prediction Formulation: The shift from a standard regression model (as seen in prior work like TLP [89]) to a progressive, digit-wise categorical classification model (Section 4.2, page 6) is a significant conceptual novelty. This directly addresses the known failure mode of regression models on out-of-distribution or "edge" values by breaking the problem into a sequence of smaller, bounded classification tasks. This allows for confidence estimation at each digit, a feature absent in direct regression approaches.
Novel Application of Reinforcement Learning: The use of DPO, a technique from the very recent LLM alignment literature [66], for dynamic, input-aware calibration of a performance model (Section 5, page 6) is a novel domain transfer. Prior work on input-adaptive performance modeling has typically relied on extracting static features or using more traditional online learning methods. Applying a preference-based RL method to learn from "ground-truth is better than prediction" pairs is a genuinely new mechanism for this problem space.
Novel Data Synthesis Pipeline: The multi-stage data generation framework (Section 6, page 8) is a more principled and sophisticated approach than prior methods. While individual components like AST-based generation [4] and using intermediate compiler representations [80] exist, the progressive pipeline that starts with general syntactic structures, specializes them for dataflow, and then uses an LLM for semantic diversification is a novel and powerful concept. The inclusion of intermediate reasoning data (Figure 9) inspired by chain-of-thought prompting [78] is a clever way to guide the model, moving beyond simple input-output pairs.

Weaknesses

The weaknesses of the paper, from a novelty perspective, are that many of the underlying mechanisms are imported from the ML/NLP fields, and the gains must be weighed against the significant increase in system complexity.

Derived, Not Foundational, Novelty: The core ideas are clever adaptations, not fundamental inventions. For example, improving LLM numeracy by isolating digits during tokenization (Section 4.1) is a known technique to improve model arithmetic [16]. The "chain-of-thought" style reasoning in the dataset is borrowed directly from LLM prompting research [78]. The novelty is purely in the application to hardware cost modeling. An expert in LLMs would not find the techniques themselves new, only their target domain.
Complexity vs. Benefit Trade-off: The proposed solution is substantially more complex than prior art. It combines a fine-tuned LLM, a reinforcement learning loop with a replay buffer, and a complex multi-stage data synthesizer. The end result, as shown in Table 3, is a reduction in MAPE from 20.0% (TLP) and 28.9% (GNNHLS) to 12.2%. While this is a clear improvement, the "delta" is not an order-of-magnitude leap. A critical assessment must question whether the marginal accuracy improvement justifies the massive increase in framework complexity, training cost, and inference latency (Table 4 shows LLMulator is ~10x slower than GNNHLS). The novelty is high, but its efficiency is questionable.

Questions to Address In Rebuttal

On Numeric Modeling: The digit-wise categorical output is a novel formulation. However, other methods of converting regression to classification exist, such as quantizing the entire output range into a set of bins. Could the authors clarify why the progressive, digit-by-digit approach is fundamentally superior to a simpler quantization scheme for this problem? Does the sequential dependency modeled between digits provide a crucial inductive bias?
On Dynamic Calibration: The choice of DPO is contemporary and interesting. However, it is one of many possible online learning or RL algorithms. Could the authors justify why DPO is uniquely suited for this task compared to, for instance, a simpler online fine-tuning approach on new data points or other RL algorithms like PPO? Is the "preference" aspect of DPO critical, or is it simply a convenient and effective implementation of RL from feedback?
On Data Synthesis: The progressive data synthesizer is presented as a key contribution. To substantiate this claim, a more detailed ablation is needed. Table 7 ablates the entire synthesizer, but not its individual stages. Can the authors provide evidence on the marginal contribution of each stage? Specifically, how much does the final, LLM-based generation stage (Section 6.1, page 8) improve performance over a dataset generated only by the AST-based and dataflow-specific stages? This would clarify the novelty and utility of incorporating LLM-based self-augmentation.

Swift and Trustworthy Large-Scale GPU Simulation with Fine-Grained Error Modeling and Hierarchical Clustering

Abstract

Kernel- level sampling is an effective technique for running large-scale GPU workloads on cycle-level simulators by selecting a representative subset of kernels, thereby significantly reducing simulation complexity and runtime. However, in large-scale GPU ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose STEM+ROOT, a kernel-level sampling methodology for GPU simulation that uses the distribution of kernel execution times as its primary signature. The method employs statistical techniques (the Central Limit Theorem) to determine sample sizes (STEM) and hierarchical clustering to group kernels with similar runtime behaviors (ROOT). The stated goal is to reduce simulation error and profiling overhead compared to prior work that relies on microarchitecture-independent features like instruction counts or Basic Block Vectors (BBVs). While the approach demonstrates impressive speedups and low profiling overhead, its core premise is built on a fundamentally hardware-dependent signature, and it fails to adequately address critical confounding factors such as inter-kernel architectural state, casting significant doubt on the trustworthiness and generality of its results.

Strengths

Problem Motivation: The paper effectively uses Figure 1 (Page 2) to illustrate the runtime heterogeneity of identical kernels in modern ML workloads. This observation clearly motivates the need for a more nuanced sampling approach than what is currently practiced.
Profiling Scalability: The primary practical contribution of this work is the significant reduction in profiling overhead. As demonstrated in Table 5 (Page 11), the proposed method is orders of magnitude faster to profile than instruction-level or BBV-based techniques (e.g., 1.33-5.53x overhead vs. 293-3704x for PKA/Sieve on CASIO). This is a clear and measurable advantage.
Statistical Formulation: The application of the Central Limit Theorem and KKT optimization to determine sample sizes (Section 3.2, 3.3) provides a formal, mathematical basis for the sampling strategy. This is a more rigorous foundation than the heuristic-based or fixed-threshold approaches used in some prior work.

Weaknesses

Fundamental Reliance on a Hardware-Dependent Signature: The central thesis—that kernel execution time distribution is a robust signature—is fundamentally flawed. Execution time is an outcome of the interaction between a program and a specific microarchitecture, not an intrinsic property of the workload. The authors' claim that statistical properties like the Coefficient of Variation (CoV) are "relatively hardware-agnostic" (Section 2.3, Page 3) is a strong, unsubstantiated assertion. Their own cross-GPU experiment in Section 5.4 (Page 10, Figure 13) directly contradicts this claim of robustness: using a profile from an H100 to sample for an H200 (a minor architectural evolution) results in an average error of 5.46%. This is a significant error rate that undermines the method's utility for design space exploration (DSE), which is a primary use case for cycle-level simulation.
Neglect of Inter-Kernel Architectural State (Warmup): The methodology implicitly assumes that each kernel invocation is an independent event, with no state carried over from previous kernels. In Section 6.2 (Page 12), the authors admit this limitation, stating their method "assumes ideal warmup of cache and hardware states." They attempt to dismiss this concern by asserting that inter-kernel cache reuse is a "negligible fraction" in their benchmarks. This is a dangerous and unproven generalization. For many real-world applications, performance is critically dependent on the state of caches (L1/L2), TLBs, and memory controllers left by preceding kernels. By ignoring this, the "sampled simulation" is not a simulation of a subsampled workload but rather a simulation of a collection of disconnected kernels, which is a different and potentially irrelevant experiment. The "extreme-case experiment" of flushing the L2 cache is insufficient evidence, as it doesn't model the complex, state-dependent patterns of realistic cache reuse.
Validity of Statistical Assumptions: The entire statistical model in Section 3.2 (Page 4) rests on the assumption that samples are independent and identically distributed (i.i.d.). The authors claim that "random sampling with replacement satisfies the i.i.d. assumption." While the draws are independent, the underlying distribution they are drawn from may not be identical throughout the workload's execution. Many programs exhibit distinct phases of behavior. Randomly sampling across the entire execution trace may not correctly represent the duration and characteristics of these phases, violating the spirit, if not the letter, of the i.i.d. assumption and potentially leading to biased estimates.
Inconsistent and Potentially Overstated Accuracy: The results show a stark contrast between the near-zero error (0.36%) on the CASIO suite (Table 3, Page 8) and the more realistic error on Rodinia (0.93%) and the significant error in the cross-GPU experiment (5.46%). The exceptional performance on CASIO suggests that these workloads may represent a best-case scenario for this methodology—perhaps consisting of many short, independent kernels where statistical averaging works perfectly. This raises concerns that the method has been over-tuned or is being evaluated on workloads that perfectly fit its assumptions, rather than proving its general applicability. The headline claims of accuracy seem to be based on this best-case result.

Questions to Address In Rebuttal

Please reconcile the claim that execution time is a "robust signature" for DSE with the 5.46% error observed in the H100-to-H200 experiment (Figure 13). If a simple generational update in memory bandwidth causes such a significant error, why should this method be trusted for exploring more substantial architectural changes (e.g., different cache hierarchies, prefetchers, or memory controllers)?
The dismissal of inter-kernel state as "negligible" (Section 6.2) is insufficient. Please provide a quantitative analysis of the error introduced by the ideal warmup assumption on at least one benchmark known to exhibit significant inter-kernel data locality. Without this, the accuracy claims of the paper are suspect.
How does the methodology guarantee an unbiased estimate in workloads with distinct, long-running execution phases? Please justify the i.i.d. assumption in this context and explain how random sampling across the entire trace avoids misrepresenting the proportional impact of these different phases.
The recursive splitting condition in ROOT (Section 3.4, Page 6) relies on comparing Told and Tnew, which are estimates of simulation time based on sample means. How robust is the clustering outcome to the statistical noise inherent in these estimates? Could an initial sampling error lead to a suboptimal clustering decision, which in turn amplifies the final error?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper addresses the critical and increasingly challenging problem of slow cycle-level simulation for large-scale GPU workloads, particularly those from the machine learning and large language model domains. The authors propose STEM+ROOT, a novel kernel-level sampling methodology designed to be both swift and trustworthy.

The core contribution is the insight that the statistical distribution of kernel execution times serves as a powerful and highly scalable "signature" for characterizing workload behavior. This represents a departure from prior works that rely on static, microarchitecture-independent features like instruction counts or Basic Block Vectors (BBVs).

The proposed method consists of two main components: 1. STEM: A statistical framework based on the Central Limit Theorem that models sampling error and uses a KKT solver to determine the optimal number of samples required from each kernel cluster to meet a user-defined error bound. 2. ROOT: A fine-grained hierarchical clustering algorithm that recursively partitions invocations of the same kernel based on their execution times. This effectively separates kernels that share the same code but exhibit different runtime behaviors due to varying inputs or system states.

The evaluation demonstrates that this approach significantly reduces sampling error (by 27-81x) and drastically cuts profiling overhead (by 7-600x) compared to state-of-the-art methods, making it practical for the massive workloads that are currently intractable for detailed simulation.

Strengths

This work has several significant strengths, primarily rooted in the elegance and practicality of its central idea.

A Powerful and Pragmatic Core Idea: The fundamental contribution—using the execution time distribution as a kernel signature—is excellent. While prior art has focused on creating microarchitecture-agnostic signatures from static code properties, this work makes a compelling case that a dynamic, hardware-dependent signal is actually more useful in practice. It correctly identifies that for modern ML workloads, runtime context (e.g., tensor sparsity, memory locality) often matters more than static code structure. This is a fresh and valuable perspective in the field of workload sampling.
Principled Statistical Foundation: The methodology isn't just a collection of heuristics; it is grounded in sound statistical principles. The use of the Central Limit Theorem to model the sampling distribution of the mean execution time and provide theoretical error bounds (Section 3.2, page 4) gives the "Trustworthy" claim in the title real substance. This allows a user to explicitly trade off simulation speed for accuracy via the ε parameter, a crucial feature for flexible design space exploration.
Exceptional Scalability and Practicality: The most significant practical advantage of this work is its scalability. By relying on lightweight kernel-level timing information (which can be gathered by profilers like NSYS with low overhead), it sidesteps the crippling profiling costs of methods that require per-warp instruction-level instrumentation (PKA, Sieve) or complex BBV comparisons (Photon). The results in Table 5 (page 11), showing orders of magnitude lower overhead, are dramatic and convincing. This is the key that unlocks the ability to simulate and analyze the gargantuan workloads from suites like Huggingface.
Effective Handling of Runtime Heterogeneity: The ROOT clustering mechanism is a clever solution to a well-known problem. Kernels like sgemm can have vastly different performance characteristics depending on the context in which they are called. As shown in Figure 1 (page 2), simply grouping by kernel name is insufficient. ROOT's recursive, goal-oriented approach—to keep splitting clusters as long as it reduces the estimated simulation time—is an elegant way to avoid the need to pre-specify the number of clusters (k) and ensures that the partitioning is done for a practical purpose.

Weaknesses

The paper's weaknesses are less about flaws in the methodology and more about its inherent trade-offs and scope, which could be discussed more explicitly.

Hardware-Signature Dependency: The primary conceptual weakness is that the signature (execution time) is intrinsically tied to the hardware on which it was profiled. While the authors acknowledge this limitation in Section 6.1 (page 11) and show promising results in their DSE and cross-GPU experiments (Section 5.4, page 10), it remains a fundamental trade-off. The paper argues this is a feature, not a bug, because it captures sensitivity to the microarchitecture. However, it relies on the assumption that the relative behavior (e.g., the shape of the distribution, the number of peaks) remains largely consistent across different architectures. This assumption may not hold for more radical architectural changes.
Neglect of Inter-Kernel Effects: Like most kernel-level sampling techniques, this method treats kernel invocations as largely independent events. It does not explicitly model state that is carried across kernel boundaries, most notably the state of the L2 cache and memory subsystem. The authors provide a reasonable defense in Section 6.2 (page 12), arguing that kernel working sets are often large enough to flush the cache, but this is an assumption about the workload character. This could be a source of error for workloads with tight producer-consumer relationships between kernels operating on the same data.
Limited Scope to Single-GPU Workloads: The current work is confined to single-GPU simulation. While this is the foundation upon which multi-GPU simulation is built, the most challenging and interesting systems-level problems in modern ML training involve communication and synchronization across multiple GPUs. The authors correctly identify this as future work, but it is a significant limitation on the immediate applicability of the technique to state-of-the-art distributed training scenarios.

Questions to Address In Rebuttal

I would appreciate it if the authors could use the rebuttal to clarify the following points, which would strengthen the paper and help position its contribution even more clearly.

Regarding the hardware-dependency of the signature: Could you elaborate on the potential "failure modes" of your approach? For instance, consider a scenario where a microarchitectural change (e.g., a new cache replacement policy) inverts the performance of two distinct runtime phases of the same kernel. A phase that was fast on Hardware A becomes slow on Hardware B, and vice versa. Would the samples selected based on the Hardware A profile still be representative, or would this lead to significant error?
Regarding inter-kernel effects: Your analysis in Section 6.2 is based on flushing the L2 cache. Have you considered workloads where the key interaction is not just cache reuse, but perhaps DRAM row buffer locality or memory controller state that persists across kernels? How sensitive do you believe your methodology is to these more subtle cross-kernel performance dependencies?
This is a more philosophical question about the nature of your signature. Do you view the execution time distribution primarily as a direct signal, or as an inexpensive proxy for underlying microarchitectural behaviors (e.g., a wide distribution implies memory-boundness, a narrow peak implies compute-boundness)? Have you considered a hybrid approach where the fast, scalable clustering of ROOT is used to identify behavioral groups, and then a tiny number of kernels from each group are profiled in detail (e.g., with NCU) to "label" the clusters with microarchitectural explanations? This seems like a promising direction for future work that connects your high-level signature back to low-level hardware behavior.

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper proposes STEM+ROOT, a kernel-level sampling methodology for accelerating large-scale GPU simulations. The authors identify that existing sampling techniques, which rely on static or microarchitectural signatures (e.g., instruction counts, basic block vectors), fail to capture the runtime heterogeneity of identical kernels in modern workloads.

The claimed novel contributions are: 1. The use of the kernel execution time distribution as the primary signature for identifying and clustering kernels with different runtime behaviors. 2. A statistical framework, STEM, based on the Central Limit Theorem (CLT) and KKT optimization, to derive the optimal number of samples required from each cluster to meet a user-defined error bound. 3. A hierarchical clustering algorithm, ROOT, which recursively partitions kernel clusters and uses the STEM model as a termination condition, thereby avoiding the need to pre-specify the number of clusters (k).

Strengths

The primary strength of this work lies in its core conceptual contribution: the shift from cause-based signatures to an effect-based signature. The vast body of prior art in workload sampling, from SimPoint [9] to Photon [21], has focused on abstracting the cause of a program's behavior—its instruction mix and control flow, represented by Basic Block Vectors (BBVs) or other instruction-level metrics. The fundamental insight of this paper is to instead use the ultimate effect of this behavior—the kernel's execution time—as the signature. This is a genuinely novel perspective in the domain of GPU simulation sampling.

This conceptual shift has several significant and novel consequences:

Principled Handling of Heterogeneity: It elegantly captures dynamic, input-dependent runtime effects that are invisible to static analysis. The histograms in Figure 1 (page 2) clearly show that execution time is a powerful discriminator where instruction-based signatures would fail.
A More Rigorous Sampling Framework (STEM): While the use of statistics in simulation sampling is not new (e.g., SMARTS [43]), the direct application of CLT to derive a closed-form solution for the required sample size based on a desired error bound is a more principled approach than the heuristics used in prior GPU work (e.g., "sample the first kernel" in PKA [2] or a fixed similarity threshold in Photon [21]).
Adaptive Clustering (ROOT): Prior methods like PKA [2] rely on k-means, which requires a priori selection of k. The proposed ROOT framework, which uses the estimated simulation time savings as a stopping criterion for its hierarchical splitting, is a novel and domain-specific adaptation of clustering that is tightly integrated with the end goal.

The combination of these ideas results in a framework that is not only novel in its approach but also demonstrably less complex in its data collection requirements (profiling overhead in Table 5, page 11) than the prior art it seeks to replace.

Weaknesses

My critique is focused exclusively on the boundaries and robustness of the novel claims.

Relationship to and Differentiation from Sieve [24]: The closest prior art is Sieve, which the authors cite. Sieve uses the coefficient of variation (CoV, or σ/μ) of execution times to hand-tune its sampling strategy. The authors must be more explicit about the delta. While this paper elevates the execution time distribution from a tuning heuristic to the core principle, one could argue it is a formalization of an existing idea rather than a completely de novo one. The paper's contribution is clearly significant, but its novelty would be better framed by directly contrasting its systematic, automated framework against Sieve's ad-hoc use of the same underlying signal.
Hardware-Dependence of the Signature: The most significant weakness of the novel idea is its reliance on a hardware-dependent signature (execution time). The entire field has historically pursued hardware-independent signatures to ensure portability of sampling data across different architectural simulations. This paper makes a conscious, and seemingly successful, trade-off. However, the limits of this approach are not fully explored. The DSE experiments in Section 5.4 (page 10) show robustness to small changes (cache size, SM count), but this may not hold for more fundamental architectural shifts. For example, sampling on a pre-Tensor Core GPU and simulating a post-Tensor Core GPU would likely produce entirely different execution time distributions for GEMM kernels, potentially invalidating the sampling choices. The novelty of the approach is tied to this trade-off, and its limitations define the scope of the contribution.

Questions to Address In Rebuttal

Please explicitly articulate the conceptual leap from Sieve's [24] use of the CoV of execution times for tuning to your use of the full execution time distribution for both clustering (ROOT) and sample size determination (STEM). Is this a formalization of a known heuristic, or a fundamentally different approach?
The core novel idea is to use a hardware-dependent signal. Consider an architectural design space exploration scenario where a new hardware feature (e.g., a new memory prefetcher, or specialized execution units like Tensor Cores) is being evaluated. This feature could fundamentally alter the execution time distributions of certain kernels, creating new performance peaks or shifting existing ones. How does your methodology, which relies on a profile from a baseline machine without this feature, guarantee representative sampling for a simulation with this feature? The claims of robustness in Section 5.4 seem limited to scaling existing resources rather than introducing new functional capabilities.
The ROOT methodology employs hierarchical clustering by recursively applying a k-means-like partitioning (with k=2). Given that the goal is to isolate peaks in a distribution, have you considered density-based clustering algorithms (e.g., DBSCAN) that could identify an arbitrary number of clusters (peaks) in a single pass? Would such an approach be a more direct implementation of your core insight, or does the recursive binary splitting offer a benefit I am overlooking?

Understanding and Mitigating Covert Channel and Side Channel Vulnerabilities Introduced by RowHammer Defenses

Abstract

DRAM chips are increasingly vulnerable to read disturbance phenomena (e.g., RowHammer and RowPress), where repeatedly accessing or keeping open a DRAM row causes bitflips in nearby rows, due to DRAM density scaling. Attackers can exploit RowHammer ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

This paper introduces "LeakyHammer," a purported new class of timing channels that exploit the observable latency variations caused by RowHammer defenses. The authors construct two covert channels based on the PRAC and RFM defense mechanisms, reporting capacities of 39.0 Kbps and 48.7 Kbps, respectively. They also present a proof-of-concept website fingerprinting side-channel attack. To address these vulnerabilities, three countermeasures are proposed and evaluated: Fixed-Rate RFM (FR-RFM), Randomly Initialized Activation Counters (RIAC), and Bank-Level PRAC. The central thesis is that the very mechanisms designed to secure DRAM introduce new, exploitable timing vulnerabilities.

While the premise—that security mechanisms can have unintended side effects—is plausible, the work as presented relies exclusively on a heavily idealized simulation environment. This raises significant questions about the external validity and practical relevance of the claimed attacks and the true cost-benefit analysis of the proposed mitigations.

Strengths

Plausible Core Idea: The fundamental observation that RowHammer defenses' preventive actions (e.g., back-offs, managed refreshes) introduce deterministic, high-latency events is sound and worthy of investigation.
Systematic Analysis of Defenses: The paper provides a structured analysis of two emerging industry-standard defenses (PRAC and RFM), examining their operational parameters and how they can be triggered to create a timing signal.
Exploration of Security-Performance Trade-offs: The evaluation of countermeasures, particularly the data presented in Figure 13 (page 13), highlights the critical and often severe performance overhead required to fundamentally mitigate these timing channels, which is an important data point for the community.

Weaknesses

Complete Lack of Real-World Validation: The entire study is conducted within the gem5/Ramulator 2.0 simulation framework. Claims of building high-capacity covert channels are not substantiated by any proof-of-concept on real hardware. Real systems exhibit numerous sources of non-deterministic noise (e.g., OS scheduler jitter, interrupts, complex memory controller interactions, thermal throttling) that are not adequately modeled by the paper's synthetic noise generation (Section 6.3, page 6). The statement in Section 5.1 that the authors "faithfully and rigorously model various noise sources" is a strong and unsupported claim. Without a demonstration on a physical system, the reported channel capacities are speculative at best.
Flawed Website Fingerprinting Methodology: The side-channel attack (Section 8, page 8) does not involve running a real web browser within the simulated environment. Instead, the authors use an external tool (Intel Pin) to generate memory traces, which are then used to drive the simulation. This methodology is fundamentally flawed as it decouples the application's behavior from the system's response. The timing of the simulated memory system has no feedback effect on the execution of the traced application. This is not a faithful representation of a real attack and calls the entire side-channel result (Figure 10, page 9) into question.
Insufficient Novelty and Questionable Efficacy of Countermeasures:
- FR-RFM: The concept of performing security-critical operations at a fixed rate, independent of application behavior, is a standard principle for eliminating timing channels (i.e., a constant-time approach). This is not a novel contribution. Furthermore, the authors' own results (Figure 13) show this approach incurs a prohibitive 18.2x performance overhead for highly vulnerable DRAM (NRH=64), rendering it impractical in the very scenarios where it is most needed.
- RIAC: Randomizing activation counters (Section 11.2, page 12) is a form of noise injection. It does not eliminate the channel but merely reduces its capacity and reliability. The paper quantifies this as an 86% reduction (Section 11.4), not 100%. From a security perspective, a noisy channel is still a channel. An attacker could employ more sophisticated signal processing or error-correcting codes to recover the signal. This is a mitigation, not a solution.
- Bank-Level PRAC: This proposal only reduces the attack's scope from the channel level to the bank level. It does not eliminate the vulnerability but rather constrains it, effectively reverting it to a threat model similar to existing same-bank channels like DRAMA [162]. To present this as a countermeasure against the fundamental LeakyHammer vulnerability is an overstatement.

Questions to Address In Rebuttal

The primary weakness of this work is its sole reliance on simulation. Please justify why a demonstration on real hardware, even one with a lower channel capacity, was not performed. What specific, insurmountable challenges prevent a real-world proof-of-concept?
Regarding the website fingerprinting attack (Section 8): Please address the methodological disconnect of using externally generated traces to drive a simulation. How can the results be considered valid when the application's execution path is not influenced by the memory latency events it is supposedly generating?
The RIAC countermeasure reduces channel capacity but does not eliminate it. Can you provide a more formal argument for why this should be considered a secure mitigation? Is it not plausible that an attacker could overcome this added noise with a longer observation window or more robust encoding schemes?
Given that your own evaluation shows FR-RFM incurs a catastrophic 18.2x performance overhead at NRH=64 (Figure 13), in what practical scenario do you envision this being a viable solution for the future, highly-vulnerable systems the paper purports to protect?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces "LeakyHammer," a novel and compelling class of timing attacks that exploit the very mechanisms designed to defend against RowHammer. The core insight is that the preventive actions taken by state-of-the-art RowHammer defenses—specifically PRAC and RFM—introduce large, predictable, and easily measurable latency variations into the memory subsystem. The authors demonstrate that these defense-induced latencies can be intentionally triggered by an attacker to create high-throughput covert channels and information-leaking side channels.

The work presents two concrete covert channel attacks exploiting PRAC and RFM, achieving capacities of 39.0 Kbps and 48.7 Kbps, respectively. Furthermore, it demonstrates a proof-of-concept website fingerprinting side channel attack. Crucially, the paper does not stop at identifying the vulnerability; it proposes and evaluates three countermeasures (FR-RFM, RIAC, and Bank-Level PRAC), providing a thoughtful analysis of the inherent trade-off between security and performance in mitigating this new threat.

Strengths

Fundamental and Conceptually Significant Contribution: The primary strength of this paper is its core thesis: that security mechanisms themselves can become a new attack surface. This is a fundamentally important observation in the field of systems security. By demonstrating that RowHammer defenses create new vulnerabilities, the authors force the community to adopt a more holistic view of security design, considering not just the problem being solved but also the potential side effects of the solution. This work is an excellent case study in the complex interplay of security, performance, and system architecture.
Excellent Contextualization and Positioning: The authors do a superb job of placing their work within the broader landscape of DRAM-based timing channels. The comparison against DRAMA [162] in Section 9 (page 10) is particularly insightful. They correctly identify that LeakyHammer's attack scope (channel-level for PRAC, bank-group for RFM) makes it fundamentally more difficult to mitigate with conventional isolation techniques like bank partitioning, which are effective against prior bank-local attacks. This demonstrates a deep understanding of the field and clearly articulates the novelty and increased threat posed by their findings.
Comprehensive and Rigorous Evaluation: The paper is impressively thorough. The authors don't just propose a theoretical attack; they implement and evaluate it rigorously. They build two distinct attacks on two different, highly relevant industry defense standards. They analyze the channel capacity under varying levels of application-induced and synthetic noise (as shown in Figures 4 and 7 on pages 6 and 8), demonstrating the robustness of the channels. The inclusion of a concrete side-channel attack (website fingerprinting) makes the threat tangible and relatable.
Forward-Looking Analysis of Countermeasures: A major strength is the exploration of the solution space in Section 11 (page 12). Proposing and evaluating FR-RFM, RIAC, and Bank-Level PRAC moves the paper from simply "problem finding" to "solution-oriented research." The conclusion that completely eliminating the channel (via FR-RFM) incurs significant performance penalties in highly vulnerable systems (i.e., at very low Nʀʜ) is a key, sobering takeaway for the architecture community. It highlights that there is no easy fix and sets the stage for future research in this area.

Weaknesses

My criticisms are less about flaws and more about opportunities to further broaden the paper's impact.

Potential for Broader Side-Channel Implications: The website fingerprinting attack (Section 8, page 8) serves as an effective proof-of-concept, but it feels like the tip of the iceberg. The signal generated by LeakyHammer—a large, deterministic, channel-wide latency spike—is remarkably clean. This could be a powerful primitive for leaking far more sensitive information. For example, could this channel be used to infer control flow or data-dependent access patterns within secure enclaves (like SGX or SEV) where other channels might be noisy or mitigated? A broader discussion on the potential reach of this side channel would strengthen the paper's impact.
Reliance on a Simulated Environment: The work is conducted within the gem5/Ramulator 2.0 simulation framework. While this is a standard and accepted methodology for architectural research, the claims would be undeniably amplified by a demonstration on a real hardware prototype, perhaps an FPGA-based system or an early-access platform supporting PRAC/RFM. This is, of course, a high bar, but it represents the next logical step in validating this important line of inquiry. This is not a reason to reject the paper, but rather a point to consider for future work.

Questions to Address In Rebuttal

Robustness of FR-RFM: The Fixed-Rate RFM (FR-RFM) countermeasure (Section 11.1, page 12) is elegant because it decouples preventive actions from application behavior. However, its security rests on the precision of its timing. In a real system with OS scheduler jitter, interrupts, and other microarchitectural noise, how perfectly periodic can the RFM commands truly be? Could a sophisticated attacker, by carefully observing minor deviations from the expected fixed interval, still infer information about system load or the memory access patterns of other processes that might delay an RFM command?
Generality of LeakyHammer: You focus on PRAC and RFM, which are the most relevant industry standards. However, you position LeakyHammer as a "new class of attacks." To strengthen this claim, could you briefly speculate on how these principles might apply to other proposed RowHammer defenses from the academic literature? For example, defenses that rely on throttling (e.g., BlockHammer [233]) or dynamic row remapping also introduce observable, pattern-dependent, high-latency events. Would they be similarly vulnerable, and would the resulting channel characteristics be different?
Signal-to-Noise Ratio in the Wild: Your noise analysis is well-structured. However, the signal from a PRAC back-off (~1400 ns) is an order of magnitude larger than typical memory latencies. This suggests an extremely high signal-to-noise ratio. In a worst-case scenario, such as a heavily oversubscribed multi-tenant cloud server running diverse, memory-intensive workloads, do you foresee any conditions where system noise could realistically drown out the LeakyHammer signal, or is the signal fundamentally too strong to hide?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present an analysis of timing vulnerabilities introduced by modern RowHammer defenses, specifically PRAC and RFM. They identify that the preventive actions of these defenses—which are high-latency and can be intentionally triggered by specific memory access patterns—create a new timing side channel. They formalize this vulnerability class as "LeakyHammer." The paper demonstrates the viability of this channel by constructing two covert channels with capacities of 39.0 Kbps and 48.7 Kbps, respectively, and a proof-of-concept website fingerprinting attack. Finally, they propose and evaluate three countermeasures: Fixed-Rate RFM (FR-RFM), Randomly Initialized Activation Counters (RIAC), and Bank-Level PRAC.

My primary task is to assess the novelty of this contribution. While the paper is well-executed, the core insight—that RowHammer mitigation mechanisms can be exploited for timing attacks—is not unique to this work. The authors themselves acknowledge concurrent discovery of this phenomenon, which significantly impacts the novelty claim.

Strengths

Conceptual Framing: The paper's core conceptual contribution is the framing of "LeakyHammer" as a distinct class of attacks based on two fundamental properties of RowHammer defenses: observable latency and deterministic triggerability (Section 4, page 3). This provides a useful abstraction for reasoning about vulnerabilities in future defenses, which is a novel conceptual synthesis.
Novelty in Countermeasures: The proposed countermeasures, while based on established security principles, represent a novel application of these principles to this newly highlighted problem domain.
- FR-RFM (Section 11.1, page 12) applies the "constant-time" principle to decouple defense actions from application behavior.
- RIAC (Section 11.2, page 12) applies the principle of randomization to inject noise and reduce channel reliability.
- Bank-Level PRAC (Section 11.3, page 12) is a novel architectural refinement that proposes reducing the scope of the observable event, thereby limiting the attack's reach. This is a tangible, though incremental, novel design suggestion.
Comprehensive Scope: The paper analyzes two distinct, state-of-the-art industry defenses (PRAC and RFM) within a single framework. While the individual attacks on these mechanisms have been concurrently discovered, evaluating them together provides a breadth that is likely absent in the more narrowly focused concurrent works.

Weaknesses

Attenuated Novelty of the Core Discovery: The central weakness of this paper, from a novelty perspective, is that the core idea is not unique. The authors explicitly state in their Related Work section (Section 13, page 14): "The idea of exploiting industry solutions to RowHammer was developed independently and concurrently by three recent works [151, 199, 223]." This admission confirms that the fundamental vulnerability was discovered simultaneously by multiple independent research groups. Therefore, the claim of being the "first analysis" is contentious and hinges on publication timing, not on the originality of the discovery itself. The works they cite [223] and [151, 199] appear to cover the exploitation of PRAC and RFM, respectively, which are the exact mechanisms this paper builds its attacks upon.
Application of Existing Attack Paradigms: The specific attacks demonstrated—covert channels and website fingerprinting—are standard applications of a newly found timing primitive. The methodologies used to build the covert channel (window-based transmission) and the side channel (collecting a trace and applying ML classifiers, Section 8, page 8) are functionally identical to how countless other cache- and memory-based timing channels have been exploited in prior work. The novelty lies in the primitive, which as established above, is a concurrent discovery, not in the exploitation technique.
Countermeasure Principles are Not Fundamentally New: The intellectual underpinnings of the proposed countermeasures are well-established in the security community. Decoupling security triggers from program behavior (constant-time) and adding randomness to thresholds are standard tools in the side-channel defense playbook. While their application here is novel and the engineering evaluation is valuable, they do not represent a fundamentally new defense paradigm.

Questions to Address In Rebuttal

Given the acknowledged concurrent works [151, 199, 223], please articulate the precise technical novelty of your attack methodology on PRAC and RFM. What core technical insights or exploitation techniques does your work present that are demonstrably absent in the concurrent works? Simply stating that this paper covers both is an argument of scope, not of novel insight.
The proposed countermeasures FR-RFM and RIAC are direct applications of the established principles of constant-time design and randomization. Please clarify the novelty of these proposals beyond this application. Was there a non-obvious technical challenge in applying these principles to RowHammer defenses that required a novel solution, or is the contribution simply the suggestion to apply them?
The website fingerprinting attack appears to be a straightforward application of the discovered LeakyHammer timing primitive using standard trace-collection and classification methods. Could the authors elaborate on what, if anything, is novel about the fingerprinting methodology itself, as distinct from the novelty of the underlying channel?

DRAM Fault Classification through Large-Scale Field Monitoring for Robust Memory RAS Management

Abstract

As DRAM technology scales down, maintaining prior levels of reliability becomes increasingly challenging due to heightened susceptibility to faults. This growing concern underscores the need for effective in-field fault monitoring and management. ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

This paper presents a DRAM fault classification methodology based on the underlying physical and architectural hierarchies of DDR4 and DDR5 devices. Using a large-scale dataset from a major hyperscaler, the authors classify faults into spatially- and temporally-defined categories. They claim that the vast majority (>98%) of faults are tightly coupled with intra-bank structures, specifically what they define as "bounded faults" within 2x2 MAT regions. The paper further proposes an emulation model to extend their analysis to DDR5 devices with in-DRAM ECC (IDECC) and suggests a framework (DFA) for mapping classified faults to appropriate RAS actions.

While the scale of the dataset is commendable, the work rests on a series of strong, insufficiently validated assumptions. The core claims regarding DDR5 fault behavior are derived from an unproven emulation model rather than direct field observation, and the robustness of the classification methodology to real-world noise and variations in data collection is not convincingly demonstrated.

Strengths

Dataset Scale: The primary strength is the access to and analysis of a large-scale, real-world dataset from Microsoft Azure servers, covering millions of RDIMMs and hundreds of billions of device-hours. This provides a valuable, contemporary view of in-field failures.
Architectural Grounding: The effort to tie observed error patterns back to specific DRAM architectural components (SWDs, MATs, etc.) is a fundamentally sound and important direction for fault analysis, moving beyond simple address-based clustering.
Actionable Framework: The explicit goal of mapping fault classifications to specific RAS actions (Table 5, page 11) provides a clear practical motivation for the work, connecting low-level fault diagnosis to system-level reliability management.

Weaknesses

The Foundational Flaw of Pseudo-DDR5 Emulation: The paper's entire analysis of DDR5 reliability (Section 5.3, page 9) is built on a "pseudo-DDR5" dataset. This dataset is not composed of real DDR5 field errors but is generated by taking DDR4 error logs and applying a transform to simulate the effect of IDECC. This approach is critically flawed. It operates on the unsubstantiated assumption that the underlying raw fault mechanisms and distributions of DDR4 and DDR5 devices from different technology nodes are identical. This is highly unlikely; new technology nodes introduce new failure modes. The reported high cosine similarity of 0.98 (page 10) between the pseudo-DDR5 and real-DDR5 results is not independent validation; it merely shows that the authors' model produces results similar to a small set of real data, without proving the model's general validity or correctness. All conclusions regarding DDR5 in this paper are therefore speculative.
Overstated Certainty of "Bounded Faults": The paper makes the exceptionally strong claim that among architecturally-aligned faults, "bounded faults represent 100.0%" (Section 5.2, page 8). Such a perfectly round number in a real-world field study is a significant red flag. This suggests a potential definitional circularity, where the classification algorithm is designed in such a way that it cannot produce any other outcome. The paper fails to discuss the algorithm's sensitivity or how it handles edge cases. For instance, how would a fault affecting a 2x3 MAT region be classified? Is it forced into the "bounded" category, or is it discarded as an anomaly, thereby artificially inflating the 100% figure? The methodology lacks the nuance expected for messy, real-world data.
Insufficient Scrutiny of Log Collection Granularity: The authors claim that reducing log resolution from 10µs to 1s has a "minimal effect on RAS action suggestion" (Section 6.3, page 11 and Table 6, page 12). This conclusion is superficial and potentially dangerous. Their analysis only considers the final RAS action, ignoring the impact on the intermediate, and crucial, temporal classification (Figure 8, page 7). A burst of transient faults occurring within a 1-second window would be indistinguishable from a single intermittent event, leading to fundamentally different temporal classifications. Misclassifying transient faults as intermittent or permanent could trigger unnecessary and costly RAS actions like page offlining or DIMM replacement. The analysis provided is inadequate to support the broad claim that polling-based collection is sufficient.
The Non-Verifiable "DFA" Framework: The proposed DRAM Fault Analyzer (DFA) is described as a method for sharing domain knowledge by "converting the information into compiled object files" (Section 6.1, page 11). This is antithetical to scientific principles of transparency and reproducibility. A black-box object file is a commercial product, not a verifiable research contribution. It prevents the community from inspecting, validating, or building upon the classification logic. Presenting this as a core part of the solution undermines the scientific credibility of the work.
Ambiguity in Inter-bank Fault Analysis: The paper asserts that inter-bank faults are rare and primarily environmental. However, the proposed classification is heavily biased towards intra-bank structures. It is plausible that complex faults spanning multiple banks or involving shared command/address logic are being misclassified as multiple, unrelated intra-bank MRMpost faults, thereby undercounting their true prevalence and misunderstanding their root cause.

Questions to Address In Rebuttal

Provide a rigorous validation of the pseudo-DDR5 emulation model. How do the authors justify the core assumption that DDR4 raw fault distributions are a suitable proxy for DDR5 raw faults, beyond simply applying an IDECC filter? Have you compared the emulated results against any ground-truth DDR5 fault data from manufacturing or stress tests to validate the model's assumptions about raw error patterns?
Clarify the precise algorithm for classifying a fault as "bounded." What is the handling mechanism for error patterns that fall marginally outside the 2x2 MAT boundary (e.g., affecting an adjacent row or column)? How sensitive is the reported 100.0% figure to the parameters of your classification algorithm?
Provide a more detailed analysis showing how log sampling frequency impacts the temporal classification of faults (i.e., the distribution of transient, sporadic, and intermittent faults). How many distinct error events are conflated or lost when moving from 10µs to 1s resolution, and how does this skew the fault distributions that underpin your RAS recommendations?
The paper argues that faults spanning multiple banks are rare. Could the intra-bank-focused classification scheme systematically misinterpret a single, complex inter-bank fault as several distinct MRMpost or other post-classified faults, leading to an incorrect diagnosis of the root cause?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a significant synthesis of deep, proprietary DRAM architectural knowledge with large-scale field data from a major cloud provider to create a highly precise and actionable fault classification methodology. The core contribution is a hierarchical classification system that maps observed error patterns not just to addresses, but to the underlying physical structures within the DRAM die (MATs, SWDs, MWLs, etc.). The authors use this framework to analyze millions of DDR4 and DDR5 RDIMMs, revealing that over 98% of faults are tightly coupled with intra-bank structures, particularly a "2x2 MAT" region which they define as a "bounded fault." This insight allows them to propose a concrete framework, the DRAM Fault Analyzer (DFA), which translates specific fault classifications into optimal system-level RAS (Reliability, Availability, Serviceability) actions. The work also provides a novel and important analysis of how in-DRAM ECC (IDECC) in DDR5 obscures fault signatures and how to account for this distortion.

Strengths

This work's primary strength lies in its successful bridging of two worlds that are often disconnected in academic literature: the esoteric, low-level physics of DRAM devices and the high-level, practical needs of large-scale system reliability management.

Unprecedented Architectural Depth: The level of detail provided on internal DRAM organization (Section 3, pages 3-5), particularly the distinction between "Device A" and "Device B" burst-aligned architectures and the visualization of SWD/MWL structures, is rarely seen in a public forum. This knowledge, clearly stemming from the deep industry collaboration with SK hynix, provides the "ground truth" that elevates this study far beyond traditional, address-based error clustering. It moves the field from correlation to a more causal understanding of fault patterns.
Massive-Scale, Real-World Validation: The study is grounded in an enormous dataset of 8.3 million DDR4 and 0.8 million DDR5 RDIMMs from Microsoft's fleet (Table 1, page 8). This is not a simulation or a small-scale experiment; it is an analysis of memory reliability "in the wild." This scale gives immense credibility to the central findings, such as the overwhelming prevalence (100% of clustered faults) of "bounded faults" within 2x2 MATs.
Actionable Engineering Contribution: The paper does not stop at analysis. The proposed DRAM Fault Analyzer (DFA) framework and the explicit mapping of fault types to specific RAS actions (sPPR, Poff, BnkSpr, RMV in Table 5, page 11) represent a concrete, deployable solution. This work provides a clear roadmap for hyperscalers and system designers to move from reactive to proactive and highly targeted memory fault management, which has been a long-standing goal in the field.
Pioneering DDR5/IDECC Analysis: The problem of fault analysis in the face of on-die error correction is critical for modern and future memory systems. The authors' "Pseudo-DDR5" emulation methodology (Figure 12, page 9) is a clever and insightful approach to quantify the "filtering" effect of IDECC. The finding that IDECC effectively masks the vast majority of simple faults while preserving the signature of more dangerous unbounded faults is a crucial piece of knowledge for the industry.

Weaknesses

The weaknesses of the paper are primarily related to its boundaries and the generalizability of its deep, specific knowledge.

Vendor Specificity: The architectural models that form the foundation of the classification are highly detailed but are implicitly tied to a single vendor's designs (presumably SK hynix). While the general principles of DRAM organization are standard, the specific implementations of WL/SWD sharing, redundancy, and MAT layout can differ significantly between manufacturers (e.g., Samsung, Micron). The paper does not discuss how this vendor-specificity might impact the classification accuracy or the framework's portability, which is a key question for heterogeneous cloud environments.
Fidelity of the "Pseudo-DDR5" Model: While the emulation is a strong point, it rests on the assumption that the fundamental fault mechanisms present in the source DDR4 population are a sufficient proxy for those in a newer DDR5 process node. It is possible that new process technologies introduce novel failure modes (e.g., related to different materials, smaller feature sizes, or new circuit designs) that would not be present in the DDR4 data, potentially skewing the emulated DDR5 fault distribution.
Gap Between Manifestation and Root Cause: The paper brilliantly connects error signatures to their location within the architectural hierarchy (fault manifestation). However, it stops short of deeply investigating the underlying physical phenomena (root cause). While the temporal classification (Section 4.3, page 7) provides hints (e.g., permanent vs. intermittent), the work could be strengthened by connecting these classified faults to known DRAM failure physics like Variable Retention Time (VRT), dielectric breakdown, or process variation. This would complete the chain from physics to system-level action.

Questions to Address In Rebuttal

The detailed architectural models in Section 3 are a key strength. However, they appear to be specific to one vendor's designs. Could the authors comment on the generalizability of their classification framework to DRAM from other major manufacturers? For example, would the definition and prevalence of "bounded faults" hold if the underlying MAT and SWD layouts were fundamentally different?
The pseudo-DDR5 analysis is an insightful approach. Can the authors discuss any potential new failure modes in the DDR5 process node that might not be captured by emulating on a DDR4 fault population? How might the emergence of such new modes affect the conclusion that IDECC is an effective "filter" for uncorrectable errors?
The paper excels at mapping error signatures to architectural components. Does this framework provide any new insights into the underlying physical root causes of these faults (e.g., VRT, process-related defects)? Could this detailed classification be used to feedback information to the manufacturing and testing process itself, beyond in-field RAS management?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present a detailed, microarchitecture-aware DRAM fault classification methodology. This classification is derived by mapping physical fault addresses to the underlying hierarchical structures within the DRAM device, such as MATs, SWDs, and MWLs. The central thesis is that the vast majority of faults are "bounded," confined to a 2x2 MAT structure, and exhibit predictable patterns. The authors validate this methodology through a large-scale field study of DDR4 and DDR5 RDIMMs, proposing a fault analyzer (DFA) that maps their classified faults to specific RAS actions. A key part of their analysis involves a novel emulation method to study the impact of in-DRAM ECC (IDECC) on fault manifestation in DDR5 devices.

Strengths

Granularity of Classification: The proposed fault taxonomy is exceptionally detailed. The sub-classification of MROW faults into categories like MWLb, EAWLb, and EEWLb (Section 4.1, Figure 7, page 6) based on specific Row Address (RA) intervals and adjacency is a level of granularity that is not commonly explored in prior large-scale field studies.
Operational Definition of "Bounded Faults": The paper provides a concrete, architectural definition for "bounded faults" as those confined within an adjacent 2x2 MAT region (Section 4.1, Figure 6, page 6). This moves beyond generic terms like "clustered faults" and provides a precise, testable hypothesis that is a strong point of the work.
Novel Methodology for IDECC Analysis: The "pseudo-DDR5" data generation approach (Section 5.3, Figure 12, page 9) is a clever and novel method to study the impact of IDECC. By transforming real DDR4 fault data based on the known bounded error characteristics of DDR5 IDECC, the authors circumvent the difficulty of collecting a sufficiently large and diverse DDR5 fault dataset while still providing valuable, well-grounded insights into how IDECC masks and transforms fault signatures.

Weaknesses

The primary weakness of this paper, from a novelty perspective, is that its foundational premise—classifying DRAM faults based on underlying device architecture—is not new. The core contribution is a significant refinement and empirical validation of this existing idea, rather than the introduction of a new paradigm.

Overlap with Prior Art in Architecture-Aware Classification: The concept of tying memory errors to the physical DRAM structures has been established in prior work.
- Li et al. ("From Correctable Memory Errors to Uncorrectable Memory Errors: What Error Bits Tell," SC'22, cited as [28]) performed a similar analysis, correlating multi-bit error patterns to the internal DRAM organization, including bit-lines, word-lines, and banks.
- Jung & Erez ("Predicting future-system reliability with a component-level dram fault model," MICRO'23, cited as [21]) explicitly proposed a component-level DRAM fault model that distinguishes between failures in cells, wordlines, and bitlines for reliability prediction.
While the present work offers a more detailed taxonomy and a larger dataset, the fundamental idea of "DRAM fault classification through architectural mapping" is part of the existing body of knowledge. The authors should more explicitly position their work as a refinement that introduces a more granular hierarchy (e.g., the 2x2 MAT "bounded fault" model) rather than presenting the general approach as a novel introduction.
Limited Generality of the Architectural Model: The novelty and impact of the specific fault patterns (e.g., the RA intervals for EAWLb and EEWLb faults) are contingent on the universality of the DRAM architectures described in Section 3 (pages 3-5). The paper details "Device A" and "Device B" for DDR4 and a JEDEC standard-aligned model for DDR5. However, DRAM internal layouts, especially for peripheral circuits like row/column decoders and SWD placement, are highly proprietary and can vary significantly between vendors and even across technology nodes from the same vendor. The work does not sufficiently argue that its detailed classification is a fundamental property of all DRAM rather than a specific feature of the devices under study. This potentially limits the novelty of the findings to a specific subset of the market.

Questions to Address In Rebuttal

The core idea of mapping errors to the DRAM microarchitecture has been explored previously (e.g., [21], [28]). Beyond providing a higher level of granularity and a larger dataset, what is the fundamental conceptual difference in your classification methodology that distinguishes it as a novel contribution over this prior art? Please be specific.
Your detailed fault signatures, particularly the MROW sub-types in Figure 7, appear tightly coupled to the specific zigzagged SWD layout and MWL addressing shown in Figure 2. How confident are the authors that these specific "bounded fault" patterns and their corresponding RA signatures are fundamental properties of modern DRAM, as opposed to artifacts of the specific SK hynix and/or Microsoft fleet device architectures studied? In other words, how would your classification change if a different vendor used a fundamentally different row decoder or MAT adjacency layout?
The "pseudo-DDR5" emulation method is an interesting contribution. What steps were taken to validate that this software-based transformation accurately reflects the real-world fault masking and potential mis-correction behaviors of a hardware IDECC implementation across various fault types? For example, how does the model account for faults within the IDECC logic itself or complex interactions that might not be captured by a simple bounded-error assumption?

SymbFuzz: Symbolic Execution Guided Hardware Fuzzing

Abstract

Modern hardware incorporates reusable designs to reduce cost and time to market, inadvertently increasing exposure to security vulnerabilities. While formal verification and simulation-based approaches have been traditionally utilized to mitigate these ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present SymbFuzz, a hybrid hardware fuzzing framework that combines the industry-standard Universal Verification Methodology (UVM) with symbolic execution. The proposed methodology aims to improve coverage and bug detection over existing hardware fuzzers by using an SMT solver to generate constraints that guide the fuzzer toward unexplored states, particularly when coverage stagnates. The framework is evaluated on several RISC-V processor cores, including a version of OpenTitan, where it reportedly finds 14 bugs, including one novel vulnerability, and achieves higher coverage than state-of-the-art fuzzers.

Strengths

Practical UVM Integration: The paper's most credible contribution is the direct integration of a fuzzer with the UVM framework. By generating constraints for the UVM sequencer, the approach bypasses the need for software surrogate models and could theoretically be integrated into existing commercial verification flows. This demonstrates sound engineering.
Assertion-Based Bug Detection: The use of security properties and assertions rather than relying solely on a golden reference model (GRM) is appropriate for security verification. As the authors correctly note (Section 5.2, page 11), many security flaws (e.g., side-channel leaks) do not manifest as functional incorrectness and would be missed by GRM-based differential testing.

Weaknesses

The paper's claims of superiority are built upon a methodology with significant underspecified components and an evaluation that lacks the necessary rigor to be conclusive.

Critically Underspecified Methodology: The core mechanisms are described at a high level but lack the technical depth required for reproducibility or to assess their robustness.
- Checkpointing Mechanism: The authors claim to use a "lightweight snapshot mechanism that saves only the essential transaction history and architectural state" (Section 3, page 3). The definition of "essential" is entirely omitted. How is this state identified automatically and reliably? An incomplete state restoration could lead to divergent, non-deterministic behavior, invalidating any subsequent exploration from that checkpoint. This is a fundamental flaw that undermines the entire partial-reset premise.
- Stagnation Heuristic: The invocation of the expensive symbolic execution engine is triggered when NoIncrement(Coverage) > Th (Algorithm 1, page 4). The parameter Th is set based on "pilot runs on smaller designs" (Section 5, page 7). This is an ad-hoc, non-generalizable heuristic. A core parameter of the system should not be based on empirical tuning without a formal justification for its transferability to new, more complex designs.
- CFG Generation and Scalability: The entire approach hinges on a Control Flow Graph (CFG) generated by Pyverilog. The authors fail to address the well-known limitations of parsing tools like Pyverilog on complex, multi-file, industrial-grade SystemVerilog designs. More importantly, the proposed coverage metric—the Cartesian product of all conditional outcomes (Section 4.6, page 6)—is combinatorially explosive and computationally intractable for all but the simplest control structures. The proposed solution, to simply fuzz non-zero values, is a weak fallback that abandons the guided approach precisely when it is most needed.
Unconvincing and Potentially Misleading Evaluation: The experimental results are presented in a way that exaggerates impact while using weak baselines and questionable comparisons.
- Weak Baseline Comparison: The claim of a "6.8× faster convergence" (Abstract, page 1) is made against "traditional UVM random testing." This is a meaningless comparison. Any guided fuzzer is expected to outperform pure random testing. The only valid baseline for a novel fuzzer is other state-of-the-art fuzzers, against which the claimed improvement is a much more modest 6-15% (Figure 4a, page 11).
- Questionable "New Bug" Claim: The centerpiece result, Bug #01 (Table 1, page 8), is presented as a novel discovery in OpenTitan. However, the authors state this bug was in an IP generated by OpenTitan's reggen tool and "was not part of the HACK@DAC'24 competition" (Section 5.1, page 7). This implies the other fuzzers were not evaluated on the exact same design. Without confirmation that all tools were run on identical RTL, the comparison is invalid. It is possible SymbFuzz found this bug simply because it was the only tool run against the code containing it.
- Unsupported Scalability Claims: The authors claim SymbFuzz "scales efficiently" (Section 5.5.2, page 12). The data in Table 3 suggests otherwise. The OpenTitan design (1.5M LoC) required 4x the latency and generated 3x the constraints of the next largest design (~200k LoC). This indicates a super-linear, and potentially exponential, increase in cost with design size, casting serious doubt on its applicability to next-generation multi-million gate SoCs.
Contradictory Performance Metrics: In Section 5.2 (page 11), the paper states SymbFuzz is "33% and 54% more efficient than DifuzzRTL and HWFP, respectively," yet also uses "4% more memory than DifuzzRTL." The term "efficient" is not defined. If it refers to CPU time, how can a more complex, SMT-solver-based approach be faster while using more memory? This lacks clarity and suggests a potential inconsistency in measurement or reporting.

Questions to Address In Rebuttal

The authors must address the following points directly and with specific evidence from their work:

Checkpointing Integrity: Provide a detailed, algorithmic description of how "essential" state is automatically identified for checkpointing. How do you formally guarantee that un-captured, non-architectural state (e.g., in pipeline registers, arbiters, or stateful peripherals) does not corrupt the design's behavior after a partial reset?
Stagnation Threshold Th: Justify the choice of Th. How sensitive are the final coverage results to this parameter? Provide data showing that the chosen Th is not simply over-fitted to the benchmark suite and can be expected to work on unseen designs.
Bug #01 Comparison: Please confirm the exact git commit hash of the OpenTitan repository used for all fuzzing comparisons. Were RFuzz, DifuzzRTL, and HWFP run on precisely the same version of the RTL that contained the reggen-generated bug (Bug #01)? If not, the claim of superior bug-finding ability for this case is unsubstantiated.
Computational Cost Breakdown: What percentage of the total execution time is spent in the SMT solver versus simulation? Provide this breakdown for the OpenTitan benchmark. This is critical for understanding the true cost of the symbolic execution component.
Scalability Projection: Given the data in Table 3, what is the projected latency and number of generated constraints for a design of 5 million LoC? How do you defend the claim of "efficient scaling" when the provided data suggests a steep increase in computational cost with design size?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents SymbFuzz, a hybrid hardware fuzzing framework that integrates symbolic execution guidance into the industry-standard Universal Verification Methodology (UVM). The core idea is to leverage a traditional coverage-guided fuzzer for broad state-space exploration and, upon detecting stagnation, employ an SMT-based symbolic execution engine to generate new input constraints that steer the fuzzer toward uncovered paths.

The authors' primary contribution is not the invention of hybrid fuzzing itself, but its novel and pragmatic adaptation to the specific challenges and workflows of pre-silicon RTL verification. Key to their approach are: 1) a tight integration with UVM, where the solver generates UVM constraints directly for the testbench sequencer; 2) a Control Flow Graph (CFG) based model of the hardware state, focused on control registers; and 3) a checkpointing mechanism to avoid costly full resets during simulation. The framework is evaluated on several RISC-V processor cores, most notably a buggy version of the OpenTitan SoC, where it successfully identified all previously known bugs plus 14 new ones, including a vulnerability significant enough to be registered in the CWE database.

Strengths

Excellent Synthesis of Academic Concepts and Industrial Practice: The most significant strength of this work is its successful translation of the hybrid fuzzing paradigm, well-established in software security (e.g., Driller, QSYM), into the rigid and complex world of industrial hardware verification. The integration with UVM (Section 4.1, page 4) is the key insight. Instead of reinventing the entire test generation and simulation environment, the authors cleverly use symbolic execution as an "oracle" that feeds constraints back into the existing, trusted UVM machinery. This pragmatic approach dramatically lowers the barrier to adoption and makes the work immediately relevant to verification engineers.
Addressing a Core Problem in Hardware Verification: The paper sits squarely at the intersection of two critical challenges in modern SoC design: the scalability limits of formal verification and the shallowness of unguided, simulation-based methods like constrained-random testing. SymbFuzz offers a compelling "best of both worlds" solution, using the speed of simulation for general exploration and the analytical power of formal methods for targeted, deep exploration of hard-to-reach corner cases. The 6.8x faster convergence compared to traditional UVM random testing (Section 5.3, page 11) is a powerful testament to this approach's value.
Strong and Convincing Empirical Validation: The results presented are impressive and serve as strong evidence for the framework's efficacy. Finding a large number of new bugs is a good outcome, but discovering a previously unknown, CWE-worthy vulnerability in a well-scrutinized open-source design like OpenTitan (Bug #01, Table 1, page 8) elevates the work significantly. It proves that the technique is not merely finding trivial bugs but is capable of uncovering subtle and impactful security flaws that other methods have missed.
Thoughtful Hardware-Specific Design Choices: The authors demonstrate a clear understanding of the hardware verification domain. The development of a checkpoint and partial-reset mechanism (Section 4.5, page 5) directly addresses a major performance bottleneck in hardware simulation (lengthy reset and initialization sequences). This shows the work is not a simple port of a software technique but a carefully engineered solution tailored to its target domain.

Weaknesses

Understated Connection to the Broader Hybrid Fuzzing Lineage: While the work is novel in the hardware context, it would be strengthened by more explicitly positioning itself as the hardware counterpart to the rich history of hybrid fuzzers in the software domain. Acknowledging this intellectual lineage would allow the authors to better highlight the non-trivial contributions required for this adaptation—such as modeling hardware state, handling concurrency implicitly through the simulator, and the aforementioned UVM integration, which have no direct software equivalents.
Limited Discussion on the Scalability of the Static Analysis Phase: The paper demonstrates scalability on cores up to ~200k LoC (Table 3, page 12). However, modern SoCs are orders of magnitude larger. The framework relies on an initial static analysis phase to generate the CFG and dependency equations. It would be beneficial to include a discussion on how this analysis phase scales with design size and complexity. For instance, what are the computational and memory bottlenecks—CFG generation, dependency analysis, or the SMT solving itself—when faced with a multi-million gate design composed of numerous interacting IPs?
Implicit Manual Effort in Property Definition: The bug detection mechanism relies on a predefined set of security properties written as SystemVerilog assertions (e.g., Listing 5, 7, 9). While effective, the paper does not discuss the effort required to create these properties. For the framework to be truly scalable across new and varied designs, the effort of property specification is a critical factor. A brief discussion on whether these properties are highly generic or require significant design-specific expertise to write would add important context.

Questions to Address In Rebuttal

The core concept of using a solver to overcome fuzzer roadblocks is central to hybrid fuzzers in software. Could you elaborate on the most significant unexpected challenges you faced when translating this paradigm from the software domain (sequential programs, single address space) to the hardware RTL/UVM domain (massive concurrency, complex state, cycle-accurate simulation)?
Regarding scalability, your results on cores like CVA6 are promising. If you were to apply SymbFuzz to a full commercial SoC with dozens of IPs, what part of your toolchain—CFG generation, dependency equation solving, or runtime coverage tracking—do you anticipate would become the primary performance bottleneck?
Your bug-finding capability is impressive and hinges on security properties. For a verification team adopting SymbFuzz for a completely new IP, what is the expected workflow for developing the necessary properties? Are the properties you used for OpenTitan generalizable (e.g., patterns for information leakage), or do they require deep, IP-specific expert knowledge to formulate?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents SymbFuzz, a hybrid hardware fuzzing framework that uses symbolic execution to guide test generation towards uncovered states. The authors' central claim to novelty rests on being the first such framework to be implemented on and deeply integrated with the industry-standard Universal Verification Methodology (UVM). The proposed system builds a Control Flow Graph (CFG) from the RTL, identifies checkpoints at high-fanout nodes, and when coverage stagnates, uses an SMT solver on dependency equations to generate new input constraints for the UVM sequencer. This allows the fuzzer to escape local coverage maxima and explore hard-to-reach states.

While the fundamental concept of symbolic-execution-guided fuzzing is well-established in the software domain, this paper's primary novel contribution lies in the architectural adaptation and practical implementation of this concept within a commercial-grade hardware verification environment. The specific mechanisms for state restoration (via input-sequence replay) and closed-loop guidance (via UVM sequencer constraints) are presented as novel solutions tailored for the hardware verification domain.

Strengths

Novelty in Integration and Practicality: The most significant and undeniable novel contribution is the integration of a hybrid fuzzer into the UVM framework (Section 4.1, page 4). Prior academic hardware fuzzers (e.g., RFuzz, DifuzzRTL, HWFP) are typically standalone tools built on simulators like Verilator. By architecting their solution around UVM components (sequencer, driver, monitor), the authors present a genuinely new pathway for transitioning advanced fuzzing research into industrial practice. This is not merely a reimplementation; it requires a novel design to bridge the gap between symbolic constraint solving and UVM's transaction-based stimulus generation.
Domain-Specific State Management Mechanism: The checkpointing mechanism (Section 4.5, page 5) is a novel implementation of state-saving for RTL simulation. It eschews heavyweight, full-state snapshots (common in software via fork-servers) in favor of a lightweight approach: saving the input vector sequence that leads to a checkpoint and re-simulating from reset to restore that state. While re-simulation is not a new idea, its systematic application based on CFG analysis to accelerate path exploration in hardware fuzzing is a novel refinement.
Generality over Prior Hybrid Hardware Fuzzers: The paper effectively distinguishes its approach from the closest prior art, HypFuzz [15]. By operating directly on RTL inputs rather than ISA-level binaries, SymbFuzz offers a more general solution applicable to any hardware IP, not just processor cores. This extension of scope is a meaningful "delta" and represents a novel advancement in the applicability of hybrid fuzzing for hardware.

Weaknesses

The Core Algorithm is Not Fundamentally New: The central weakness from a novelty perspective is that the underlying algorithm—using symbolic execution to solve path constraints when random mutation gets stuck—is the canonical definition of hybrid or concolic fuzzing. This technique has a long and storied history in software verification, with seminal works like Driller [54], SAGE, and KLEE [12] establishing the paradigm over a decade ago. The paper should more explicitly frame its contribution as a novel adaptation and engineering of this known paradigm to a new domain, rather than implying the hybrid loop itself is a new invention.
Incremental Advancement in Guidance: While the implementation is novel, the guidance strategy itself is a straightforward application of symbolic execution. The process of identifying an uncovered branch, tracing its path condition backward, and solving for inputs is conceptually identical to existing work. The novelty is in the plumbing (getting constraints into UVM), not in the guidance logic itself.
Overstated Novelty of CFG-based Coverage: The use of a CFG to model hardware states and track coverage is a logical step, but not a profound conceptual leap. Static analysis to extract state machines or flow graphs from RTL is a standard technique in formal verification and synthesis tools. While its application to guide a fuzzer is appropriate, it doesn't represent a fundamentally new analysis technique.

Questions to Address In Rebuttal

The core concept of hybrid fuzzing is well-established in software. Beyond the significant challenge of UVM integration, what were the fundamental hardware-specific challenges (e.g., dealing with 4-state logic, timing semantics, complex clocking domains) that required a genuinely novel algorithmic solution, rather than a direct porting of software-based concolic execution principles?
The paper effectively differentiates itself from HypFuzz [15] on implementation (RTL vs. ISA) and oracle (assertions vs. differential). Could the authors elaborate on the conceptual novelty of their guidance mechanism compared to HypFuzz? Does HypFuzz's "formal-assisted" search employ a similar symbolic constraint-solving loop, and if so, what is the key conceptual—not just implementation—difference in how SymbFuzz performs its guided search?
The dependency equation generation (Section 4.4.2, page 5) appears critical to the guidance mechanism. The paper presents a toy example. How does this analysis scale to large, industrial designs where a single output may depend on thousands of internal registers and inputs over many cycles? Is the novelty of the approach constrained by the scalability of this preliminary analysis, and is the analysis itself novel compared to techniques used in, for example, information flow tracking (IFT) static analysis?

X-SET: An Efficient Graph Pattern Matching Accelerator With Order-Aware Parallel Intersection Units

Abstract

Graph Pattern Matching (GPM) is a critical task in a wide range of graph analytics applications, such as social network analysis and cybersecurity. Despite its importance, GPM remains challenging to accelerate due to its inherently irregular control flow ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present X-SET, a GPM accelerator architecture featuring two primary contributions: an "Order-Aware Set Intersection Unit" (SIU) based on a bitonic merger, and a "Barrier-Free" out-of-order task scheduler for DFS-based GPM. The paper claims significant improvements in performance and compute density over state-of-the-art CPU, GPU, and specialized accelerator baselines. While the paper identifies legitimate bottlenecks in GPM acceleration, its core claims rest on a questionable evaluation methodology and an understated presentation of the hardware complexity and its associated trade-offs. The work, in its current form, makes several logical leaps that are not sufficiently substantiated by the provided evidence.

Strengths

The paper correctly identifies two well-known and critical bottlenecks in hardware-accelerated GPM: the efficiency of the core set intersection operation and the underutilization of parallel units due to rigid DFS scheduling.
The application of a bitonic merger network to the problem of intersecting two sorted sets is a technically sound approach to increasing throughput compared to simple merge-based units.
The integration into a RISC-V SoC via the RoCC interface demonstrates a plausible path to a complete system, moving beyond purely theoretical accelerator models.

Weaknesses

Indirect and Non-Rigorous Accelerator Comparisons: The central performance claims against other accelerators (FlexMiner, FINGERS, Shogun) are critically flawed. As stated in Section 7.1.3, performance data for FINGERS and Shogun are not derived from direct simulation or implementation but are calculated based on their reported relative speedup over FlexMiner. This transitive comparison is scientifically unsound. It inherits and potentially amplifies any methodological discrepancies, architectural assumptions, or benchmark variations from three separate papers. A rigorous comparison requires simulating all architectures under identical conditions. Without this, the speedup claims in Figure 13 are unsubstantiated.
Understated Scheduler Complexity and Questionable Area Claims: The proposed "Barrier-Free" scheduler, detailed in Section 6 and Figure 10, involves numerous complex hardware structures (Task Tree, Task Sets, Fast Spawning Registers, Candidate Buffers, etc.) for fine-grained dependency tracking. However, the area breakdown in Table 4 reports a control logic area (0.044 mm²) that is significantly smaller than that of FINGERS (0.069 mm²), which employs a simpler pseudo-DFS scheduler. This discrepancy is highly suspicious. Either the area accounting is not apples-to-apples, or crucial components of the scheduler have been omitted from the "Control" category. The complexity depicted in Figure 10 does not align with the small area reported.
Unexamined Impact of Out-of-Order Execution on Cache Locality: The paper claims the barrier-free scheduler improves hardware utilization by executing ready tasks from different levels of the search tree (Section 3.2). However, it fails to address the obvious and detrimental consequence: destroying data locality. Executing tasks from disparate parts of the search tree will inevitably lead to cache thrashing in both the private and shared caches, increasing memory traffic and potentially negating the gains from improved SIU utilization. The claim in Section 6.2 that a "Depth-First policy is used for final selection to minimize the memory footprint" is a weak and unsubstantiated countermeasure. The paper provides no data on cache miss rates or memory bandwidth utilization to prove this is not a performance-limiting factor.
The "Order-Aware" SIU's Practicality is Ambiguous: The theoretical O(N log N) hardware complexity of the SIU is presented as a clear win. However, this asymptotic advantage comes at the cost of a deep, multi-stage pipeline (MIN, CAS, Merge stages described in Section 5). For the short intersection lists common in sparse graphs or early in the search tree, the high pipeline latency of this unit could easily make it slower than a simple, low-latency merge queue. The end-to-end results in Figure 14 do not sufficiently decouple this latency vs. throughput trade-off. The authors show their design wins on average, but they do not provide the analysis required to understand the performance crossover point or the conditions under which their complex design is actually a liability.
Marginal and Conditional Superiority Over GPU Baselines: The performance comparison against the GPU baseline GLUMIN (Figure 12c) shows a geometric mean speedup of only 1.05×. This is a marginal improvement at best. Furthermore, the authors concede in Section 7.2.2 that "the performance advantage of X-SET diminishes for larger graphs" and for graphs where vertex degrees are within the GPU's warp-level LUT generation limits. This is a crucial admission: the claimed superiority over a modern GPU is conditional and disappears on the very workloads (large, complex graphs) that are often the motivation for building accelerators in the first place.

Questions to Address In Rebuttal

Please provide a rigorous justification for using indirect, transitive comparisons for the accelerator baselines in Figure 13. How can the authors guarantee that the experimental conditions (e.g., compiler flags, library versions, detailed architectural parameters) across the three source papers are consistent enough to make this comparison valid?
Provide a detailed, component-by-component area breakdown of the 0.044 mm² "Control" unit. Specifically, what is the area of the Task Tree management logic, the Candidate Buffers, the Fast Spawning Registers, and the dispatch logic shown in Figure 10? How does this truly compare to the specific scheduler logic area in FINGERS and Shogun, not their entire "control" unit?
Present data on L1 and L2 cache miss rates and DRAM bandwidth utilization for X-SET compared to a baseline using a simple in-order DFS scheduler. This evidence is necessary to substantiate the claim that your out-of-order scheduler does not cause prohibitive memory system performance degradation.
Provide a microbenchmark analysis that directly compares your Order-Aware SIU against a simple merge-based SIU. The analysis should plot latency and sustained throughput as a function of input set length to clearly demonstrate the performance trade-offs and identify the crossover point where your design becomes superior.
The 142.9× improvement in compute density over FINGERS is an extreme outlier. Please specify the exact benchmark (pattern and graph combination) that produces this result and provide the raw cycle counts, area numbers, and a detailed analysis explaining why this specific case is so favorable to your architecture and so unfavorable to the baseline.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents X-SET, a hardware accelerator for Graph Pattern Matching (GPM) integrated into a RISC-V SoC. The authors identify and tackle two widely acknowledged bottlenecks in GPM acceleration: inefficient set intersection and rigid, synchronization-heavy task scheduling. The core of their contribution is twofold. First, they introduce an "Order-Aware Set Intersection Unit" (SIU) that cleverly adapts a classic bitonic merger network to perform highly parallel set intersections. By leveraging the inherent sorted nature of adjacency lists, this unit reduces hardware complexity from O(N²) to a more scalable O(N log N) while achieving high throughput. Second, they propose a "Barrier-Free Task Scheduler" that decouples task execution from the strict hierarchical levels of the DFS search tree, enabling asynchronous, out-of-order execution that maximizes the utilization of their parallel SIUs. The synergy between these two innovations is the key to the impressive performance gains reported, with the paper claiming a 6.4x geometric mean speedup over other state-of-the-art GPM accelerators.

Strengths

Elegant Repurposing of a Classic Parallel Primitive: The single most compelling idea in this paper is the application of a bitonic merger network to the problem of set intersection (Section 3.1, page 4). This is a wonderful example of cross-pollination, taking a well-understood algorithm from the domain of parallel sorting and applying it with great effect to graph analytics. By transforming the intersection of two sorted sets into a sorting problem on a derived bitonic sequence, the authors achieve a design that sits in a beautiful "sweet spot": it is far more parallel than a simple merge-based approach and significantly more area-efficient than a fully combinatoric systolic array. The theoretical complexity reduction shown in Table 1 (page 2) is a fundamental contribution.
Addressing the System-Level Bottleneck: A powerful execution unit is often wasted if the system cannot keep it fed. The authors correctly identify that rigid, level-by-level DFS scheduling is a primary cause of underutilization in irregular graph workloads. Their barrier-free, dependency-aware scheduler (Section 3.2, page 4 and Section 6, page 7) is a crucial system-level contribution that unlocks the potential of their efficient SIUs. It moves GPM accelerator design away from simplistic synchronous parallelism towards a more sophisticated, dynamic model reminiscent of dataflow architectures or modern task-based software runtimes.
Strong Synergy Between Contributions: The two core ideas are not independent; they are highly synergistic. The high-throughput SIU creates a demand for a constant stream of ready-to-execute tasks, which the barrier-free scheduler provides. Conversely, the scheduler's ability to dispatch tasks from anywhere in the search tree would be less impactful without powerful, parallel execution units to send them to. This holistic approach to accelerator design is a major strength.
Practical and Reproducible Design: The integration into a RISC-V SoC via the RoCC interface (Section 4.1, page 5) grounds the work in reality, demonstrating a clear path from research concept to a practical system component. Furthermore, the commitment to providing an open-source artifact, including Chisel RTL and a SystemC simulator, is commendable and significantly increases the potential impact of this work as a platform for future research.

Weaknesses

While the work is strong, its context could be broadened and some assumptions could be more critically examined.

Focus on Single-Node Scalability: The architecture is centered around a shared memory hierarchy (specifically, the shared L2 cache). This is a well-established model for a single SoC, but it inherently limits the scalability to a single node. The broader landscape of graph processing is increasingly concerned with distributed, multi-node systems. The current design does not naturally extend to such a paradigm, as the centralized task dependency management and shared cache would become significant bottlenecks.
Implicit Assumption of Data Layout: The efficiency of the Order-Aware SIU is predicated on the input neighbor lists being sorted. This is a common convention but not a universal guarantee. The paper does not discuss the potential system-level overhead of enforcing this property if the input graph data is not already sorted. In a real-world data pipeline, this pre-processing step could have non-trivial performance implications.
Limited Discussion on Energy Efficiency: The paper provides power breakdown data for the SIU designs (Figure 15, page 11) but lacks a broader, end-to-end discussion of energy efficiency (e.g., performance-per-watt) in its main comparisons, especially against the GPU baseline. For a hardware accelerator, energy efficiency is often a primary motivator, and a more explicit analysis would strengthen the claims of superiority.

Questions to Address In Rebuttal

The bitonic merger is a fantastic choice for the SIU. Could the authors comment on where else this "order-aware" hardware philosophy might be applied? For instance, could similar principles be used to accelerate other graph kernels like graph joins or merge-path operations in graph traversal algorithms?
The barrier-free scheduler is designed for the specific dependency structure of a GPM DFS tree. How generalizable is this hardware scheduling approach? Could it be adapted to accelerate other irregular, tree-based search or recursive algorithms (e.g., in constraint satisfaction or AI planning)?
Regarding the assumption of sorted adjacency lists: Could the authors provide an estimate or a qualitative argument on the system-level performance impact if a data-ingest stage had to explicitly sort neighbor lists before they could be processed by X-SET? How would this affect the end-to-end speedup on graphs that are not natively stored in this format?
While the performance speedups are impressive, could you provide more insight into the performance-per-watt of X-SET compared to the GPU baseline? A key motivation for accelerators is often improved energy efficiency, and quantifying this would provide a more complete picture of the benefits.

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present X-SET, a hardware accelerator for Graph Pattern Matching (GPM). The paper's novelty claims are centered on two primary contributions: (1) an "Order-Aware Set Intersection Unit" (SIU) based on a bitonic merger network, which claims to reduce hardware complexity for parallel intersection from O(N²) to O(N log N); and (2) a "Barrier-Free Task Scheduler" designed to enable asynchronous, out-of-order execution of tasks in a DFS-based search tree.

My analysis concludes that the first contribution—the Order-Aware SIU—represents a genuinely novel architectural synthesis for the GPM acceleration domain, providing a clear asymptotic advantage over prior art. The second contribution—the barrier-free scheduler—is an incremental but significant refinement of ideas recently introduced in the literature, specifically by Shogun [49]. While not a revolutionary concept, its implementation appears more aggressive and holistically integrated with the high-throughput SIU, justifying its inclusion as a key contribution.

Strengths

The primary strength of this paper lies in the novelty of its Order-Aware Set Intersection Unit (SIU), detailed in Section 3.1 (page 4).

Novel Application of a Classic Algorithm for Architectural Benefit: The underlying mechanism, the bitonic sorter/merger, is a classic algorithm (Batcher, 1968). However, the authors' insight is to recognize that by concatenating one sorted set with the reverse of another, they can form a bitonic sequence. Applying a bitonic merger network to this constructed sequence is a clever and non-obvious method to architect a parallel set intersection unit.
Clear Asymptotic Improvement over Prior Art: This architectural choice provides a tangible improvement in hardware complexity over direct GPM accelerator predecessors. It stands in stark contrast to the two established approaches: the simple merge-queue (e.g., FINGERS [11]), which is fundamentally sequential (O(1) throughput, O(1) comparators), and the systolic array (e.g., DIMMining [15]), which is massively parallel but requires O(N²) comparators for an N-wide input. The proposed O(N log N) hardware complexity for N-wide throughput occupies a novel and highly efficient design point between these two extremes. This is a significant architectural contribution.
Evolutionary Advancement in Scheduling: The concept of a barrier-free, out-of-order scheduler is an important step forward. The authors correctly identify the limitations of strictly level-synchronous DFS schedulers and the windowed "pseudo-DFS" approach from FINGERS [11]. The proposed design appears to be a more fully realized dataflow-like execution model than what was proposed in Shogun [49], which the authors note retains some synchronization barriers in its "locality-aware mode." The detailed design in Section 6 (page 7) with its Task Tree and dynamic dispatching demonstrates a meaningful step beyond prior work.

Weaknesses

My concerns relate primarily to the framing of the novelty and the depth of comparison with the closest conceptual predecessors.

Scheduler Novelty is Incremental: The claim of a "barrier-free" or "out-of-order" scheduler for GPM is not entirely new. Shogun [49] explicitly introduced an "incremental task scheduler for out-of-order execution to reduce inter-depth barriers." The authors acknowledge Shogun but could do more to delineate the precise architectural deltas. The novelty here is in the degree and implementation, not the fundamental concept of breaking inter-depth barriers. The paper should frame this contribution more precisely as a "fully asynchronous" or "less constrained" scheduler to more accurately reflect its relationship to prior art.
Lack of Context from Adjacent Fields: The core idea of the SIU is, at a high level, a hardware-accelerated sort-merge operation. The database hardware community has explored hardware for sorting and joining for decades. While the specific application to a streaming GPM accelerator is novel, the paper would be strengthened by acknowledging this conceptual lineage and clarifying how their design differs from, for instance, hardware sort-merge join units. This is not to diminish the contribution, but to properly place it within the broader landscape of ideas.
Complexity vs. Benefit Trade-off in Scheduling: The proposed scheduler architecture (Figure 10, page 8) is considerably complex, involving a Task Tree, Task Sets, Frames, Fast Spawning Registers, and Candidate Buffers. While the ablation study (Figure 16) shows this is effective, the paper does not discuss potential failure modes or scenarios where the overhead of managing this complex scheduler state could itself become a bottleneck, for example in patterns with extremely sparse but deep search trees.

Questions to Address In Rebuttal

Regarding the novelty of the scheduler, please provide a detailed, point-by-point architectural comparison with the scheduler in Shogun [49]. Beyond Shogun's "locality-aware mode," what are the fundamental differences in the dependency tracking mechanisms, task spawning logic, and resource management that make your scheduler qualitatively different and more efficient?
Could the authors comment on the relationship between their Order-Aware SIU and hardware implementations of sort-merge join algorithms from the database accelerator literature? While the application to GPM is novel, clarifying the conceptual lineage would strengthen the paper's contribution.
The proposed barrier-free scheduler introduces significant state and management logic. Under what conditions (e.g., specific graph topologies or query pattern structures) might the overhead of managing the Task Tree become a significant fraction of the execution time? How does the design ensure that the scheduler itself does not become the bottleneck?

FALA: Locality-Aware PIM-Host Cooperation for Graph Processing with Fine-Grained Column Access

Abstract

Graph processing is fundamental and critical to various domains, such as social networks and recommendation systems. However, its irregular memory access patterns incur significant memory bottlenecks on modern DRAM architectures, optimized for sequential ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose FALA, a hybrid host-PIM architecture for accelerating graph processing. The design couples a host-side accelerator featuring a fine-grained cache with a lightweight, per-bank PIM unit in DRAM. The core contributions are: (1) FALA-PIM, an in-memory unit to offload Reduce operations; (2) FALA-CAS, a mechanism for fine-grained (8B), non-contiguous column access within a DRAM row; and (3) a locality-aware scheduling policy, driven by a Row Locality Monitor, that directs memory requests to either the host cache or the PIM unit based on their access history. The paper claims this synergistic design overcomes the limitations of prior cache-only and PIM-only approaches.

Strengths

The paper correctly identifies the fundamental tension in graph processing acceleration between exploiting data locality via caches and leveraging high internal memory bandwidth via PIM. The motivation to create a hybrid architecture that dynamically arbitrates between these two execution modes is well-founded. The concept of coalescing multiple fine-grained requests destined for the same DRAM row in the PIM-Cache Cooperation Unit is a technically sound method for improving memory transaction efficiency.

Weaknesses

Despite a promising premise, the work suffers from several critical weaknesses in its core mechanisms and evaluation that call its claimed benefits into question.

The "Mat Conflict" Problem Is a Fundamental Flaw, Not a Minor Issue. The central novelty of the hardware, FALA-CAS, hinges on its ability to efficiently gather non-contiguous columns. However, the design is crippled by "mat conflicts," which occur when multiple requests target the same MatGroup. The paper's own analysis in Section 7.7 (Page 12, Figure 16b) reveals that for most datasets, conflict-free operations constitute a minority (e.g., ~40% or less). This means that for the majority of operations, the access latency is doubled (requiring 2xtCCDL). The paper downplays this severe penalty, but it effectively negates much of the theoretical benefit of the fine-grained access mechanism. The claim that "FALA would not suffer from severe conflict in practical" (Section 7.7, Page 12) is a direct contradiction of the presented data.
The Locality Monitor's Granularity Is Excessively Coarse. The decision to track locality at the DRAM row level (Section 5.2, Page 6) is a significant design flaw. A single DRAM row (e.g., 1KB in HBM2) can contain properties for dozens or hundreds of distinct vertices. Classifying an entire row as "high locality" because of frequent access to one or two vertices within it will inevitably lead to fetching large amounts of un-reused, "cold" data into the host's fine-grained cache. This cache pollution directly undermines the goal of efficient cache utilization and contradicts the abstract's claim of minimizing "unnecessary memory access." The justification for avoiding a finer-grained monitor—metadata overhead—is insufficient without a rigorous analysis showing that this coarse-grained approach is indeed the optimal trade-off.
The Hardware Overhead Analysis is Unconvincing and Indirect. The authors claim the modifications for FALA-CAS are "feasible" with minimal overhead (Section 4.2, Page 5). The evidence provided in Section 7.4 (Page 11) is weak. It relies on comparing synthesized logic transistor counts to a commercial PIM product [48] that implements a far more complex GEMV engine. This is not a valid apples-to-apples comparison for estimating the area impact of adding custom decoders and routing within a highly constrained DRAM bank layout. Such a modification is non-trivial, and the paper fails to provide sufficient evidence (e.g., layout-level considerations) to support its claims of low overhead.
Key Design Problems are Left Unaddressed. The paper identifies "type conflicts" (Section 5.2, Page 8) where pending cache requests are flushed because a new PIM request evicts the corresponding locality entry. The authors state this occurs for 6.32% of requests (Section 7, Page 9). This is not a negligible rate, and such pipeline flushes can introduce significant performance variability. Yet, the paper offers no concrete solution within the evaluated architecture, instead relegating a potential fix to future "Extensions" (Section 9, Page 13). A robust design should not defer the handling of such a fundamental operational hazard.

Questions to Address In Rebuttal

The authors must address the following points to substantiate their claims:

Given that your own data in Figure 16b shows mat conflicts occur in the majority of FALA-CAS operations, provide a quantitative analysis of the effective memory bandwidth of your design. How does this effective bandwidth, which must account for the frequent 2xtCCDL latency penalty, compare to a simpler baseline that does not attempt non-contiguous access?
Defend the choice of a DRAM row as the unit of locality tracking. Provide a sensitivity analysis or comparative data showing that the cache pollution induced by this coarse granularity is less detrimental to performance than the metadata overhead required by a finer-grained (e.g., 64B or 128B chunk) locality monitor.
The area estimation for FALA-CAS is insufficient. Please provide a more direct and rigorous analysis of the hardware overhead. At a minimum, this should include a block-level diagram illustrating the new decoders, multiplexers, and control logic within the DRAM bank and discuss the routing implications on the critical data paths.
Quantify the performance penalty (in terms of cycles lost or percentage slowdown) incurred by the 6.32% of requests that result in a "type conflict" flush. Explain why this issue was not resolved in the primary design instead of being postponed as a potential extension.
The claimed 7.15x speedup over a high-end RTX A6000 GPU (Section 7.6) is extraordinary. Provide a detailed breakdown of the execution time on the GPU, identifying the specific bottlenecks (e.g., memory divergence, kernel launch overhead, thread scheduling) that account for its relatively poor performance and explain precisely how the FALA-L architecture alleviates these specific bottlenecks to achieve such a dramatic improvement.

Review 2

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents FALA, a hybrid hardware architecture for accelerating graph processing that synergistically combines a host-side accelerator with a Processing-In-Memory (PIM) system. The core contribution is a co-designed system that dynamically partitions graph computations based on data locality. This is achieved through three key innovations: 1) a lightweight PIM unit (FALA-PIM) designed specifically for the low-arithmetic-intensity Reduce phase of vertex-centric processing; 2) a fine-grained, non-contiguous column access mechanism in DRAM (FALA-CAS) that mitigates the classic overfetching problem when accessing small vertex properties; and 3) a novel Row Locality Monitor that guides the decision of whether to process data in PIM (for low-locality data) or on the host's cache (for high-locality data). By intelligently integrating these components, FALA aims to harness the high internal bandwidth of PIM for random accesses and the data reuse benefits of on-chip caches for frequent accesses, addressing a fundamental tension in graph accelerator design.

Strengths

The primary strength of this paper is its elegant synthesis of several key ideas in the accelerator community into a single, cohesive, and well-motivated system. The authors have correctly identified the core trade-offs in the field and have proposed a solution that doesn't just pick a side but seeks a pragmatic middle ground.

Excellent Problem Framing and Motivation: The paper does a superb job of contextualizing its work. The introduction (Section 1, page 1) and motivational study (Section 3, page 3) clearly delineate the landscape into cache-based, PIM-based, and existing hybrid approaches. The spider chart in Figure 2 is particularly effective at visually summarizing the design space and positioning FALA as a holistic solution.
Pragmatic Co-Design: FALA is a strong example of hardware-software co-design. Instead of treating the memory system as a black box, the authors propose targeted modifications (FALA-CAS) that directly enable their higher-level scheduling policy (the Row Locality Monitor). The decision to track locality at the DRAM row level is a clever insight, as it aligns the monitoring granularity with the physical structure of the memory, making it both effective and resource-efficient.
Addresses a Fundamental Bottleneck: The mismatch between DRAM burst sizes (e.g., 32B) and the granularity of graph data (e.g., 4B/8B vertex properties) is a well-known, foundational problem. While prior work like Piccolo [78] has addressed this with in-memory scatter-gather, FALA integrates this concept directly with a PIM execution model. This allows FALA to not only reduce data movement for cache-based execution but also make its PIM operations significantly more efficient.
Comprehensive Evaluation: The evaluation is thorough and convincing. The choice of baselines—GraphDyns (cache), GraphPIM (PIM), PEI (hybrid), and Piccolo (fine-grained cache)—covers the design space comprehensively. The ablation study (Figure 10, page 10) is crucial, as it clearly isolates the performance contribution of each of FALA's components (PIM, CAS, and Locality-aware cooperation), substantiating the authors' design claims.

Weaknesses

The weaknesses of the paper are less about fundamental flaws and more about the scope and potential limitations of the proposed approach.

Focus on the Reduce Phase: The architecture is heavily optimized for offloading the Reduce operation. While the paper correctly identifies this as a memory-bound phase with low arithmetic intensity, the performance of the overall system is still coupled to the host's ability to execute the Process and Apply phases. In workloads where these phases also present bottlenecks, the benefits of FALA might be less pronounced.
Sensitivity to Data Layout: The Row Locality Monitor operates at the granularity of a DRAM row. This is pragmatic, but it implicitly assumes that vertices with high locality are spatially clustered into the same rows. If a graph layout interleaves "hot" and "cold" vertices within the same DRAM row, the monitor might make suboptimal decisions, either polluting the cache with low-reuse data or unnecessarily offloading high-reuse data to PIM. The impact of data layout on the monitor's effectiveness could be explored more deeply.
Required DRAM Modifications: While the proposed hardware changes for FALA-CAS are presented as feasible, they still require non-trivial modifications to the DRAM core (specifically, the column decoding logic). This places the solution in the category of "custom memory" rather than a purely host-side accelerator that can leverage commodity DIMMs. While this is necessary to achieve the claimed benefits, it is a significant barrier to adoption that should be acknowledged.

Questions to Address In Rebuttal

The effectiveness of the Row Locality Monitor seems dependent on the spatial locality of vertex data within DRAM rows. Could the authors comment on how sensitive their mechanism is to different graph partitioning and data layout schemes? For example, how would FALA perform if high-degree and low-degree vertices were randomly interleaved in the address space?
The FALA architecture is presented in the context of graph processing. However, the core mechanisms—fine-grained column access and locality-based offloading—seem applicable to other domains with irregular memory access patterns (e.t., sparse linear algebra, database pointer-chasing, N-body simulations). Could the authors elaborate on the potential for generalizing FALA beyond graph analytics?
The PIM-Cache Cooperation Unit (Section 5.2, page 6) is central to coalescing requests and managing conflicts. The paper mentions handling "type conflicts" by flushing pending requests. Could you provide more detail on the frequency of such conflicts and the performance impact of the flush-based resolution? Have you considered more complex resolution strategies, and what would be the associated hardware cost?

Review 3

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper proposes FALA, a hybrid host-PIM architecture for accelerating graph processing. The central claim of novelty rests on a synergistic co-design of three main components: (1) a lightweight, in-DRAM processing unit (FALA-PIM) dedicated to the Reduce phase of graph algorithms; (2) a fine-grained column access mechanism (FALA-CAS) that allows non-contiguous 8B data access within a single DRAM burst; and (3) a locality-aware cooperation scheme, orchestrated by a Row Locality Monitor and a PIM-Cache Cooperation Unit, that dynamically decides whether to execute a Reduce operation in PIM or to fetch the data to a host-side fine-grained cache.

The core argument is that by tracking access locality at the DRAM row granularity, FALA can make more intelligent decisions than prior work, directing low-locality requests to PIM to leverage internal bandwidth and high-locality requests to the host cache to leverage data reuse, all while minimizing data movement via the fine-grained access mechanism.

Strengths

The primary strength of this paper is its novel architectural synthesis. While the individual concepts of PIM for graph processing, fine-grained DRAM access, and locality-based scheduling have been explored before, FALA integrates them into a coherent and compelling system.

The most significant novel concept is the use of a DRAM row-level locality monitor (Section 5.2, page 6) to arbitrate between two distinct execution pathways (PIM vs. host cache). Prior locality-aware PIM architectures, such as PEI [2], operate at the granularity of a traditional cache line. The authors correctly identify that this is a mismatch for sparse graph workloads, where a single hot vertex can cause an entire cache line of mostly cold data to be wastefully fetched. Tracking locality at the row level and using this information to guide execution is a clear conceptual advance.

This central mechanism is made effective by the other components. FALA-CAS, while conceptually similar to the scatter-gather support in Piccolo [78], is uniquely positioned to service both the host's fine-grained cache misses and the PIM unit's data requests from the same underlying hardware modification. This dual-purpose integration is a clever piece of co-design.

Weaknesses

While the synergistic combination is novel, the constituent components are evolutionary developments of prior concepts. It is crucial to frame the contribution accurately.

Fine-Grained Column Access (FALA-CAS): The idea of enabling sub-burst, non-contiguous access within DRAM is not fundamentally new. Piccolo [78] introduced in-memory scatter-gather to build full cache lines from sparse vertex data, which is functionally very similar. O'Connor et al.'s "Fine-Grained DRAM" [62] also proposed mechanisms for more granular access to improve energy efficiency. The FALA-CAS implementation using odd/even MatGroups (Section 4.2, page 5) is a specific hardware proposal, but the conceptual groundwork has been laid by others. The novelty is in its application as a shared resource for a hybrid PIM/cache system.
PIM for Reduce Operations (FALA-PIM): Offloading simple, associative operations to near-bank logic is a well-established pattern in PIM research for graphs. Works like Tesseract [1] and GraphPIM [58] previously proposed PIM units to handle updates or atomic operations, which are functionally equivalent to the Reduce phase described here. FALA-PIM's design is described as "lightweight," but this is more of an implementation goal than a conceptual innovation.
Complexity of the Solution: The proposed solution requires non-trivial modifications to both the host memory controller (PIM-Cache Cooperation Unit, Row Locality Monitor) and the DRAM die itself (FALA-PIM logic, modified column decoders for FALA-CAS). While the performance gains of 1.56x over the state-of-the-art (Piccolo) are significant, the complexity cost is also high. The novelty must be weighed against this implementation burden.

Questions to Address In Rebuttal

The authors should use the rebuttal to sharpen their claims of novelty by differentiating their work more precisely from the closest prior art.

FALA vs. Piccolo [78]: The paper positions Piccolo as a key baseline. However, Piccolo's in-memory scatter-gather is also a fine-grained column access mechanism. Please clarify the fundamental architectural difference between FALA-CAS and Piccolo's mechanism. Beyond that, the core difference appears to be FALA's explicit PIM path. Could the Piccolo architecture not be augmented with a similar locality monitor to achieve a similar result? What is fundamentally novel about FALA's integration that prevents this?
FALA vs. PEI [2]: The key differentiator from PEI seems to be the locality monitoring granularity (DRAM row vs. cache line). Please provide data on the practical difference this makes. For example, what percentage of DRAM rows flagged as "high-locality" contain a data sparsity level (e.g., <25% of resident vertex properties are actually hot) that would make a conventional cache-line-based fetch inefficient? This would quantify the direct benefit of your core novel mechanism.
Robustness of Row-Level Locality: The entire decision-making process hinges on the Row Locality Monitor. What is the sensitivity of the system's performance to the size and organization (e.g., associativity) of this monitor? Real-world graphs have complex locality patterns; what happens when the working set of "hot" DRAM rows exceeds the monitor's capacity, leading to thrashing?
Scope of the Contribution: The architecture is heavily optimized around offloading the Reduce operation. While this is a known bottleneck, how does the FALA architecture benefit graph algorithms where the Process or Apply phases are more dominant, or where the Reduce operation is non-associative or too complex for the lightweight FALA-PIM unit? Does the novel contribution become less impactful in those scenarios?

Rethinking Tiling and Dataflow for SpMM Acceleration: A Graph Transformation Framework

Abstract

Sparse Matrix Dense Matrix Multiplication (SpMM) is a fundamental computation kernel across various domains, including scientific computing, machine learning, and graph processing. Despite extensive research, existing approaches optimize SpMM using loop ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present "Aquila," a comprehensive framework for Sparse Matrix-Dense Matrix Multiplication (SpMM) acceleration. The central thesis is that traditional matrix-based representations are fundamentally flawed for handling irregular sparsity. To address this, they propose a multi-stage solution involving: (1) reformulating SpMM as a graph problem, (2) a runtime, non-contiguous tiling strategy based on "vertex decomposition" and "adaptive depth traversal" (ADT), (3) a novel "Pull-after-Push" (PaP) hybrid dataflow, and (4) a corresponding "Bidirectional Fiber Tree" (BFT) data format. The authors claim this full-stack approach yields significant speedups and energy efficiency improvements over several state-of-the-art accelerators. While the ambition is noted, the work introduces substantial complexity and runtime overheads, and its core contributions appear to be insufficiently distinguished from established principles in graph partitioning, raising questions about the necessity of the proposed hardware.

Strengths

Problem Identification: The paper correctly identifies a key limitation of conventional SpMM accelerators: the reliance on rigid, coordinate-aligned tiling schemes that fail to capture the true data locality inherent in the connectivity patterns of sparse matrices.
Full-Stack Approach: The authors have undertaken a comprehensive design, considering the problem from the theoretical level of graph abstraction down to the microarchitecture of processing elements and a custom data format.
Scope of Evaluation: The evaluation is conducted across a diverse set of datasets with varying sparsity patterns and dimensions, which is a necessary condition for validating a general-purpose SpMM solution.

Weaknesses

Unjustified Runtime Complexity: The core of the proposed method is the Non-Contiguous Tiling (NCT) Engine, which performs graph traversal and partitioning (vertex decomposition, ADT) at runtime. This is a fundamentally expensive operation. The authors' own analysis (Figure 18, page 11) reveals that this "preprocessing" consumes up to 14.74% of the total execution time for the MIN dataset. For many latency-critical applications, this level of front-loaded overhead is unacceptable. The paper fails to provide a convincing argument for why this complex, power-hungry runtime engine is superior to using well-established, offline graph partitioning algorithms (e.g., METIS, k-way partitioning) that could achieve similar or better tile quality without any runtime cost.
Questionable Novelty of Tiling Heuristics: The proposed "Adaptive Depth Traversal" is presented as a novel technique. However, it appears to be a constrained breadth-first/depth-first search heuristic aimed at maximizing two-hop path locality (Section 3.1.2, page 4). The fundamental principles are not new and are related to community detection and graph partitioning. The paper lacks a direct, quantitative comparison against existing, highly-optimized sparse matrix reordering and partitioning libraries. Without this comparison, the claims of superiority for ADT are unsubstantiated.
Introduction of New, Unanalyzed Bottlenecks: The "vertex decomposition" strategy splits high-degree vertices into multiple "child" vertices, deferring their final accumulation to a centralized "Child-Parent Aggregator" (CPA) unit (Section 5.3, page 9). The authors claim this "decouples irregular inter-tile reductions," but it appears to merely shift the synchronization bottleneck from the PEs to the CPA. The paper provides no analysis of the potential for contention within the CPA. When multiple PEs finish processing child vertices of the same parent simultaneously, the CPA's Parent-Child Table and internal buffers will become a point of contention, effectively serializing the final aggregation step. This critical performance characteristic is ignored.
Understated Data Format Overheads: The proposed Bidirectional Fiber Tree (BFT) format is critical for the PaP dataflow. However, the overhead analysis (Section 4.1, page 6) admits to a worst-case storage overhead of "approximately 33% relative to CSR or CSC." This is a massive increase in memory footprint and, consequently, off-chip memory bandwidth requirements—often the primary bottleneck in SpMM. This cost is not adequately justified. The benefits of the PaP dataflow must be monumental to compensate for a 33% increase in data movement from DRAM.
Weak Ablation Study Baseline: The ablation study presented in Figure 22 (page 12) measures the contributions of "Tiling only" and "PaP only" against a baseline of "simple index-based tiling with pull-based dataflow." This is a strawman argument. A pull-based (inner-product or row-wise) dataflow is known to have poor input matrix reuse, as the paper itself states in Section 1. A more rigorous ablation would have compared the proposed tiling against a state-of-the-art tiling scheme (e.g., HotTiles) and the PaP dataflow against a state-of-the-art push-based (outer-product) dataflow to fairly assess the incremental benefit of each component. The current baseline likely inflates the reported contributions.

Questions to Address In Rebuttal

Please quantify the cycle latency and energy consumption of the NCT Engine for each benchmark. Provide a clear justification for why incurring this substantial runtime overhead is superior to applying an offline, state-of-the-art graph partitioner to the sparse matrix before execution, which would require no specialized hardware.
Provide a direct, quantitative comparison of the tiling quality (measured by metrics such as inter-tile edges/communication volume and intra-tile density) generated by your runtime ADT versus an offline tool like METIS.
Detail the contention model for the Child-Parent Aggregator (CPA). What is the maximum throughput of the Parent-Child Table and its associated buffers? At what rate of incoming child-vertex results from the PE array does the CPA become a system bottleneck?
Provide a detailed analysis of the BFT format's storage overhead across all evaluated datasets, not just the worst-case theoretical bound. How does the resulting increase in off-chip memory bandwidth demand impact overall system performance and energy, particularly for memory-bandwidth-bound scenarios?
Please justify the choice of baseline in the ablation study (Figure 22). How would the standalone performance of "Tiling only" and "PaP only" compare against more competitive baselines, such as a state-of-the-art adaptive tiling method or a pure outer-product dataflow, respectively?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a fundamental rethinking of hardware acceleration for Sparse Matrix-Dense Matrix Multiplication (SpMM). The authors argue that the primary limitations of existing accelerators stem from their reliance on a matrix-coordinate-based view, which fails to capture the intrinsic data dependencies in unstructured sparse patterns. The core contribution is a paradigm shift: reformulating SpMM as a graph optimization problem from the ground up. This graph-centric perspective informs a novel set of techniques spanning theory, algorithm, and architecture.

The key contributions are: 1. A theoretical framework for non-contiguous tiling based on graph vertex decomposition and adaptive depth traversal (ADT). This allows the accelerator to group computationally related nonzeros regardless of their row/column indices, breaking from the rigid grid structure of traditional tiling. 2. A Pull-after-Push (PaP) dataflow that naturally emerges from this graph traversal model. This hybrid dataflow dynamically switches between push (column-wise) and pull (row-wise) operations to simultaneously maximize data reuse for both the input dense matrix (B) and the output matrix (C). 3. A specialized Bidirectional Fiber Tree (BFT) data format and a hardware Non-Contiguous Tiling (NCT) Engine that can implement these concepts at runtime.

The authors embody these ideas in a proposed accelerator, Aquila, which demonstrates significant speedups (2.7x-4.3x) and energy efficiency improvements over several state-of-the-art SpMM and GNN accelerators across a diverse set of sparse datasets.

Strengths

A Compelling Conceptual Leap: The paper's primary strength is its successful re-conceptualization of the SpMM problem. Instead of treating graph analysis as a preprocessing step to regularize a matrix for a traditional accelerator (as seen in works like I-GCN), this work embeds graph traversal as the central organizing principle of the hardware itself. The idea of "non-contiguous tiling" (Section 3.1, p. 3; Figure 4, p. 4) is particularly powerful, as it offers a principled, bottom-up approach to discovering data locality that is simply inaccessible to conventional loop-based transformations. This feels less like an incremental improvement and more like a genuinely new perspective on the problem.
Holistic, Full-Stack Design: The work is impressive in its completeness. The authors have developed a cohesive solution that connects high-level theory to low-level implementation. The chain of logic—from the graph abstraction, to the vertex decomposition algorithm, to the PaP dataflow, to the BFT data format needed to support it, and finally to the NCT engine and PE microarchitecture—is clear and well-justified. This end-to-end thinking strengthens the credibility of the proposal significantly.
Elegant Unification of Competing Dataflows: The push-pull dichotomy has long defined the trade-off space for SpMM dataflows. The proposed PaP dataflow (Section 4, p. 5) is an elegant solution that is not merely an ad-hoc hybrid but a direct consequence of the graph traversal model. By first pushing from a source vertex (column) and then pulling to complete target vertices (rows), the dataflow naturally maximizes reuse opportunities as they are discovered, rather than being statically chosen based on loop order. This connects the work to the broader context of dataflow design in accelerators like Eyeriss, MAERI, and SIGMA, but provides a novel, sparsity-driven motivation.
Broad and Relevant Evaluation: The authors evaluate their design across a wide spectrum of application domains, including scientific computing, graph analytics, GNNs, and even sparse attention patterns from Transformers (Table 1, p. 9). This demonstrates the generality and potential impact of their approach. By showing strong performance on both "traditional" sparse matrices and those emerging from machine learning, the work positions Aquila not as a niche GNN accelerator, but as a truly general-purpose sparse computing engine.

Weaknesses

Overhead of Dynamicism: The NCT Engine performs sophisticated graph traversal, vertex decomposition, and conflict management at runtime. While the results are impressive, the paper could benefit from a deeper analysis of the performance and energy overhead of this online processing. The preprocessing analysis in Section 6.7 (p. 11) shows it's a small fraction of total time, but it's not clear how this overhead scales with graph density or size. Is there a "complexity cliff" where the dynamic scheduling cost begins to erode the benefits, for instance, in matrices that are nearly dense or, conversely, extremely sparse? The work would be stronger if it better characterized the boundaries of its own effectiveness.
Contextualization Against Offline Software Optimizations: The paper provides a strong comparison against other hardware accelerators. However, a crucial point of comparison is missing: how does Aquila's online, hardware-based graph partitioning compare to a state-of-the-art offline, software-based graph partitioning (e.g., using METIS or other hypergraph partitioners) followed by execution on a more traditional accelerator like Sextans or SPADE? This would help disentangle the benefits of the graph-centric approach itself from the benefits of implementing it dynamically in hardware. It's plausible that for static graphs, an offline approach could achieve similar results with a much simpler hardware design.
Practicality of the Bidirectional Fiber Tree (BFT) Format: The BFT format is key to enabling the PaP dataflow. While conceptually sound, its practical implications deserve more discussion. How is this format generated from standard formats like CSR? Is this a one-time cost performed on the host, or does it need to be done dynamically? The worst-case storage overhead is mentioned (Section 4.1, p. 6), but an empirical analysis of its overhead on the evaluated datasets would provide valuable context for architects considering such a format.

Questions to Address In Rebuttal

Regarding the runtime overhead of the NCT Engine: Can the authors provide more insight into how the preprocessing time (shown in Figure 18) scales with key graph properties like average degree and diameter? Is there a point at which the graph traversal becomes a bottleneck, and could this limit the scalability of the architecture to a larger number of PEs?
To better isolate the contribution, could the authors comment on the performance of a baseline system that combines a sophisticated offline software reordering/partitioning algorithm (designed to maximize inter- and intra-block reuse) with a state-of-the-art matrix-based accelerator like Sextans? How would Aquila's dynamic, hardware-based approach compare against such a software-hardware co-optimized baseline?
Could the authors elaborate on the generation of the Bidirectional Fiber Tree (BFT) format? What is the computational cost of converting from a standard format like CSR to BFT, and where is this conversion assumed to take place in the execution pipeline?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents Aquila, a framework and accelerator for Sparse Matrix-Dense Matrix Multiplication (SpMM) that reframes the problem from a traditional linear algebra perspective to one of graph optimization. The authors claim novelty in three primary areas: (1) A graph-based tiling method called "non-contiguous tiling," enabled by "vertex decomposition" and "adaptive depth traversal (ADT)," which clusters non-adjacent matrix elements based on graph connectivity. (2) A hybrid "pull-after-push (PaP)" dataflow designed to maximize reuse of both input and output dense matrices. (3) A "bidirectional fiber tree (BFT)" data format created to efficiently support the PaP dataflow. The authors implement these ideas in a detailed accelerator architecture and demonstrate significant performance gains over prior work.

My review focuses exclusively on the novelty of these core contributions by situating them within the landscape of prior art. While the performance results are strong, the justification for publication hinges on whether the underlying ideas represent a genuine conceptual advance.

Strengths

The primary strength of this work is its holistic and consistent application of the graph paradigm to co-design the algorithm, dataflow, and data format for SpMM acceleration.

A Co-Designed Framework: The most novel aspect is the tight integration of the three core ideas. While individual elements may have conceptual predecessors, the design of a specific dataflow (PaP) that requires a custom data format (BFT), which in turn processes tiles generated by a novel graph-based tiling algorithm (ADT), is a comprehensive and original contribution. The framework as a whole is novel.
Non-Contiguous Tiling as a Runtime Primitive: The concept of grouping matrix elements by connectivity rather than by coordinate is the paper's most significant novel claim. While related to offline graph partitioning and clustering, the proposal of ADT as a lightweight, streaming, hardware-based algorithm for generating these tiles at runtime is new. This moves partitioning from a pre-processing step into a dynamic part of the execution pipeline, which is a novel approach for SpMM accelerators.
Pull-after-Push (PaP) Dataflow Mechanism: The idea of a hybrid push/pull dataflow is not fundamentally new. However, the specific mechanism proposed here—a tightly-coupled traversal that first pushes contributions from a column and then immediately pulls remaining contributions for the affected rows—is a novel and specific control policy. It is more nuanced than prior work that might switch modes based on coarser granularities like vertex degree or execution phase.

Weaknesses

The paper's primary weakness is that it sometimes overstates the novelty of its foundational concepts and does not sufficiently differentiate its specific algorithms from conceptually similar prior art.

Conceptual Overlap with Graph Clustering in GNN Accelerators: The proposed non-contiguous tiling via ADT bears a strong resemblance to graph clustering techniques used to improve locality in GNN accelerators. Specifically, the "islandization" technique from I-GCN [15] also aims to identify and process densely connected subgraphs ("islands") to improve data reuse. While the authors cite I-GCN, the paper does not provide a deep enough analysis of the algorithmic delta between ADT and islandization. Both appear to be forms of localized graph traversal to find communities. The novelty seems more in the specific hardware implementation (the NCT Engine in Section 5.1, page 7) than in a fundamentally new graph theory principle.
Vertex Decomposition is a Known Technique: The use of "vertex decomposition" (Section 3.1.1, page 4) to split high-degree vertices is a well-established technique in graph processing systems to handle workload imbalance in power-law graphs. The contribution here is the application of this known technique to the specific problem of managing reuse for the dense B and C matrices in SpMM hardware. The paper should be more precise and frame this as an application of a known method rather than a novel technique in itself.
Framing of Graph Representation: The paper frames the shift to a graph representation of SpMM as a key insight. However, this is the standard model for Graph Neural Networks and many graph processing frameworks where SpMM (or its variant, SpMSpV) is the core computational kernel. The novelty is not in the representation but in using that representation to derive a new hardware tiling and dataflow strategy.

Questions to Address In Rebuttal

ADT vs. Islandization: Please provide a detailed algorithmic comparison between the proposed Adaptive Depth Traversal (ADT) and the "islandization" technique in I-GCN [15]. Beyond implementation differences, what are the fundamental trade-offs in terms of locality discovery, workload balance, and overhead? What makes ADT a distinct and superior approach?
Novelty of Vertex Decomposition: Could the authors please clarify the novelty of vertex decomposition? Please situate this technique relative to prior work in graph processing systems that use node splitting or virtualization to handle "supernodes" (high-degree vertices).
Justification for Runtime Tiling Complexity: The NCT Engine (Figure 10, page 7) introduces significant hardware complexity to perform graph traversal and tiling at runtime. What is the performance impact of using a standard, offline graph partitioner (e.g., METIS) to generate non-contiguous tiles as a pre-processing step, and then feeding those to the rest of the Aquila architecture? Please justify why the complexity of a runtime, online approach is necessary and superior to a simpler offline one.
PaP Dataflow Precedents: Are the authors aware of any prior work in hybrid GNN accelerators or graph processing systems that employ a similarly fine-grained, coordinated switch between push and pull traversals within the processing of a single task or subgraph? The authors should further solidify the novelty claim for the PaP control policy.

Boosting Task Scheduling Data Locality with Low-latency, HW-accelerated Label Propagation

Abstract

Task Scheduling is a popular technique for exploiting parallelism in modern computing systems. In particular, HW-accelerated Task Scheduling has been shown to be effective at improving the performance of fine-grained workloads by dynamically assigning ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose Task-LP, a hardware accelerator for the Label Propagation (LP) algorithm, integrated into a RISC-V many-core system to improve data locality for task-based parallelism. The system, prototyped on an FPGA, dynamically builds a task dependency graph, uses the Task-LP accelerator to find community structures (clusters), and then leverages a Task Placement Engine (TPE) to guide task scheduling. The TPE attempts to balance data locality with core utilization through a relaxation policy. The evaluation is conducted using synthetic graph benchmarks and a suite of HPC applications.

While the paper presents a comprehensive hardware implementation and evaluation, I find its central claims are predicated on a series of questionable methodological choices and an optimistic interpretation of mixed results. The work is not yet rigorous enough to substantiate its conclusions.

Strengths

Complete System Prototype: The primary strength of this work is the implementation and integration of a full hardware/software system on an FPGA. This demonstrates considerable engineering effort, moving beyond pure simulation to a tangible prototype.
Identification of a Core Problem: The paper correctly identifies and articulates the fundamental tension between maximizing data locality and maintaining high core utilization (i.e., load balancing) in task-based systems. The proposed Rlen parameter is a direct, albeit simplistic, mechanism to address this trade-off.
High-Speed Clustering Core: The raw cycle counts for the Task-LP hardware module to reach convergence are impressively low (Section 7.3, pg. 11), demonstrating that a hardware implementation of this specific algorithm can indeed be very fast.

Weaknesses

My concerns with this paper are significant and center on the validity of the experimental methodology and the interpretation of the results.

Fundamentally Flawed Performance Baseline: The headline claim of being up to "581× faster than an equivalent software implementation" (Abstract, pg. 2) is based on a deeply unfair comparison. The authors compare cycle counts from their specialized hardware accelerator against a C-based igraph implementation running on a high-end AMD 7950x3D processor (Section 7.3, pg. 11). This is an apples-to-oranges comparison. A scientifically valid baseline would be a software implementation of LP running on the same RISC-V cores within their own system. Without this comparison, the 581x figure is meaningless and serves only to inflate the contribution's perceived performance. It is highly probable that the data movement and general-purpose nature of the x86 system account for a vast majority of this difference, not just the algorithm's execution.
Unrealistic Experimental Platform: The FPGA evaluation platform runs the RISC-V cores at 30 MHz while the HBM operates at 225 MHz (Table 2, pg. 9). The authors state they "compensate" for this unrealistic clock ratio by enforcing a fixed 220-cycle memory latency. This is a gross oversimplification of a modern memory subsystem. It ignores the complexities of cache hierarchies, memory controllers, contention, and interconnects, which do not behave as a fixed-latency black box. All performance results (speedups, task cycles) are therefore built on an unsound foundation, and it is unclear if they would translate to a real-world ASIC system where clock ratios are far more balanced.
Unsupported and Overstated Performance Claims: The paper highlights "up to 1.50×" speedups. However, the data presented in Figure 6a (pg. 10) shows a far more complex reality. While some benchmarks benefit, cholesky experiences a slowdown, and jacobi shows negligible improvement. A single negative result or a null result is often more informative than several positive ones. The paper fails to provide a convincing root-cause analysis for the slowdown, which is a critical omission. Relying on "up to" metrics without providing geometric means and a thorough analysis of failures constitutes cherry-picking.
Unaddressed Architectural Scalability: The core data structure for Task-LP is an adjacency matrix, which has a memory and logic complexity of Ω(N²), where N is the number of in-flight tasks (Nnodes). The authors dismiss this concern by stating a 128-node matrix is only 2 KiB (Section 4.3, pg. 7). This misses the point. The area synthesis results in Figure 3 (pg. 7) show area growing at ≈N^1.58, which is super-linear. This suggests the architecture will not scale efficiently to the larger context windows needed for more complex applications. The separate x86 scalability study in Section 7.5 (pg. 11) feels like an admission of this limitation; if the proposed hardware architecture were truly scalable, this ad-hoc software study on a different platform would be unnecessary.
The Rlen Parameter is a Tacked-on Fix, Not a Principled Solution: The mixed utilization-locality policy hinges entirely on the Rlen parameter (Section 5, pg. 8). The default value of 256 is presented without sufficient justification. Figure 7i and 7j (pg. 12) show that performance and utilization are highly sensitive to this value for synthetic workloads. However, no such sensitivity analysis is provided for the HPC benchmarks. Is Rlen=256 optimal for Cholesky? For Jacobi? Or is it a "one-size-fits-all" value that happens to work for some cases? Without this analysis, the TPE is not a robust solution but rather a heuristic that requires manual tuning.

Questions to Address In Rebuttal

The authors must address the following points directly to convince me of the validity of this work:

Please provide a performance comparison (in cycles) of your Task-LP hardware against a carefully written software implementation of Label Propagation compiled for and executed on one of the 30 MHz RISC-V cores in your system. This is the only baseline against which claims of "HW acceleration" can be fairly judged.
Can you provide a rigorous justification for why a fixed 220-cycle memory latency is a valid model for a real memory subsystem, given the 1:7.5 core-to-memory clock ratio on your platform? How would your results change if you modeled contention or other more realistic effects?
Provide a detailed, quantitative root-cause analysis for the performance degradation observed in the Cholesky benchmark (Figure 6a). What specific mechanism in your proposed architecture causes the application to run slower than the baseline?
What is the practical upper limit for Nnodes that your hardware architecture can support before the N^1.58 area scaling makes it prohibitively expensive? How does this limit impact the applicability of your approach to applications with thousands of in-flight tasks?
Please provide a sensitivity analysis for the Rlen parameter for the HPC benchmarks, similar to what was done in Figure 7i. How was the value of 256 chosen for the results in Figure 6a, and what is the performance impact of varying it for each benchmark?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces a novel hardware-centric approach to a long-standing problem in parallel computing: improving data locality in dynamic, fine-grained task-scheduling systems. The authors' core contribution is the design and evaluation of Task-LP, a low-latency hardware accelerator that performs graph clustering using the Label Propagation (LP) algorithm. This accelerator is integrated into a comprehensive scheduling system, the Task Placement Engine (TPE), which runs alongside a 24-core RISC-V system on an FPGA.

The system dynamically constructs a task dependency graph, uses Task-LP to identify "communities" of tasks that share data, and then uses these communities as strong hints for scheduling tasks to specific cores. Crucially, the system employs a mixed policy that can relax these locality hints to prioritize core utilization when necessary, striking a pragmatic balance between locality and parallelism. The work demonstrates significant performance improvements (up to 1.50x speedup) on both synthetic and HPC benchmarks, driven by substantial reductions in L1 cache misses.

Strengths

High Conceptual Novelty and Significance: The central idea of embedding a sophisticated graph clustering algorithm directly into the hardware scheduling pipeline is highly novel and compelling. For decades, the field has wrestled with the trade-off between simple, fast, but locality-oblivious schedulers (e.g., random work-stealing) and complex software-based approaches that are too slow for fine-grained tasks. This work charts a third path: making a powerful heuristic fast enough (under 300 cycles per clustering, as noted in the Abstract on page 2) to be used "in the loop." This opens a new, promising design space for intelligent, data-aware hardware schedulers.
Excellent System-Level Integration and Evaluation: This is not a paper based on high-level simulation. The authors have implemented their entire system on an FPGA, featuring 24 RISC-V cores running Linux, a hardware dependency manager (Picos), and their new TPE and Task-LP modules (Figure 2, page 5). This full-system evaluation, complete with real-world HPC benchmarks (Section 7.2, page 10), lends enormous credibility to the results and demonstrates the practical feasibility of the approach.
Pragmatic and Thoughtful Design: The decision to use clustering results as a "hint" rather than a hard constraint is a mark of mature engineering. The "mixed utilization-locality policy" (Section 5, page 8), controlled by the Rlen parameter, directly addresses the fundamental tension between keeping cores busy and keeping data local. This prevents the pathological cases where a strict locality policy would lead to massive core idleness, making the system robust.
Connecting Disparate Fields: The work does an excellent job of synthesizing concepts from multiple domains: computer architecture (hardware accelerators), parallel programming models (task-based runtimes), and network science/algorithms (community detection). It successfully shows how a technique from the latter (Label Propagation) can be architecturally accelerated to solve a key problem in the former two.

Weaknesses

While the core idea and execution are strong, the work's primary weaknesses relate to the boundaries of its exploration and its potential limitations at a larger scale.

The "Sliding Window" Problem: The system's effectiveness is constrained by its context size of 128 in-flight tasks (Section 4.1.9, page 7). While the label-freezing mechanism is a clever way to preserve history, the accelerator can only directly optimize for locality within this relatively small, sliding window. In very large applications with sprawling task graphs, critical data dependencies may span far beyond this window, limiting the potential for global locality optimization. The paper acknowledges this, but the full impact on a different class of massively parallel applications remains an open question.
Scalability of a Centralized Resource: The evaluation is performed on a 24-core system. The TPE and Task-LP appear to be a centralized resource that all cores interact with. As core counts scale into the hundreds or thousands, this centralized unit could become a serial bottleneck for task submissions, dependency updates, and ready-task dispatch. While the ASIC area analysis (Section 4.3, page 7) is promising, a more detailed analysis of the throughput and contention on the scheduler's interfaces in a many-core context would strengthen the scalability claims.
Limited Exploration of the Heuristic Design Space: The authors make a convincing case for Label Propagation as a hardware-friendly algorithm. However, this work essentially opens up the entire field of "hardware-accelerated scheduling heuristics." It would be insightful to discuss where LP sits on the spectrum of complexity versus benefit. Are there even simpler, structural heuristics (e.g., based on parent/grandparent location) that could be implemented with even lower overhead and achieve, say, 80% of the benefit? The comparison with Immediate Successor (IS) is a good start, but the broader design space is vast.

Questions to Address In Rebuttal

Regarding the 128-node context window: Could the authors elaborate on the types of task graphs or application structures where this limitation would be most pronounced? For instance, would breadth-first-like task-spawning patterns be more negatively affected than depth-first ones?
Could the authors comment on the scalability of their centralized design to systems with hundreds of cores? Specifically, what are the anticipated bottlenecks in the communication paths to and from the Task Placement Engine, and are there architectural pathways (e.g., hierarchical or distributed TPEs) to mitigate them?
The relaxation cycle length (Rlen) is presented as a static parameter. Have the authors considered a dynamic control system where the TPE could adjust Rlen at runtime based on metrics like average core idleness or observed L1 miss rates, thereby making the locality/utilization trade-off adaptive?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper proposes a hardware-accelerated, low-latency label propagation (LP) engine, named Task-LP, integrated into a dynamic task scheduling system for multi-core processors. The core idea is to perform online graph clustering on the dynamic task dependency graph to generate data locality hints. These hints are then consumed by a Task Placement Engine (TPE), which balances the goal of locality-aware placement against the need to maintain high core utilization. The authors claim this approach significantly improves performance for fine-grained task workloads by reducing cache misses. The entire system is implemented and evaluated on an FPGA-based RISC-V multi-core platform.

The primary novel claim is the synthesis of three concepts: (1) a specific graph clustering algorithm (Label Propagation), (2) its full acceleration in custom hardware to achieve convergence in hundreds of cycles, and (3) its direct integration into the critical path of a dynamic task scheduler to make online placement decisions.

Strengths

Novelty of Synthesis: The central contribution—the integration of a hardware-accelerated graph clustering algorithm directly into the critical path of a dynamic task scheduler—is, to my knowledge, novel. While prior art exists for hardware-assisted scheduling and for locality-aware software scheduling, none have proposed using an algorithm as complex as label propagation, enabled by bespoke hardware, for online, fine-grained task placement.
Enabling Contribution: The work convincingly demonstrates that hardware acceleration is the key enabling factor that makes this idea practical. The comparison in Section 7.3 (page 11) showing that the hardware implementation is up to 581x faster than an equivalent software implementation on a modern CPU is a powerful justification. It effectively argues that this class of sophisticated, locality-aware scheduling is infeasible without a hardware-centric approach, thereby carving out a new design space.
Clear Demarcation from Prior Art: The authors do a commendable job of positioning their work relative to existing techniques. They clearly distinguish their dynamic, graph-based approach from static, user-annotated systems (Section 1.2, page 3) and from simpler dynamic heuristics like Immediate Successor (IS), against which they directly compare (Section 7, page 10, Figures 7g, 7h). The novelty is not in "locality awareness" as a concept, but in the mechanism used to achieve it.

Weaknesses

Component-level Novelty is Limited: While the synthesis of these components is novel, the individual building blocks are well-established.
- Hardware Task Scheduling: Systems like Picos [54, 71, 84] and Carbon [45], cited by the authors, represent prior art in hardware acceleration for task management (e.g., dependency tracking, queueing).
- Label Propagation Algorithm: The algorithm itself is not new, as acknowledged by the authors' citation of the original work [64]. Its properties are well-understood in the context of software-based graph analytics.
- Locality-Aware Scheduling: The principle of co-locating dependent tasks is a foundational concept in parallel computing. The novelty here is strictly in the implementation method, not the overarching goal.
Narrow Exploration of the Novel Idea: The paper's exploration of hardware-accelerated clustering is confined exclusively to the Label Propagation algorithm. While LP is a reasonable choice for its implementation simplicity, the work does not explore or justify why other, potentially more effective, community detection algorithms were not considered for hardware acceleration. This limits the generality of the paper's conclusion; it is a case study on HW-accelerated LP, not a broader investigation into HW-accelerated clustering for scheduling.
Inherent Scalability Limitation of the Chosen Implementation: The novelty of the approach is constrained by an implementation choice that has poor scaling properties. The reliance on a full N-by-N adjacency matrix (Section 4.2.1, page 7) for the task graph representation has a memory complexity of O(N^2). This forces the authors to limit the context window to 128 nodes, which may be insufficient for applications with larger-scale parallelism where more distant task relationships are important. While this is an implementation detail, it fundamentally constrains the practical application of this novel approach.

Questions to Address In Rebuttal

On the Choice of Algorithm: Could the authors elaborate on the choice of Label Propagation beyond its conceptual simplicity? Have other community detection algorithms (e.g., Louvain, Girvan-Newman) been considered? If so, what makes them less suitable for a low-latency hardware implementation in this specific context of task scheduling? A brief analysis of the trade-offs would strengthen the justification for this core design choice.
On the Scalability of the Approach: The N^2 complexity of the adjacency matrix imposes a hard limit on the number of in-flight tasks (128). How do the authors envision this novel approach scaling to future systems that may need to manage thousands of in-flight tasks? Would a different graph representation (e.g., adjacency list) be feasible in hardware without sacrificing the low-latency goal that makes this work novel in the first place?
On the Novelty of the TPE Policy: The Task Placement Engine (TPE) uses a relaxation policy (controlled by R_len as described in Section 5, page 8) to balance locality and core idleness. Is this policy itself a novel contribution, or is it an adaptation of existing work-stealing or load-balancing techniques? Clarifying the novelty of the interplay between the clustering hints and the final scheduling policy is important.

BitL: A Hybrid Bit-Serial and Parallel Deep Learning Accelerator for Critical Path Reduction

Abstract

As deep neural networks (DNNs) advance, their computational demands have grown immensely. In this context, previous research introduced bit-wise computation to enhance silicon efficiency, along with skipping unnecessary zero-bit calculations. However, we ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose BitL, a hybrid bit-serial and bit-parallel DNN accelerator designed to mitigate the critical path problem inherent in existing zero-bit skipping architectures. The core idea is to dynamically switch the computation direction between column-wise (bit-serial) and row-wise (bit-parallel) on a cycle-by-cycle basis. This direction is guided by offline, pre-computed metadata derived from an A* search on the weight sparsity patterns. A further optimization, "dynamic pivoting," allows PEs that would otherwise be idle to switch their computation direction independently to process available bits, aiming to maximize hardware utilization.

Strengths

Problem Formulation: The paper correctly identifies a fundamental limitation in unidirectional zero-bit skipping accelerators. The motivating examples in Figure 1 (page 1) clearly and effectively illustrate how both purely bit-serial and purely bit-parallel approaches can suffer from critical path bottlenecks due to dense '1's in a single weight or a single bit-position, respectively.
Logical Solution Concept: The proposed hybrid execution model is a logical and direct response to the identified problem. Dynamically choosing the path of least resistance (i.e., the sparser dimension) on a per-cycle basis is a sound theoretical approach to reducing the total cycle count.
RTL-level Evaluation: The authors have gone beyond high-level simulation by implementing their design in Verilog and synthesizing it for the 45nm technology node (Section 5.1.2, page 9). This provides more credible area and power estimates than purely architectural simulators, assuming the implementation is sound.

Weaknesses

My analysis reveals several critical weaknesses in the methodology and reporting that question the validity and practical significance of the presented results.

Unjustified Simplification of Sparsity: The entire optimization strategy hinges on an offline analysis of weight sparsity only (Section 4.1, page 7). This is a critical oversimplification. The performance of a DNN accelerator is a function of both weight and activation sparsity. By generating static metadata based solely on weights, the paper ignores the dynamic nature of activations, which can create entirely different critical paths at runtime. A scenario with sparse weights but dense activations could render the pre-computed path suboptimal. The evaluation is therefore incomplete as it does not model this crucial interaction.
Suspicious Hardware Cost Reporting: The reported hardware area for the control logic is highly questionable. Table 3 (page 10) reports the 'Ctrl' area for BitL as 7,754.43 µm², which is substantially smaller than that of Bitlet (16,230.32 µm²) and BBS (11,520.53 µm²). This is counter-intuitive. BitL's control logic must manage metadata decoding, dynamic selection between two data paths (row/column), and the complex "dynamic pivot" mechanism for each PE. This functionality is demonstrably more complex than the control logic of Bitlet or BBS. The authors' claim of "simplifying the control logic" (Section 5.4, page 11) is unsubstantiated and lacks the technical detail to be credible. It is more likely that the area accounting is flawed or that the baselines were not implemented efficiently.
Underestimated Overhead of Data Transposition: The architecture relies on a "Wire Parser" to transpose the weight matrix from a column-wise format to a row-wise format for bit-parallel execution (Figure 8, page 8). For a 16x16 sub-tile, this is a 256-bit matrix transposition. Such a network incurs significant routing overhead in terms of both area and power, which is not explicitly broken out or adequately discussed. Furthermore, the "dynamic pivot" implies that each PE must have a data path to access both its assigned row and column slices. The paper dismisses this as "carefully localized input" (page 8), but the physical implementation of such dual-access wiring for every PE is non-trivial and its cost appears to be unaccounted for.
Inconsistent and Potentially Inflated Performance Claims: The abstract claims a "1.24× improvement over recent zero-bit skipping accelerators." The most recent and highest-performing baseline presented is BBS. According to Figure 10 (page 10), BitL achieves an average speedup of 1.74x over Stripes, while BBS achieves a speedup of approximately 1.49x (based on the 49% improvement mentioned on page 10). The calculated average improvement of BitL over BBS is therefore 1.74 / 1.49 ≈ 1.17x. The 1.24x figure is not substantiated by the average results and suggests cherry-picking of a specific model's best-case result, which is misleading.

Questions to Address In Rebuttal

Regarding Metadata and Sparsity: Please quantify the absolute metadata storage overhead and the A* search-based generation time for the largest evaluated models (e.g., LLaMA-2-7B). More importantly, justify the design decision to ignore activation sparsity. Provide data or a strong argument as to why a static, weight-only optimal path remains effective in the presence of dynamic activation patterns.
Regarding Hardware Implementation: Provide a detailed architectural diagram of the "Direction Controller" and the PE's datapath, clearly showing how both row and column data are routed to it to enable dynamic pivoting. Please provide a rigorous justification for the control area of BitL being reported as significantly smaller than that of simpler baseline architectures like Bitlet and BBS (Table 3). Were the baseline architectures re-implemented for this work? If so, how can their optimality be assured?
Regarding the Wire Parser: What is the specific implementation of the "Wire Parser" (e.g., a Benes network, a crossbar)? Please provide a post-synthesis area and power breakdown for this block specifically, as it appears to be a major source of overhead that is currently obscured within the "Calculator" unit's area.
Regarding Performance Claims: Please explicitly state which baseline, model, and conditions correspond to the "1.24× improvement" claim made in the abstract. If this is not an average figure, the abstract should be revised to reflect the average improvement, which is calculated to be substantially lower (~1.17x over BBS).

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces BitL, a deep learning accelerator designed to overcome a fundamental performance limitation in existing zero-bit skipping architectures. The authors identify that both purely bit-serial (column-wise) and bit-parallel (row-wise) computation schemes can suffer from "critical paths"—rows or columns with a high density of non-zero bits that bottleneck the entire computation group and diminish the benefits of sparsity.

The core contribution is a novel hybrid computation model that dynamically switches between column-wise and row-wise processing on a cycle-by-cycle basis. This allows the accelerator to always choose the more computationally sparse direction, effectively navigating around potential bottlenecks. The execution path is determined by a light-weight offline analysis using an A* search algorithm. This core idea is further refined with a "dynamic pivoting" mechanism that allows individual processing elements (PEs) to switch their orientation independently to utilize cycles where they would otherwise be idle. The work is evaluated extensively, demonstrating significant throughput and energy efficiency gains over a strong set of prior bit-wise accelerators.

Strengths

Elegant and Fundamental Core Idea: The paper's primary strength lies in its central thesis. Rather than proposing another incremental improvement within the established paradigms of bit-serial or bit-parallel processing, it unifies them. The recognition that these are two orthogonal ways of traversing a bit-matrix, and that the optimal traversal is data-dependent, is a powerful insight. The hybrid approach is an elegant and direct solution to the critical path problem, which is compellingly illustrated in Figure 1 (page 1).
Excellent Problem Motivation and Contextualization: The authors have done a superb job of positioning their work within the broader research landscape. Section 2 ("Background and Related Works") provides a clear narrative of the evolution from bit-parallel designs to bit-serial methods and their subsequent refinements (e.g., Stripes, Pragmatic, Bitlet, BBS). This contextualization makes it easy to understand the specific limitations of prior art and appreciate the novelty of BitL's approach. The analysis in Section 3.1, particularly Figure 3, provides convincing empirical evidence that the critical path problem is not a contrived corner case but a tangible issue in modern DNNs.
Demonstrated Generality and Robustness: A key success of this architecture is its applicability across a wide variety of models. The evaluation includes classic CNNs, modern ConvNets, Vision Transformers, and even Large Language Models (LLMs). The fact that the performance gains are consistent across these diverse architectures (Figure 10, page 10) underscores that BitL exploits a fundamental property of data sparsity, not a quirk of a specific model family. The compatibility with standard pruning (Figure 13, page 12) is also a significant practical advantage over architectures that require specialized co-designed pruning methods.
Holistic Design: The work is not just a high-level concept; it is a well-considered design. The authors present a software strategy (A* search for pathfinding), a hardware microarchitecture for the PEs and Core (Figure 8, page 8), and a dynamic mechanism (pivoting) to handle fine-grained inefficiencies. This end-to-end thinking strengthens the credibility of the proposal.

Weaknesses

While the work is strong, its primary trade-off lies in the shift of complexity.

Reliance on Offline Preprocessing: The dynamic nature of the hardware is guided by a static, offline analysis (Section 4.1). While this is a pragmatic choice that simplifies the runtime hardware, it introduces a preprocessing dependency. The A* search, while more efficient than BFS, still represents a computational cost that must be paid once per model. For scenarios involving rapid model iteration, fine-tuning, or future on-device learning, this offline step could become a bottleneck. The significance of this weakness depends heavily on the target application domain.
Increased Datapath and Control Complexity: To facilitate the hybrid execution, the hardware is inherently more complex than a purely unidirectional design. The PE requires data feeds for both rows and columns, a "Wire Parser" to transpose data for row-wise execution, and more sophisticated control logic in the Direction Controller to manage metadata and dynamic pivoting. While the authors' area results in Table 3 (page 10) are competitive (impressively, BitL is smaller than Bitlet and BBS), a qualitative discussion on the design and verification complexity of this flexible datapath would be valuable.

Questions to Address In Rebuttal

Scalability of Preprocessing: Could the authors comment on the scalability of the offline A* search? For the LLMs evaluated (e.g., LLaMA-2-7B), what was the approximate preprocessing time and memory overhead? How is this expected to scale for models in the 70B or 100B+ parameter range?
Broader Context of Hybrid Processing: The paper frames its novelty within the lineage of bit-serial accelerators. However, the idea of hybrid processing exists more broadly (e.g., architectures that use different compute units for dense vs. sparse tensors, or software that dispatches kernels based on sparsity). Could the authors contextualize their fine-grained, cycle-by-cycle hybridism against these coarser-grained hybrid strategies in the wider field of sparse computation?
Robustness to Non-Standard Sparsity: The motivation relies on the observed "Gaussian-like" distribution of bit patterns (Figure 3b, page 4). How would BitL's performance be affected by highly structured or unusual sparsity patterns that might arise from techniques like structured pruning or aggressive quantization schemes (e.g., ternary/binary networks)? Does the A* search still find effective paths, or could such patterns create scenarios that are dense in both directions?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper proposes BitL, a DNN accelerator architecture designed to mitigate the "critical path" problem inherent in existing bit-wise computation schemes. The authors correctly identify that both purely bit-serial (column-wise) and purely bit-parallel (row-wise) zero-skipping accelerators can be bottlenecked by a single dense row or column of '1's, respectively.

The core claim of novelty rests on a hybrid computation model that dynamically switches between column-wise (bit-serial) and row-wise (bit-parallel) processing on a cycle-by-cycle basis within a sub-tile. This processing path is pre-determined by an offline A* search algorithm to find the minimal number of cycles. A secondary novel contribution is the "dynamic pivoting" mechanism, where individual Processing Elements (PEs) can autonomously switch their processing direction mid-computation if their assigned path runs out of work, thus improving hardware utilization.

Strengths

Fundamentally Novel Dataflow: The central idea of a hybrid bit-serial and bit-parallel dataflow that can be reconfigured cycle-by-cycle is, to my knowledge, novel in the context of DNN accelerators. Prior art has focused on optimizing within a unidirectional framework. For instance, Pragmatic [1] and Laconic [27] improve bit-serial processing, Bitlet [21] optimizes bit-parallel processing, and the recent BBS [3] introduces bi-directional sparsity definition (skipping 0s or 1s) but maintains a unidirectional computation (column-by-column). BitL introduces a second degree of freedom in the computation itself, which is a significant conceptual departure.
Clear Problem Identification: The paper does an excellent job of articulating a genuine and previously under-appreciated limitation of bit-wise accelerators. The "critical path" problem, clearly illustrated in Figure 1 (page 1), is a fundamental performance ceiling that cannot be overcome by simply improving the efficiency of a single processing direction. Identifying and targeting this specific problem is a strength.
Elegant Secondary Optimization: The "dynamic pivoting" mechanism (Section 3.3, page 5) is a clever and novel solution to a problem created by the primary contribution itself—namely, idle PEs resulting from overlapping computation regions. Making this a local, autonomous decision at the PE level (as described in Algorithm 2, page 7) is an elegant way to handle this issue without complex global control.

Weaknesses

The "Delta" over SOTA is Modest for the Added Complexity: While the core idea is novel, the empirical performance benefit over the most recent state-of-the-art, BBS [3], is not overwhelming in the standard case. As per Figure 10 (page 10), the average speedup of BitL is 1.74x (vs. Stripes), while BBS is 1.69x. This is a marginal ~3% improvement. The paper's novelty is therefore more architectural than performance-based in this scenario. The added complexity—an offline A* search and more complex datapath/control logic—must be weighed against this modest gain.
Reliance on Offline Analysis: The entire scheme is predicated on a software-based pre-analysis using an A* search to determine the optimal path. This moves a significant amount of complexity from runtime hardware to offline software. This is a valid trade-off, but it is not without cost. The overhead of this search is not quantified. If weights are updated (e.g., in continual learning scenarios) or if different quantization schemes are used, this analysis must be re-run. This dependency reduces the dynamism of the solution compared to fully online schemes.
Hardware Complexity is Understated: The architecture requires each PE to have access to both a row-slice and a column-slice of the weight sub-tile. This is implemented via a "Wire Parser" (Figure 8, page 8), which appears to be a form of on-the-fly matrix transpose or a complex routing network. A 16x16 sub-tile requires routing from 16 row buffers and 16 column buffers to 16 PEs. The complexity, area, and potential timing delay of this routing network seem non-trivial and may be more significant than the component-level area breakdown in Table 3 suggests. This is a key piece of "new" hardware whose cost-benefit is central to the paper's claims.

Questions to Address In Rebuttal

On the Offline Search: Could the authors quantify the computational overhead of the A* search algorithm? Please provide the time taken to generate the metadata for a representative model like VGG or Swin-B. Is this search practical for very large models or in scenarios where weights might be frequently updated?
On the Wire Parser Implementation: Please provide more detail on the implementation of the "Wire Parser." Is it a full 16x16 crossbar for bit-level transposition? What is its specific contribution to the critical path delay and area of the overall BitL Core?
On Justifying Novelty vs. Performance: Given the modest average performance gain over BBS [3] in the non-pruned case, can the authors provide a more compelling argument for their contribution beyond the architectural novelty? For example, can you characterize the specific sparsity patterns where BitL shows a disproportionately large advantage over BBS, thereby demonstrating its unique value in handling corner cases that cripple unidirectional schemes? The strong results with pruning (Figure 13) are a good start, but a more fundamental analysis would strengthen the paper.

HiPACK: Efficient Sub-8-Bit Direct Convolution with SIMD and Bitwise Management

Abstract

Quantized Deep Neural Networks (DNNs) have progressed to utilize sub-8-bit data types, achieving notable reductions in both memory usage and computational expenses. Nevertheless, the efficient execution of sub-8-bit convolution operations remains ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The paper presents HiPACK, a collection of software techniques aimed at accelerating sub-8-bit direct convolution on SIMD architectures, specifically ARM NEON. The core contributions are: (1) decoupling the unpacking of packed multiplication results from the multiplication itself to enable SIMD parallelism, (2) rescheduling accumulation across input channels to occur before unpacking, (3) optimizing the segment bitwidth (g) to maximize accumulations before overflow, and (4) a Dual Interleaved Register (DIR) mechanism to further extend the accumulation capacity. The authors claim significant speedups over state-of-the-art libraries like QNNPACK and ARMNN.

While the proposed bit-manipulation techniques are technically sound and demonstrate a clear understanding of the microarchitectural challenges, the experimental evaluation appears to rely on idealized scenarios and potentially weak baseline comparisons, which may substantially inflate the reported performance gains and obscure the method's practical limitations.

Strengths

Correct Problem Identification: The authors correctly identify the fundamental data dependencies in prior packing-based convolution methods (HiKonv) as the primary inhibitor to SIMD vectorization (Section 3, page 4). The analysis of sequential unpacking and inter-result dependency is precise.
Sound Technical Contributions: The core ideas of rescheduling accumulation to occur on packed intermediates (Section 4.2) and the DIR mechanism for doubling accumulation space (Section 5.2) are clever and directly address the identified bottlenecks. These represent valid, well-reasoned micro-optimizations.
Thorough Ablation Study: The paper includes an ablation study (Section 6.4, Table 5, Figure 16) that breaks down the performance contribution of each optimization. This is commendable and provides insight into the relative importance of each technique.

Weaknesses

Highly Idealized Kernel Benchmarks: The headline performance numbers (e.g., 169.16 GOPS in Figure 9) are derived from a single convolution layer with extremely large dimensions (1024 input channels, 1024 output channels). This configuration is maximally compute-bound, minimizing the relative impact of memory access, instruction cache misses, and other system-level overheads. This represents a best-case scenario that is not representative of the diverse layer parameters found in typical DNNs. The "up to" performance claims are therefore potentially misleading.
Questionable "Base" Comparison in Ablation Study: The ablation study (Table 5, page 12) reports that the "Base" implementation achieves only 2.93 GOPS. The paper defines this base case vaguely as including "common data blocking and input reordering techniques." For a 64-bit ARM Cortex-A72, a performance of ~3 GOPS for a 3x3 convolution is exceptionally low, suggesting it may be an unvectorized, naive implementation. This establishes a strawman baseline, making the subsequent 15.52x relative speedup appear far more significant than it might be against a reasonably optimized, but non-HiPACK, kernel.
Selective and Incompletely Characterized End-to-End Evaluation: The authors state in Section 6.3 that for end-to-end model evaluation, "only the 3 × 3 parallel convolution operations...are replaced." Modern architectures (e.g., MobileNets) rely heavily on 1x1 convolutions, for which HiPACK's own limitations section (5.3) admits its methods degenerate and offer no benefit over simpler packing schemes. The paper fails to quantify what percentage of the total model MACs in ResNet-18/34 are actually covered by their optimized kernels. Without this critical context, the end-to-end speedup claims (e.g., up to 1.7x over QNNPACK in Table 2) are uninterpretable and may represent a significant speedup on only a small fraction of the total workload.
Overstatement of General Kernel Support: The method for handling n x n kernels (Section 5.3) is a standard tiling approach that decomposes the problem into multiple calls to an n x 3 kernel. This is not a novel contribution. More importantly, the performance penalty for kernels with dimensions not divisible by 3 (e.g., 5x5, 7x7) is evident in Figure 14 but is not sufficiently discussed. The zero-padding required introduces computational waste, a practical limitation that is understated.
Unconvincing Comparison to Prior Art: The reported performance of the authors' re-implemented HiKonv is below 0.3 GOPS (Section 6.2.1). This is orders of magnitude lower than any other method and seems implausible for any serious implementation. While HiKonv was not designed for SIMD, this result suggests the baseline implementation was not competitive, thus inflating the relative gain of HiPACK.

Questions to Address In Rebuttal

Please provide a detailed specification for the "Base" implementation used in the ablation study (Section 6.4, Table 5). Specifically, was this baseline vectorized using NEON intrinsics? What compiler optimizations were enabled? Please justify why ~3 GOPS is a fair and representative starting point for a direct convolution kernel on this hardware.
For the end-to-end model results presented in Table 2 and Table 3, please quantify the percentage of total model Giga-MACs that are accounted for by the 3x3 convolutions accelerated by HiPACK for each model (VGG16, ResNet18, ResNet34, UNet).
The performance of n x n kernels where n is not a multiple of 3 shows a notable decrease in efficiency (Figure 14). Can the authors provide a quantitative analysis of the overhead incurred by the zero-padding strategy for 5x5 and 7x7 kernels?
Can the authors justify the reported performance of <0.3 GOPS for their HiKonv implementation? Is this a faithful and optimized C++ representation of the principles in the original paper [20], or a simplified, non-vectorized port that does not represent a competitive baseline?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents HiPACK, a software methodology for accelerating sub-8-bit direct convolution on modern SIMD architectures, specifically ARM NEON. The authors identify a critical bottleneck in existing "packing-based" convolution methods: while theoretically efficient at reducing multiplication operations, these techniques introduce sequential data dependencies during the unpacking of results, rendering them incompatible with SIMD parallelism.

The core contribution is a suite of systematic optimizations to resolve these dependencies. The authors propose (1) decoupling the unpacking phase from multiplication by caching intermediate packed results in SIMD registers, (2) rescheduling the accumulation of partial sums across input channels to occur before unpacking, drastically reducing the number of unpacking operations, and (3) a novel Dual Interleaved Register (DIR) mechanism that cleverly splits intermediate values into high and low bits to effectively double the accumulation capacity before overflow. These techniques collectively transform a theoretically powerful but impractical algorithm (packing-for-convolution) into a highly efficient, parallelizable kernel. The empirical results are strong, showing significant speedups (up to 1.7x) over state-of-the-art libraries like QNNPACK on real hardware.

Strengths

Excellent Problem Formulation and Contextualization: The paper does a superb job of situating its work within the broader landscape of DNN acceleration. The introduction (Section 1, page 1) clearly delineates the two dominant approaches—"bitwidth extension" and "data packing"—and articulates their respective trade-offs (bit-space inefficiency vs. data type incompatibility). The analysis in Section 3, particularly Figure 5 (page 4), provides a crystal-clear illustration of the sequential dependency issue that plagues prior work like HiKonv, which serves as the direct motivation for this research.
An Elegant System of Solutions: The proposed optimizations are not just a collection of disconnected tricks; they form a coherent and logical system. Decoupling the multiplication and unpacking (Section 4.1) enables parallelism, rescheduling the accumulation (Section 4.2) dramatically reduces redundant work, and the DIR mechanism (Section 5.2) pushes the limits of this approach by maximizing in-register computation. This demonstrates a deep understanding of both the algorithm and the underlying hardware constraints. It is a classic example of applying HPC principles (e.g., operation rescheduling, maximizing register reuse) to the domain of neural network inference.
Connects a Missing Link in the Literature: This work serves as a crucial bridge between the theoretical promise of packing-based convolution (e.g., HiKonv [20]) and its practical implementation on commodity hardware. While prior works proposed the mathematical foundation for packing multiple operations into a single multiplication, this paper provides the architectural and algorithmic insights necessary to make it truly fast. It effectively "unlocks" the potential of this entire class of algorithms for SIMD processors.
Forward-Looking Implications: I was particularly impressed by the "Implications to Future Architecture" section (Section 6.5, page 12). By demonstrating the significant software gains achievable through complex bit-wise management, the authors make a compelling case for future hardware support, such as native sub-byte vector lanes and richer in-register bit-manipulation primitives. This elevates the paper from a simple software optimization study to a piece of work that can inform the next generation of processor design for AI workloads.

Weaknesses

While the core ideas are strong, the paper could be improved by broadening its contextual analysis in a few areas:

Limited Discussion on Alternative Convolution Algorithms: The paper focuses exclusively on optimizing direct convolution. It dismisses other methods like Winograd or FFT-based convolution early on (Section 2, page 2). While the reasons given (simplicity, accuracy) are valid, the context would be richer with a more nuanced discussion. For instance, at what point (in terms of kernel size, precision, and hardware support) does the overhead of Winograd's data transformations become justifiable again, even in a sub-8-bit regime? A brief analysis of this trade-off would strengthen the paper's positioning.
Implicit Assumptions about Kernel Structure: The methodology is heavily optimized around a base unit of an n x 3 kernel, which is then tiled to support larger n x n kernels (Section 5.3, page 7). This is a pragmatic choice for ARM architectures. However, the paper could benefit from a short discussion on the generality of the approach. How do the packing density and efficiency change for other kernel aspect ratios? This would help the reader understand the boundaries of the method's effectiveness.

Questions to Address In Rebuttal

The performance gains over direct convolution methods are clear. Could the authors elaborate on the performance crossover point with Winograd-based methods? Given that ARMNN uses Winograd for FP32, is there a scenario in the sub-8-bit space where a highly optimized Winograd kernel might outperform HiPACK, and if so, what are its characteristics (e.g., large channel counts, specific kernel sizes)?
The presented techniques are intricate and rely on careful bit-level management. Could you comment on the implementation complexity and the effort required to generalize HiPACK to support a new target bit-width (e.g., 6-bit) or a different SIMD architecture (e.g., x86 AVX2/AVX-512)?
The Dual Interleaved Register (DIR) mechanism is a clever trick for extending accumulation depth. Does this approach introduce any measurable overhead in terms of instruction count for the splitting/merging process, and how does this overhead scale as you further partition the g-bit segment? Is there a point of diminishing returns?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper introduces HiPACK, a set of techniques designed to make sub-8-bit direct convolution efficient on modern SIMD architectures like ARM NEON. The authors identify that prior "packing-for-efficient-convolution" methods, most notably HiKonv [20], are fundamentally incompatible with SIMD parallelism due to sequential dependencies in the unpacking and accumulation stages.

The central claim of novelty rests on a series of architectural and algorithmic re-orderings to break these dependencies. The core contributions are: 1. Decoupling Multiplication from Unpacking: Instead of a tight multiply -> unpack -> accumulate loop for each output, HiPACK performs a block of wide SIMD multiplications first, caching the packed, un-interpreted results directly in SIMD registers. 2. Rescheduled Accumulation: The accumulation of values from different input channels is performed on the packed intermediate results before the final unpacking step, significantly reducing the number of expensive bit-extraction operations. 3. Dual Interleaved Registers (DIR): A micro-architectural technique to effectively double the accumulation bit-depth for intermediate results without sacrificing packing density, by splitting segments into low and high bits stored in separate register files.

The authors demonstrate substantial performance improvements over existing frameworks, including QNNPACK and ARMNN.

Strengths

The primary strength of this work lies in its genuinely novel approach to solving a well-defined problem.

Addressing a Known Limitation in Prior Art: The paper correctly identifies the critical bottleneck in the HiKonv [20] theoretical framework: its sequential nature makes it impractical on parallel SIMD hardware. The core contribution of HiPACK—decoupling and rescheduling—is a direct, non-obvious, and effective solution to this specific limitation. This is not an incremental improvement but a fundamental rethinking of the data flow.
Novelty in Algorithmic Reordering: The concept of "Rescheduled Accumulation" (Section 4.2, Page 5) is a significant and novel contribution. The mathematical formulation in Equation (12), which moves the UP() (Unpack) function outside the summation (UP(ΣΣ...) vs. the implicit ΣΣ UP(...)), is a clear and elegant representation of this new approach. This reordering is the key enabler for reducing the unpacking overhead, which is the main performance limiter.
Clever Micro-architectural Technique: The Dual Interleaved Registers (DIR) mechanism (Section 5.2, Page 6) is a novel bit-manipulation strategy tailored for this problem. While using shifts and masks is standard, the specific application of splitting segments into interleaved registers to extend accumulation capacity without altering the packing format is a new and insightful trick. It demonstrates a deep understanding of the interplay between logical operations and register file constraints.

Weaknesses

While the core ideas are novel, their context and scope should be carefully considered.

Derivative Foundation: The novelty of HiPACK is predicated on making the pre-existing concept from HiKonv [20] practical. The foundational idea of packing multiple low-bit values, performing a single wide multiplication, and segmenting the results is not new. The contribution is the "how" (making it SIMD-parallel), not the "what" (the packing scheme itself). The paper is transparent about this, but it is an important distinction.
Incremental Nature of Supporting Optimizations: The optimization of the segment bitwidth (Section 5.1, Page 6) is an incremental improvement over the analysis in HiKonv. The new formulation (Equation 15) is a necessary consequence of introducing the block size B2 from the rescheduled accumulation, but it is not a standalone novel concept. It is a refinement, not a breakthrough.
Narrow Scope of Applicability: The proposed techniques are highly specialized for direct convolution with kernels of size 3x3 and larger, where partial products overlap. The authors acknowledge in Section 5.3 (Page 7) that for 1x1 or 2x2 kernels, the method degenerates to a simpler packing scheme like ULPPACK [28]. This specificity limits the generality of the novel contributions, though the targeted domain is admittedly very important.

Questions to Address In Rebuttal

Generality of the Core Idea: The core novelty is decoupling and rescheduling. Could this principle be applied to other domains beyond sub-8-bit convolution where data is packed to exploit wide multipliers? For instance, could similar techniques accelerate polynomial multiplication or signal processing filters on SIMD hardware?
On the DIR Mechanism: The Dual Interleaved Registers (DIR) trick is clever. Is this a one-off solution for the specific bit-width constraints encountered in this problem, or does it represent a more generalizable pattern for managing intermediate precision in packed-data SIMD algorithms? Please elaborate on its potential applicability elsewhere.
Clarification of Prior Art Limitations: The performance comparison against HiKonv [20] is stated as "95 ~ 181x" (Page 8), but this is against a single-thread, ALU-based implementation. While HiKonv is not SIMD-compatible, a multi-threaded ALU implementation could have been a fairer, albeit still slower, baseline. Could the authors clarify precisely which sequential dependency in HiKonv's logic prevents even a trivial data-parallel implementation across multiple cores without the proposed decoupling?

Recommendation (Based purely on novelty): Accept.

The paper presents a clear, significant, and well-motivated set of novel techniques. It identifies a specific, unresolved problem in prior art (the SIMD-incompatibility of HiKonv-style packing) and proposes a non-obvious and highly effective solution through algorithmic reordering and clever bitwise management. While building on existing concepts, the "delta" over prior art is substantial and enables a new level of performance. The complexity of the solution is justified by the significant empirical gains. This work represents a genuine advancement in the field of high-performance, low-precision computing.

MCBP: A Memory-Compute Efficient LLM Inference Accelerator Leveraging Bit-Slice-enabled Sparsity and Repetitiveness

Abstract

Large language models (LLMs) face significant inference latency due to inefficiencies in GEMM operations, weight access, and KV cache access, especially in real-time scenarios. This highlights the need for a versatile compute-memory efficient accelerator. ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The authors propose MCBP, a hardware accelerator for Large Language Model (LLM) inference that aims to improve memory and compute efficiency. The core contribution is a suite of three bit-level optimization techniques: 1) BS-repetitiveness-enabled computation reduction (BRCR) to eliminate redundant GEMM operations by identifying and merging computations on identical bit-slice column vectors; 2) BS-sparsity-enabled two-state coding (BSTC) to compress sparse, high-order bit-slices of weights; and 3) Bit-grained progressive prediction (BGPP) to reduce KV cache access via an early-termination prediction mechanism. These techniques are implemented in a custom accelerator architecture and evaluated against a GPU and other SOTA accelerators, with the authors claiming significant speedup and energy efficiency improvements.

Strengths

The fundamental observation that decomposing value-level matrices into bit-slice matrices can expose significant, previously unexploited structure (sparsity and repetitiveness) is sound. The illustration in Figure 4 provides a clear, albeit simplified, motivation for this approach.
The work attempts to address the three primary bottlenecks in LLM inference (GEMM, weight access, KV cache access) within a single, unified co-design framework. This holistic approach is commendable, as optimizing one aspect in isolation often exposes another as the new bottleneck.
The paper provides detailed architectural designs for its key components, including the CAM-based BRCR unit (Figure 14) and the clock-gated BGPP unit (Figure 16), demonstrating a considerable implementation effort.

Weaknesses

My primary concerns with this paper lie in the rigor of its experimental evaluation, the justification for key design choices, and the potential overstatement of its claims.

Baseline Comparisons are Potentially Unfair:
- SOTA Accelerators: The authors state that FuseKNA and Bitwave were "adapted from convolution to GEMV using im2col" (Section 5.1, Page 10). This adaptation is non-trivial. It is unclear if these adapted baselines represent a fair, performant implementation or a "strawman" version that is suboptimal for GEMM workloads. Without validation against the original authors' performance models or a more rigorous adaptation process, the comparisons in Figure 23 are suspect.
- GPU Baseline: The claimed 9.43x speedup and 31.1x energy efficiency gain over an NVIDIA A100 GPU are exceptionally high. While a specialized ASIC is expected to outperform a general-purpose GPU, this magnitude raises questions. The paper does not specify whether the TensorRT-LLM baseline was configured to leverage the A100's native sparsity support (e.g., 2:4 structured sparsity). If the comparison is between a sparsity-aware accelerator (MCBP) and a purely dense GPU execution, the reported gains are significantly inflated and misleading.
Over-reliance on Empirically Chosen "Magic Numbers":
- The entire BRCR mechanism's effectiveness hinges on the choice of group size m. While the authors provide a design space exploration in Figure 18 that justifies m=4, the analysis is performed against "dense models." It is not clear if this choice remains optimal under the aggressive sparsity introduced by their other techniques.
- The BGPP prediction mechanism (Section 3.3, Page 7) depends critically on radius (empirically set to 3) and the parameter α_r. The paper provides a sensitivity analysis for α_r in Figure 24(a), but this is done post-facto to define "Standard" and "Aggressive" modes. The choice of radius=3 is presented without any justification or sensitivity analysis, making the results difficult to trust or generalize.
Hardware Claims Lack Sufficient Substantiation:
- The paper claims the CAM-based fast match unit can "identify identical elements in one cycle" (Section 4.3, Page 8). For a 1GHz clock frequency, a parallel search and bitmap generation across a non-trivial number of entries is a significant circuit design challenge. This claim requires circuit-level evidence, such as a critical path analysis, to be credible.
- There is a fundamental tension between the reported system-level energy efficiency gains and the accelerator's own power breakdown. Figure 22(b) shows that off-chip DRAM access still accounts for 47.6% of total system power. It is difficult to reconcile this with a 31.1x system-level energy efficiency improvement over a highly optimized platform like the A100. Such a massive gain would imply the baseline's memory access is catastrophically inefficient, or the comparison methodology is flawed. The authors must provide a far more detailed energy breakdown of both systems to substantiate this claim.
Inconsistent Narrative on Contribution: The paper claims that the bit-reordering overhead is minimal (3% in Figure 23). However, the complexity of the proposed memory layout (Figure 13 and 15(c)) suggests that the overhead might manifest in ways not captured by a simple energy bucket, such as increased address generation complexity or potential bank conflicts in the memory controller, which do not appear to be modeled.

Questions to Address In Rebuttal

Please clarify the implementation details of the adapted baselines (FuseKNA, Bitwave, etc.). How can the authors assure the reviewers that these adaptations represent a fair, best-effort implementation of the original architectures for a GEMM workload?
Did the TensorRT-LLM baseline on the A100 GPU utilize any form of structured or sparse tensor cores? If not, please re-evaluate against a sparsity-enabled GPU baseline or explicitly state that the comparison is against dense execution and temper the claims accordingly.
The BGPP mechanism relies on an empirically set radius=3. Please provide a sensitivity analysis for this parameter, similar to the one performed for α_r, and justify why this value is optimal or robust across different models and tasks.
Please provide circuit-level justification (e.g., critical path analysis, post-layout simulation data for the cell) to support the claim that the CAM-based search unit achieves its function in a single 1GHz clock cycle.
Please provide a detailed, side-by-side energy consumption breakdown (in joules per inference, broken down by compute, on-chip memory, and off-chip DRAM access) for both MCBP and the A100 baseline. This is necessary to explain how a 31.1x system-level efficiency gain is possible when off-chip DRAM, a component common to both systems, constitutes nearly 50% of MCBP's own power consumption.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces MCBP, an algorithm-hardware co-designed accelerator for large language model (LLM) inference. The work's core contribution is a paradigm shift in how optimization opportunities are identified and exploited in quantized neural networks. Instead of operating at the conventional value-level, the authors propose decomposing weight and activation matrices into their constituent bit-slices. They compellingly argue and demonstrate that this bit-level view uncovers a vast amount of previously obscured structure—namely, extreme sparsity in high-order bits and significant repetition among bit-slice vectors.

Based on this foundational insight, the authors develop a holistic, three-pronged strategy to tackle the primary bottlenecks in LLM inference: 1. Computation (GEMM): A technique called BS-Repetitiveness-enabled Computation Reduction (BRCR) identifies and merges redundant computations arising from repeated bit-slice vectors. 2. Weight Access: A BS-Sparsity-enabled Two-state Coding (BSTC) scheme compresses weights by exploiting the high sparsity found in individual bit planes. 3. KV Cache Access: A Bit-grained Progressive Prediction (BGPP) method refines the existing top-k attention mechanism by performing early termination at the bit-level, reducing memory traffic.

These algorithmic innovations are supported by a custom accelerator architecture, resulting in significant reported gains in performance and energy efficiency over both GPUs and prior state-of-the-art accelerators.

Strengths

Powerful and Unifying Core Insight: The single most important strength of this paper is its central thesis: that the bit-level representation of quantized data holds more exploitable structure than the value-level representation. The illustration in Figure 4 (page 4) is particularly effective at demonstrating how sparsity and repetition dramatically increase after bit-slice decomposition. This is not just an incremental improvement; it is a fundamentally different and potentially more fruitful perspective for hardware acceleration in the era of quantized LLMs.
Holistic Problem-Solving: The current landscape of Transformer accelerators often features point solutions that address one bottleneck (e.g., attention computation, weight sparsity) in isolation. MCBP stands out by using its core bit-level insight as a unifying principle to build a comprehensive solution that simultaneously addresses GEMM, weight memory, and KV cache access—the three major latency contributors identified in Figure 1 (page 2). This demonstrates a strong systems-level approach to design.
Excellent Connection between Observation, Algorithm, and Hardware: The authors do a commendable job of connecting their high-level observations to concrete implementations. For instance, the abstract problem of "finding repetitive bit-slice vectors" (BRCR) is translated into a practical CAM-based fast-match unit (Section 4.3, page 8). Similarly, the inefficiencies of value-level top-k prediction are addressed with a dedicated, threshold-aware, clock-gated bit-serial unit (BGPP, Section 4.5, page 9). This tight algorithm-hardware co-design is a hallmark of high-quality accelerator research.
Broad Contextualization and Empirical Rigor: The work is well-positioned within the literature. The authors effectively contrast their approach with value-level methods, providing a clear rationale for their design choices. The evaluation in Section 5 (pages 10-13) is extensive, covering multiple models, diverse benchmarks, and detailed comparisons against a wide range of academic and commercial baselines. The ablation study in Figure 19 (page 10) and the breakdown of gains in Figure 21 (page 12) provide clear evidence for the efficacy of each proposed component.

Weaknesses

My critiques are less about flaws and more about the boundaries and broader context of the proposed ideas.

Dependence on Specific Data Formats: The proposed techniques, particularly BSTC, are designed around the sign-magnitude (SM) format to maximize sparsity in the most significant bits (Section 3.2, page 6). While this is a clever choice, the broader field of quantization is exploring many different formats (e.g., non-uniform, logarithmic, block floating-point). The paper could be strengthened by a discussion on how the core principles of bit-level repetition and sparsity might adapt, or fail to adapt, to these alternative quantization schemes.
Understated Connection to the Bit-Serial Computing Lineage: The paper rightly positions itself against contemporary Transformer accelerators. However, the idea of processing data one bit at a time has a long history, from early DSPs to more recent deep learning accelerators like Stripes and Bit-pragmatic. While MCBP's novelty lies in exploiting inter-vector repetition and its holistic application to LLMs, a more explicit discussion of how it builds upon or diverges from this established lineage of bit-serial architectures would help to better contextualize its specific contributions for readers familiar with that domain.
Potential Control Complexity: While the data path for each unit is well-described, processing data at the bit-slice level can introduce significant control and data management complexity (e.g., scheduling, managing metadata for compression, orchestrating the multi-round BGPP). The paper asserts that these overheads are managed, but a deeper discussion on the complexity and scalability of the main controller and scheduling logic would be valuable.

Questions to Address In Rebuttal

The effectiveness of BSTC and BRCR seems tied to the statistical properties of quantized weights in the sign-magnitude format. Could the authors comment on the applicability of their approach to other popular quantization schemes, such as symmetric two's complement or more advanced non-uniform formats? How robust is the bit-level structural advantage across these different representations?
The ablation study in Figure 24b (page 13) shows that the CAM unit for BRCR adds considerable area and power overhead compared to a baseline systolic array. Could you further elaborate on the trade-offs here? Specifically, at what level of bit-slice repetition does the computational savings from BRCR begin to outweigh the static and dynamic power of the CAM-based search and merge logic?
Could you further clarify the key distinction between MCBP's approach and prior bit-serial accelerators (e.g., for CNNs)? My understanding is that the primary novelty is the exploitation of repetition across different bit-slice column vectors via BRCR, rather than just exploiting sparsity within a vector. Is this interpretation correct, and are there other key differentiators?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present MCBP, an algorithm-hardware co-designed accelerator for Large Language Model (LLM) inference. The central thesis is that existing value-level optimizations miss fine-grained opportunities, and a unified bit-level approach can simultaneously address the three primary inference bottlenecks: GEMM computation, weight access, and KV cache access. To this end, the paper proposes three techniques: 1) BS-repetitiveness-enabled Computation Reduction (BRCR) to reuse computation for identical bit-slice vectors, 2) BS-sparsity-enabled Two-state Coding (BSTC) to compress sparse high-order bit-slices of weights, and 3) Bit-grained Progressive Prediction (BGPP) for early-termination in attention score calculation.

While the paper presents a comprehensive and well-engineered system, my analysis finds that the core concepts underpinning its main contributions have strong precedents in prior work. The novelty appears to lie not in the introduction of fundamentally new mechanisms, but in the synthesis and application of known bit-level optimization principles to the specific context of modern LLMs.

Strengths

Holistic, Unified Framework: The primary strength of this work is its ambition to tackle all three major LLM bottlenecks (GEMM, weight memory, KV cache) under a single, consistent bit-level optimization philosophy. This unified approach is a noteworthy engineering achievement.
Bit-Grained Prediction (BGPP): Among the three proposed techniques, BGPP (Section 3.3, page 7) demonstrates the most significant conceptual delta over prior art. While value-level top-k prediction is established, shifting this prediction to a progressive, bit-serial paradigm to enable early termination is a clever and specific optimization.
Thorough Co-design: The authors have clearly considered the interplay between their proposed algorithms and the underlying hardware, with custom units for CAM-based matching (Figure 14), lightweight codecs (Figure 15), and the BGPP filter (Figure 16).

Weaknesses

My primary concern is the degree of conceptual novelty in the core mechanisms, particularly BRCR and BSTC, which appear to be adaptations of previously published ideas.

BRCR is Conceptually Analogous to Prior Work on Computational Reuse: The core idea of BRCR—identifying repeated vectors (in this case, bit-slice columns) and reusing their computational results (merged activations)—is functionally identical to the "weight repetition" technique pioneered in UCNN (Hegde et al., ISCA 2018) [31]. UCNN identified repeating weight filters in CNNs to avoid redundant MAC operations. The authors of MCBP even acknowledge this line of work (Section 1, page 2) but differentiate by highlighting the challenges of applying it to large LLM matrices. This frames the contribution as an engineering and scaling solution, not as the invention of a new computational paradigm. The grouping strategy to increase repetition probability is a standard technique used in dictionary-based compression and is a logical extension of the core reuse concept.
BSTC Leverages Known Properties and Standard Encoding: The principles behind BSTC are not new.
- Exploiting Bit-Level Sparsity: A long line of work on bit-serial accelerators, such as Bit-Pragmatic (Albericio et al., MICRO 2017) [2] and Laconic (Sharify et al., ISCA 2019) [78], is built entirely on exploiting bit-level sparsity (i.e., skipping operations on zero-bits).
- High-Order Bit Sparsity: The observation that high-order bits are sparser in quantized weights is a well-known statistical property of values drawn from a near-Gaussian distribution. This is an observation of a natural phenomenon, not a novel insight.
- Encoding Scheme: The proposed "two-state coding" (zero vs. non-zero) is one of the most basic forms of data compression, functionally similar to a bitmap or a simplified run-length encoding scheme. There is no novel algorithmic contribution in the coding scheme itself. The novelty is its application in a co-designed pipeline, which is an integration effort.
BGPP is an Incremental Refinement: While BGPP is the most novel component, it is still an incremental refinement of existing ideas. The framework of using a low-overhead pre-computation to prune attention is well-established by SOTA accelerators like SpAtten [94] and FACT [72]. MCBP’s contribution is to change the granularity of this prediction from value-level (e.g., 4-bit INT) to bit-level. This allows for earlier termination, which is a clever optimization. However, it builds directly upon the established top-k prediction paradigm rather than proposing a new one.

In summary, the paper's contribution seems to be the meticulous engineering and synthesis of these three concepts into a single accelerator. While the final system is novel in its specific combination, the foundational building blocks are largely drawn from the existing intellectual landscape of accelerator design.

Questions to Address In Rebuttal

On BRCR: Please explicitly differentiate the core algorithmic mechanism of BRCR from the weight repetition technique in UCNN [31]. Beyond the application domain (LLMs vs. CNNs) and the data granularity (bit-slice vectors vs. value-level filters), what is the fundamental conceptual novelty that makes BRCR a new technique for computational reuse?
On BSTC: The paper claims BSTC as a key innovation. Given that exploiting bit-level sparsity is foundational to prior bit-serial accelerators and the two-state encoding is a standard compression primitive, please clarify what makes the BSTC algorithm itself novel, as distinct from its tight integration with the MCBP hardware pipeline.
On Overall Contribution: The paper's strength appears to be the successful integration of multiple known principles into a single, high-performance system for a new problem domain (LLMs). Is it the authors' position that this act of synthesis and co-design constitutes the primary novel contribution, or is there a single, underlying theoretical or architectural concept presented here that is fundamentally new and has not appeared in prior literature? Please identify it.

PolymorPIC: Embedding Polymorphic Processing-in-Cache in RISC-V based Processor for Full-stack Efficient AI Inference

Abstract

The growing demand for neural network (NN) driven applications in AIoT devices necessitates efficient matrix multiplication (MM) acceleration. While domain-specific accelerators (DSAs) for NN are widely used, their large area overhead of dedicated buffers ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper presents PolymorPIC, a Processing-in-Cache (PIC) architecture integrated into the Last-Level Cache (LLC) of a RISC-V System-on-Chip (SoC). The stated goal is to accelerate neural network inference, specifically bit-serial matrix multiplication (BSMM), while addressing system-level challenges such as coherence, programmability, and area efficiency, which the authors claim are overlooked by prior work. The proposed solution involves reconfigurable Homogeneous Memory Arrays (HMAs) that can function as cache, compute units, or buffers. The authors introduce a coherence protocol (DCF and PAI) to manage the transition between cache and PIC modes, claiming it is "processor-safe." The system is implemented on an FPGA with a Linux OS, and evaluation results are presented based on ASIC synthesis, claiming significant improvements in area and energy efficiency over baseline processors and competing accelerators like Gemmini.

While the ambition to create a full-stack, processor-safe PIC system is commendable, the manuscript suffers from significant methodological weaknesses in its evaluation and makes several unsubstantiated or overstated claims that undermine its conclusions. The core contributions, particularly regarding coherence and performance, require much stronger validation.

Strengths

System-Level Integration: The effort to integrate a PIC architecture into a full SoC stack, including a RISC-V core (BOOM), standard bus protocols (TileLink), and verification with a running operating system (Linux), is a notable strength. This moves beyond circuit-level proofs-of-concept common in this domain.
Use of Standard-Cell SRAM: The decision to leverage standard SRAM arrays generated by a memory compiler (as mentioned in Section 3.2 and 7.1.4) is a practical approach that avoids the area and compatibility issues associated with the custom bit-cells used in prior works like Neural Cache [25].
Focus on Coherence: The paper correctly identifies cache coherence as a critical system-level barrier for PIC adoption. The explicit attempt to design mechanisms for allocation, isolation, and release (Section 5) is a necessary and important research direction.

Weaknesses

Misleading Performance Metrics and Unfair Baselines: The headline performance claims are built on questionable comparisons.
- The claim of a 1543.8x energy efficiency improvement (Abstract) is made against a general-purpose OoO core (BOOM). This is a classic accelerator-vs-CPU comparison that, while technically correct, is largely meaningless. Any dedicated hardware will show orders-of-magnitude improvement over a general-purpose core for its target workload. This number is inflammatory and does not provide a useful comparison against state-of-the-art accelerators.
- The comparison against Gemmini appears to be deliberately handicapped. Section 7.1.3 states the total buffer capacity for the NPU was constrained to match the cache size of PolymorPIC. Gemmini's systolic array architecture is highly sensitive to buffer size for hiding memory latency; forcing a 1MB buffer configuration for an 8x8 or 16x16 array may not be optimal and could cripple the baseline's performance. More critically, the "area efficiency" metric (GOPS/mm²) in Figure 15 is highly suspect. PolymorPIC's absolute throughput is clearly lower than Gemmini16 (Figure 16). The only way PolymorPIC can claim higher "area efficiency" is if the normalization area includes the entire processor, memory, and peripherals, thereby diluting the large area of the Gemmini NPU across the entire SoC. Efficiency should be judged on the area of the added accelerator components, not the whole chip. This is an apples-to-oranges comparison.
Insufficiently Validated Coherence Mechanism: The proposed "processor-safe" coherence strategy is not rigorously proven.
- The Direct Cache Flush (DCF) mechanism (Section 5.1, Figure 8) requires the Switch Controller to iterate through every set in the LLC to query the directory for a specific way. For a 1MB, 16-way cache with 1024 sets (as per their BOOM-S configuration), this is 1024 queries to the directory per way being switched. While better than the naive approach, the absolute latency of this operation is not quantified, nor is its impact on the memory subsystem's availability for other processor requests.
- The PIC Array Isolation (PAI) mechanism (Section 5.2) relies on a single PIC_mode flag per way to make it "invisible" to the MESI replacement policy. This seems overly simplistic. It does not address more complex coherence scenarios, such as snoops from other cores in a multi-core system, or handling of speculative memory accesses by an OoO core that might target an address mapped to an isolated way. The claim of being "processor-safe" is not substantiated beyond a trivial case.
"Full-Stack Verification" is Superficial: The claim of being "successfully end-to-end verified on a validation platform with an operating system running" is an overstatement based on the evidence provided.
- The evaluation in Section 7.4 involves running a single SPEC2017 benchmark in parallel with the BSMM computation. This is a controlled "hero run" scenario. It does not stress the system with realistic OS-level complexities like high interrupt frequency, context switching during PIC operations, or contention on the memory bus from multiple I/O devices. The interaction between the OS scheduler and the PIC resource management is not explored. Therefore, the robustness of the system in a real-world, dynamic environment remains unproven.
Novelty of Scheduling is Unclear: Section 6.3 discusses scheduling using standard Input/Weight Stationary (I/WS) and Output Stationary (OS) dataflows. While the analysis in Figure 17 is useful, these dataflows are not novel; they are standard practice in the design of NN accelerators. The paper fails to articulate a specific, novel scheduling algorithm or contribution beyond applying existing concepts to their HMA architecture. The software stack's role in making these scheduling decisions is not detailed sufficiently.

Questions to Address In Rebuttal

Regarding Performance Claims:
- Please clarify the exact methodology for calculating "area efficiency" in Figure 15. Is the normalization area (the denominator in GOPS/mm²) the area of the accelerator-specific add-ons only, or the total SoC area? If the latter, provide a strong justification for why this is a fair metric for comparing a tightly integrated PIC unit with a loosely coupled co-processor like Gemmini.
- Provide evidence that the buffer size and configuration used for the Gemmini baseline are representative of an optimized system, rather than a configuration constrained to match PolymorPIC's LLC size.
- Please re-frame the 1543.8x energy efficiency claim in the context of accelerator-vs-accelerator comparisons, as the current framing against a CPU baseline is not insightful.
Regarding the Coherence Mechanism:
- What is the absolute cycle latency of the DCF operation for allocating a single way (e.g., in the 1MB LLC configuration), and how does this scale with the number of sets and ways?
- Elaborate on the PAI mechanism's handling of complex coherence scenarios. For example, in a multi-core BOOM configuration, if Core 1 has isolated Way 5 for PIC, what happens when Core 2 issues a read request that would map to Way 5 and misses in all other ways?
- How does the system handle an interrupt or a high-priority preemption request that occurs mid-way through a DCF or PIC execution phase? Is the state machine gracefully pausable and resumable?
Regarding System Verification:
- Beyond the single parallel SPEC run shown in Figure 22, what further stress tests were conducted to validate the "processor-safe" claim under more dynamic OS conditions (e.g., heavy I/O, frequent context switching, virtual memory pressure leading to page swaps)?
Regarding Scheduling:
- Please clarify the precise division of labor between the hardware PIC scheduler and the software stack (Table 1). Who makes the high-level decision between I/WS and OS dataflows for a given layer, and based on what cost model? What novel scheduling algorithm, if any, is being proposed?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces PolymorPIC, a full-stack architecture that integrates polymorphic processing-in-cache (PIC) capabilities into the Last-Level Cache (LLC) of a RISC-V System-on-Chip (SoC). The core contribution is not merely the act of computation within a cache, but the holistic, system-level design that makes this feasible, programmable, and safe to operate alongside a host processor running a full operating system.

The authors achieve this through three key thrusts: 1. An architecture based on reconfigurable Homogeneous Memory Arrays (HMAs) that can dynamically function as standard cache, data buffers, or bit-serial compute units, thereby eliminating the large, dedicated buffers typical of standalone Neural Processing Units (NPUs). 2. A practical cache coherence strategy that allows for the safe partitioning and isolation of cache ways for computation without halting the host processor, a concept they term "processor-safe PIC." 3. A complete end-to-end implementation and verification on an FPGA, including custom ISA extensions, a C-level software stack, and an evaluation that demonstrates significant gains in area and energy efficiency for AI inference workloads.

In essence, this work presents a compelling blueprint for transforming the LLC from a passive memory component into an active, efficient, and reusable compute resource for edge AI.

Strengths

Excellent System-Level Synthesis: The primary strength of this paper is its successful synthesis of ideas from multiple domains—processing-in-memory, bit-serial accelerators, and cache management—into a single, coherent, and demonstrably working system. Many prior works have focused on novel PIC circuits or high-level concepts, but this paper provides the crucial "glue" by addressing system-level challenges like coherence, programmability, and OS compatibility. The full-stack approach, from hardware RTL to a C-function library (Section 4.2), is a significant achievement and makes the contribution far more tangible and impactful.
The "Processor-Safe" Coherence Model: The proposed coherence strategy (detailed in Section 3.3 and Section 5) is a cornerstone of this work's practical significance. By enabling parts of the cache to be used for acceleration while the rest remains available to the host CPU, PolymorPIC overcomes a major limitation of earlier designs that often required commandeering the entire cache, stalling the processor. This allows for genuine parallelism between general-purpose tasks and AI acceleration (as shown in Figure 22, page 13), which is critical for responsive AIoT devices.
Elegant Architectural Unification (HMAs): The concept of Homogeneous Memory Arrays (HMAs) is an elegant solution to the area-inefficiency of specialized accelerators. As shown conceptually in Figure 4 (page 4), unifying the functions of compute engine, local/global buffer, and standard cache into a single, reconfigurable SRAM structure is the key enabler for the impressive area efficiency reported. This hardware is not idle during non-AI workloads; it simply serves its primary function as an LLC, making it a highly cost-effective design point.
Comprehensive and Rigorous Evaluation: The evaluation in Section 7 is thorough and well-contextualized. The authors compare PolymorPIC against a wide spectrum of relevant designs, including CPU-only, vector processors (Hwacha), mainstream NPUs (Gemmini), standalone bit-serial accelerators (ANT, BBS), and other PIC designs (MAICC, Duality Cache). This broad comparison effectively situates their work in the current landscape and provides convincing evidence for their claims of superior area and energy efficiency for edge-class SoCs.

Weaknesses

While this is a strong paper, there are areas where the positioning and future implications could be discussed more deeply.

Trade-off Between Efficiency and Peak Throughput: The results, particularly in Figure 16 (page 11), show that while PolymorPIC is highly efficient, its absolute throughput is lower than some dedicated, highly-optimized NPU designs like BBS. This is an expected and perfectly acceptable trade-off—the core value proposition is efficiency, not raw performance leadership. However, the paper could benefit from a more explicit discussion of this trade-off and the specific application domains where PolymorPIC's design point (maximum efficiency for a given area/power budget) is more valuable than maximum possible throughput.
Scalability to Multi-Core Systems: The paper presents a compelling solution for a single-core SoC. The coherence model works cleanly by partitioning ways in the L2 cache, which is private or last-level for that core. The challenges would multiply in a multi-core system sharing a last-level PIC-enabled cache (e.g., L3). Issues of inter-core synchronization, managing PIC resource allocation between cores, and maintaining coherence for shared data would become significantly more complex. While beyond the scope of this paper's implementation, a brief discussion of these future challenges would strengthen its contextualization.

Questions to Address In Rebuttal

Positioning vs. Throughput-Optimized NPUs: The paper convincingly demonstrates superior area and energy efficiency. Could you please explicitly frame the application space for PolymorPIC in contrast to throughput-focused designs like BBS? Is the target primarily area- and power-constrained edge devices where "good enough" performance with maximum efficiency is the goal, rather than applications requiring the highest possible inference rate?
Path to Multi-Core Scalability: The "processor-safe" coherence mechanism is a key strength for the presented single-core system. Could you elaborate on the primary challenges and your envisioned architectural solutions for extending this model to a multi-core SoC where multiple cores share the PolymorPIC-enabled LLC? For instance, how would requests for PIC allocation from different cores be arbitrated, and how would you manage snoop traffic related to the PIC-isolated ways?
Developer Experience and Programmability: The software stack (Section 4.2.2, page 6) is a vital part of the full-stack claim. From a programmer's perspective, how much manual optimization is required to efficiently map a new NN model onto PolymorPIC? Specifically, how does the effort of managing data tiling, choosing a dataflow (I/WS vs. OS), and configuring the Mat-CUs/Mat-SBs compare to using a more conventional NPU with a mature compiler toolchain that automates much of this process?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents "PolymorPIC," a polymorphic processing-in-cache (PIC) architecture integrated into the last-level cache (LLC) of a RISC-V System-on-Chip (SoC). The stated goal is to accelerate neural network inference, specifically bit-serial matrix multiplication (BSMM), with high area and energy efficiency. The core architectural proposal is a "Homogeneous Memory Array" (HMA) concept, where standard SRAM arrays can be dynamically reconfigured at runtime to function as a normal cache, a computational unit (Mat-CU), or a data buffer (Mat-SB). To enable this dynamic switching in a multi-tasking environment, the authors propose a specific coherence strategy involving a "Direct Cache Flush" (DCF) for rapid allocation and "PIC Array Isolation" (PAI) for processor-safe execution. The authors claim this is the first full-stack, end-to-end verified PIC-enabled SoC that can run a full operating system.

Strengths

The primary novelty of this work lies not in a single groundbreaking idea, but in the specific synthesis of several known concepts into a coherent, practical, and fully-realized system.

Architectural Novelty in Practicality: While PIC is a well-explored field, many seminal works rely on custom SRAM bit-cells (e.g., Neural Cache [25], Duality Cache [29]) which suffer from poor area density and portability. The core novel contribution of PolymorPIC is the architectural design that enables BSMM acceleration using standard, compiler-generated SRAM arrays (Section 3.2, page 4). This HMA concept, which repurposes entire digital sub-arrays for compute or buffering, represents a significant step towards making PIC architectures manufacturable and integrable with standard design flows. The "delta" over prior art here is the move from bit-cell-level modification to array-level functional polymorphism.
System-Level Coherence as a Novel Contribution: Most prior PIC literature focuses on the accelerator microarchitecture and often hand-waves the complexities of system integration. This paper’s explicit focus on a "processor-safe" coherence mechanism is a novel contribution in the context of PIC design. While the constituent ideas (direct flushing via way/set ID, using a directory bit for way isolation) are not fundamentally new concepts in cache design, their specific application and combination to solve the rapid, safe mode-switching problem for PIC is new. The PAI mechanism (Section 5.2, page 7), in particular, provides an elegant, low-overhead solution to a critical system-level problem that has been a barrier to the practical deployment of PIC.
Demonstration of a Full-Stack System: The claim of being the "first full-stack PIC-enabled SoC, whose functionality is successfully end-to-end verified on a validation platform with an operating system running" (Section 2, page 2) is a substantial novelty claim. Many academic accelerators are evaluated in simulation or as standalone hardware kernels. Demonstrating a PIC architecture that coexists with an operating system on a RISC-V core, handles virtual memory, and can run SPEC benchmarks in parallel with PIC computations (Section 7.4, page 13) elevates this work from a pure architectural proposal to a demonstrated system. This end-to-end integration is a significant and novel engineering achievement in this domain.

Weaknesses

The paper's claims of novelty could be tempered by a more explicit acknowledgment of the conceptual heritage of its components.

Incremental Novelty of Individual Components: The paper presents its components as largely new inventions, whereas they are more accurately described as novel applications or optimizations of existing concepts.
- Bit-Serial MM: This is a well-established technique for efficient NN acceleration, as acknowledged by the authors' citation of BISMO [63] and others. The novelty is purely in its implementation venue (the LLC).
- PIC Array Isolation (PAI): This mechanism is functionally analogous to cache way-partitioning or way-locking, techniques that have existed for years to provide quality-of-service or security isolation in caches. PAI uses a directory bit to make a way "invisible" to the replacement policy; this is conceptually identical to how partitioning is often implemented. The novelty is the purpose (enabling safe in-cache computation) rather than the mechanism.
- Direct Cache Flush (DCF): This is a logical optimization of a standard cache flush. Instead of iterating through memory addresses to find cache entries, it directly targets cache indices (wayID, setID). This is a straightforward microarchitectural enhancement, not a fundamental new coherence protocol. Its novelty is limited.
Overstated Terminology: The term "Homogeneous Memory Array" (HMA) is presented as a new architectural primitive. However, it could be argued that this is a new name for a reconfigurable SRAM macro-cell that has been augmented with minimal compute logic. The novelty lies in the specific configuration and control logic, not necessarily in the invention of a new class of memory array. The paper would be stronger if it positioned HMA as a specific, novel implementation of a reconfigurable memory architecture rather than a new fundamental concept.
Complexity vs. Benefit Justification: The paper demonstrates significant efficiency gains. However, the added complexity to the LLC—including MAC units, adders, registers, and control FSMs within each Mat (Figure 9, page 7)—is not trivial. This fundamentally alters the design and validation of what is typically a simple memory structure. While the results are compelling against an NPU like Gemmini, the comparison is with a specific type of NPU. The novelty of the trade-off (complex cache vs. dedicated accelerator) is interesting but may not be universally superior, and the paper should be careful not to overgeneralize its benefits. The true innovation is proving this trade-off point is viable, but the complexity cost should be more critically analyzed.

Questions to Address In Rebuttal

Please explicitly differentiate the proposed "PIC Array Isolation" (PAI) mechanism from prior art in cache way-partitioning and way-locking (e.g., Intel CAT, or academic proposals for security/QoS). Is the implementation fundamentally different, or is the novelty purely in its application to enable processor-safe PIC?
The "Homogeneous Memory Array" (HMA) is a central concept. Can the authors clarify how HMA differs conceptually from a standard SRAM array augmented with peripheral compute logic and a reconfigurable datapath? Is the homogeneity claim based on the reuse of the memory bit-cells themselves, or the uniform structure of the augmented Mats?
The work claims to be the "first full-stack" PIC-enabled SoC. To substantiate this novelty claim, could the authors detail one or two specific system-level challenges (e.g., related to the OS scheduler, memory management unit interactions beyond page table walking, or interrupt handling) that they had to solve for this integration, which were not addressed by prior PIC simulation-based studies?

MHE-TPE: Multi-Operand High-Radix Encoder for Mixed-Precision Fixed-Point Tensor Processing Engines

Abstract

Fixed- point general matrix multiplication (GEMM) is pivotal in AI-accelerated computing for data centers and edge devices in GPU and NPU tensor processing engines (TPEs). This work exposes two critical limitations in typical spatial mixed-precision TPEs: ❶...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present MHE-TPE, a tensor processing engine architecture for mixed-precision fixed-point GEMM. The work identifies two valid issues in existing spatial accelerators: redundant partial product (PP) reductions and imbalanced compute density scaling with lower precision. To address this, the paper proposes a multi-operand high-radix encoder (MHE) to halve the number of PPs in vector-inner products and a spatiotemporal mapping strategy to achieve balanced throughput scaling. While the core concept of joint operand encoding is intriguing, the paper's claims of superiority are predicated on a flawed comparative analysis and a concerning lack of discussion regarding significant architectural overheads and practical limitations. The evaluation, while using standard tools, fails to establish the true cost-benefit of the proposed complexity.

Strengths

The paper correctly identifies and clearly articulates two critical and persistent challenges in the design of mixed-precision spatial accelerators: redundancy in PP reduction (Section 2.1) and the failure of real-world hardware to achieve theoretical throughput gains at lower precisions (Section 2.2).
The core architectural idea of jointly encoding two multiplicands to operate on two multipliers (the MHE concept, Section 3.1) is a novel approach to PP reduction at the vector level, moving beyond the single-multiplier scope of traditional Booth encoding.
The experimental methodology is based on a full RTL-to-GDS flow using industry-standard tools on a modern-ish process node (UMC 22nm), lending a degree of credibility to the reported area, power, and timing figures for the implemented components.

Weaknesses

Fundamentally Flawed Baseline Comparison: The central claim of superior efficiency rests on the comparisons in Table 8. The authors compare their flexible, mixed-precision MHE-TPE against a "Systolic Array (TPU-like)" baseline. However, the numbers provided for this baseline are for dedicated, single-precision hardware. A reconfigurable architecture will necessarily incur overhead for its flexibility that a dedicated design does not. This comparison is misleading and inflates the perceived benefits of MHE-TPE. A rigorous evaluation would require constructing and synthesizing a comparable reconfigurable baseline (e.g., using tiled low-precision multipliers with a reconfigurable reduction network) on the same UMC 22nm process. Without this, the claims of superior TOPS/W and TOPS/mm² are unsubstantiated.
Understated Architectural Overheads and Complexity: The paper glosses over several sources of significant overhead:
- Dual-Clock Domain: The use of a 4x fast clock for MHE/MHD and a slow clock for the compressor trees (Section 3.2.3, Page 6) is a complex design choice. The costs associated with robust cross-clock domain (CDC) synchronization logic (arbiters, synchronizer flops, gray code counters, etc.) in terms of area, power, and potential timing verification challenges are non-trivial. These costs do not appear to be explicitly broken out in the area/power analysis in Table 5.
- VPP Pre-computation Latency: The VPP LUT generation requires 6 cycles of pre-computation (Section 3.2.2, Page 6). The paper frames this within a WS dataflow, assuming high data reuse. However, this preprocessing overhead could severely degrade performance for GEMM workloads with a small K-dimension, where the reuse of Matrix B is limited. The paper fails to provide any sensitivity analysis of performance with respect to the K-dimension.
- Control Fabric: The spatiotemporal mapping methodology requires a sophisticated control fabric to manage the temporal iteration over Matrix A bit-slices and the spatial allocation of Matrix B bit-slices across different TPE Tiles. The area and power of this control logic are not detailed.
Inflexible Architectural Constraints: The design appears to be rigidly tied to a 4-bit spatial slicing of Matrix B. As stated in Section 4.7 (Page 13), the VPP LUT bit-width is constrained to 6 bits to support this 4-bit slicing. This imposes a fundamental architectural limitation. The entire premise of spatial scaling for precision relies on this partitioning. The paper does not discuss the ramifications of this constraint or how the architecture would adapt if, for example, 8-bit or 6-bit native slices were more efficient for a given workload.
Proposed Solutions for Low Utilization Are Not Implemented: In the analysis of CNN workloads (Section 4.6.2, Page 12), the authors note that utilization drops to 60.88% in deeper layers. They propose a "transposed dataflow layout" as a "viable optimization strategy." This is a purely hypothetical solution. There is no evidence presented that the MHE-TPE hardware, with its specific systolic-like dataflows, can actually support such a transposition efficiently, nor is the area/power overhead of the necessary routing/multiplexing logic for such a dataflow accounted for. Proposing a software fix for a hardware limitation without showing hardware support is a critical weakness.

Questions to Address In Rebuttal

Can the authors provide a direct, apples-to-apples comparison against a baseline mixed-precision accelerator (e.g., a tiled INT4 multiplier array with a reconfigurable reduction tree) designed and synthesized using the same UMC 22nm process and toolchain?
Please provide a detailed breakdown of the area and power overhead specifically for the cross-clock domain synchronization logic required by your dual-clock design. How much does this contribute to the total TPE Tile area and power?
How does the overall performance (effective TOPS) of the MHE-TPE array degrade as the matrix dimension K decreases (e.g., K = 64, 32, 16)? At what point does the 6-cycle VPP LUT pre-computation overhead negate the benefits of PP reduction?
Regarding the "transposed dataflow layout" proposed to fix low utilization in CNNs: Does the hardware described in Figure 8 contain the necessary interconnects (e.g., crossbars, routing muxes) to implement this dataflow? If so, what is their area and power cost? If not, the claim of this being a viable solution for your architecture is unsubstantiated.
The architecture is built around a 4-bit spatial slice for Matrix B. What would be the architectural implications and required redesign effort to support a native 8-bit spatial slice? Does this hardcoded 4-bit granularity represent a fundamental performance limiter for future workloads?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces MHE-TPE, a novel architecture for mixed-precision fixed-point tensor processing. The authors identify two fundamental limitations in conventional spatial architectures like systolic arrays: 1) redundant computation in the reduction of partial products (PPs) across both spatial and temporal domains, and 2) imbalanced computational density, where throughput gains from lower precision fail to reach theoretical limits (e.g., INT4 delivering only 2x, not 4x, the throughput of INT8).

The core contribution is a paradigm shift from optimizing single scalar multiplications to optimizing vector inner products directly. This is achieved through a Multi-operand High-Radix Encoder (MHE) that processes pairs of multiplicands (Am,2k, Am,2k+1) to generate selection signals for a pre-computed Vector Partial Product (VPP) Lookup Table. This VPP LUT stores linear combinations of corresponding multiplier pairs (B2k,n, B2k+1,n), effectively halving the number of terms that must be summed in the subsequent reduction tree. To address the mixed-precision scaling problem, the authors propose an elegant spatiotemporal mapping strategy: higher precision in the multiplicand matrix (A) is handled by iterating temporally over bit-slices, while higher precision in the multiplier matrix (B) is handled by distributing bit-slices spatially across adjacent processing tiles. This decoupling allows for balanced, near-theoretical throughput scaling across a wide range of precisions (INT2 to INT32). The authors provide a detailed microarchitectural design and extensive experimental validation, including synthesis results and performance on modern DNN workloads like Llama3 and ResNet50.

Strengths

The primary strength of this work lies in its conceptual elegance and its direct, effective solution to well-known, practical problems in accelerator design.

A Novel Architectural Primitive: The concept of a multi-operand encoder for vector inner products is a significant step beyond traditional Booth encoding. Instead of viewing GEMM as a grid of independent MAC operations, the authors reconsider the dot product as the fundamental unit to be optimized. By co-encoding pairs of operands, they fuse two multiplications and an addition into a single lookup and scaling operation before the main reduction network. This is a powerful and generalizable idea that reduces hardware complexity in the most critical component—the reduction tree.
Solves the Mixed-Precision Scaling Problem: The spatiotemporal mapping scheme is a standout contribution. The challenge of achieving balanced performance scaling with mixed precision is a major issue in both commercial and academic accelerators, which often rely on composing smaller multipliers, leading to overhead and inefficiency. The authors' method of mapping Matrix A's precision to the time domain and Matrix B's precision to the spatial domain is a clean, scalable solution. It ensures that hardware resources are used efficiently, delivering the 4x density for INT4 vs. INT8 that theory promises but practice rarely delivers. This is a significant practical achievement.
Broad Applicability and Generality: Unlike many contemporary works that rely on specific data properties like bit-level sparsity (e.g., Stripes, Laconic) or activation distributions (e.g., LUT-based approaches like LUTein), the MHE-TPE approach is fundamentally mathematical. It exploits the algebraic structure of the dot product itself. This makes the architecture broadly applicable to any dense or sparse GEMM computation without depending on favorable data statistics, enhancing its value as a general-purpose building block.
Thorough and Convincing Evaluation: The experimental methodology is comprehensive. The authors analyze the architecture from the component level (Table 4) to the full array (Table 6), exploring a wide design space (Table 5). The analysis under different process corners and conditions (Table 7, Figure 11) and, most importantly, the evaluation on relevant, modern workloads like Llama3 and ResNet50 (Figures 13 & 14) provide strong evidence for the practicality and effectiveness of their design. This level of detail builds significant confidence in the reported results.

Weaknesses

The paper's core ideas are strong, and the weaknesses are relatively minor, mostly relating to the exploration of design-space boundaries and the presentation of complex ideas.

Implicit Design Constraints of the VPP LUT: The VPP LUT is central to the design. The paper explains that it stores 8 pre-computed values derived from a 4-bit slice of two B operands. While the spatial mapping of 4-bit slices is well-justified for scaling Matrix B's precision, the paper could benefit from a deeper discussion on why a 4-bit slice is the optimal design point for the LUT itself. What are the area/power/timing implications of designing a VPP LUT based on, for example, 2-bit or 8-bit slices of B? This would help clarify the trade-offs that led to the current design.
Clarity of the Array-Level Dataflow: While the component-level diagrams (Figures 6 & 7) are clear, the paper could improve the intuitive leap to the full array-level spatiotemporal mapping (Figure 10). A more explicit, step-by-step example walking through a simple mixed-precision GEMM (e.g., INT8 x INT4) showing how the data for Matrix A iterates temporally and how the high/low nibbles of Matrix B are placed in different tiles would significantly aid reader comprehension.
Overhead of Inter-Tile Communication and Control: The design relies on a Local Reduce Module (LRM) for cross-tile accumulation and systolic broadcast of selection signals. While the components are evaluated, the complexity and overhead of the control logic and routing required to manage this dynamic, precision-dependent dataflow are not explicitly broken out. In a physical implementation, this control fabric can contribute non-trivially to area and power.

Questions to Address In Rebuttal

Could the authors elaborate on the design trade-offs of the VPP LUT's input bit-width? Is the choice of processing 4-bit slices of Matrix B fundamental to the architecture's efficiency, or could it be adapted for wider slices (e.g., 8-bit)? If so, how would this impact the LUT's complexity and the overall TPE design?
The spatiotemporal mapping is a powerful concept. Could the authors comment on the complexity of the control logic required to manage this mapping? Specifically, how does the LRM handle the synchronization and bit-shifting required for different precision combinations (e.g., INT2xA x INT32xB vs. INT32xA x INT2xB)?
To provide a more direct comparison, could the authors provide an estimate of the area and power savings of their MHE-TPE tile (e.g., for an M=32, K=32 INT8 computation) compared to a hypothetical baseline tile implemented in the same UMC 22nm process, but constructed by composing four INT4 standard Booth multipliers with a conventional reduction tree, as critiqued in Section 2.2? This would help to precisely quantify the benefits of the proposed approach against its most direct alternative.

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors propose a novel microarchitecture, the Multi-operand High-Radix Encoder Tensor Processing Engine (MHE-TPE), designed to mitigate two key inefficiencies in mixed-precision spatial GEMM accelerators: redundant partial product (PP) reduction and imbalanced computational density scaling. The central claim to novelty lies in the "Multi-operand High-radix Encoder" (MHE). This mechanism performs a joint encoding of two multiplicand operands to generate a single selection signal. This signal is then used to retrieve a "Vector Partial Product" (VPP) from a small, pre-computed lookup table (LUT) containing linear combinations of the corresponding two multiplier operands.

By structuring the computation around 2-element vector inner products (M(A_2k)B_2k + M(A_{2k+1})B_{2k+1}), this approach effectively halves the number of PPs that need to be reduced in the subsequent hardware stages. The paper builds a complete architectural framework around this core idea, featuring a three-stage pipeline (encoding, VPP generation, reduction) and a spatiotemporal mapping strategy that assigns multiplicand (Matrix A) precision to the temporal domain and multiplier (Matrix B) precision to the spatial domain.

Strengths

The primary strength of this work is that its core contribution—the multi-operand encoder (MHE)—appears to be genuinely novel. My analysis of prior art confirms the following:

Novel Extension of Booth Encoding: While high-radix encoding (e.g., Radix-4, Radix-8 Booth) is a well-established technique for reducing PPs from a single multiplicand, the proposed MHE extends this concept into a vector dimension. It is, to my knowledge, the first architecture to propose a joint, simultaneous encoding of multiple multiplicand operands to reduce the number of PPs for a vector inner product. This is not merely a higher-radix encoder but a "higher-dimension" encoder.
Unique Application of LUTs: The use of LUTs in accelerators is not new (e.g., LUT-TensorCore [32], LUTein [16]). However, the implementation here is distinct. Prior work typically uses single-operand lookups to store pre-computed results (e.g., activation values or partial MBE terms). In contrast, the VPP LUT in this work (Figure 5, page 4) stores pre-computed linear combinations of two distinct multiplier operands (B_2k, B_{2k+1}). The lookup is indexed by a signal derived from two distinct multiplicand operands. This joint operand treatment at the hardware encoding level is a significant conceptual departure from existing LUT-based designs.
Coherent Architectural Framework: The proposed three-stage computational paradigm and the spatiotemporal mapping strategy (Figure 10, page 8) are logical and novel consequences of the core MHE encoding scheme. By decoupling the reduction dimension (K) from the physical compressor tree size, the architecture achieves a more balanced and predictable scaling of throughput with precision, which is a non-trivial architectural innovation.

Weaknesses

My critique is focused on the contextualization of the novelty and its inherent trade-offs, rather than the novelty itself.

Insufficient Differentiation from Fused Vector Operations: While the hardware implementation is novel, the functional goal—replacing two multiplications and an addition with a single, more complex operation—is conceptually related to fused vector operations or custom instructions in DSPs. The paper would be stronger if it more explicitly differentiated the MHE concept from this broader domain, highlighting why a hardware-level joint encoding approach is fundamentally different and more efficient than a higher-level instruction fusion.
Framing of the Problem: The motivation presented in Section 2 (page 3) identifies redundancy in standard MBE-based multipliers. This redundancy, arising from the limited codomain of the MBE function ({-2, -1, 0, 1, 2}), is a known phenomenon. The novelty is not in observing this redundancy, but in the specific mechanism proposed to exploit it at the vector level. The paper could sharpen its contribution by stating this more directly, framing the work as a novel exploitation of a known property, rather than the discovery of the property itself.
Implicit Complexity Trade-offs: The MHE introduces new hardware components: the VPP Select Encoder, the VPP LUT, and the pre-computation adder (Section 3.1, page 4). This represents a non-trivial area, power, and latency cost paid upfront to simplify the downstream reduction tree. The dual-clock scheme (Section 3.2.3, page 6) further adds design complexity. While the results demonstrate a net benefit for the chosen configurations, a more detailed analysis of the break-even point (e.g., for what reduction dimension K does this approach become superior to a standard design?) would better contextualize the practical domain of this novel contribution.

Questions to Address In Rebuttal

The proposed MHE is based on a 2-element vector inner product. What are the theoretical and practical barriers to extending this to a 3- or 4-element vector inner product? Would the VPP LUT size (currently 8 entries) and the complexity of the VPP Select Encoder grow exponentially, rendering the novelty of this approach fundamentally limited to the 2-element case?
The architecture's mixed-precision capability for Matrix B relies on spatial mapping of 4-bit slices across different TPE Tiles (Section 3.3.2, page 7). This fixes the internal VPP LUT data path to a width compatible with 4-bit inputs (+2 expansion bits). Does this design choice create a structural inflexibility compared to conventional bit-serial or bit-sliced architectures that can be reconfigured for, say, 1-bit or 2-bit operations on Matrix B without stranding hardware resources? How does this impact the generality of the proposed "unified hardware"?
The VPP LUT must be populated during a "preprocessing phase" before computation can begin. For workloads with very low data reuse (i.e., small M dimension in a GEMM), could the latency and energy overhead of this preprocessing phase negate the benefits gained from the reduced PP count? Please provide an estimate of the M dimension at which the MHE-TPE begins to outperform a conventional systolic array of the same area.

SuperMesh: Energy-Efficient Collective Communications for Accelerators

Abstract

Chiplet- based Deep Neural Network (DNN) accelerators are a promising approach to meet the scalability demands of modern DNN models. Such accelerators usually utilize 2D mesh topologies. However, state-of-the-art collective communication algorithms often ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present "SuperMesh," a modification to the standard 2D-mesh topology for chiplet-based accelerators, intended to improve the performance and energy efficiency of collective communication operations. The proposed modification involves adding short, bidirectional links exclusively between adjacent nodes on the periphery of the mesh. Two variants are proposed: SUPERMESHBI, which adds links parallel to all peripheral links, and SUPERMESHALTER, which adds them alternately. The authors claim that these minimal additions resolve the well-known communication bottleneck at border nodes. They co-design pipelined AllReduce, ReduceScatter, and AllGather algorithms that leverage these new links to form four disjoint communication trees, as opposed to the three trees possible in a standard mesh. The paper claims significant speedups for collectives (up to 1.33x for AllReduce and 2.22x for ReduceScatter/AllGather) and improved energy efficiency compared to baseline mesh topologies.

Strengths

The paper correctly identifies a well-established and significant performance limiter in large-scale 2D-mesh interconnects: the reduced connectivity of border and corner nodes, which creates bottlenecks during collective communication operations.
The proposed topological modification is conceptually simple and localized to the periphery, which plausibly preserves the scalability and regularity advantages of the core mesh structure.
The evaluation includes a reasonable set of baseline algorithms (TTO, TACOS, MultiTree) and collectives (AR, RS, AG), demonstrating a breadth of analysis.

Weaknesses

The paper's claims rest on a foundation that appears methodologically questionable and contains several unsupported assertions.

Outdated Technological Assumptions: The entire energy and power analysis (Section 6.4, page 11) is predicated on DSENT simulations using a 32nm process node. This technology is over a decade old and is not representative of modern chiplet-based accelerators, which are fabricated on 7nm, 5nm, or even more advanced nodes. Power characteristics, particularly the ratio of static to dynamic power, differ dramatically across process generations. Therefore, any conclusions regarding energy efficiency (e.g., claims of consuming 0.72-0.84x the energy of mesh) are suspect and cannot be credibly extrapolated to current or future hardware.
Unsubstantiated Claims Against Prior Art: In Section 2.3 (page 2), the authors dismiss the ARIES interconnect by claiming that "over 84% of its added links could be removed without affecting collective communication performance." This is a remarkably specific and strong claim used to position their work favorably against a relevant alternative. However, the paper provides no data, analysis, or citation to support this figure. Without evidence, this stands as an unsubstantiated assertion designed to marginalize a competitor's design.
Heuristic and Potentially Sub-Optimal Algorithm Design: The co-designed collective algorithms rely on a heuristic approach for tree formation (Algorithm 1, page 8) and a specific, patterned strategy for root selection (Figure 8, page 8). The paper provides no formal analysis or proof that this greedy, BFS-based method consistently produces optimally balanced trees. The performance gains could be an artifact of a specific, favorable root selection that may not hold under different conditions or for other collective patterns. The lack of robustness analysis for the tree-generation algorithm is a significant omission.
Conflation of Throughput and Latency Metrics: The paper's primary motivation is to improve the throughput of large-data collective operations, which is a bandwidth-bound problem. However, Section 6.11 (page 13) presents results on average packet latency for point-to-point traffic patterns (uniform and tornado). While these results show a latency reduction, they are largely orthogonal to the core thesis. Performance in these traffic patterns does not necessarily correlate with performance for bandwidth-intensive collectives. This section feels extraneous and distracts from the central claims, potentially creating a misleading impression of the topology's general-purpose benefits.
Incomplete Scalability Analysis: While Figure 11 (page 9) shows normalized runtime scaling up to 256 nodes, the analysis is incomplete. As the mesh size (N x N) increases, the ratio of border nodes (4N-4) to total nodes (N^2) decreases. Consequently, the relative contribution of the authors' peripheral modifications should diminish with scale. The paper fails to discuss this fundamental scaling property and at what point the benefits of SuperMesh would become negligible compared to core mesh bottlenecks in extremely large systems.

Questions to Address In Rebuttal

Please provide a rigorous justification for using a 32nm process node for the energy analysis. Furthermore, provide a sensitivity analysis showing how the claimed energy efficiency benefits (Figure 14 and 15) would change when modeled with a more contemporary technology (e.g., 7nm), where static power is a different component of the total power budget.
Provide the complete data and methodology to substantiate the claim from Section 2.3 that "over 84% of [ARIES's] added links could be removed without affecting collective communication performance." If this analysis cannot be provided, the claim should be retracted.
The tree generation in Algorithm 1 is heuristic. Can the authors provide a formal analysis of its properties? Specifically, how does the algorithm ensure tree height balance, and how sensitive are the final performance results to the root selection strategy presented in Figure 8 versus a random or alternative root selection?
Please clarify the relevance of the point-to-point latency results in Section 6.11 to the paper's central thesis on improving collective communication throughput. Why were these unicast traffic patterns chosen for evaluation instead of analyzing latency characteristics within the pipelined collective operations themselves?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents "SuperMesh," a novel set of topologies and co-designed collective communication algorithms for chiplet-based DNN accelerators. The core contribution is a minimalist and targeted approach to solving the well-documented communication bottleneck at the borders of conventional 2D mesh networks. Instead of proposing radical topological changes or adding power-hungry long-range links, the authors augment the standard mesh by adding short, bidirectional links only between adjacent nodes along the periphery. They propose two variants: SUPERMESHBI (links added to all peripheral nodes) and SUPERMESHALTER (alternating links). To leverage this modest hardware change, they extend the pipelined, disjoint-tree approach for AllReduce (AR) to utilize four trees instead of the typical three, and critically, they adapt this highly efficient pipelined paradigm to ReduceScatter (RS) and AllGather (AG) operations, which are often overlooked by specialized AR optimizations. The work argues that this targeted, "less is more" philosophy yields significant performance and energy-efficiency gains with minimal design overhead.

Strengths

Elegant Problem-Solution Fit: The paper's primary strength lies in its diagnosis of the problem and the elegance of its solution. The authors correctly identify that for collective communications on a mesh, the performance-limiting factor is not the bisection bandwidth or average latency, but the reduced connectivity of border and corner nodes (as illustrated in Figure 2, page 2). The SuperMesh topology is a direct, precise, and minimal remedy for this specific ailment. It avoids the high costs (energy, latency, design complexity) of more "brute-force" solutions like Folded Torus or the potential overkill of uniform augmentation schemes like ARIES.
Pragmatism and Real-World Applicability: The proposed modifications are highly pragmatic. By adding only short, local links, the design adheres to the physical constraints of interposer-based chiplet systems where long D2D links are undesirable. The fact that the internal mesh structure remains untouched makes this an easily adoptable, almost "drop-in" enhancement for existing and future mesh-based accelerator designs. This practicality is a significant advantage over more academically novel but physically challenging topologies.
Strong Hardware-Software Co-Design: This is not merely a paper about a new topology. The co-designed collective algorithms presented in Section 4 (page 6) are essential to the work's success. The ability to form four disjoint trees for pipelined AllReduce fully utilizes the enhanced connectivity and directly translates the hardware modification into performance. More importantly, the novel adaptation of this pipelined methodology to ReduceScatter and AllGather is a substantial contribution, addressing the needs of modern training paradigms like ZeRO that rely heavily on these collectives.
Comprehensive Contextualization and Evaluation: The authors do a commendable job of positioning their work relative to a wide spectrum of existing interconnect research. The comparative analysis in Figure 1 (page 2) and later in Section 6.8 (page 12) effectively demonstrates the trade-offs between mesh, torus, butterfly, and Kite topologies, making a strong case for their approach. The evaluation against multiple state-of-the-art collective algorithms (TACOS, TTO, MultiTree) further solidifies their claims.

Weaknesses

While the core idea is strong, the paper could be improved by exploring its implications more broadly.

Understated Philosophical Contribution: The authors correctly claim that schemes like ARIES are inefficient for this problem, stating that "over 84% of its added links could be removed" (page 2). This is a powerful insight. However, the paper frames its contribution primarily as a new topology rather than a new design principle: that for collectives on a mesh, targeted, non-uniform augmentation is superior to uniform augmentation. Elevating this principle and contrasting it more directly with the philosophy behind ARIES would better highlight the work's conceptual novelty.
Limited Discussion of Physical Design Overhead: The paper claims "negligible area and power cost," but this is assessed at an architectural level. A more detailed discussion on the physical implementation would be beneficial. For example, the SUPERMESHBI variant requires some border nodes to have six-port routers (page 6). This has implications for router design, area, and power that are not fully explored. Furthermore, routing these additional peripheral links on a dense interposer, even if short, could present challenges that warrant discussion.
Narrow Focus on Collective Communication: The work is laser-focused on optimizing collective communication, which is its primary goal. However, large-scale accelerators also handle point-to-point traffic. While Section 6.11 (page 13) briefly shows a benefit for latency under uniform and tornado traffic, a deeper analysis would be welcome. How would a congestion-aware routing algorithm leverage the additional peripheral paths for mixed workloads? Could the extra links create routing complexities or deadlocks for non-collective traffic if not managed carefully? A more holistic view of the network's performance would strengthen the paper.

Questions to Address In Rebuttal

The comparison to ARIES is a key part of your motivation. Could you elaborate on your claim that 84% of ARIES's added links are unnecessary for collective performance? Does this suggest that SuperMesh captures nearly all the collective performance benefits of a fully-connected-row/column mesh like ARIES, but with a fraction of the hardware cost?
Regarding the physical design, the SUPERMESHBI variant requires some routers to be upgraded from 5-port to 6-port. Could you provide an estimate of the area and static power overhead of this change? How does this impact the overall claim of minimal overhead, especially if a significant portion of nodes are border nodes in smaller mesh configurations?
Your work brilliantly optimizes for collective throughput. In a realistic scenario with mixed workloads (e.g., parallel execution of multiple models, or complex dataflow patterns involving both collectives and point-to-point messages), could the peripheral links in SuperMesh be leveraged to offload congestion from the core mesh? Or do you see the primary benefit remaining strictly within the context of large-scale collective operations?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present "SuperMesh," a modification to 2D-mesh topologies for chiplet-based accelerators, designed to improve the performance and energy efficiency of collective communication operations. The core idea is that for collective communication, the primary bottleneck is the reduced connectivity of border nodes, not internal congestion. To address this, the authors propose adding short, bidirectional links parallel to the existing links, but only at the periphery of the mesh. Two variants are proposed: SUPERMESH_BI (adds links to all peripheral segments) and SUPERMESH_ALTER (alternates link additions). To leverage this new hardware, the authors co-design pipelined collective algorithms (AllReduce, ReduceScatter, AllGather) that are adaptations of the pipelined tree approach, now capable of forming four disjoint trees instead of the three possible on a standard mesh.

Strengths

The primary strength of this work lies in its novel and targeted approach to a well-known problem. The central claim of novelty is twofold: the specific topological modification and the co-designed algorithms that exploit it.

Precise and Justified Problem Formulation: The authors' key insight—that the bottleneck for collective communication is peripheral link scarcity due to node degree, rather than the internal congestion that plagues point-to-point traffic—is a clear and novel framing of the problem. This distinction correctly separates their work from a large body of prior art on express links for general-purpose NoCs (e.g., MECS [17], Adapt-NoC [84]), which are designed to reduce average packet latency, not maximize collective throughput.
Elegant and Minimalist Hardware Solution: The proposed solution is compelling in its simplicity. Instead of introducing complex, long-range links (as in Folded Torus [74] or Kite [5]) or uniformly augmenting the entire mesh (as in ARIES [81]), the authors propose a minimal, localized change. Adding short, parallel links only at the periphery is practical for interposer-based designs where link length is a critical constraint. This represents a significant "delta" from prior work. The claim in Section 2 (page 2) that over 84% of ARIES's added links could be removed without impacting collective performance is a powerful justification for this targeted approach.
Demonstrated Synergy of Hardware and Software: The novelty is not just in adding links, but in showing that this minimal change enables a qualitative shift in algorithmic capability. Specifically, it overcomes the fundamental limitation of pipelined AllReduce (TTO [35]) on a 2D-mesh, which cannot form more than three disjoint trees. The ability to form four trees using all N nodes is a direct and significant consequence of the SuperMesh topology.

Weaknesses

My critique is focused on the precise boundaries of the claimed novelty and whether the contributions are as fundamental as portrayed.

Incremental Nature of the Algorithmic "Co-Design": The novelty of the proposed collective algorithms is overstated. The paper presents them as "co-designed," but the core algorithmic paradigm is a direct extension of the existing pipelined tree concept from TTO [35]. The fundamental innovation was the pipelined execution over disjoint trees. The authors' algorithm (Algorithm 1, page 8) essentially applies this existing concept to a new graph that now supports four trees instead of three. While necessary for the paper, this is more of an adaptation than a novel algorithmic contribution in its own right. The true novelty lies in the hardware topology that enables this adaptation.
The Core Idea is an Optimization, Not a New Paradigm: The act of adding parallel links to a network is not, in itself, a new idea. The novelty here is the targeted placement of these links. While I acknowledge this is a clever and effective optimization, it must be viewed as such. It is an evolutionary improvement on the 2D-mesh for a specific workload, not a revolutionary new class of interconnect.
Limited Scope of Novelty: The entire contribution is predicated on the dominance of collective communication. While this is true for many distributed training workloads, the paper provides limited evidence that the topology does not regress performance for other important traffic patterns. The brief analysis in Section 6.11 (page 13) shows an improvement for uniform and tornado traffic, but this seems to be a secondary effect. The novelty is confined to a specific problem domain, which limits its foundational impact.

Questions to Address In Rebuttal

The closest and most relevant prior art appears to be ARIES [81], which adds bypass links uniformly. Your most compelling argument against it is the claim that "over 84% of its added links could be removed without affecting collective communication performance" (Section 2, page 2). Is this claim based on a rigorous simulation of ARIES using your pipelined collective algorithms, or is it an analytical argument based on link counting? Please provide clear evidence for this specific number, as it is central to justifying your targeted approach over a more general one.
Could the authors clarify the novelty of their collective algorithm beyond being an adaptation of the TTO/pipelined tree concept [35]? Is there a more fundamental algorithmic insight—perhaps in the tree formation or scheduling—that was required to efficiently utilize the SUPERMESH_ALTER variant, where peripheral connectivity is irregular?
The core proposal is to add more physical paths at the periphery. A conceptually similar alternative would be to add more bandwidth to existing peripheral links (e.g., by doubling their width or clock frequency). While this may have its own implementation challenges, it would also address the peripheral bottleneck. Can you argue why your topological solution is fundamentally superior to a non-topological, bandwidth-focused solution targeting the same physical locations in the mesh? This would help solidify the novelty of the topological contribution itself.

SMX: Heterogeneous Architecture for Universal Sequence Alignment Acceleration

Abstract

Sequence alignment is a fundamental building block for critical applications across multiple fields, such as computational biology and information retrieval. The rapid advancement of genome sequencing technologies and breakthrough generative AI tools, ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper presents SMX, a heterogeneous architecture combining an ISA extension (SMX-1D) and a dedicated coprocessor (SMX-2D) to accelerate sequence alignment. The authors claim this approach provides both the flexibility of general-purpose cores and the efficiency of specialized hardware, enabling acceleration across diverse alignment models (DNA, protein, ASCII) and algorithms (banded, Xdrop, Hirschberg). However, the work suffers from significant methodological weaknesses in its evaluation. The performance claims rely on comparisons against a questionable baseline and a deeply flawed analysis of state-of-the-art competitors. The architectural justification for the dual-component design is not sufficiently supported by evidence, and the physical design analysis rests on questionable assumptions. Consequently, the claimed performance advantages are not convincingly substantiated.

Strengths

Comprehensive Design: The paper details a complete architectural proposal from the ISA level (SMX-1D) to a coprocessor microarchitecture (SMX-2D) and its integration with a host CPU.
Physical Implementation: The authors provide an RTL-level implementation and physical design results in a 22nm process (Figure 13, page 12). This demonstrates a degree of engineering effort beyond high-level simulation, lending some credibility to the area and frequency claims, assuming the underlying process is acceptable.
Exploration of Differential Encoding: The work correctly identifies and builds upon differential encoding as a key optimization for reducing data width, which is a sound architectural principle for this domain.

Weaknesses

Fundamentally Flawed State-of-the-Art (SotA) Comparison: The comparisons presented in Section 11 and Figure 14 (page 12) are misleading and do not constitute a fair or rigorous evaluation.
- GACT (Darwin): The authors claim GACT achieves "zero recall" on ONT sequences, a damning assertion used to dismiss its performance advantage. This is an extraordinary claim that suggests either a fundamental flaw in GACT or, more likely, a suboptimal configuration by the authors. A heuristic's failure is often a matter of parameter tuning, which is not discussed. The authors then pivot to comparing their own banded algorithm against GACT's windowed heuristic—an apples-to-oranges comparison, as these are different algorithmic trade-offs. The fact remains that on the task GACT was designed for, SMX is 2.4x slower.
- DPX (on CPU SIMD): The evaluation of DPX is a strawman argument. DPX is an ISA extension for NVIDIA's massively parallel GPU architecture, designed to leverage its specific execution model and memory hierarchy. Implementing its logic on a CPU's limited-width SIMD unit is not a faithful representation of the architecture and is guaranteed to perform poorly. This comparison is invalid and serves only to inflate SMX's relative performance.
- CUDASW++ (GPU): The comparison against an NVIDIA H100 GPU is based on a projection of a "72-core SMX-enhanced Grace CPU". This is not a real system. Such a projection is fraught with unstated assumptions about memory bandwidth, interconnect performance, and software overhead. Comparing a speculative, simulated system against a real-world, highly-optimized software library running on state-of-the-art hardware is not a scientifically valid methodology.
Insufficiently Justified Baseline: The primary performance evaluation in Figure 9 (page 10) uses the KSW2 implementation from Minimap2 as the sole "SIMD" baseline for all use cases (DNA-edit, DNA-gap, Protein, ASCII). While KSW2 is a respectable implementation, it is not the universally acknowledged state-of-the-art for all these scenarios. Other libraries, such as Parasail or SSW, offer highly optimized SIMD kernels, particularly for protein alignment with substitution matrices. By selecting a single baseline, the authors have likely inflated their speedup claims against software that is not optimally tuned for every specific task they evaluate.
Weak Justification for the Heterogeneous Approach: The paper's core thesis is that the combination of SMX-1D and SMX-2D is superior. However, the necessity of the SMX-1D ISA extension is not proven. Its primary stated role in the combined system is to handle traceback re-computation within tiles (Figure 8, page 9). The paper fails to provide an ablation study that quantifies the performance impact of this task. It is unclear if a simple, scalar CPU implementation of re-computation would create a significant bottleneck. Without this analysis, the added complexity and area of the SMX-1D unit is not justified over simply using the CPU for irregular tasks and a more powerful SMX-2D coprocessor. The synergy is asserted, not demonstrated.
Questionable Physical Design Assumptions: The physical analysis in Section 10 (page 12) is based on a 22nm technology node, which is significantly outdated. Area comparisons to accelerators or processors built on modern 5nm/4nm nodes are therefore difficult to interpret. More critically, the power consumption figure (0.342 mW) is derived from an assumed "20% gate activity factor." This value seems arbitrarily low for a systolic array (SMX-Engine) designed for high-throughput computation, which would be expected to have very high activity when utilized. This assumption requires strong justification, as it likely leads to an underestimation of the true power draw.

Questions to Address In Rebuttal

Regarding the GACT Comparison: Please provide a thorough justification for GACT's "zero recall." Did you attempt to tune its heuristic parameters for the ONT dataset? Please defend the decision to compare different algorithms (SMX with banding vs. GACT with windowing) rather than reporting that SMX is 2.4x slower on an identical task.
Regarding the DPX Comparison: How can implementing a GPU-specific ISA, designed for a massively parallel SIMT architecture, on a CPU's narrow SIMD unit be considered a fair or representative evaluation of DPX's capabilities?
Regarding the Baseline Selection: Please justify the exclusive use of KSW2 as the SIMD baseline across all evaluated alignment models. Provide data showing how KSW2 compares to other state-of-the-art SIMD libraries (e.g., Parasail) for protein alignment, and explain why it was still considered a suitable baseline if it is not the top performer.
Regarding Architectural Justification: Please provide an ablation study quantifying the performance bottleneck of traceback re-computation when performed on the host CPU versus the SMX-1D extension. This data is critical to justify the area and design complexity of including SMX-1D.
Regarding Physical Design: Please provide a rationale for the assumed 20% gate activity factor used for power estimation. How does this value compare to activity factors observed in similar high-throughput systolic array designs operating at near-peak performance?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces SMX, a heterogeneous architecture for accelerating sequence alignment. The core contribution is not merely another accelerator, but a thoughtfully architected system that recognizes and addresses the dual nature of modern alignment algorithms. The authors propose a co-design of two specialized components: (1) SMX-1D, an ISA extension integrated into a general-purpose core to handle irregular, sequential, and control-heavy tasks like traceback and heuristic evaluation; and (2) SMX-2D, a dedicated coprocessor designed as a 2D systolic array to accelerate the regular, parallel, and compute-intensive task of DP-matrix calculation.

This approach aims to bridge the well-known gap between flexible general-purpose processors (which are slow) and efficient but rigid domain-specific accelerators (which are inflexible). By architecturally partitioning the problem, SMX seeks to achieve the best of both worlds: high performance on the bulk of the computation via the coprocessor, while retaining the programmability of the host CPU to implement a wide variety of complex heuristics and algorithms. The work is supported by extensive cycle-accurate simulations and a physical design implementation, demonstrating significant speedups over state-of-the-art software and favorable performance-per-area compared to other hardware accelerators.

Strengths

Core Architectural Insight: The primary strength of this paper is its fundamental design philosophy. The authors correctly identify that practical sequence alignment is not a single, monolithic task. It is a composite of highly regular DP computation and highly irregular control flow. The decision to map this algorithmic duality onto a hardware duality (SMX-2D for compute, SMX-1D for control) is elegant and powerful. This positions the work not as just another data point on the accelerator spectrum, but as a novel and compelling design pattern for this problem domain.
Excellent Problem Contextualization: The paper does an outstanding job of situating itself within the broader landscape. The introduction and motivation sections (Sections 1 and 3) clearly articulate the exponential growth of sequence data and the limitations of existing solutions—from the overhead of CPUs to the inflexibility of ASICs like GenASM [15] and Darwin [101]. The analysis in Figure 2 (page 2), showing the trade-offs between computation, memory, and accuracy for different algorithms, effectively frames the need for a solution that is both fast and flexible.
Demonstrated Flexibility and Versatility: A key claim of the paper is its ability to accelerate a variety of alignment tasks, and the evaluation strongly supports this. The experiments cover DNA, protein, and ASCII alignment, and critically, they implement and accelerate complex, practical algorithms like Hirschberg and Xdrop. The performance comparison in Section 11 (page 12), where SMX is configured to emulate the behavior of other specialized systems, is a particularly strong piece of evidence for its flexibility. It shows that SMX can perform reasonably well even on tasks for which systems like GACT are purpose-built, while also being able to execute algorithms (like full-recall Hirschberg) that those systems cannot.
Grounded in Reality: The work is not purely theoretical. The inclusion of a physical design implementation in a 22nm technology node (Section 10, page 12) provides concrete area and power estimates. This demonstrates that the proposed architecture is practical and not excessively costly, with the entire SMX system adding about 30% to a single-issue in-order core—a reasonable overhead for the massive speedups achieved.

Weaknesses

Understated Software and Programmability Challenge: While the hardware architecture is detailed, the paper gives little insight into the programming model. How does a developer leverage SMX? Is there a high-level API or library that abstracts the offload to SMX-2D and the use of SMX-1D instructions? Or must the programmer manage this complex interplay manually? The success of a heterogeneous system heavily depends on its usability. Without a clear software story, the barrier to adoption for the bioinformatics community could be high. This feels like a significant missing piece in the overall system design.
The Limits of "Universality": The title claims "Universal" acceleration. The paper impressively supports various characters, substitution matrices, and several key algorithmic heuristics. However, a major class of alignment models not discussed is affine gap penalties, which are ubiquitous in bioinformatics. The recurrence relations in Equation 2 (page 3) and their differential forms in Equations 5 and 6 (page 5) are for linear gap penalties. It is unclear if the SMX-PE microarchitecture (Figure 5, page 6) can be extended to handle the three-way DP-matrix dependency required for affine gaps without a significant redesign and area increase. Clarifying the bounds of this universality would strengthen the paper.
Opportunity for a Deeper Discussion on Design Trade-offs: The paper presents itself as a "case study" exploring the frontier between flexibility and efficiency. While the results demonstrate this trade-off in action (e.g., the comparison with GACT in Section 11), the discussion could be more explicit. What specific design choices in the SMX-engine or SMX-worker were made to prioritize flexibility (e.g., support for variable EW) over raw, single-purpose performance? Explicating these engineering trade-offs would provide deeper insights for future architects in this and other domains.

Questions to Address In Rebuttal

Programming Model: Could the authors elaborate on the software interface for SMX? What level of abstraction is envisioned for a bioinformatician to program this system? Is there a compiler or library support to manage the coordination between the CPU, SMX-1D instructions, and SMX-2D offloads?
Support for Affine Gap Penalties: The support for various models is a key strength. Can the authors comment on the feasibility of extending SMX to support affine gap penalties? Would this require a fundamental change to the SMX-PE design, or could it be handled with the existing hardware, perhaps at a lower performance?
Applicability to Other Domains: The architectural pattern of separating regular bulk computation from irregular control flow seems broadly applicable. Beyond sequence alignment, do the authors see the SMX-1D/SMX-2D design pattern being useful for other bioinformatics or computational science problems that exhibit a similar dual nature (e.g., Hidden Markov Models, graph analytics algorithms)?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present SMX, a heterogeneous architecture for accelerating sequence alignment. The core proposal is a co-designed system comprising two main components: (1) SMX-1D, an ISA extension designed to accelerate irregular, sequential tasks such as traceback and pre/post-processing, and (2) SMX-2D, a dedicated coprocessor structured as a systolic array for accelerating the highly parallel, regular computation of DP-matrix blocks. The authors claim this heterogeneous approach provides a novel balance between the flexibility of general-purpose cores and the high efficiency of domain-specific accelerators, making it suitable for a "universal" set of sequence alignment tasks.

The central novelty claim is not in the individual components, but in their synergistic integration and the specific division of labor they enable. While prior art contains both ISA extensions and standalone accelerators for this problem, this paper appears to be the first to propose their tight co-design, where the ISA-extended core handles the control-intensive and irregular parts of the algorithm (like traceback re-computation) that typically hamstring standalone DSAs.

Strengths

The primary strength of this work lies in its core architectural concept: the synergistic co-design of an ISA extension and a dedicated coprocessor to tackle different facets of the same core problem. This division of labor, where the SMX-2D handles the bulk, regular computation and the SMX-1D provides the host CPU with the necessary tools to efficiently manage the irregular components (Figure 8, page 9), is a novel and compelling approach to the classic flexibility-vs-efficiency trade-off in domain-specific acceleration.

While the underlying ideas are not entirely new, the paper introduces several well-motivated incremental novelties: * The use of a runtime-configurable, narrow-width differential encoding (Section 4.1, page 5) is a practical and well-executed enhancement over prior work that used fixed 8-bit encoding (e.g., Minimap2 [63], Suzuki and Kasahara [99]). Shifting the values to be non-negative simplifies the hardware, which is a clever microarchitectural optimization (Figure 5, page 6). * The tight coupling that allows the core (using SMX-1D) to selectively recompute inner DP-elements for a traceback path through a tile previously computed by the coprocessor (SMX-2D) is an elegant solution. It avoids the massive area/storage cost for traceback logic seen in other DSAs (e.g., GACT and GenASM, as noted in Section 3) by effectively leveraging the now-accelerated host core.

Weaknesses

My main concern is that while the integrated system is novel, the constituent components are largely evolutionary, not revolutionary. The paper should be more precise in delineating its contributions from the vast body of existing work.

Component-Level Novelty is Incremental: The paper rightly positions itself against prior work, but the novelty of the components themselves is limited. The SMX-2D coprocessor is fundamentally a systolic array for DP-matrix computation, a well-established pattern for this problem dating back decades (e.g., Yu et al. [109]). Its main distinction is the implementation of the authors' specific encoding scheme. Similarly, the SMX-1D ISA extension builds upon the conceptual foundations of prior art in both SIMD-style DP computation (e.g., Farrar [34]) and specialized ISA extensions for DP (e.g., GMX [30], NVIDIA DPX [85]). The contribution is in the refinement and combination, not in a new fundamental mechanism.
"Universal" Claim Overstated: The title and abstract claim "Universal Sequence Alignment Acceleration." However, the presented recurrence relations (Eq. 2, page 3) and subsequent hardware design appear to only support a linear gap penalty model (a constant penalty for each indel). Most state-of-the-art, high-sensitivity alignment tools for both genomics (e.g., WFA) and proteomics (e.g., BLAST) rely on more complex scoring like affine gap penalties (separate penalties for opening and extending a gap). It is not at all obvious that the proposed differential encoding scheme and the simple SMX-PE microarchitecture can be extended to support affine gaps without significant architectural changes that would likely invalidate the current design's simplicity and efficiency. This is a critical omission that significantly curtails the claim of "universality."
Lack of Comparison to Functionally Similar Hybrid Systems: While the paper compares against pure ISA extensions and pure DSAs, it does not discuss prior art in hybrid CPU+FPGA systems for bioinformatics. Such systems often implement a similar division of labor, with the CPU handling control flow and the FPGA fabric accelerating the core DP kernel. While SMX is a more tightly integrated ASIC-based solution, the conceptual parallel is strong and should be acknowledged and discussed.

Questions to Address In Rebuttal

The authors should address the following points to strengthen their novelty claim and clarify the scope of their contribution:

Novelty of the SMX-PE: Please position the novelty of the SMX Processing Element (SMX-PE) microarchitecture itself more clearly. Beyond implementing the proposed non-negative differential recurrence relations, what is fundamentally new in its design compared to prior systolic cells for Smith-Waterman or Needleman-Wunsch?
Support for Affine Gap Penalties: This is the most critical point. Can the SMX architecture, and specifically the differential encoding scheme and SMX-PE datapath, support affine gap penalties? If so, please provide the modified recurrence relations and a sketch of the required hardware changes and their area/latency impact. If not, the claim of "universality" must be significantly tempered, and the paper should explicitly state this limitation and justify why the linear gap model is sufficient for the target applications.
Generalizability of the Architectural Pattern: The core architectural idea—an ISA extension for irregular "glue" logic coupled with a coprocessor for bulk computation—seems broadly applicable to other algorithms with a similar structure (e.g., other DP problems like Viterbi decoding or Hidden Markov Models). Could the authors comment on the potential of this SMX pattern beyond sequence alignment? Acknowledging this could strengthen the case that the core contribution is a novel architectural pattern, not just a point solution.

MINDFUL: Safe, Implantable, Large-Scale Brain-Computer Interfaces from a System-Level Design Perspective

Abstract

Brain- computer interface (BCI) technology is among the fastest growing fields in research and development. On the application side, BCIs provide a deeper understanding of brain function, inspire the creation of complex computational models, and hold ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present an analytical framework, MINDFUL, intended to model the system-level design trade-offs for future large-scale, implantable Brain-Computer Interface (BCI) Systems-on-Chip (SoCs). The paper develops first-order equations to estimate power consumption, area, and data throughput as a function of the number of neural interface channels. Using this framework, the authors analyze two primary design paradigms: communication-centric and computation-centric. Their analysis concludes that naively scaling current communication-centric designs is infeasible due to power safety limits. Furthermore, they claim that integrating modern, computationally-intensive Deep Neural Networks (DNNs) on-implant is also unviable without significant, multi-faceted optimizations that drastically reduce the computational workload.

Strengths

Problem Formulation: The paper correctly identifies a critical and forward-looking challenge in the BCI field: the impending collision between the desire for higher channel counts and sophisticated on-chip processing, and the strict, non-negotiable safety constraints of implantable devices.
System-Level Scope: The work attempts to create a unified view across multiple layers of the system stack, from the neural interface (NI) and wireless communication to on-chip computation. This holistic perspective is valuable for the computer architecture community.
Pragmatic Lower-Bound Analysis: In its analysis of on-implant computation (Section 5.3), the framework’s focus on Multiply-Accumulate (MAC) operations as the primary driver of power consumption provides a reasonable, if simplistic, lower bound for evaluating DNN feasibility.

Weaknesses

The paper’s conclusions are entirely dependent on its analytical framework, which is built upon a series of oversimplified assumptions and questionable methodological choices. These foundational weaknesses call the validity of the paper's primary claims into serious question.

Critically Oversimplified Power and Thermal Model: The entire analysis is predicated on a maximum power density limit of 40 mW/cm² (Section 3.2, page 4), assuming uniform power consumption and heat dissipation across the chip surface. This is a fundamentally flawed premise for a modern SoC. Any non-trivial computation, especially the DNN acceleration discussed, will create significant thermal hotspots. The authors’ justification—that silicon’s thermal conductivity spreads heat rapidly—is insufficient to dismiss the well-established problem of localized heating. A single hotspot exceeding the thermal limit renders the device unsafe, a possibility this model completely ignores. This invalidates the concept of a single, uniform Pbudget.
Arbitrary and Inconsistent Scaling Methodology: The method for scaling existing SoC designs to a 1024-channel baseline is inconsistent and lacks rigor (Section 4.1, page 5). For instance, the WIMAGINE device [80] is subjected to a "50x reduction in both power and area" to model a "more evolved design." This is not an extrapolation based on a predictive model; it is a post-hoc normalization based on conjecture. Similarly, HALO [110] is "scaled down" to fit within the power budget. This manipulation of baseline data points means the framework is not predicting future scaling behavior but is instead being calibrated with manually altered data, which undermines its scientific validity.
Unjustified DNN Complexity Scaling: The framework assumes that DNN model complexity scales linearly with the number of input channels, governed by the parameter a (Section 5.3, page 9). This is a simplistic heuristic that lacks empirical or theoretical justification. The relationship between input dimensionality and optimal network size is highly dependent on the specific task, data statistics, and network architecture. By enforcing a rigid linear scaling, the authors pre-ordain their pessimistic conclusion that DNN integration is infeasible. A more robust model would consider sub-linear scaling possibilities or alternative architectures that are less sensitive to input size.
Lack of Sensitivity Analysis: The paper presents its conclusions as definitive, yet they are derived from a set of fixed, point-estimate parameters (e.g., PMAC from a single 45nm synthesis, a 15% QAM efficiency baseline, the 40 mW/cm² limit). For a paper whose primary contribution is a model, the absence of any sensitivity analysis is a major omission. How would the conclusions change if the power density limit were 30 mW/cm² or 50 mW/cm²? What is the impact of a 2x variation in PMAC due to different technology nodes or circuit design? Without this analysis, it is impossible to gauge the robustness of the findings.
Oversimplified Optimization Analysis: The analysis of combined optimizations in Section 6 (page 11) applies each strategy sequentially. This approach fails to capture the complex, non-additive interactions between them. For example, applying "Channel Density" reduces Asoc, which in turn lowers the overall Pbudget, creating a negative feedback loop that makes subsequent optimizations harder to fit. The paper acknowledges this effect but the presentation of results in Figure 12 as sequential, independent improvements is misleading.

Questions to Address In Rebuttal

Please provide a detailed justification for the uniform heat dissipation model. How would the presence of thermal hotspots, which are inevitable in any accelerator-rich SoC, alter the calculation of the Pbudget and the overall feasibility conclusions of the paper?
The scaling of existing SoCs to the 1024-channel baseline appears arbitrary. What is the specific, evidence-based justification for the 50x power and area reduction applied to the WIMAGINE [80] design? Without this, the baseline data points used for all subsequent analysis lack credibility.
The linear scaling of DNN complexity with input channel count (Section 5.3) is a critical assumption that directly drives the paper's conclusions. What evidence supports this specific scaling law over other plausible, potentially sub-linear relationships?
Given that the paper's contribution is a modeling framework, why was a sensitivity analysis of key parameters (e.g., power density limit, PMAC, QAM efficiency) not performed? How can the reader trust the robustness of the conclusions without understanding their sensitivity to these foundational assumptions?
The analysis in Section 5.3 focuses exclusively on conventional DNNs (MLPs, CNNs). While the related work briefly mentions Spiking Neural Networks (SNNs) [54], they are not included in the analytical framework. Given that SNNs are often proposed for their potential power efficiency in edge applications, their omission from a forward-looking BCI architecture study is a significant gap. Please justify this exclusion.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper, "MINDFUL," presents a much-needed, system-level analytical framework for evaluating the design space of future large-scale, implantable Brain-Computer Interfaces (BCIs). The authors distill the complex BCI System-on-Chip (SoC) into three core components: data acquisition (neural interface), on-chip computation, and wireless communication. The central thesis is that the scaling of these systems is fundamentally constrained by a non-negotiable safety limit on power density to prevent thermal damage to brain tissue.

Using this framework, the authors project the feasibility of scaling existing BCI SoC designs. Their analysis quantitatively demonstrates a critical tension: the desire to incorporate more channels and more powerful deep neural network (DNN) computations is on a collision course with the hard safety-related power budget. The paper systematically explores different design strategies—from communication-centric approaches with advanced modulation to computation-centric approaches with on-implant DNNs and hybrid partitioned models—concluding that a naive "more of everything" approach is infeasible. The work serves as both a design guide and a call for a more holistic, co-designed approach to creating the next generation of safe and scalable BCIs.

Strengths

Holistic, System-Level Perspective: The most significant contribution of this work is its panoramic view. Instead of focusing on a single component in isolation (e.g., a better amplifier, a more efficient MAC unit), the authors create a framework that binds the entire system together. The clear dichotomy between "communication-centric" and "computation-centric" dataflows (Fig. 3, page 4) provides an excellent conceptual lens through which to analyze the entire design space. This perspective is invaluable for a field that requires deep collaboration between disparate disciplines like neuroengineering, computer architecture, and wireless communications.
Bridging a Critical Knowledge Gap: This paper serves as an essential bridge for the computer architecture community. It masterfully translates a fundamental biological constraint (thermal safety) into concrete architectural metrics (power budget, power density). For architects accustomed to designing within the thermal envelopes of consumer electronics (TDP), the stringent and absolute limits of implantable devices represent a paradigm shift. This paper provides the foundational "rules of the game" for any architect wishing to contribute to this impactful field.
Grounded, Quantitative Projections: While the framework is based on first-order approximations, its strength lies in grounding the discussion in concrete numbers. By starting with real, published SoC designs (Table 1, page 5) and systematically extrapolating their performance, the authors move the conversation from qualitative hand-waving to quantitative trade-off analysis. The plots showing power consumption relative to the power budget (e.g., Fig. 5, page 7 and Fig. 10, page 10) are particularly effective at illustrating the looming "power wall."
Identification of a Key Conflict: The paper's conclusion that modern, cutting-edge DNN models are largely incompatible with safe, scalable on-implant integration is a stark and important finding. This highlights a significant disconnect between the BCI software/algorithm community, which is rapidly developing larger and more complex models, and the hardware community, which must operate under unforgiving physical constraints. This work provides the quantitative evidence needed to encourage more hardware-aware algorithm design and co-design efforts. The analysis in Section 5.3 and Section 6 makes this point compellingly.

Weaknesses

While the paper's vision and approach are its greatest strengths, they also lead to some inherent limitations, which should be viewed not as fatal flaws but as avenues for future work.

Reliance on First-Order Approximations: The analytical model necessarily simplifies complex realities. For example, power for the sensing front-end is assumed to scale linearly with channel count (Equation 5, page 6), which may neglect overheads or non-linear effects in routing and clock distribution at very large scales. Similarly, the thermal model assumes uniform heat dissipation, whereas a large DNN accelerator could create significant on-chip hotspots that violate local temperature limits even if the average power density is safe. The authors acknowledge this, but the implications of these simplifications could be more deeply explored.
Abstracted Computation Costs: The analysis of DNN power consumption is heavily centered on the cost of MAC operations (Section 5.3, pages 8-9). While the authors compellingly argue this is the dominant factor (Fig. 9), this abstracts away the potentially significant costs of on-chip data movement, memory access, and control logic, which are known bottlenecks in conventional DNN accelerators. In a power-starved environment like an implant, these "secondary" costs could become primary obstacles.
Limited Exploration of Alternative Architectures: The paper focuses on traditional synchronous digital logic for implementing DNNs. The BCI field has a strong intellectual connection to neuromorphic computing and Spiking Neural Networks (SNNs), which are often proposed specifically for their potential for extreme power efficiency. While mentioned briefly in the related work section, an analysis of how SNNs might alter the fundamental trade-offs within the MINDFUL framework would have been a powerful addition, further strengthening the paper's contextual analysis.

Questions to Address In Rebuttal

On Model Sensitivity: The conclusions hinge on a set power density limit (40 mW/cm²) and a set of scaling assumptions. How sensitive are the primary conclusions (e.g., the infeasibility of integrating full DNNs beyond ~1800 channels) to these parameters? For instance, if future biocompatibility research were to allow for a slightly higher power density, or if a breakthrough in analog front-end design dramatically lowered the per-channel sensing power, how would the crossover point between communication-centric and computation-centric designs shift?
On the Role of Alternative Computational Models: The focus on DNNs is well-justified given current trends. However, could the authors comment on how their framework could be adapted to evaluate SNNs? Given that SNNs trade computational structure for event-driven, sparse activity, how might this affect the analysis, particularly the balance between sensing, computation, and communication power?
On the Full System and Closed-Loop Operation: The analysis focuses on the data-out pathway (implant to wearable). Many advanced BCI applications, particularly therapeutic ones, will require a closed-loop system with a data-in pathway for stimulation or model updates. Could the authors speculate on how the power and bandwidth constraints of this return channel would fit into their framework and potentially impact the overall system design?
On Application Diversity: The DNN analysis uses models for speech synthesis (page 10). How might the architectural requirements and power trade-offs differ for other flagship BCI applications, such as the continuous, low-latency decoding required for motor prosthesis control? Would this favor different points in the design space explored in Section 6?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present MINDFUL, an analytical framework intended to model the system-level design trade-offs for future large-scale, implantable Brain-Computer Interfaces (BCIs). The core of the work is a set of first-order equations that project the power consumption, area, and throughput of an implanted System-on-Chip (SoC) by separately modeling its key subsystems: neural interface (sensing), wireless communication, and on-chip computation. Using this framework, the authors analyze the feasibility of different design strategies ("communication-centric" vs. "computation-centric") as the number of recording channels scales beyond the current state-of-the-art. The principal conclusion drawn from the framework is that a significant and growing gap exists between the computational demands of modern deep learning-based BCI applications and the stringent power and area constraints imposed by safe, in-vivo operation.

Strengths

The primary strength of this work—and indeed, its only claim to novelty—is the synthesis of disparate design considerations into a single, cohesive analytical model. While prior works have analyzed individual components of a BCI system in isolation (e.g., the power efficiency of a neural amplifier, the energy-per-bit of a wireless link, or the operations-per-watt of a DNN accelerator), this paper attempts to unify these into a single parametric framework. This holistic, system-level perspective is valuable for illustrating high-level trends and bottlenecks. The paper is well-structured and clearly articulates the assumptions underpinning its model.

Weaknesses

My evaluation is based on a single criterion: is the core idea genuinely new? In this case, the contribution is a methodological framework, not a new device or algorithm. While the specific application of this type of framework to projecting future BCI designs is a contribution, the framework itself is constructed from well-established, non-novel components and concepts.

Lack of Fundamental Novelty in the Methodology: The core idea of creating a system-level model based on first-order approximations to explore a design space is a standard engineering practice, not a novel research contribution. The equations presented in Section 4 are straightforward extrapolations based on linear or square-root scaling laws and basic communication theory principles (e.g., Equation 9, which links communication power to throughput and energy-per-bit). These are foundational concepts, not new theoretical insights.
Significant Overlap with Prior System-Level Analyses: The central theme of evaluating the trade-offs between on-implant computation and communication is not new. Even-Chen et al. ("Power-Saving Design Opportunities for Wireless Intracortical Brain-Computer Interfaces," Nature Biomedical Engineering, 2020) [35] presented a comprehensive analysis of this exact problem. They explored the power trade-offs between data compression, feature extraction, and wireless transmission for intracortical BCIs. While the MINDFUL paper formalizes this analysis with a specific set of equations and applies it to a broader set of published SoCs, the conceptual groundwork and the identification of the core problem are largely pre-existing. The "delta" between this work and Even-Chen et al. is one of formalization and scope, not fundamental concept.
Component Models are Not Novel: The analysis of individual subsystems relies on existing knowledge.
- DNN Power Modeling: The methodology of estimating DNN power by focusing on the number of multiply-accumulate (MAC) operations (Section 5.3) is a common first-order approximation used in countless hardware accelerator papers. It is a known lower bound that omits significant power contributors like memory access and data movement, a point the authors do not sufficiently address.
- Communication Modeling: The analysis of On-Off Keying (OOK) and Quadrature Amplitude Modulation (QAM) in the context of their power/throughput trade-offs is textbook material from wireless communication theory (e.g., Goldsmith, "Wireless Communications," 2005) [44]. Its application to biomedical implants has also been extensively studied.

The novelty of this paper is therefore confined to the act of "gluing together" these pre-existing models and applying them to a forward-looking BCI scaling problem. While the resulting insights are useful for the community, the intellectual contribution from a novelty standpoint is limited. It is a work of synthesis and projection, not invention.

Questions to Address In Rebuttal

The core premise of your work bears a strong conceptual resemblance to the system-level analysis presented by Even-Chen et al. [35]. Please articulate precisely what novel conceptual advance your MINDFUL framework provides over the analysis and conclusions in that prior work. Simply stating that your model is equation-based is insufficient; you must demonstrate how this formalization leads to fundamentally new insights that were not, or could not be, derived from the prior analysis.
Your DNN power model (Section 5.3) is a first-order lower bound based exclusively on the power of MAC units. In many modern DNN accelerators, power consumption from memory access (e.g., weight fetching from ROMs) and interconnect can be as, or more, significant than the arithmetic units themselves. How would the inclusion of these second-order, but potentially dominant, power effects change your conclusions regarding the feasibility of on-implant DNNs (Fig. 10)? Is it not possible that your already pessimistic conclusions are, in fact, overly optimistic?
The framework's value is in its predictive power. However, it relies on simple scaling laws (e.g., linear scaling of sensing power) that may not hold for large-scale systems where second-order effects like clock distribution, routing congestion, and thermal hotspots become dominant. Please defend the validity of using such a simplified model to make predictions about systems that are an order of magnitude larger than those from which the model's parameters are derived. What is the confidence interval on your projections?

DS-TIDE: Harnessing Dynamical Systems for Efficient Time-Independent Differential Equation Solving

Abstract

Time- Independent Differential Equations (TIDEs) are central to modeling equilibrium behavior across a wide range of scientific and engineering domains. Conventional numerical solvers offer reliable solutions but incur significant computational costs due ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The authors propose DS-TIDE, a conceptual hardware solver for Time-Independent Differential Equations (TIDEs) based on the principles of dynamical systems. The core contributions are twofold: 1) A "Heterogeneous Dynamics with Temporal Layering" (HDTL) architecture, which decomposes the problem into conditioning, solving, and decoding stages to enhance expressivity. 2) A "Self-adaptive Dynamics" mechanism for on-device parameter alignment, intended to overcome the limitations of offline training and hardware mismatch. The paper claims, based on software simulation, significant efficiency improvements (~10^3x) and accuracy that is competitive with or superior to state-of-the-art numerical and machine learning solvers.

Strengths

The fundamental concept of leveraging the natural equilibrium-seeking behavior of a physical dynamical system to solve differential equations is elegant.
The proposed on-device alignment mechanism directly addresses a critical and well-known weakness of previous analog and DS-based computing approaches: the performance degradation due to the mismatch between offline-trained models and physical hardware realities.
The HDTL framework is a logical extension of simpler DS models, providing a structured approach to increase the system's representational capacity for more complex equations.

Weaknesses

My primary concerns with this manuscript center on the validity of its core claims, which are predicated on a fundamentally flawed evaluation methodology and contain significant, unsubstantiated theoretical leaps.

Over-reliance on Idealized Simulation: The paper's entire set of empirical results is derived from a software simulator that, by the authors' own admission (Section 4.1, page 8), assumes "ideal circuit components and interconnects with perfect links." This is a fatal flaw for a paper proposing a hardware architecture. Real-world analog circuits are dominated by non-idealities such as device mismatch, process variations, thermal noise, parasitic capacitance, and non-linear component behavior. By ignoring these first-order effects, the reported accuracy results in Table 1 and the robustness analysis in Figure 7 are rendered meaningless as predictors of physical hardware performance. A claim of sub-1% error is untenable without accounting for the very physical phenomena that this hardware would face.
Unjustified Approximations in the Alignment Algorithm: The proposed on-device alignment mechanism is the paper's cornerstone, yet its derivation is not rigorous. In Section 3.3.1, the authors introduce the Adjoint Sensitivity Method (Eq. 13-14) as the mathematically sound approach, only to immediately discard it as "infeasible." They then propose an "adjoint-free" version (Eq. 15-16) "inspired by Feedback Alignment methods." This is a critical leap of faith. The manuscript provides no theoretical justification, proof of convergence, or even an ablation study to demonstrate that this hardware-friendly approximation is a valid or stable substitute for the rigorous method across the problem space. The claim that it provides "correct directional guidance" is an assertion, not a proven fact.
Absence of Hardware Cost Analysis: The paper claims exceptional efficiency but provides no analysis of the cost required to achieve it. The architecture for on-device alignment, depicted in Figure 5, appears highly complex, involving additional multipliers, control registers, and intricate signal routing for each weight parameter. What is the area and power overhead of this alignment circuitry relative to the core solving circuitry? Without this analysis, the performance claims in Table 2 (Solving Latency) are incomplete. A 1µs solve time is not impressive if the required chip area or static power consumption is orders of magnitude larger than a comparable digital solution.
Misleading Performance Comparisons: The latency comparison in Table 2 pits a conceptual, specialized ASIC against general-purpose GPUs executing ML models. This is not a fair comparison. The proper baseline would be a custom digital hardware accelerator implementing a numerical solver, or at the very least, a detailed analysis normalizing performance by silicon area and power (e.g., solutions/second/mm²/Watt). As presented, the speedup numbers are impressive but potentially misleading.

Questions to Address In Rebuttal

The authors must address the following points to substantiate their claims:

Regarding the Simulation: How can the accuracy and robustness claims be considered reliable for a hardware architecture without modeling the dominant non-ideal effects, specifically device mismatch and thermal noise? Please justify the use of an ideal simulator or provide new results from a simulation that incorporates these effects.
Regarding the Alignment Algorithm: Please provide either a theoretical proof or extensive empirical evidence (e.g., comparison against the true adjoint method in simulation) that the proposed "adjoint-free" alignment method (Eq. 15-16) reliably converges to high-quality solutions and does not get stuck in poor local minima for the class of problems studied.
Regarding Hardware Costs: What is the estimated area and power overhead of the on-device alignment circuitry (Figure 5) compared to the core data path? How does the total area/power budget scale with the problem size (N and M)?
Regarding Noise Analysis: In Section 4.5, the noise voltage is given in absolute terms (µV). To make this meaningful, what is the nominal signal voltage range of the node states (h, y) in the system? Is the injected noise level of 480 µV a small or large perturbation relative to the signal?
Regarding Performance Baselines: To provide a fair assessment of efficiency, how would DS-TIDE's projected performance, when normalized by area and power, compare to a dedicated digital accelerator implementing an iterative numerical method (e.g., SOR) for the same TIDE?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces DS-TIDE, a novel hardware solver for Time-Independent Differential Equations (TIDEs) that leverages the physics of dynamical systems (DS). The central thesis is that the natural evolution of a physical DS towards a low-energy equilibrium state is computationally analogous to the process of solving a TIDE. The authors propose two key innovations to make this concept practical: 1) Heterogeneous Dynamics with Temporal Layering (HDTL), a three-stage (conditioning, solving, decoding) process that enhances the system's expressive power to model complex equations, and 2) a Self-Adaptive On-Device Alignment mechanism that allows the hardware to rapidly adapt its intrinsic dynamics to solve a wide variety of TIDEs without offline training or reconfiguration. The work presents simulation results demonstrating competitive accuracy with state-of-the-art numerical and machine learning solvers, while reporting a revolutionary efficiency improvement of approximately 1000x.

Strengths

Ambitious and Foundational Contribution: The paper's core contribution is not an incremental improvement but a conceptual leap. It seeks to revive and modernize the principles of analog computing for the crucial domain of scientific computation. By mapping the problem of TIDE solving onto the natural physics of a CMOS-compatible device, the work moves beyond simply accelerating existing iterative algorithms and instead proposes a new computational paradigm.
Elegant Problem-Solver Alignment: The fundamental insight—that a DS naturally "solves" for its equilibrium state in the same way a TIDE describes an equilibrium physical system—is exceptionally powerful. The discussion of "natural alignment" in the introduction (Section 1, page 1) and Figure 1 (page 2) effectively communicates this core strength. This approach bypasses the costly discretization and iterative steps that dominate both classical numerical solvers and the training phase of ML models.
Thoughtful Solutions to Key Architectural Challenges: The work does an excellent job of identifying and addressing the two primary obstacles that have historically limited such approaches:
- Expressivity: Simple dynamical systems can't capture the complexity of many real-world TIDEs. The proposed HDTL framework (Section 3.2, page 5) is a clever, structured approach to this, analogous to creating depth and feature hierarchy in a neural network, but implemented in the temporal domain.
- Versatility: A fixed piece of hardware is typically not adaptable. The on-device alignment mechanism (Section 3.3, page 6) is the paper's most significant technical innovation. By developing a hardware-friendly, adjoint-free update rule, it effectively enables "on-the-fly training," making the specialized hardware remarkably general-purpose within its domain.
Compelling Empirical Validation: The experimental results are striking. The authors compare DS-TIDE against strong baselines, including the Fourier Neural Operator (FNO), on a canonical set of benchmark TIDEs. The consistent achievement of competitive accuracy (Table 1, page 9) coupled with orders-of-magnitude reduction in latency (Table 2, page 10) and alignment time (Figure 6, page 10) makes a very strong case for the potential of this approach.
Excellent Contextualization: The paper is well-situated within the broader landscape. It correctly identifies its intellectual lineage from classical analog computers (e.g., Bush's differential analyzer), its relationship to modern DS-based processors like Ising machines, and its positioning as an alternative to both traditional HPC and ML-based surrogate modeling. The related work section is thorough and demonstrates a mature understanding of the field.

Weaknesses

My critiques are less about flaws in the work presented and more about the practical and theoretical boundaries that need to be explored for this idea to realize its full potential.

The Simulation-to-Silicon Gap: The study relies on a FEA software simulator which, by the authors' own admission, assumes ideal circuit components. The true test of any analog computing paradigm lies in its resilience to the trifles of the physical world: device mismatch, thermal noise, process variations, and limited precision. While the robustness analysis in Section 4.5 (page 10) is a good first step, it cannot fully capture the correlated, systematic errors present in a physical chip. The phenomenal accuracy reported might be challenging to maintain in practice.
Uncertain Scalability to Large-Scale Problems: The experiments are conducted on 1D and 2D problems with up to ~2000 grid points. Many real-world scientific and engineering problems involve 3D domains with millions or billions of degrees of freedom. While the paper mentions scale-up and scale-out strategies from prior work (Section 4.4, page 10), the practical challenges of interconnect, power delivery, and maintaining global convergence across multiple chips for a tightly coupled analog system are immense and remain unaddressed.
Defining the Boundaries of Applicability: The paper successfully demonstrates DS-TIDE on a set of five important TIDEs. However, the theoretical framework for determining which classes of differential equations can be successfully mapped onto the HDTL structure is not fully developed. It is unclear how the system would handle equations with very high stiffness, sharp discontinuities, or highly complex, non-local boundary conditions. Understanding these limitations is crucial for positioning DS-TIDE as a practical tool for scientists and engineers.

Questions to Address In Rebuttal

Hardware Realism: Could the authors elaborate on the path from the current FEA simulation to a physical prototype? What do they anticipate as the most significant challenges in hardware implementation, particularly with respect to maintaining precision in the face of device mismatch and noise? For instance, how precise do the programmable resistors and multipliers for J, P, and Q need to be?
Scalability Bottlenecks: Beyond the existing O(MN) complexity analysis, what do the authors foresee as the primary physical bottleneck for scaling this architecture to problems with millions of grid points? Is it the number of nodes, the density of the coupling interconnect, I/O for programming and readout, or power consumption?
Generality and Limitations: Could the authors comment on the theoretical limitations of the HDTL framework? Are there known classes of TIDEs (e.g., those with chaotic behavior in their transient phase, even if the final state is stable) that would be fundamentally difficult to map onto the proposed system dynamics? How might complex boundary conditions (e.g., Neumann or Robin) be implemented in this framework?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents DS-TIDE, a CMOS-based hardware solver for Time-Independent Differential Equations (TIDEs). The authors' primary claims to novelty are twofold: 1) an architectural pattern called Heterogeneous Dynamics with Temporal Layering (HDTL), which decomposes the solving process into three distinct, temporally sequential stages (conditioning, solving, decoding) governed by different dynamics; and 2) a mechanism for on-device, self-adaptive alignment that allows the hardware's intrinsic dynamics to be rapidly configured to solve a diverse range of TIDEs without offline training or hardware redesign.

While the general concept of using physical systems to solve differential equations is not new, the specific architectural implementation and, most notably, the on-device alignment mechanism, represent a significant and novel contribution to the field of DS-based processors. The work successfully extends this class of accelerators from their prior domain of combinatorial optimization and graph learning into the more general and complex domain of DE solving.

Strengths

The primary strength of this paper lies in its novel approach to achieving versatility in a fixed analog hardware solver.

Novelty of On-Device Alignment: The most significant contribution is the on-device alignment mechanism (Section 3.3, page 5-7). Previous DS-based processors, such as BRIM [1] and DS-GL [47], relied on offline training to determine the system's coupling parameters (the J and h matrices). This created a rigid system tuned for a specific task and suffered from the inevitable mismatch between software simulation and physical hardware. DS-TIDE's proposal to perform this alignment on the device itself using an adjoint-free, feedback-alignment-inspired method is a genuine innovation. This mechanism, which physically updates circuit parameters based on error signals, effectively makes the hardware programmable and adaptive in a way prior art in this specific lineage of processors is not. This is the key enabler for the claimed versatility.
Architectural Novelty of HDTL: The HDTL concept (Section 3.2, page 5) is a novel architectural pattern for DS-based computation. The idea of temporally staging a computation, where the equilibrium state of one stage becomes the initial condition for the next, all within a continuous-time evolution, is a clever way to increase the system's expressive power. This contrasts with prior analog DE solvers that typically map a single, homogeneous iterative update rule (e.g., a finite-difference stencil) onto hardware [48, 62]. The three-stage conditioning-solving-decoding pipeline allows for a more abstract, end-to-end learning of the solution operator and appears to be a new way of structuring computation for this class of machine.
Significant Delta from Prior Art: The combination of these two ideas creates a substantial "delta" from the closest prior work. Compared to DS-GL [47], this work tackles a new, more complex problem domain (general TIDEs vs. graph learning) and replaces the offline training bottleneck with a fast, on-device process. Compared to classical analog computers [8, 41], this work introduces a learning-based, adaptive element that was absent in those early, fixed-function machines.

Weaknesses

The paper's primary weakness is its framing, which at times overstates the novelty of the general concept while potentially under-selling the specifics of its true contributions.

Positioning Relative to Foundational Work: The paper positions itself as leveraging the "intrinsic connection between Dynamical Systems (DS) and Differential Equations (DE)" (Abstract, page 1) as if this were a new insight. This connection is the foundational principle of analog computing, dating back nearly a century to Vannevar Bush's Differential Analyzer [8] and Shannon's theoretical formulation [41]. The authors should more carefully frame their contribution not as discovering this connection, but as proposing a new, highly efficient, and adaptive CMOS-based implementation of it. The current framing risks appearing unaware of this deep history.
Over-reliance on Analogy: The analogy of the system to an "infinitely deep neural network temporally unrolled" (Abstract, page 1) is evocative but lacks rigor. A continuous-time dynamical system is not equivalent to an infinitely deep discrete-layer network. While it serves as a useful mental model for expressivity, this claim is unsubstantiated and should be tempered. The novelty lies in the physical system's dynamics, not in its tenuous equivalence to a conceptual NN model. The strength of the contribution does not require this analogy.
Insufficient Detail on the Alignment Approximation: The paper acknowledges its on-device alignment algorithm is an approximation inspired by the Adjoint Sensitivity Method [37] and Feedback Alignment [34]. The crucial step is the "adjoint-free" approximation where static error signals (δh in Eq. 16) are used in place of evolving adjoint state variables (a(t) in Eq. 13). This is a major design choice that enables hardware feasibility, but its theoretical implications are not discussed. How does this approximation affect convergence guarantees or the quality of the final solution? A more thorough analysis of this trade-off would strengthen the paper's core contribution.

Questions to Address In Rebuttal

Please clarify your work's novelty with respect to the long history of analog differential equation solvers, particularly the foundational work by Bush [8]. What is the key conceptual departure of DS-TIDE from a classical analog computer, beyond the obvious implementation technology (CMOS vs. mechanical)?
Can the authors provide a more formal justification for the claim that the system is "analogous to an infinitely deep neural network"? Alternatively, would you be willing to rephrase this to more accurately describe a continuous-time dynamical system, to avoid making an unsubstantiated equivalence?
The novelty of the on-device alignment rests heavily on the "adjoint-free" approximation. What are the theoretical or empirical consequences of this simplification compared to the rigorous Adjoint Sensitivity Method [37]? Does this approximation limit the class of TIDEs that DS-TIDE can learn to solve accurately?

Towards Closing the Performance Gap for Cryptographic Kernels Between CPUs and Specialized Hardware

Abstract

Specialized hardware like application-specific integrated circuits (ASICs) remains the primary accelerator type for cryptographic kernels based on large integer arithmetic. Prior work has shown that commodity and server-class GPUs can achieve near-ASIC ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present an effort to improve the performance of cryptographic kernels (NTT, BLAS) on x86 CPUs to better compete with specialized hardware like ASICs. The work consists of two main parts: first, a hand-optimized AVX-512 implementation of 128-bit integer arithmetic for these kernels; and second, a proposal for a small ISA extension, MQX, containing three new instructions for multi-word arithmetic. The performance of this hypothetical MQX extension is then evaluated using a novel modeling technique called "Performance Projection using Proxy ISA" (PISA), and its potential is further extrapolated to a full multi-core system using a "speed-of-light" roofline model.

Strengths

The primary strength of this paper is the development and benchmarking of a high-performance AVX-512 implementation for 128-bit cryptographic kernels. The results presented in Section 5.3 and 5.4 for the AVX-512 baseline are based on real-world measurements on contemporary hardware. These results (e.g., a 38x speedup for NTTs over OpenFHE on a single core) are significant and provide a valuable, updated baseline for the community against which future ASIC, GPU, and CPU solutions should be compared. This part of the work is a solid engineering contribution.

Weaknesses

The paper's conclusions regarding the potential of CPUs to "close the gap" hinge entirely on the proposed MQX extension, and the evaluation of MQX is predicated on a chain of questionable assumptions and a flawed modeling methodology.

The PISA Methodology Is Fundamentally Unsound: The core of the MQX evaluation rests on the PISA model (Section 4.2), which projects the performance of hypothetical MQX instructions by mapping them to existing AVX-512 instructions. This mapping is not technically justified.
- _mm512_mul_epi64 -> _mm512_mullo_epi64: A widening 64x64 multiply produces a 128-bit result. The proxy, mullo, produces only the lower 64 bits. It is microarchitecturally naive to assume that an execution unit producing twice the amount of data would have identical latency, throughput, and execution port requirements as one producing half the data. This assumption is unsubstantiated.
- _mm512_adc_epi64 -> _mm512_mask_add_epi64: An add-with-carry operation requires a dedicated data path for propagating carry bits between lanes or from a flag register. A masked add operation uses the mask for predication (to write-enable lanes), not as an arithmetic input. These are fundamentally different operations. The authors' justification—that scalar ADD and ADC have similar performance—is irrelevant and does not scale to a 512-bit wide SIMD unit with 8 independent carry operations.
The "Validation" of PISA Is a Circular Argument: In Section 5.2, the authors attempt to validate PISA by using it to predict the performance of an existing AVX2 instruction (_mm256_mul_epu32) with another existing instruction as a proxy. While the error is low, this only proves that PISA works when the target and proxy instructions are structurally and functionally very similar. It provides zero evidence that the model is accurate when mapping a hypothetical, functionally distinct instruction (like a widening multiply or a true vector ADC) to a proxy with different semantics and hardware requirements.
The Speed-of-Light (SOL) Analysis Is Overly Optimistic and Misleading: The roofline analysis in Section 6 extrapolates single-core performance to a 192-core system using a model that assumes perfect, linear scaling (Equation 13). This is an unrealistic best-case scenario that ignores critical system-level bottlenecks that would dominate at that scale, such as memory bandwidth limitations, cache coherency overhead, and NUMA effects. The authors even observe their kernel becoming memory-bound on a single core at NTT size 2¹⁶ (Section 5.4), which contradicts the assumption that performance would continue to scale linearly with more cores. Presenting these highly idealized "MQX-SOL" numbers on the same plots (Figure 1, Figure 7) as measured hardware creates a misleading visual comparison.
Hardware Implementation Claims Are Unsubstantiated: The paper repeatedly claims that MQX requires "minimal proposed hardware modifications" (Abstract) and "minimize[s] the required engineering effort" (Section 4.1). These are strong claims made without any supporting evidence from hardware design, such as proposed circuit diagrams, area estimates, or timing analysis. The fact that similar instructions existed in a past, in-order architecture (Larrabee) does not mean they are trivial to integrate into a modern, complex out-of-order server core.

Questions to Address In Rebuttal

Please provide a detailed microarchitectural justification for the specific proxy instruction choices in PISA. How can a widening multiply that produces a 128-bit result be reasonably expected to have the same performance characteristics as a multiply-low that produces a 64-bit result?
Beyond the weak analogy to scalar instructions, what evidence supports the assumption that a vector add-with-carry (adc_epi64) instruction would map to the same execution ports and have the same performance as a masked add (mask_add_epi64)?
The validation of PISA in Section 5.2 uses existing instructions. How does this experiment validate the model's predictive power for the proposed MQX instructions, which are functionally different from their chosen proxies in ways that the validation experiment does not capture?
Please defend the use of a perfect linear scaling model for the SOL analysis in Section 6. Given that your own results show the NTT kernel can become memory-bound on a single core, why should we believe that performance would continue to scale linearly up to 192 cores without being throttled by shared resources like the memory controller?

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper addresses the significant performance gap between general-purpose CPUs and specialized hardware (ASICs/GPUs) for cryptographic kernels, particularly the Number Theoretic Transform (NTT) and BLAS operations central to Fully Homomorphic Encryption (FHE). The authors present a multi-pronged approach to narrowing this gap. First, they develop highly-optimized implementations using existing SIMD instruction sets (AVX-512), demonstrating substantial speedups over state-of-the-art libraries. Finding this insufficient to close the gap entirely, they propose a minimal and targeted ISA extension called Multi-word Extension (MQX), consisting of just three new instructions for multi-word arithmetic (widening multiply, add-with-carry, subtract-with-borrow).

The core of the paper lies not just in this proposal, but in its pragmatic evaluation. The authors introduce a clever performance modeling methodology, "Performance Projection using Proxy ISA (PISA)," which estimates the performance of the proposed MQX instructions by mapping them to existing, structurally similar AVX-512 instructions on real hardware. This avoids the inaccuracies of cycle-level simulation for modern proprietary microarchitectures. Finally, using a speed-of-light projection, they argue that a top-tier server CPU equipped with MQX could achieve performance comparable to state-of-the-art ASICs, potentially making CPUs a viable platform for demanding cryptographic workloads and eliminating the data-transfer bottlenecks associated with accelerators.

Strengths

The primary strength of this work is its compelling and well-constructed argument for reconsidering the role of CPUs in the era of specialized cryptographic accelerators. It successfully connects a high-level system challenge to low-level microarchitectural solutions.

A Pragmatic and Feasible ISA Extension: The MQX proposal is compelling precisely because it is so minimal. Rather than designing a complex co-processor, the authors identify the three most critical missing primitives for large integer arithmetic. Crucially, as noted in Section 4.1 (page 6), they ground their proposal in historical precedent by drawing direct parallels to scalar x86 instructions (ADC/SBB) and prior vector ISAs like Intel's Larrabee New Instructions (LRBni). This is a masterstroke of contextualization, transforming a hypothetical proposal into a plausible and de-risked engineering path for hardware vendors.
Innovative and Credible Performance Modeling (PISA): The PISA methodology (Section 4.2, page 7) is a significant contribution in its own right. The authors rightly identify the challenge of evaluating new ISA extensions without access to proprietary design details. PISA offers an elegant middle-ground between high-level analytical models and full simulation. By grounding projections in the measured performance of existing proxy instructions and validating the methodology with known instructions (Section 5.2, page 9), the authors build a strong foundation of credibility for their subsequent MQX performance claims. This is a technique that could be adopted by other researchers in the architecture community.
Holistic, Multi-Layered Approach: The paper tells a complete and convincing story. It begins with state-of-the-art software optimization (the AVX-512 implementation), demonstrates its limits, proposes a targeted hardware fix (MQX), and then projects the system-level impact (the speed-of-light analysis in Section 6, page 11). This layered approach makes the final conclusion—that CPUs can approach ASIC performance—feel earned and well-supported, rather than speculative.
Significant Contextual Placement: This work fits beautifully within the broader landscape of computer architecture. It directly engages with the long-standing debate on general-purpose vs. specialized computing. By targeting FHE, a workload often considered the exclusive domain of accelerators, the paper makes a powerful case for the continued relevance and adaptability of the modern CPU. The potential to eliminate the host-accelerator data transfer bottleneck is a critical system-level advantage that is correctly highlighted.

Weaknesses

The paper's weaknesses are minor and largely related to the inherent limitations of its forward-looking proposal, rather than flaws in its execution or reasoning.

Optimism of the Speed-of-Light Model: The speed-of-light (SOL) analysis in Section 6 is a powerful tool for illustrating potential, but it is, by the authors' own admission, an optimistic upper bound. Achieving near-linear scaling across 192 cores for a memory-intensive kernel like NTT is a formidable software and system challenge, involving NUMA effects, cache coherence, and thread synchronization overheads that are not captured in the simple scaling model of Equation 13. While the authors wisely temper their claims in the "Towards realizing SOL performance" paragraph (page 12), the headline results in Figure 7 might be interpreted as more achievable than they are in practice.
Limited Exploration of Power and Area: The design philosophy of MQX is to minimize engineering effort, which implicitly addresses area and design cost. However, the paper does not include any quantitative estimates of the silicon area or power impact of adding the proposed MQX execution units (e.g., 8 parallel 64x64->128-bit multipliers). While a detailed analysis is beyond the scope of this work, some high-level discussion or citation of related work on the cost of such units would strengthen the argument for hardware feasibility.

Questions to Address In Rebuttal

The SOL projection in Section 6 assumes that the single-core implementation can be scaled across a multi-core chip. However, the performance of the single-core MQX implementation itself begins to degrade for larger NTT sizes (e.g., N=2^16 on Intel Xeon, Figure 5a), which you hypothesize is due to L2 cache capacity. On a large multi-core processor where L3 cache is a shared resource, could contention for shared caches and memory bandwidth become a dominant bottleneck well before linear scaling is achieved, thus making the SOL projections even more optimistic for larger problem sizes?
The PISA methodology relies on mapping a proposed instruction to a single "most structurally similar" existing instruction. In the case of _mm512_mul_epi64, the proxy is _mm512_mullo_epi64. Could the authors elaborate on why this is a reasonable proxy? A full 64x64->128 widening multiplier is more complex than a 64x64->64 multiply-low unit. Is the assumption that the critical path latency and port usage would be similar, or that a modern SIMD multiplier already computes the full result internally and simply discards the high part for mullo? Some justification here would further bolster confidence in PISA.
Your work compellingly argues for bringing multi-word arithmetic support to CPU vector units. Beyond FHE, what other application domains do you see as major beneficiaries of the MQX extension? Could this be a general-purpose feature that also accelerates other forms of cryptography, high-precision scientific computing, or even large number factorization?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper presents an investigation into closing the performance gap between general-purpose CPUs and specialized hardware for cryptographic kernels, specifically NTT and BLAS operations on large integers. The authors' contribution can be deconstructed into three parts: (1) an optimized software implementation using existing scalar and SIMD instructions (AVX-512), (2) a proposed ISA extension named Multi-word Extension (MQX) to further accelerate these kernels, and (3) a performance modeling methodology termed Performance Projection using Proxy ISA (PISA) to estimate the gains from MQX.

My analysis concludes that while the engineering effort is sound and the performance results are compelling, the core ideas presented as novel are, in fact, derivative of prior, well-established architectural concepts. The central proposal, MQX, is a direct adaptation of SIMD instructions proposed and documented over a decade ago for Intel's Larrabee and Knights Corner architectures. The PISA methodology is a pragmatic formalization of a common-sense modeling technique (proxy-based estimation) rather than a fundamentally new evaluation framework. The paper's primary contribution is therefore not one of invention, but of demonstrating the significant value of re-introducing and adapting old ideas to a new data type (64-bit) and a modern problem domain (cryptography).

Strengths

Excellent Problem-to-Solution Mapping: The paper correctly identifies the specific arithmetic operations (multiplication with wide results, addition/subtraction with carry propagation) that are bottlenecks for large integer arithmetic in existing SIMD instruction sets.
High-Impact, Low-Complexity Proposal: The proposed MQX instructions, despite their lack of novelty, are shown to yield substantial performance improvements (a further 2.1x-3.7x over AVX-512 on page 10, Figure 5). The authors correctly argue that the hardware implementation cost would be minimal, as the scalar analogues are fundamental to the x86 ISA. This represents a very favorable complexity-vs-benefit trade-off.
Commendable Transparency: The authors are transparent in citing the prior art that their work is based on. In Section 4.1 (page 6), they explicitly draw parallels between MQX and the SIMD instructions found in Intel's Larrabee New Instructions (LRBni) [38] and Knights Corner (KNC) intrinsics [23]. This transparency is laudable, but it also serves to undermine the claim of novelty.

Weaknesses

Fundamental Lack of Novelty in the MQX ISA Extension: The core proposal of this paper is the MQX instruction set. However, these instructions are conceptually identical to prior art.
- Vector Add-with-Carry and Subtract-with-Borrow: The proposed _mm512_adc_epi64 and _mm512_sbb_epi64 are vector versions of the x86 ADC/SBB instructions. As the authors themselves note, vector versions of these operations (vadcpi, vsbbpi) were key features of the Larrabee ISA (LRBni) [38] from 2008 and were later documented as intrinsics for the Knights Corner (MIC) architecture [23]. The only "delta" here is the extension from 32-bit elements to 64-bit elements and the use of AVX-512 mask registers for carry/borrow flags instead of a dedicated flag register. This is an evolutionary, data-width-extension change, not a novel conceptual contribution.
- Widening Multiplication: The proposed _mm512_mul_epi64 to produce a 128-bit result from two 64-bit inputs is an extension of the fundamental MUL instruction in x86. Again, vector widening multiplies for smaller data types (e.g., 32-bit) existed in KNC. The concept is not new.
Limited Novelty of the "PISA" Methodology: The authors introduce PISA as their method for performance modeling (Section 4.2, page 7). While giving it an acronym lends it an air of novelty, the underlying technique is simply proxy-based performance estimation. The practice of using an existing, structurally similar instruction to model the performance characteristics (latency, port usage) of a hypothetical instruction is a standard and long-standing technique in both industry and academic architecture exploration, especially when cycle-accurate simulators are unavailable or lack support for new ISAs. The novelty is in the application, not the method itself.
Framing of the Contribution: The paper is framed as a proposal for a new ISA extension. A more accurate framing would be an argument for the reinstatement and adaptation of previously abandoned, but highly effective, SIMD concepts for modern cryptographic workloads. The current framing overstates the inventive aspect of the work.

Questions to Address In Rebuttal

Beyond the change in operand width from 32-bit to 64-bit and the use of mask registers for carry propagation, what are the fundamental conceptual differences between the proposed MQX instructions (adc_epi64, sbb_epi64) and the vadcpi/vsbbpi instructions from Larrabee New Instructions [38]? Please justify why this adaptation constitutes a novel architectural proposal rather than an extension of prior art.
The PISA methodology (Section 4.2, page 7) is presented as a novel contribution. Can the authors provide citations to prior work in performance modeling and differentiate PISA from the general, well-established practice of proxy-based estimation, where an existing instruction is used to model a proposed one that maps to similar hardware resources?
The sensitivity analysis in Section 5.5 (page 11) suggests that replacing a full widening multiply with a multiply-high instruction (+Mh,C) results in only a minor performance degradation. Given that a multiply-high instruction is likely cheaper to implement than a full widening multiply that must write back two full vectors, why did the authors choose to propose the more complex instruction in the main MQX design? Does this choice represent the optimal trade-off between implementation cost and performance benefit?

HAWK: Fully Homomorphic Encryption Accelerator with Fixed-Word Key Decomposition Switching

Abstract

Fully Homomorphic Encryption (FHE) allows for direct computation on encrypted data, preserving privacy while enabling outsourced processing. Despite its compelling advantages, FHE schemes come with a significant performance penalty. Although recent ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present HAWK, a hardware accelerator for Fully Homomorphic Encryption (FHE) based on the CKKS scheme. The paper's primary assertion is that the recently proposed Key Decomposition Switching (KDS) method [36], despite its superior asymptotic complexity, is impractical for existing fixed-word-length hardware accelerators. The authors identify challenges including high computational cost at small word lengths, substantial memory expansion of the evaluation key, and the need for a hardware-unfriendly rounding operation. To address these, they propose the Fixed-Word KDS (FW-KDS) method, which unifies the operational word length. They augment this with several optimizations, including a "Half-RConv" strategy to reduce ring conversions in H-IDFT and a novel rounding circuit claimed to be error-free. The resulting HAWK architecture is presented as an evolution of the SHARP accelerator, reportedly achieving up to a 1.45x performance improvement with a 4% area increase.

Strengths

Problem Identification: The paper provides a thorough and valuable analysis of the practical implementation costs of the KDS method in Section 3. The empirical data presented in Figures 3 and 4 (page 5) effectively illustrates the performance crossover points with HKS and the significant evaluation key bloat, which are non-obvious from the original theoretical presentation of KDS. This analysis is a solid contribution in its own right.
Clear Baseline: The work is evaluated against SHARP [34], a state-of-the-art FHE accelerator. Using a strong, recent, and relevant baseline allows for a more credible assessment of the claimed performance improvements.
Multi-faceted Approach: The authors address the identified problems from multiple angles: algorithmic modification (FW-KDS), procedural optimization (Half-RConv), and microarchitectural design (the new rounding circuit). This demonstrates a comprehensive approach to co-design.

Weaknesses

My primary concerns with this submission revolve around the framing of the core problem, the attribution of performance gains, and the rigor of key claims.

Potentially Circular Problem Framing: The central premise is that KDS is "incompatible" with fixed-word accelerators. However, the theoretical advantage of KDS stems precisely from its ability to use a converted modulus H composed of larger primes h_i than the ciphertext modulus primes q_i. The proposed "Fixed-Word" KDS forces wd(h_i) = wd(q_i), constraining KDS to an operating point that the authors' own analysis in Figure 3 (page 5) demonstrates is computationally inferior to HKS. The paper then presents optimizations to mitigate the performance loss of this self-imposed constraint. This framing is questionable; it appears less like an inherent flaw in KDS and more like a problem created by forcing the algorithm into an unsuitable architectural model.
Conflated and Misleading Attribution of Performance Gains: The headline claim is a 1.45x speedup, and the paper is titled and framed around "Fixed-Word Key Decomposition Switching." However, the ablation study in Figure 12(b) (page 12) tells a different story. The transition from the HKS baseline to using FW-KDS yields a performance improvement of approximately 1.11x-1.15x. The subsequent application of the "Half-RConv" technique provides the remaining, and more substantial, jump to the final ~1.45x performance. This suggests that Half-RConv is the dominant optimization, yet it is presented as a secondary improvement. The paper's focus on FW-KDS as the main contributor is therefore misleading.
Insufficient Justification for the "Error-Free" Rounding Circuit: In Section 6.3 (page 11), the authors claim their new rounding circuit "completely eliminates the rounding computational error." This is a very strong claim. The proof relies on Lemma 6.1, which requires the rounding offset to be less than 1/4. The authors assert that their modulus selection H > 4 * 2 * B_d guarantees this condition. However, the derivation in Section 7.7 (page 14) that q_i > dnum * 2N is sufficient to meet this larger bound is terse. A more rigorous and formal proof is required to substantiate the claim that this condition holds universally for all valid CKKS parameter sets and that the rounding is truly "error-free" in all cases, rather than just having an error bounded below the precision of the system.
Ambiguous Cost Analysis: The abstract claims a "4% area increase," which seems modest for the added capabilities and memory. Table 2 (page 12) shows the on-chip memory capacity of HAWK is 212 MB (192+20), while for SHARP it is 198 MB (180+18). This constitutes a 7.1% increase in on-chip memory. Given that on-chip SRAM is a dominant contributor to the area and power budget of FHE accelerators, it is unclear how a 7.1% increase in memory results in only a 4.25% (186.4/178.8) increase in total chip area. The authors must provide a detailed area breakdown of the logic components (NTTU, EWU, EBConvU) and memory for both HAWK and their SHARP baseline to validate this claim. Without it, the 4% figure is not credible.

Questions to Address In Rebuttal

The authors must provide clear and concise answers to the following questions:

Your core premise is the impracticality of KDS on fixed-word hardware. Please justify why forcing KDS into a suboptimal, fixed-word configuration (FW-KDS) and then optimizing it is a superior approach to designing a more flexible architecture that could properly leverage the variable-word-length nature of the original KDS method. Is the problem truly with KDS, or with the inflexibility of the target architecture?
According to your ablation study (Figure 12(b)), the Half-RConv optimization appears to contribute more to the final performance gain than the FW-KDS method. Please justify the paper's title and narrative focus on FW-KDS. Provide a performance breakdown for the bootstrapping workload that precisely isolates the speedup from FW-KDS alone versus Half-RConv alone.
Regarding the rounding circuit in Section 6.3, please provide a formal proof that your modulus selection strategy guarantees that the rounding offset | [A]_H / H | is strictly less than 1/4 for all valid and secure CKKS parameter configurations, not just the specific ones tested in this work.
Please reconcile the discrepancy between the 7.1% increase in on-chip memory capacity and the claimed 4% total area increase over the SHARP baseline. Provide a detailed area breakdown for all major components (functional units, memory, NoC, etc.) for both your HAWK design and the SHARP baseline to substantiate your overall area claim.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents HAWK, an FHE accelerator designed around a novel algorithm-hardware co-design principle. The work's central focus is to make the recent and theoretically promising Key Decomposition Switching (KDS) algorithm practical for hardware implementation. The authors first provide an insightful analysis (Section 3) identifying the key barriers to deploying KDS on existing fixed-word FHE accelerators: (1) its requirement for divergent word-lengths, (2) a significant inflation of evaluation key memory, and (3) a hardware-unfriendly rounding operation.

The core contribution is a new method, Fixed-Word Key Decomposition Switching (FW-KDS), which elegantly resolves the primary challenge by constructing the temporary conversion modulus H from the set of existing RNS primes q_i. This unifies the operational word length, making KDS compatible with established accelerator designs. Building on this foundation, the authors introduce further algorithmic and architectural co-optimizations, including a technique to halve the expensive ring conversions in bootstrapping (Half-RConv) and a novel, hardware-efficient circuit for the exact base conversion that completely eliminates rounding errors. The resulting system, HAWK, demonstrates up to a 1.45x performance improvement over a state-of-the-art baseline with a negligible area increase, effectively bridging the gap between a new cryptographic theory and its practical hardware realization.

Strengths

Excellent Problem Identification and Positioning: The paper's primary strength lies in its clear identification of a critical gap in the field. The KDS algorithm [36] was a recent advance from the cryptography community with superior asymptotic complexity, yet its practical utility for the hardware architecture community was unproven and, as the authors show, deeply problematic. This work serves as an essential bridge between these two domains. The analysis in Section 3 is one of the most valuable parts of the paper, as it clearly articulates the practical hurdles that a purely theoretical analysis would miss.
Elegant and Impactful Core Idea: The proposed FW-KDS method is a beautifully simple and effective solution to the main challenge of divergent word lengths. The idea of reusing existing RNS primes for the converted modulus H (Section 4.1) is not a brute-force fix but an insightful co-design choice that unlocks a cascade of subsequent benefits, including reduced computational complexity (Section 4.3) and smaller memory footprints (Section 4.4). This represents a significant conceptual advance over the original KDS formulation for any fixed-word implementation.
Strong System-Level Co-Design: This work goes far beyond a simple algorithmic tweak. The authors have considered the full system impact of their choices. The "Half-RConv" optimization (Section 5.1.1) is a clever insight into the dataflow of iterative bootstrapping routines, showing a deep understanding of the application structure. Furthermore, the development of a new hardware-friendly rounding method (Section 6.3, Figure 11e) demonstrates a commitment to solving the problem across the entire stack, from algorithm to circuit level. This holistic approach is commendable.
Opening a New Design Space: Prior to this work, FHE accelerator research had largely converged on optimizing the Hybrid Key-Switching (HKS) method. HAWK fundamentally challenges this consensus by demonstrating a practical and superior alternative. It effectively provides a blueprint for a new generation of FHE accelerators built on a more efficient primitive. This has the potential to shift the focus of future research and unlock further performance gains.

Weaknesses

While the paper is strong, its context and implications could be further enriched.

Under-explored Trade-off Characterization: The paper pragmatically settles on a hybrid strategy: using FW-KDS for high and medium levels and falling back to HKS for low levels. This is a sound engineering decision, but the underlying trade-offs could be more deeply explored. Figure 14b provides a glimpse, but a more explicit characterization of the performance/memory crossover point between HKS and FW-KDS as a function of key FHE parameters (e.g., polynomial degree N, number of levels L, RNS prime size log q_i) would be invaluable for future designers seeking to apply this work.
Interaction with Orthogonal Optimizations: The FHE literature contains other powerful optimization techniques, most notably hoisting [27], which trades a significant increase in memory for a large reduction in H-(I)DFT computation. The authors rightly follow the path of recent accelerators like ARK [35] and SHARP [34] in forgoing hoisting to manage on-chip memory. However, the paper would be more complete with a short discussion on how FW-KDS might interact with hoisting in a different design context (e.g., a GPU-based system with vast memory). Does the reduced key size of FW-KDS make hoisting more tenable? Or do its benefits primarily apply to memory-constrained systems?
Generalizability Beyond CKKS: The discussion in Section 8.2 touches upon adapting the method to other RLWE-based schemes like BGV and BFV. This is a crucial point for assessing the work's broad impact. This section could be strengthened by briefly explaining why the key-switching mechanisms are similar enough to expect this adaptation to be successful, thereby giving the reader more confidence in the claimed generalizability.

Questions to Address In Rebuttal

The proposed exact rounding method in Section 6.3 is a novel and important contribution for making KDS practical. Could you quantify the area and/or power savings of your proposed circuit (Figure 11e) compared to the "traditional" high-precision implementation (Figure 11d) it replaces? This would more concretely establish the benefit of this specific optimization.
Regarding the dual-mode HKS/FW-KDS strategy, could the authors elaborate on the crossover point where FW-KDS becomes more efficient than HKS? How sensitive is this point to FHE parameters like N, L, or the decomposition base alpha? A clearer characterization of this trade-off space would be valuable.
This work successfully demonstrates a new, superior key-switching primitive for accelerators, which likely shifts the overall system bottlenecks. After applying your (M)FW-KDS optimizations, what becomes the next dominant bottleneck in the bootstrapping process? Is it still the (I)NTT operations, or has the problem shifted more towards data movement or other element-wise operations within the EWU?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents HAWK, a hardware accelerator for Fully Homomorphic Encryption (FHE) that centers on a novel adaptation of the recently proposed Key Decomposition Switching (KDS) algorithm. The authors identify that the original KDS method is incompatible with mainstream fixed-word-length FHE accelerators due to its requirement for large, variable-width moduli (hi) to achieve its theoretical efficiency.

The central claim of novelty is the Fixed-Word Key Decomposition Switching (FW-KDS) method. This method modifies the KDS algorithm by constructing the temporary conversion modulus H using primes from the existing FHE parameter set (qi), thereby forcing all operations into a single, fixed word length. The paper further proposes two supporting contributions: a hardware-friendly integer-based method for the exact rounding operation required by KDS, and a dataflow optimization named Half-RConv to reduce redundant computations during bootstrapping. The authors claim this is the first work to successfully deploy the KDS method in a fixed-word hardware accelerator.

Strengths

Core Contribution's Novelty: The primary contribution, FW-KDS, appears to be genuinely novel. The KDS algorithm (Kim et al., CRYPTO 2023 [36]) is very recent, and to my knowledge, no prior work has systematically analyzed its hardware implementation challenges, let alone proposed a concrete algorithmic modification to resolve the critical word-length incompatibility issue. While the problem of fixed vs. variable word length is well-understood in the FHE accelerator community (e.g., SHARP [34]), the authors' proposed solution—constructing the temporary modulus H from the existing prime set D (Section 4.1, page 6)—is a simple but clever algorithmic trick that directly addresses this problem. This is a clear and novel advancement over the prior art.
Novelty in Supporting Contributions: The paper contains secondary contributions that are also novel in their specific context:
- Hardware-Friendly Rounding (Section 6.3, page 11): The standard method for exact base conversion relies on high-precision arithmetic that is hostile to hardware. The authors replace this with a technique based on Lemma 6.1, which maps the problem to a fixed-point multiplication and bit truncation. While using fixed-point tricks to implement rounding is a known concept in DSP, its application to solve the specific ExactBConv problem from Halevi et al. [24] within an FHE accelerator is a new and valuable piece of algorithm-hardware co-design. It eliminates a significant implementation barrier for KDS.
- Half-RConv Optimization (Section 5.1.1, page 8): This dataflow optimization, which defers the ring conversion for one of the key-switch outputs, is a novel observation specific to the iterative structure of bootstrapping's giant-step when combined with a KDS-like method. It arises directly from the introduction of the temporary ring RH, a feature absent in prior HKS-based accelerators. Therefore, this optimization is a novel contribution in the context of accelerating KDS.
Systematic Problem Formulation: The paper does an excellent job of deconstructing the KDS algorithm and identifying its practical weaknesses for hardware (Section 3, pages 4-6). This clear motivation, which precedes the solution, establishes that the authors are not just porting an algorithm but are thoughtfully re-engineering it based on a deep understanding of hardware constraints.

Weaknesses

Incremental Nature of the Core Idea: While novel, the core idea of FW-KDS is ultimately an engineering compromise born of a hardware constraint. The authors' own analysis (Figure 3, page 5) clearly shows that their proposed FW-KDS (with wd(hi) forced to 36 bits) is computationally inferior to the original KDS algorithm running at its optimal, larger word length (e.g., 64 bits). The novelty lies in making KDS practical for fixed-word hardware, but it does so by sacrificing some of the algorithm's intrinsic asymptotic advantage. The paper should be more upfront that FW-KDS is a "hardware-constrained KDS" rather than an unequivocally superior version.
Limited Scope of Novelty for Rounding Technique: The mathematical basis for the rounding optimization (Lemma 6.1, page 11) is a known principle in number theory and fixed-point arithmetic: if a value is known to be very close to an integer, its integer part can be found efficiently. The contribution is the application of this principle, not the invention of it. The framing could be more precise to reflect this.
The Delta Over Prior Art is Primarily an Integration Effort: The paper's main achievement is integrating the ideas of KDS into a state-of-the-art accelerator architecture template (largely derived from SHARP [34]). The architectural changes themselves (e.g., the EBConvU) are direct consequences of supporting the new algorithm, rather than fundamental architectural innovations. The novelty is therefore more algorithmic and co-design-focused than purely architectural.

Questions to Address In Rebuttal

The fundamental premise of FW-KDS is the constraint of a fixed-word architecture. How does the performance of HAWK using FW-KDS compare to a hypothetical, albeit more complex, accelerator designed with native support for multiple word lengths (e.g., 36-bit and 64-bit ALUs) that could execute the original KDS algorithm optimally? This would help quantify the performance cost of the fixed-word design choice.
The proposed rounding method relies on the condition H > 4 * 2Bd. In Section 7.7 (page 14), you argue this condition is easily met. Can you provide a more rigorous analysis? Are there any FHE parameter sets for different security levels (e.g., 192-bit) where this constraint becomes difficult to satisfy or forces a sub-optimal choice for the modulus H, thereby negatively impacting performance?
The Half-RConv optimization is presented as a key contribution. Could the authors clarify if the novelty lies in the observation that this specific dataflow is amenable to optimization, or if the technique of deferring basis conversion in this manner is itself a fundamentally new concept?

ShadowBinding: Realizing Effective Microarchitectures for In-Core Secure Speculation Schemes

Abstract

Secure speculation schemes have shown great promise in the war against speculative side-channel attacks and will be a key building block for developing secure, high-performance architectures moving forward. As the field matures, the need for rigorous ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adverserial Skeptic)

Summary

The authors present "ShadowBinding," an evaluation of two in-core secure speculation schemes, Speculative Taint Tracking (STT) and Non-Speculative Data Access (NDA), on the RISC-V BOOM out-of-order core. The paper identifies a critical timing dependency chain in the originally proposed rename-stage implementation of STT (termed STT-Rename). To address this, the authors propose a new microarchitecture, STT-Issue, which delays taint computation to the issue stage. The authors provide RTL implementations and evaluate the schemes on FPGA, reporting on IPC, timing, area, and power. They conclude that the performance cost of these schemes is higher than previously estimated, challenge the superiority of STT over NDA when timing is considered, and argue that performance degradation will be worse on higher-performance cores.

Strengths

The work is grounded in an RTL implementation on a synthesizable core (BOOM), providing a more concrete basis for microarchitectural analysis than purely abstract, high-level simulation.
The identification of the single-cycle, width-dependent dependency chain in rename-stage taint tracking (STT-Rename, Section 4.1, page 4) is a precise and valuable microarchitectural insight. The analysis in Figure 3 is clear and highlights a fundamental scalability problem.
The paper provides a multi-faceted evaluation, considering not just IPC but also timing from synthesis, area, and power (Section 8, pages 8-10). This is a necessary step beyond the typical IPC-only analysis common in the literature.

Weaknesses

This paper's conclusions, while provocative, are built upon a foundation with significant methodological and analytical weaknesses. The claims of high performance costs for state-of-the-art cores are not adequately substantiated by the evidence provided.

Questionable Proxy for High-Performance Cores: The central thesis rests on extrapolating results from the BOOM core to "leading processor core designs." This is a significant analytical leap. The authors' own data (Table 1, page 7) shows the highest-performing BOOM configuration (Mega) achieves an IPC of 1.27 on SPEC2017, whereas a contemporary Intel core (Redwood Cove) achieves 2.03. Furthermore, the authors concede in Section 9.6 (page 12) that "The BOOM is less optimized than leading commercial cores." Using a mid-range academic core to make sweeping claims about the performance of highly optimized, industrial cores is unconvincing. The observed trends may be artifacts of BOOM's specific microarchitecture (e.g., its "naïve memory-retry policy") rather than fundamental properties of the secure schemes themselves.
Unreliable Timing and Performance Extrapolation: The timing analysis is based on FPGA synthesis. It is well-established that FPGA and ASIC synthesis produce vastly different timing results, due to different cell libraries, routing constraints, and physical design flows. Claiming that the timing trends observed on an FPGA (Figure 10, page 10) will hold for a high-frequency ASIC design is speculative at best. This weakness is compounded by the performance extrapolation in the abstract (page 1) and Section 9.5 (page 12). The "linear extrapolation" is acknowledged as unlikely, and the "less pessimistic estimate with only halved growth" is an arbitrary assumption with no theoretical or empirical justification. These figures appear to be sensationalized rather than rigorously derived.
Insufficient Novelty of STT-Issue: The proposal to delay taint tracking to a later pipeline stage (STT-Issue) is not entirely novel. The authors themselves cite Jauch et al. [21], who also perform taint tracking later in the pipeline. The paper attempts to differentiate its work by stating Jauch et al. taint "post register-read instead of post-issue" (Section 10, page 12), but fails to provide a detailed analysis of why this specific difference leads to the claimed benefits. The contribution of STT-Issue appears to be more of an implementation choice than a fundamental new concept, and its novelty is overstated.
Inconsistent Argumentation: The paper criticizes prior work for relying on architectural simulators (Section 9.4, page 11) and making potentially unrepresentative assumptions (e.g., L1 cache latency). However, this work commits similar errors by using a non-representative core (BOOM) and an unreliable timing methodology (FPGA synthesis) to extrapolate results to a target domain (high-performance ASICs) where they may not apply. This is a case of criticizing others for a class of error one also commits.

Questions to Address In Rebuttal

Please provide a compelling justification for using the BOOM core as a valid proxy for "leading processor core designs," especially when its baseline performance is substantially lower and its microarchitecture is admittedly less optimized. How can the authors be certain that their observed IPC and timing trends are not artifacts of BOOM's specific limitations?
The paper's conclusions about total performance (IPC x Timing) hinge on timing results from FPGA synthesis. Please defend the validity of these timing results and their applicability to high-frequency, commercial ASIC designs. What evidence suggests that the critical paths identified on the FPGA would remain the critical paths in an ASIC implementation?
The performance loss projections for a Redwood Cove-class core are presented prominently. What is the basis for the "halved growth" assumption used in the "less pessimistic estimate"? Without a formal model or supporting data, this appears to be unfounded speculation. Please justify this calculation or remove it.
Please provide a more detailed microarchitectural comparison between STT-Issue and the secure speculation implementation by Jauch et al. [21]. Clarify precisely what makes STT-Issue novel and why the specific design choice of tainting at issue, rather than post-register-read, is fundamentally better.
The analysis of the exchange2 benchmark in Section 9.2 (page 11) is interesting, suggesting STT-Rename fundamentally harms store-to-load forwarding by delaying store address generation. Is this an inherent flaw of the STT-Rename concept, or could it be an artifact of the specific partial-issue mechanism in your BOOM implementation?

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a rigorous, hardware-level evaluation of two state-of-the-art in-core secure speculation schemes, Speculative Taint Tracking (STT) and Non-Speculative Data Access (NDA). The authors move beyond abstract architectural simulation by implementing these schemes in RTL on the RISC-V BOOM core and evaluating them on FPGAs.

The core contribution is twofold. First, the authors uncover a fundamental microarchitectural flaw in the originally proposed concept for STT—a performance-limiting dependency chain in the rename stage—and propose a novel, more scalable alternative, "STT-Issue." Second, and more broadly, their work serves as a crucial reality check for the field, demonstrating that the true performance cost of these schemes (considering both IPC and timing degradation) is far higher than previously estimated, potentially exceeding 30% for high-performance cores. This challenges the prevailing sentiment that in-core Spectre mitigations can be a low-cost solution and acts as a "call to arms" for more deeply integrated and realistic evaluations of security mechanisms.

Strengths

Methodological Grounding and Realism: The paper's greatest strength lies in its departure from purely simulator-based analysis. By implementing the schemes in synthesizable RTL and evaluating them on an FPGA platform (BOOM/FireSim), the authors provide a much-needed bridge between architectural theory and microarchitectural reality. This allows them to uncover timing, area, and power implications (Section 8.2, 8.4) that are simply invisible at a higher level of abstraction. This work sets a higher bar for how such schemes should be evaluated going forward.
Novel Microarchitectural Insight and Contribution: The identification of the single-cycle dependency chain in rename-based taint tracking for STT is a significant and non-obvious finding (Section 4.1, page 4). It is a perfect example of a problem that only becomes apparent through concrete implementation. The proposed STT-Issue microarchitecture, which delays tainting to the issue stage, is an elegant solution that directly addresses this scaling bottleneck. This is a solid, self-contained microarchitectural contribution that advances the state of the art for STT.
Significant and Impactful Results: The central conclusion—that in-core schemes are much more expensive than the community has been led to believe—is of paramount importance. The data presented in Figure 1 (page 2) and Figure 9 (page 10) compellingly shows that as core parallelism (and thus baseline performance) increases, the relative overhead of these security features becomes dramatically worse. This finding has the potential to redirect research efforts, perhaps encouraging the community to reconsider the trade-offs of out-of-core mechanisms or more tightly integrated hardware/software co-designs, which may have been prematurely dismissed as too complex.
Holistic Performance Perspective: The paper correctly argues that performance is a product of both IPC and clock frequency (Timing). By analyzing these two factors separately and then combining them, the authors provide a complete picture of the performance impact. Their analysis highlights how a scheme like NDA, while having a worse IPC impact, can be superior overall due to its design simplicity, which translates into a negligible timing penalty. This is a crucial insight for architects who must make practical design decisions.

Weaknesses

Limited Scope of Evaluated Platform: While the use of BOOM is a major strength, it is still an academic, open-source core. The absolute performance and the specific critical paths identified may not perfectly map to the highly-optimized, proprietary designs from major vendors like Intel or AMD. This is an inherent limitation of academic hardware research, but the authors could be more explicit in framing their results as indicative of trends that are likely to be exacerbated in more complex commercial designs, rather than as precise predictions of overhead.
Hand-wavy Extrapolation: The linear extrapolation of performance loss for a Redwood Cove-class processor mentioned in the abstract and Section 9.5 is speculative. Performance scaling is notoriously non-linear, and while the trend is clear and alarming, presenting a specific number like "49.5%" might overstate the predictive power of the model. It would be stronger to focus on the demonstrated trend itself without making such a precise, and likely inaccurate, projection.
Threat Model Nuances: The paper focuses on C- and D-shadows (control and data speculation), which are indeed the most prominent. However, a brief discussion on the implications of extending their microarchitectures to handle more complex shadows (e.g., M- and E-shadows for memory consistency and exceptions) would strengthen the paper's completeness. Would the dependency chain in STT-Rename become even worse? Would the simplicity of NDA continue to be an advantage?

Questions to Address In Rebuttal

The paper's "call to arms" is one of its most compelling aspects. If the authors are correct that the performance cost of state-of-the-art in-core schemes is prohibitively high, what do they believe is the most promising path forward for the research community? Should we focus on optimizing in-core schemes (e.g., via techniques like Doppelganger Loads [29] or ReCon [2]), pivot back to out-of-core solutions (e.g., InvisiSpec [56]), or invest more heavily in HW/SW co-design?
Your proposed STT-Issue design elegantly removes the critical dependency chain from the rename stage but introduces new complexity into the issue stage, including the taint unit and back-propagation of YRoT information to the issue queue (Figure 4, page 5). Could you elaborate on the scalability of this back-propagation network and the potential increase in issue queue port complexity, especially in a very wide and deeply out-of-order machine?
Your analysis shows that the simpler design (NDA) ultimately provides better overall performance than the more complex one (STT) on your widest core configuration due to timing advantages. This is a classic architectural lesson. Do you see this as a generalizable principle for secure microarchitecture—that is, should the community prioritize schemes with minimal structural disruption over those that appear to offer better IPC in abstract models?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents microarchitectural designs and a high-fidelity performance evaluation for two existing in-core secure speculation schemes: Speculative Taint Tracking (STT) [58] and Non-Speculative Data Access (NDA) [55]. The paper's novel contributions are not new security schemes, but rather:

The identification of a previously unarticulated, fundamental performance limitation in STT: a single-cycle, width-dependent dependency chain when taint tracking is performed at the register rename stage (termed STT-Rename).
A novel microarchitectural proposal, "STT-Issue," which mitigates this limitation by delaying taint computation until the instruction issue stage, thereby breaking the dependency chain.
A novel empirical finding, derived from an RTL-based evaluation methodology, that the true performance cost (IPC × Timing) of these in-core schemes is substantially higher than reported in prior simulator-based work.

While the proposed STT-Issue architecture is a novel solution to a newly identified problem, the paper's own results suggest that a less-novel design (NDA) ultimately provides better performance on wider cores. The primary value of this work lies in its rigorous analysis and the new insights it provides into the scaling limitations of existing schemes.

Strengths

Identification of a Novel Problem: The paper's most salient contribution is the detailed characterization of the critical dependency chain inherent to performing taint tracking at the rename stage (Section 4.1, page 4, Figures 2b and 3). The original STT paper [58] suggested compatibility with register renaming but did not analyze the microarchitectural implications of resolving same-cycle dependencies for taint propagation. This paper is the first to articulate that taint tracking dependencies are fundamentally different from register dependencies and to demonstrate the resulting critical path that scales linearly with processor width. This is a significant, novel insight into the prior art.
A Novel Microarchitectural Solution (STT-Issue): The proposed STT-Issue architecture (Section 4.3, page 5) is a novel and logical solution to the dependency chain problem identified in STT-Rename. While the use of issue-stage replay mechanisms is not new in itself [24], its specific application to decouple taint computation for speculative security is a novel contribution. It represents a clear advancement over the naive STT-Rename approach.
Novelty in Methodology and Empirical Findings: The evaluation on a synthesized RTL design (BOOM) provides a level of fidelity that is absent from prior work in this specific area, which has relied on architectural simulators (e.g., gem5). The resulting conclusion—that the combined impact of IPC loss and timing degradation leads to a performance overhead of over 20-35% (Figure 1, page 2)—is a novel and crucial finding. This challenges the entire premise of prior work suggesting these schemes have modest overheads. This empirical contribution is arguably as important as the architectural one.

Weaknesses

Limited Novelty in the NDA Microarchitecture: The proposed microarchitecture for NDA (Section 5, page 6) appears to be a direct and straightforward implementation of the requirements laid out in the original paper [55]. The core idea is to decouple the data writeback from the readiness broadcast for speculative loads (Figure 5b, page 6). While this is a necessary engineering step for a concrete implementation, it does not introduce a new microarchitectural principle. The contribution here is one of "realization" rather than "innovation," and the delta from the original concept is minimal.
Novelty vs. Efficacy Trade-off: A critical weakness stems from the paper's own results. The novel and more complex STT-Issue architecture is ultimately outperformed by the simpler, less-novel NDA implementation on the highest-performance "Mega" core configuration (Table 3, page 9). The performance loss for STT-Issue is 27% versus 22% for NDA. This demonstrates a case where the introduction of a novel microarchitectural technique does not lead to a superior result compared to a more direct implementation of an existing idea. This undermines the practical value of the novel STT-Issue contribution, suggesting that the complexity it introduces is not justified by the final performance outcome. The innovation solves a problem but leads to a suboptimal design point.

Questions to Address In Rebuttal

Regarding the NDA microarchitecture, please clarify the novel microarchitectural principle beyond a direct hardware implementation of the scheme's requirements as described in [55]. What is the conceptual delta that future designers could learn from and apply elsewhere?
Your results show that the less-novel NDA outperforms the novel STT-Issue on the Mega core configuration due to better timing. Does this not suggest that the pursuit of novel complexity in STT-Issue is a less fruitful path than refining simpler, existing schemes? Please justify the value of the STT-Issue novelty when a less-complex, known approach yields superior performance on high-performance cores.
Given that your novel analysis uncovers a fundamental scaling problem with rename-stage taint tracking, and your proposed novel solution (STT-Issue) still underperforms NDA, what is the key takeaway for future novel scheme design? Should the community conclude that taint-tracking approaches like STT are a dead end for wide-issue cores, or is there another novel architectural insight needed to make them viable?

SmartPIR: A Private Information Retrieval System using Computational Storage Devices

Abstract

Fully Homomorphic Encryption-based Private Information Retrieval systems provide strong privacy by enabling encrypted queries on databases hosted by untrusted servers. However, adoption is limited by system-level bottlenecks, including severe I/O ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present SmartPIR, a system for Private Information Retrieval (PIR) that leverages Computational Storage Devices (CSDs) to mitigate I/O bottlenecks. The core contributions are a protocol and architecture co-design, featuring a "zero-skipping encoding" (ZSE) to handle variable-length data efficiently and an in-storage FHE engine implemented on FPGAs. The system claims significant (10²×~10³×) speedups over CPU-based PIR schemes.

While the motivation is sound and the use of CSDs is pertinent, the paper's central claims rest on a foundation of weak baselines, an insufficient security analysis, and a glossing over of critical implementation details. The work demonstrates a functional prototype but fails to rigorously substantiate its performance advantages and security guarantees against realistic alternatives and threats.

Strengths

Problem Identification: The paper correctly identifies two critical, practical bottlenecks in FHE-based PIR systems: the I/O overhead from full database scans and the computational redundancy introduced by padding variable-length data entries. This is a well-articulated motivation.
Prototype Implementation: The authors have implemented a full-stack prototype on commercially available CSDs (Samsung SmartSSDs). This is a non-trivial engineering effort and provides a valuable data point on the practical feasibility of such systems, moving beyond pure simulation.
Core Concept: The high-level idea of offloading FHE operations to CSDs to co-locate computation and data is a logical and promising direction for alleviating the identified I/O bottleneck.

Weaknesses

Grossly Exaggerated Performance Claims: The headline claim of a "10²×~10³× speedup" (Abstract, page 1) is deeply misleading. A close reading of the evaluation in Section 8.2.1 and Figure 8 (page 9) reveals this claim is derived from a comparison against a single-threaded CPU baseline (MulPIR+CPU). Against more credible, parallelized baselines such as MulPIR+CPUs (48 threads) and MulPIR+A100 (a high-end GPU), the speedups are far more modest, typically in the range of 5× to 47×. The abstract and introduction must be revised to reflect the performance against state-of-the-art, parallelized implementations, not a strawman baseline.
Insufficient Security Analysis: The security analysis in Section 5.4 (page 7) is cursory and unconvincing. The paper claims "indistinguishable computation" because the skipping pattern is dictated by the static database layout. This argument completely ignores timing side channels. An adversary observing the total execution time could potentially infer properties of the query index if different query paths, despite being functionally identical from the server's perspective, result in different numbers of skipped (i.e., fast-path) vs. non-skipped (slow-path) operations. The execution time of a query is an observable trace. Without a formal proof or at least a rigorous argument that execution time is independent of the query index, the security claim is unsubstantiated.
Unfair and Opaque Baselines:
- The comparison to INSPIRE (Section 8.2.1, page 10) is an apples-to-oranges comparison. INSPIRE is a simulated architecture, and the authors dismiss its superior performance on the uniform (Uni) dataset by claiming it relies on "idealized compute and I/O capabilities." This is a weak defense. The authors should instead explain why their real-world implementation is fundamentally less efficient in this best-case scenario for traditional PIR.
- A crucial baseline is missing: a direct implementation of the baseline protocol (MulPIR) on the same CSD hardware. Without this, it is impossible to disentangle the performance gains from the SmartPIR protocol (i.e., ZSE) from the gains simply obtained by moving any PIR computation to the CSDs. The authors' "co-design" claim hinges on this, but they provide no evidence to support it.
Unaddressed Protocol Overheads: The ZSE protocol (Section 5.1, page 5) requires a "bitwidth-aware sorting of the entries" as a preprocessing step. The cost of this sorting is never analyzed or mentioned in the evaluation. For dynamic databases where entries are frequently added, updated, or deleted, this could introduce significant and recurring overhead, undermining the system's practicality. The paper presents the database as a static artifact, which is not realistic.
Missing Hardware Implementation Details: The claim of a "resource-efficient FPGA circuit design" is not sufficiently supported. Table 1 (page 8) shows very high resource utilization (e.g., 85% of BRAMs, 68% of LUTs). High utilization often leads to significant challenges in timing closure and typically necessitates lower clock frequencies. The authors fail to report the clock frequency of their FPGA design, a critical metric for evaluating the efficiency of any hardware accelerator. Without this information, the performance results lack crucial context.

Questions to Address In Rebuttal

Please justify the 10²×~10³× speedup claim in the abstract. Provide a revised claim based on a comparison to the multi-threaded CPU (MulPIR+CPUs) or GPU (MulPIR+A100) baselines, which represent the current state-of-the-art.
Provide a rigorous security argument for why the data-dependent "zero-skipping" in your ZSE protocol does not leak information about the client's query index through timing side channels. A simple assertion that the pattern is static is insufficient.
What is the computational cost of the initial "bitwidth-aware sorting" required by ZSE? How does your system handle database updates (insertions/deletions), and what is the amortized performance cost of maintaining the sorted structure?
What is the operating clock frequency of your FPGA design on the Kintex UltraScale+ KU15P? How does this compare to other published FPGA-based FHE accelerators?
Please explain the significant performance deficit of SmartPIR compared to the simulated INSPIRE on the Uni dataset. What specific architectural or protocol-level limitations cause this gap?
To properly evaluate the benefit of the SmartPIR protocol itself, please provide performance data for the baseline MulPIR protocol implemented on your CSD hardware, or provide a compelling argument for why this comparison is not necessary.

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents SmartPIR, a full-stack Private Information Retrieval (PIR) system that tackles key performance and scalability bottlenecks in existing Fully Homomorphic Encryption (FHE)-based schemes. The authors correctly identify that as FHE computation is accelerated by hardware like GPUs, the system-level bottleneck shifts to I/O, as the entire database must be read from storage for every query. Furthermore, they highlight the significant computational overhead incurred when handling real-world, variable-length data, which requires padding records to a uniform maximum size.

The core contribution of SmartPIR is a holistic protocol and architecture co-design that moves the computation to the data using commercial Computational Storage Devices (CSDs). This is achieved through three key innovations: 1. An in-storage computing framework that offloads FHE operations to FPGAs embedded within CSDs, fundamentally eliminating the host-storage I/O bottleneck. 2. A novel "Zero-Skipping Encoding" (ZSE) protocol that intelligently handles variable-length data by segregating payload from padding, allowing the system to skip redundant computations on zero-padded data without compromising privacy. 3. A resource-efficient hardware design and load-aware scheduler optimized for the constrained environment of CSDs, enabling high throughput and near-linear scalability across an array of devices.

The authors demonstrate the efficacy of their system with an implementation on commercial SmartSSDs, showing speedups of 100-1000x over state-of-the-art CPU-based PIR systems.

Strengths

Excellent Problem Identification and Contextualization: The paper's primary strength lies in its clear-eyed view of the evolving landscape of PIR systems. The authors astutely recognize that the success of prior work in accelerating FHE computation has created a new, dominant bottleneck: data movement. The motivation presented in Section 3 (page 4), particularly with the compelling data in Figure 2, perfectly frames this "Amdahl's Law" moment for the field. This work represents a logical and necessary next step, shifting the focus from purely computational optimization to system-level, I/O-bound optimization.
A Powerful Synthesis of Ideas: This is not merely a paper about accelerating FHE on an FPGA. It is a compelling example of co-design, where insights from cryptography (PIR protocols), computer architecture (CSDs, FPGA design), and systems (scheduling, data layout) are synthesized into a single, cohesive solution. Marrying the CSD paradigm with a protocol explicitly designed to be hardware-friendly (ZSE) is the paper's most significant achievement.
Addressing a Critical Practical Challenge: The problem of variable-length data is a persistent thorn in the side of many cryptographic protocols, which often assume uniform data structures for simplicity. SmartPIR's ZSE scheme (Section 5.1, page 5) is an elegant solution that directly addresses this practical issue, yielding massive performance gains on realistic, skewed datasets. This significantly enhances the applicability of PIR to real-world use cases like text databases, logs, or user records.
Strong Empirical Evidence on Real Hardware: The evaluation is conducted on commercially available SmartSSDs, not a simulator. This lends immense credibility to the performance claims. The demonstrated near-linear scalability with an increasing number of CSDs (Figure 10, page 11) validates the architectural design and its promise for building large-scale, private systems. The comparison against strong baselines, including a high-end A100 GPU, effectively underscores the superiority of the in-storage computing approach for this I/O-bound problem.

Weaknesses

My criticisms are less about fundamental flaws and more about opportunities to further explore the implications of this excellent work.

Limited Discussion on Dynamic Data Management: The paper rightly positions itself as superior to LWE-based schemes like SimplePIR for dynamic datasets that require frequent updates (Section 8.5, page 12). However, the cost of updates within the SmartPIR framework itself is not fully explored. The ZSE scheme relies on sorting and partitioning the database based on entry length. A single record update, insertion, or deletion could potentially disrupt this layout, necessitating a costly data re-organization across the CSDs. A discussion of the amortization cost of such updates would provide a more complete picture of the system's performance in truly dynamic environments.
Generality of the Hardware Accelerator: The hardware engine is thoughtfully designed with three stages (PMAC, MAC, RAC) that are tightly coupled to the SmartPIR protocol's workflow (Figure 6, page 7). While this specialization is key to its efficiency, it raises questions about its flexibility. How much effort would be required to adapt this hardware to support a different FHE-based protocol (e.g., one with a different retrieval structure or higher multiplicative depth)? A brief discussion on the accelerator's programmability or adaptability would help contextualize it within the broader FHE acceleration landscape.

Questions to Address In Rebuttal

Could the authors elaborate on the operational cost of database updates? Specifically, if a new record is inserted or an existing one's length changes significantly, does this trigger a resort and re-partitioning of a large portion of the database shard? How is this re-organization managed, and what is its performance impact?
The cost and energy efficiency analysis in Section 8.4 (page 11) is very compelling. Given that CSDs are still a relatively niche technology compared to commodity SSDs and GPUs, could the authors comment on the trajectory of this technology? How does the anticipated maturation and commoditization of CSDs affect the long-term viability and accessibility of the SmartPIR approach?
The paper presents a significant leap forward for RLWE-based PIR. How do the authors see this work influencing the broader field of private computation? For example, could the principles of co-designing protocols for in-storage computation be applied to other FHE-heavy applications, such as private machine learning inference or genomic analysis?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents SmartPIR, a Private Information Retrieval (PIR) system designed to run on Computational Storage Devices (CSDs). The authors identify two primary bottlenecks in existing Fully Homomorphic Encryption (FHE)-based PIR systems: (1) I/O overhead, which becomes dominant when computation is accelerated, and (2) computational inefficiency when handling variable-length data due to mandatory padding.

To address this, the authors propose a co-design of a new protocol and an in-storage architecture. The core claims to novelty appear to be: 1. An in-storage computing framework that offloads FHE operations to an array of commercial CSDs, thereby eliminating the host-SSD I/O bottleneck. 2. A "Zero-Skipping Encoding" (ZSE) protocol that structurally separates payload data from padding, allowing the system to skip redundant computations on the padding. 3. A set of system and architectural optimizations, including a "folding retrieval" mechanism, a resource-efficient FPGA design, and a load-aware scheduler, to make the system scalable and efficient on resource-constrained CSDs.

The paper demonstrates substantial performance gains over CPU- and GPU-based implementations of state-of-the-art PIR schemes.

Strengths

The most significant novel contribution of this work is the Zero-Skipping Encoding (ZSE) protocol (Section 5.1, Page 5). The problem of handling variable-length data in PIR has historically been addressed by brute-force padding, which, as the authors correctly identify, leads to massive computational overhead. The ZSE scheme, with its "horizontal encoding" that segregates padding into distinct, all-zero plaintext vectors, is a genuinely new protocol-level idea for mitigating this issue. This allows for a data-aware computation flow where zero-vectors can be provably skipped without leaking information about the query, directly tackling a fundamental inefficiency in prior art.

The second key strength is the practical realization of in-storage PIR on commercial hardware. While the conceptual groundwork for in-storage PIR acceleration was laid by prior work (e.g., INSPIRE [34]), that work was based on a simulated architecture. This paper presents a full-stack implementation on commercially available SmartSSDs. Moving a complex system like FHE-based PIR from simulation to a functioning hardware prototype is a non-trivial contribution that validates the real-world viability of the in-storage processing paradigm for this domain.

Weaknesses

My primary concern relates to the positioning of the work and the novelty of several constituent components.

The Foundational Idea of In-Storage PIR Is Not New: The central premise of offloading PIR computation to near-storage processors to alleviate the I/O bottleneck was previously proposed and explored in INSPIRE [34]. The authors cite INSPIRE as a baseline but do not sufficiently acknowledge that the core concept of an "in-storage private information retrieval" system is prior art. The novelty here lies in the implementation on real hardware and the co-designed protocol for variable-length data, not in the foundational idea of moving PIR computation to storage. The authors should frame their contribution more precisely as the first practical and variable-length-aware implementation of this concept.
Several "Novel" Optimizations Are Adaptations of Known Techniques: The paper presents several architectural and protocol-level optimizations that, while effective, are adaptations of existing ideas rather than fundamental innovations.
- Folding Retrieval (Section 5.2, Page 6): The authors explicitly state they "leverage the idea of folding [6, 7]". Their contribution is the application of this known query-size reduction technique to the specific 3D data structure produced by their ZSE scheme. This is an engineering adaptation, not a new retrieval concept.
- KeySwitch Hoisting (Section 6.1.2, Page 7): Deferring expensive operations like KeySwitch until after a batch of cheaper operations (like Mult and Add) have accumulated is a known optimization pattern in FHE accelerator design to reduce overhead. While its application here is sound, it does not represent a novel architectural principle.
- Load-Aware Scheduling (Section 6.2, Page 8): The use of a greedy heuristic to solve a knapsack-like load balancing problem is a standard systems design technique. The novelty is in the metric being balanced (the count of non-zero plaintexts, a direct result of ZSE), but the scheduling algorithm itself is not new.

The paper would be stronger if it more clearly delineated between genuinely new ideas (like ZSE) and the clever engineering that involved adapting existing techniques to their specific system.

Questions to Address In Rebuttal

The concept of in-storage PIR was proposed in INSPIRE [34]. Please clarify the conceptual novelty of your work beyond the (admittedly significant) engineering effort of implementing it on real CSDs. Is the primary novelty the ZSE protocol's ability to efficiently handle variable-length data, a specific weakness that INSPIRE did not address?
The Zero-Skipping Encoding (ZSE) appears to be the main novel protocol contribution. Could the authors elaborate on any potential performance or security trade-offs introduced by this "horizontal encoding" approach? For instance, does spreading a single database entry across multiple plaintext vectors increase the baseline complexity for retrieving a single, fully-packed entry compared to traditional encoding schemes?
The paper presents several architectural techniques (e.g., KeySwitch Hoisting, modular resource reuse). Could the authors please contextualize these with respect to prior work in the FHE hardware acceleration community? Are there specific constraints of the CSD's FPGA that make the application of these known techniques non-trivial and thus a contribution in its own right?

Beyond Page Migration: Enhancing Tiered Memory Performance via Integrated Last-Level Cache Management and Page Migration

Abstract

Emerging memory interconnect technologies, such as Compute Express Link (CXL), enable scalable memory expansion by integrating heterogeneous memory components like local DRAM and CXL-attached DRAM. These tiered memory systems offer potential benefits in ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose TierTune, a framework that integrates LLC partitioning with page migration to manage tiered memory systems. The central thesis is that traditional migration-only policies are too slow to react to workload dynamics and that the metric used by prior work (LLC miss latency) is flawed. TierTune uses L1 miss latency to dynamically partition the LLC between near and far memory tiers, supposedly offering a rapid response to traffic imbalances. This is supplemented by a conservative page migration policy for persistent imbalances. The authors claim a 19.6% average performance improvement over a state-of-the-art policy (Memtis+Colloid) in a simulated environment.

While the motivation is clear, the work rests on a foundation of questionable methodological choices and unsubstantiated claims that undermine the validity of its conclusions.

Strengths

Strong Motivating Example: The analysis in Section 3.3 (page 5), particularly Figure 3, presents a compelling case against relying solely on LLC miss latency as a balancing metric. The demonstration that IPC can improve while LLC miss latency spikes due to prefetching is a valid and important insight that correctly challenges the premise of prior work like Colloid.
Conceptually Sound Hybrid Approach: The high-level idea of using a fast, low-overhead mechanism (LLC partitioning) for transient imbalances and a slower, heavyweight mechanism (page migration) for persistent ones is logical. Decoupling these two responses is a sensible architectural principle.

Weaknesses

Fundamentally Flawed Experimental Methodology: The study's conclusions are derived from a simulation model that is not representative of the systems it aims to improve. In Section 5.1 (page 8), the authors state they scale down the modeled 128-core system with 8 memory channels per node to a 16-core system with one memory channel per node. This is an invalid simplification. The entire premise of this paper is managing memory traffic and bandwidth contention in many-core systems. A single-channel memory subsystem presents a fundamentally different bottleneck and queuing behavior than an 8-channel one. Any conclusions drawn about balancing traffic between near and far memory are suspect, as the contention point in the simulation is an artificial single-channel bottleneck, not the multi-channel contention the paper claims to address. The authors' claim of "negligible differences" between the full-scale and scaled-down models is extraordinary and requires far more rigorous proof than is offered.
Unjustified Hardware Modifications: The core of TierTune relies on measuring per-node L1 miss latency. As described in Section 4.1 (page 6), this requires a "minor architectural modification by adding destination bits to the MSHR." This proposal is not a pure software or system policy; it is a hardware/software co-design. The cost, complexity, and feasibility of this hardware change are not analyzed. The claim that it is "minor" is unsubstantiated. This significantly weakens the paper's practical implications, as it cannot be implemented on existing commodity hardware.
Oversimplified Migration Model: The simulation models page migration with a fixed bandwidth of 1 GB/s (Section 5.1, page 8). In a real system, migration is not a magical background process with reserved bandwidth. It is executed by kernel threads on host cores, consuming CPU cycles and contending with the application for memory bandwidth. This simplified model artificially reduces the cost of migration, potentially skewing the comparison against migration-heavy baselines.
Lack of Parameter Sensitivity Analysis: The control logic for TierTune is critically dependent on an undefined threshold. Algorithm 1 (page 7) hinges on the condition Lnear ≈ Lfar (balanced). What defines "approximately equal"? Is it a 5% difference? 10%? The performance of the system, particularly the trade-off between cache partitioning and page migration, will be highly sensitive to this threshold, yet no analysis is provided. Similarly, the decision to enforce a minimum of two LLC ways per partition is asserted without justification.
Unsupported Claims of Extensibility: Section 4.4 (page 7) makes broad claims about supporting multi-tenant and multi-node systems. The multi-node extension is described in a single, hand-wavy paragraph referencing "diffusion-based load balance" without any implementation details or evaluation. This amounts to speculation and should be removed from the paper if not properly evaluated.

Questions to Address In Rebuttal

Provide rigorous data to justify the claim that a 16-core, 1-channel system is representative of a 128-core, 8-channel system for bandwidth-contention-sensitive workloads. How can latency and bandwidth balancing results from such a drastically scaled-down model be considered valid?
The proposed L1 miss latency metric avoids the prefetching issue seen with LLC miss latency. However, what are its blind spots? Can you describe a workload scenario where L1 miss latency would be a misleading indicator of memory system pressure, and explain how TierTune would behave?
Justify the assertion that modifying MSHRs in every core is a "minor" hardware change. Please provide an area/power/complexity analysis, or cite processor design literature that supports the feasibility of such a modification for tracking memory destinations.
The entire control system relies on a threshold to determine if latencies are balanced. What is this threshold in your experiments, and how does the system's performance change as you vary it from, for example, 1% to 25%?
Please provide a quantitative evaluation for your proposed multi-node diffusion-based algorithm or remove the claim from Section 4.4. A textual description is insufficient for a conference paper.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper addresses the increasingly critical challenge of memory management in tiered memory systems, particularly those enabled by new interconnects like CXL. The authors argue that existing approaches, which rely primarily on hotness- or latency-based page migration, are fundamentally limited. These methods are either too slow to react to dynamic workload changes, or they inadvertently create new bottlenecks by concentrating traffic on near memory, failing to leverage the aggregate bandwidth of the system.

The core contribution is TierTune, a hybrid framework that synergistically integrates two control mechanisms operating at different timescales. For rapid, fine-grained adjustments, TierTune employs dynamic Last-Level Cache (LLC) partitioning, a hardware-level technique, to quickly balance memory traffic between near and far tiers. This fast path is guided by L1 miss latency, which the authors convincingly argue is a more accurate performance proxy than the commonly used LLC miss latency. For persistent, large-scale imbalances that exceed the corrective capacity of the LLC, TierTune uses a coordinated, selective page migration policy as a slower, coarse-grained mechanism. This two-level approach aims to provide both rapid responsiveness and long-term stability, improving performance while minimizing migration overhead.

Strengths

Elegant Problem Decomposition and a Compelling Core Idea: The paper's primary strength lies in its clear diagnosis of the problem and the elegance of its proposed solution. The motivation (Section 3) is exceptionally well-argued. By identifying the separate issues of slow convergence in migration-only systems (Insight #2, page 5) and the inadequacy of LLC miss latency as a metric (Insight #3, page 5), the authors build a powerful case for a new approach. The resulting two-timescale control system—using fast, lightweight cache management for tactical adjustments and heavyweight page migration for strategic ones—is an intuitive and powerful concept. It is an excellent example of a hardware/software co-design that leverages the distinct strengths of each layer.
Excellent Contextualization and Positioning: The work is well-situated within the current research landscape. It correctly identifies the evolution from simple hotness-based policies (e.g., TPP, Memtis) to more sophisticated latency-aware ones (e.g., Colloid) and clearly articulates the remaining gaps. The insight to repurpose a known technique—cache partitioning (like Intel CAT)—for a new purpose (dynamic traffic balancing in tiered memory) is clever and highly practical. This is not a "blue-sky" proposal but one grounded in existing hardware capabilities, which significantly increases its potential impact.
Significant and Well-Supported Performance Improvements: The experimental evaluation is thorough, using a robust simulation infrastructure and a diverse set of modern, memory-intensive workloads. The demonstrated 19.6% average performance improvement over a state-of-the-art baseline (Memtis+Colloid) is substantial. The analysis goes beyond simple performance numbers, effectively showing why TierTune works by breaking down bandwidth utilization (Figure 7, page 10) and migration counts (Figure 8, page 10). The dramatic reduction in page migrations is a key result, as it directly translates to lower system overhead and energy consumption.

Weaknesses

While the core idea is strong, the paper could be improved by addressing the following points, which seem less developed than the central thesis:

Underdeveloped Analysis of Second-Order Effects: The paper successfully argues that L1 miss latency is a better metric because it accounts for caching and prefetching. However, it does not explore the second-order interaction between its own mechanism (LLC partitioning) and hardware prefetchers. Could aggressively partitioning the LLC confuse stream or stride prefetchers that rely on observing access patterns within the LLC? It's possible that in some cases, the mechanisms could work against each other. A deeper analysis of this potential interaction would strengthen the paper.
Oversimplification of the Multi-Node Extension: The extension of TierTune to systems with more than two memory tiers (i.e., multiple CXL nodes) is discussed briefly in Section 4.4 (page 7). The proposed diffusion-based algorithm, which performs pairwise balancing between adjacent nodes, is a reasonable starting point. However, this model can suffer from slow convergence or instability in complex topologies. The brief treatment in the paper does not sufficiently address the complexities of global optimization versus local, greedy adjustments in a many-node system. This part feels more like a sketch of an idea than a fully vetted design.
Ambiguity in Monitoring Overhead: The paper asserts that the required hardware modifications for per-node L1 miss latency monitoring are "minor" and "lightweight" (Section 4.1, page 6). While this is plausible, the argument would be more convincing with a more concrete analysis. A brief discussion of the potential area, power, and complexity costs of implementing these per-core latency monitors and address comparators would add valuable credibility to the claim of practicality.

Questions to Address In Rebuttal

Regarding the interaction with hardware prefetchers: The paper astutely uses L1 miss latency to account for prefetching effects when measuring performance. However, could the LLC partitioning mechanism itself negatively interact with hardware prefetchers, for example, by reducing the cache capacity available to a given memory tier and thereby disrupting the patterns a prefetcher needs to see?
The proposed diffusion-based balancing for multi-node systems seems plausible but could face stability or convergence issues in systems with many tiers. Have the authors considered alternative global balancing strategies or analyzed the conditions under which the proposed pairwise approach might perform sub-optimally?
Could the authors elaborate on how the TierTune framework would adapt to heterogeneous far memory tiers? For instance, in a system with both a low-latency CXL-DRAM tier and a higher-latency CXL-attached Storage Class Memory (SCM) tier, would the core balancing algorithm remain effective, or would it require new heuristics to manage the more pronounced performance asymmetry?
The 50 ms interval for latency monitoring and cache partitioning adjustments was determined empirically. Could you discuss the sensitivity of the system to this parameter? How does performance degrade if the interval is too long (slow response) or too short (high overhead/instability)? This would help in understanding the tuning robustness of the system.

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The central thesis of this paper is the introduction of a two-level, coordinated control system named TierTune for managing data in tiered memory systems. The authors claim novelty in integrating a fast, fine-grained control mechanism (dynamic Last-Level Cache partitioning) with a conventional slow, coarse-grained mechanism (OS-level page migration). The fast path is designed to handle transient latency imbalances, while the slow path addresses persistent traffic skew. A key element of this proposed system is the use of L1 miss latency as the control metric, which the authors argue is superior to the LLC miss latency used in prior work.

After a thorough review of prior art, I find the core conceptual framework—the synergistic and coordinated co-design of dynamic LLC partitioning and page migration specifically for inter-tier traffic balancing—to be genuinely novel. While the constituent technologies (cache partitioning, page migration) are well-established, their integration into a hierarchical control system with a well-defined hand-off mechanism represents a new and significant contribution to the field of memory management.

Strengths

Novel Conceptual Synthesis: The primary strength of this work is its novel synthesis of two existing techniques. Prior art has extensively explored page migration for tiered memory (e.g., TPP [45], Memtis [37], Colloid [62]) and cache partitioning for QoS and inter-application performance isolation (e.g., PARTIES [12]). However, this paper is the first I am aware of to propose using dynamic LLC partitioning as a rapid-response mechanism to balance memory traffic for a single application across memory tiers. The concept of partitioning the LLC into "near" and "far" sections that are dynamically resized based on latency feedback is a fundamentally new approach to this problem.
Novel Insight into Control Metrics: The paper makes a compelling and novel argument for the inadequacy of LLC miss latency as a control metric for traffic balancing (Section 3.3, page 5). The identification that prefetching can obscure true performance by increasing LLC misses while improving IPC is a critical insight. Proposing L1 miss latency as a superior alternative that inherently captures the effects of the entire cache hierarchy is a strong, original contribution that corrects a deficiency in the most relevant prior work, Colloid [62].
Well-Defined and Minimalist Integration: The proposed hold/resume signaling mechanism (Algorithm 1, page 7) for coordinating the hardware cache allocator and the OS migration policy is an elegant and practical design. It formalizes the two-level control loop without introducing excessive complexity. The architectural modifications required are presented as minimal (modifying MSHRs and LLC way-masking), which increases the plausibility of adoption compared to proposals requiring extensive hardware changes.

Weaknesses

Limited Novelty of Constituent Parts: To be precise, the novelty here lies in the synthesis, not the components. The paper relies on commodity cache partitioning technology (Intel CAT) and standard OS page migration mechanisms. The innovation is the control algorithm and coordination strategy that orchestrates them. This is not a weakness per se, but the contribution should be understood as a new system design and algorithm rather than the invention of new underlying hardware primitives.
Simplicity of the Coordination Protocol: The hold/resume signal is a binary protocol. This may be insufficient for more complex scenarios where cache partitioning can alleviate, but not entirely solve, a latency imbalance. A more sophisticated protocol could, for instance, communicate the degree of remaining imbalance to the OS to better guide the magnitude or urgency of page migration, potentially avoiding oscillations or suboptimal performance in edge cases.
Superficial Treatment of Multi-Node Extension: The proposed extension to multi-node systems (Section 4.4, page 7) relies on a diffusion-based load balancing algorithm [43], a concept that has been known for decades. While its application here is new, the description is brief and lacks evaluation. This part of the design feels more like a conceptual sketch than a fully-fledged novel contribution and detracts from the rigor of the core two-tier proposal.

Questions to Address In Rebuttal

The authors should use the rebuttal to clarify the precise boundaries of their novel contributions.

On the Novelty of Integration: Please articulate the key difference between TierTune and a hypothetical system where an off-the-shelf QoS cache partitioner (like PARTIES) is simply run on top of a latency-balancing migration policy (like Colloid). What specific design choices in your integrated approach (e.g., the L1 miss latency metric, the per-tier partitioning goal, the hold/resume signal) enable it to succeed where such a loosely-coupled combination would fail?
On the Robustness of the hold/resume Signal: The binary hold/resume signal implies a crisp decision boundary. What happens when the system operates near this boundary, where LLC partitioning provides a partial but incomplete solution? Could this lead to oscillations between holding and resuming migration? Please provide some intuition on the stability of your control system.
On the Multi-Node Design: The proposed diffusion-based balancing for multi-node systems is a classical approach. Given that this mechanism was not evaluated, could you justify why this decades-old method is the right choice compared to more modern centralized or hierarchical resource management schemes for complex memory topologies? What new research challenges do you foresee in scaling your coordination mechanism beyond a simple pairwise diffusion?

Learning to Walk: Architecting Learned Virtual Memory Translation

Abstract

The rise in memory demands of emerging datacenter applications has placed virtual memory translation in the spotlight, exposing it as a significant performance bottleneck. To address this problem, this paper introducesLearned Virtual Memory (LVM), a page ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper presents Learned Virtual Memory (LVM), a page table architecture that replaces conventional data structures (radix trees, hash tables) with a learned index model. The central thesis is that application virtual address (VA) spaces exhibit sufficient regularity to be modeled by a hierarchy of simple linear functions, enabling single-access translation in the common case. The authors propose a complete system, including a cost model for building the index, mechanisms for handling dynamic updates, support for multiple page sizes, and hardware/OS implementation details.

However, the work's conclusions rest on a fragile foundation: the assumption of well-behaved, regular VA spaces. The evaluation, while broad, appears to lack rigorous stress testing against pathological but plausible workloads. The paper's claims of near-ideal performance are predicated on empirically-tuned parameters and a downplaying of the overheads associated with model maintenance, particularly in dynamic, fragmented scenarios that are common in real-world systems.

Strengths

Practical Hardware Model: The proposal for a fixed-point arithmetic-based page walker is a credible and necessary step for moving learned indexes from software theory to hardware reality. The RTL synthesis results (Section 7.4) suggest the core walker component is feasible in terms of area and latency.
Compact Index Representation: The resulting learned index sizes are impressively small (avg. 112 bytes for 4KB pages, Table 2), which directly supports the claim of high cacheability in the proposed LVM Walk Cache (LWC). This is a distinct advantage over prior learned index proposals that require megabytes of storage.
Principled Design for Physical Fragmentation: The design explicitly acknowledges the reality of physical memory fragmentation (Figure 3) and addresses it by using multiple, small, non-contiguous gapped page tables (Section 4.2.2). This is a pragmatic design choice.

Weaknesses

Insufficient Justification of VA Regularity: The entire premise of LVM hinges on the "significant regularity" of VA spaces, which is substantiated primarily by the "gap coverage" metric (Section 3.1, Figure 2). This metric is simplistic and potentially misleading. It only measures the prevalence of contiguous page allocations (VPN_next - VPN_current = 1). It completely fails to characterize the structure of the non-contiguous portions of the address space, which could be sparse, clustered, or follow other patterns that are hostile to a simple linear model. The paper lacks any evaluation against a workload specifically designed to have a pathological VA layout (e.g., heavy use of sparse mmap with MAP_FIXED), which is essential for any fundamental change to virtual memory.
Empirically-Tuned "Magic Numbers": The cost model, which is critical for constructing an efficient index, relies on tunable weights (x1, x2, x3) that are "empirically set" to 10, 5, and 200, respectively (Section 5.1). This is a significant methodological flaw. Without a sensitivity analysis, it is impossible to know if these values are overfitted to the specific workloads evaluated. The performance of the system could degrade dramatically with different workloads or even different phases of the same workload if these parameters are not robust. The paper provides no theoretical or systematic justification for these values.
Oversimplification of Dynamic Update Overheads: The paper hand-waves away the cost of model maintenance.
- The "rescaling" mechanism for out-of-bounds insertions (Section 4.3.4) is only efficient for contiguous growth at the edge of a range. It is not a general solution for arbitrary insertions.
- The paper states that a full model rebuild is the fallback, but claims this is "rare" and "quite small" (Section 4.3.4, Section 7.3). The evidence provided is an average of 2-3 retrains for the entire application run on their chosen benchmarks. This is not convincing. A workload with frequent, disjoint mmap/munmap calls could easily trigger a cascade of expensive rebuilds. The claimed 1.17% OS overhead seems entirely too low for a system that must actively manage, train, and potentially rebuild a complex data structure.
Misleading "Ideal" Baseline: The paper repeatedly claims LVM is "within 1% of an ideal page table" (Abstract, Section 7.1). This "ideal" baseline is defined as a structure that always performs a single memory access for translation. This is a strawman. It's an unachievable hardware-only ideal that completely ignores the software overhead of building and maintaining the data structure that enables this single access. LVM has non-zero overhead for training, insertions, and rebuilds, which this baseline conveniently ignores. The comparison is therefore not on equal footing.
Unsubstantiated Fragmentation Robustness: In Section 7.3, the authors evaluate performance under simulated memory fragmentation and make the strong claim that "LVM's performance remains the same." This is highly unlikely. While the system can function by creating more, smaller leaf tables, this must have second-order effects that are not analyzed. For instance, it increases the number of leaf nodes, which could increase the size and pressure on the upper levels of the learned index. It also scatters PTEs that would otherwise be contiguous, potentially harming the effectiveness of hardware prefetchers for page table data. This claim requires a much more nuanced evaluation.

Questions to Address In Rebuttal

Provide an evaluation against a workload specifically designed to be pathological for LVM's linear models. For example, a program that allocates memory in a sparse, pseudo-random pattern across its VA space. How does the index size, depth, collision rate, and OS management overhead change in this adversarial scenario?
Provide a sensitivity analysis for the cost model parameters (x1, x2, x3) and the depth limit (d_limit). How does end-to-end performance vary as these parameters are changed by ±50%? This is necessary to demonstrate that the results are not an artifact of parameter overfitting.
Clarify the precise conditions that trigger a full model rebuild versus a partial retraining or rescaling. Quantify the tail-latency impact of these rebuilds (e.g., P99, P99.9 latency) on a high-throughput workload like memcached, rather than just reporting the average time.
Provide a detailed breakdown of the claimed 1.17% average OS overhead. How much of this is initial model training versus ongoing costs for insertions and rescaling? How does this overhead scale with the frequency of mmap/munmap operations?

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents Learned Virtual Memory (LVM), a novel architecture for virtual memory address translation that aims to replace traditional radix page tables. The core contribution is a complete, practical system design that adapts the concept of "learned index structures," originating from the database community, to the strict constraints of a hardware memory management unit (MMU).

The authors first establish a crucial premise: application virtual address spaces are highly regular and sequential (Section 3.1, Page 3), making them amenable to prediction. LVM leverages this regularity by building a hierarchical index of simple, hardware-friendly linear models (y = ax + b) that predict the physical location of a page table entry (PTE) for a given virtual page number (VPN).

This is not merely a theoretical application; the authors present a holistic architecture that addresses the key challenges that have previously made this approach impractical. This includes: a principled cost model to optimize the index structure, "gapped page tables" and a rescaling mechanism to support dynamic insertions efficiently, a fixed-point arithmetic implementation to eliminate the need for floating-point hardware in the page walk pipeline, and an elegant method of supporting multiple page sizes by representing them as different slopes within the same linear model. The proposed system is a hardware/OS co-design, evaluated through a Linux prototype, RTL synthesis, and full-system simulation, demonstrating significant performance gains that approach an idealized single-access page table.

Strengths

The primary strength of this paper is its successful synthesis of ideas from different domains (databases, systems, and architecture) into a cohesive and compelling solution for a classic, high-impact problem.

High Potential Impact & Novelty: The work addresses the virtual memory translation bottleneck, a fundamental performance limiter in modern systems. For decades, the field has been dominated by the trade-offs between radix and hashed page tables. LVM offers a genuinely new path forward. By replacing a rigid, one-size-fits-all data structure with one that learns and adapts to the application's own memory layout, it represents a potential paradigm shift in MMU design.
Holistic and Practical System Design: This is the paper's most impressive quality. The authors don't just propose an idea; they meticulously engineer a full solution. They correctly identify why prior attempts or naive applications of learned indexes would fail—model size, floating-point math, handling dynamic updates, physical memory fragmentation—and present a specific, well-reasoned solution for each. The integration of a cost model (Section 4.2.3, Page 6), support for dynamic insertions (Section 4.3.4, Page 7), and the clever use of fixed-point arithmetic (Section 4.5, Page 8) elevates this from an academic curiosity to a plausible architectural proposal.
Strong Empirical Grounding: The design of LVM is not based on assumption, but on solid data. The analysis of virtual address space regularity across a wide range of workloads (Figure 2, Page 3) provides a strong justification for the entire approach. Similarly, their study of physical memory contiguity in Meta's datacenters (Figure 3, Page 4) directly motivates their design choice to use small, non-contiguous leaf page tables, grounding the work in real-world operational constraints.
Bridging Research Communities: This work forms a critical bridge between the systems software/database community, which pioneered learned indexes, and the computer architecture community. It effectively translates a powerful software concept into a viable hardware architecture, a process that often fails due to the immense gap in constraints and requirements. This paper could inspire a new wave of research applying lightweight, data-aware techniques to other hardware structures.

Weaknesses

My criticisms are not of the core idea, which is excellent, but rather of areas where the implications of this new paradigm could be explored more deeply.

Resilience to Pathological Cases: The system's efficacy is predicated on the observed regularity of virtual address spaces. While the authors state that their cost model bounds the index depth and ensures a minimum coverage-per-byte, the paper would be strengthened by an explicit stress test against a deliberately pathological workload—one with a highly fragmented and randomized virtual address space. Demonstrating that LVM "fails gracefully" and maintains a bounded worst-case performance (e.g., no worse than a radix walk) would significantly bolster confidence in its robustness.
OS Implementation Complexity: LVM is an OS/hardware co-design, and it appears to shift a significant amount of complexity into the operating system. The OS is now responsible for training models, running the cost model, and managing rescalings and rebuilds. While the performance overhead is shown to be low (Section 7.3, Page 12), the paper does not fully address the software engineering complexity. This new logic resides in one of the most critical and difficult-to-debug parts of the kernel. A discussion of the implementation challenges and the increase in the kernel's trusted computing base would be valuable.
Security Implications: The paper addresses Address Space Layout Randomization (ASLR) by having the OS expose the randomized base addresses to hardware (Section 5.2, Page 9). This is a critical point that feels somewhat glossed over. Introducing a predictive, data-driven model into the address translation path, a component that is fundamental to memory protection, warrants a more thorough security analysis. Could the predictive nature of the LVM walker, its timing variations based on model traversal, or the new OS-hardware interface introduce novel side-channels or vulnerabilities?

Questions to Address In Rebuttal

Could the authors elaborate on the behavior of LVM under a truly adversarial virtual address allocation pattern (e.g., one that maximizes fragmentation)? Specifically, how do the cost model's constraints (d_limit, coverage-per-byte) prevent a "performance cliff" and ensure a bounded worst-case lookup latency?
While the runtime overhead of the OS components is shown to be minimal, could the authors comment on the software engineering and maintenance complexity of integrating LVM's training and management logic into a production OS kernel like Linux?
The proposed method for handling ASLR involves new communication between the OS and the MMU. Have the authors considered the security implications of this design? Could the predictive nature of LVM and its variable-time lookups potentially leak information about the address space layout, thus undermining ASLR's protections?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The paper introduces Learned Virtual Memory (LVM), a new page table architecture that replaces conventional radix or hash-based structures with a learned index. The authors' central claim is that this design can achieve near-ideal, single-access address translation by learning the distribution of an application's virtual address space (VAS). The novelty of the work does not lie in the general idea of applying machine learning to address translation, which has been superficially explored before. Rather, the primary novel contribution is the holistic and practical co-design of a learned index specifically for the rigid constraints of hardware virtual memory. The authors propose several novel techniques to overcome the well-known barriers that have made prior learned index structures unsuitable for this domain, including a cost model to ensure a compact index, a mechanism to handle physical memory fragmentation, an efficient update strategy that avoids expensive retraining, and an elegant method for supporting multiple page sizes.

Strengths

The primary strength of this paper is its high degree of architectural and algorithmic novelty in adapting a concept from one domain (databases) to another (computer architecture) by solving a series of difficult, domain-specific challenges.

Significant Delta Over Prior Art: The most relevant prior art, Margaritov et al. [58], proposed using a neural network to predict translation locations but was acknowledged as impractical due to high computational latency (120 cycles) and a lack of support for insertions. LVM's contribution is a complete departure from this, architecting a system from the ground up for hardware viability. The use of simple linear models, a fast fixed-point hardware path, and robust mechanisms for dynamic updates represents a significant and novel advancement that transforms a theoretical idea into a plausible architecture.
Novel Solution to Memory Fragmentation: A key insight and novel contribution is the explicit handling of physical memory fragmentation. Existing learned indexes typically assume large, contiguous memory allocations for their data arrays, a condition known to be unrealistic in datacenter servers [95]. LVM’s design of using per-leaf-node gapped page tables (GPTs) that are small and can be allocated non-contiguously is a novel adaptation that directly addresses this critical implementation barrier (Section 4.2.2, page 6).
Novel Mechanism for Dynamic Updates: The "rescaling" mechanism for handling out-of-bounds insertions (Section 4.3.4, page 7) is a genuinely new technique in the context of learned indexes. It cleverly exploits the common pattern of VAS expansion at the edges (e.g., heap growth) to allow for insertions without retraining the model or re-inserting existing keys. This is a crucial piece of novelty that makes the system practical for dynamic workloads.
Elegant and Novel Multi-Page Size Support: The approach to supporting multiple page sizes (Section 4.4, page 8) is particularly innovative. Instead of using separate structures for each page size as in prior work like ECPT [77], LVM represents different page sizes as linear functions with varying slopes within a single index structure. This is a simple, elegant, and to my knowledge, entirely new idea that reduces complexity and overhead.

Weaknesses

My criticisms are not about a lack of novelty, but rather about fully exploring the boundaries and implications of the novel ideas presented.

Unclear Robustness at the "Novelty Cliff": The entire premise of LVM is predicated on the regularity of application virtual address spaces (Section 3.1). The paper demonstrates this holds for a range of common workloads. However, the most critical test for a novel data structure is its behavior under pathological or adversarial conditions. While the authors claim the cost model provides safeguards against index bloat (Section 4.2.3), the paper lacks a clear analysis of the performance cliff. If an application were to use an allocator that deliberately randomizes its VAS, how gracefully does LVM degrade? Does its performance fall below that of a simple radix table? This is a crucial aspect of understanding the limits of the proposed novelty.
Potential for Unforeseen System Interactions: The introduction of a learned, OS-managed component into the critical path of the hardware MMU creates a new, complex OS-hardware contract. The novelty of this interaction may introduce unforeseen failure modes. For instance, a bug in the OS's training logic or cost model could generate a suboptimal or incorrect index, severely degrading performance in a way that would not be possible with a deterministic radix structure. The paper does not sufficiently discuss the new reliability or security challenges this novel interface might create.

Questions to Address In Rebuttal

Could the authors elaborate on the performance of LVM under a deliberately fragmented or pseudo-random virtual address space allocation pattern? While the cost model provides safeguards (Section 4.2.3), what is the performance cliff, and how does it compare to the predictable worst-case of a radix walk?
The rescaling mechanism for out-of-bounds inserts (Section 4.3.4) is a key novel contribution. What are its limits? At what point does the error from the original linear model become too large for the expanded key range, forcing a more expensive retrain? Is there a risk of "accuracy debt" accumulating over many rescalings that eventually degrades prediction quality?
The tight coupling where the OS generates models for the hardware to execute is novel and powerful. Have the authors considered the security implications of this new contract? Could a compromised OS kernel feed the MMU walker malicious models that, for example, create exploitable timing side-channels by deliberately inducing collisions for specific address ranges, or create predictable physical address mappings that undermine ASLR-like defenses at a lower level?

Delegato: Locality-Aware Atomic Memory Operations on Chiplets

Abstract

The irruption of chiplet-based architectures has been a game changer, enabling higher transistor integration and core counts in a single socket. However, chiplets impose higher and non-uniform memory access (NUMA) latencies than monolithic integration. ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose "Delegato," a mechanism to improve the performance of Atomic Memory Operations (AMOs) in chiplet-based architectures. The proposal consists of two new "far AMO" types (delegated and migrating) to supplement existing near and centralized AMOs. These new types allow the directory to choose a more optimal execution location for an AMO. To guide this choice, the paper introduces a tracing mechanism to convey reuse information from private L2 caches to the directory, which feeds a simple predictor. The authors claim that Delegato improves performance by 1.07x over a centralized AMO baseline and 1.13x over the state-of-the-art predictor, DynAMO.

While the problem of AMO performance on NUMA systems is well-established, this paper's solution rests on a simplistic prediction heuristic, an evaluation that appears tuned to flatter the proposal, and an underestimation of the practical implementation complexity. The claims of superiority are not convincingly supported when the details are scrutinized.

Strengths

The paper correctly identifies a relevant and challenging problem: the high latency of AMOs in chiplet systems due to expensive cross-chiplet communication.
The exploration of alternative execution locations for far AMOs beyond a single, centralized point is a logical direction for investigation.

Weaknesses

Simplistic and Potentially Inaccurate Tracing Mechanism: The core of the predictor's intelligence relies on Delegato, which uses a single reuse_bit to signal usage from the private cache to the directory (Section 5.2, page 6). This is an exceptionally coarse heuristic. It only captures reuse that occurs between two consecutive delegate transactions. It cannot capture more complex temporal patterns, distinguish between frequent and infrequent reuse, or handle cases where a line is used for non-AMO purposes between AMOs. The entire premise of making accurate predictions rests on this fragile, low-information signal, which is a fundamental weakness.
Flawed and Self-Serving Baseline Comparison: The paper claims a 1.13x speedup over the "state-of-the-art" predictor, DynAMO [102], which is the authors' own prior work. This comparison is problematic. DynAMO is an L1-based predictor that decides if an AMO should be sent far, whereas Delegato is a directory-level predictor that decides where a far AMO should execute. These are not mutually exclusive. A rigorous comparison would have been to augment the baseline DynAMO to enable it to issue the newly proposed delegated/migrating AMOs, thereby isolating the performance contribution of the prediction logic itself. As presented, the comparison conflates the benefits of the new AMO types with the benefits of the predictor. Furthermore, the authors admit in Section 6.4 (page 10) that combining Delegato with an L1 predictor could fix performance degradation, which concedes that the presented comparison is incomplete.
Unconvincing and Potentially Biased Evaluation:
- The choice of a 50 ns cross-chiplet latency (Table 3, page 7) is extremely high for a modern interconnect and heavily penalizes any data movement, thereby creating an environment where a mechanism like Delegato is destined to show benefit. The sensitivity study in Section 6.6 (page 10) only considers a 2 ns latency, an opposite extreme. The lack of analysis for more moderate and realistic latencies (e.g., 10-20 ns) casts doubt on the robustness of the conclusions.
- The geomean results mask significant performance regressions. In Figure 9 (page 9), Delegato is shown to be slower than the baseline near AMOs for the FMM, BAR, and LFQ benchmarks. A proposed optimization that results in a slowdown for multiple applications cannot be considered a clear success. The paper fails to adequately diagnose or discuss the cause of these regressions.
- The CAS Counter microbenchmark (Figure 8, page 8) is a "best-case" scenario of pure, high-contention updates to a single address. While illustrative, its performance is not representative of real applications and overstates the benefits of the Pinned Owner policy.
Understated Implementation Complexity and Lack of Rigor:
- The paper proposes adding ALUs to L2 caches (Section 4.1, page 5) and new, complex transaction types (SnpAMO) to the coherence protocol. The verification and validation effort for such changes is monumental, yet it is not discussed. This is a critical omission for a hardware proposal.
- The suggestion to reuse the DataPull field in the AMBA CHI protocol for the reuse_bit (Section 6.8, page 11) is an ad-hoc solution that demonstrates a lack of implementation rigor. Such a modification would likely be non-compliant with the standard and create conflicts with features that legitimately use that field, like Stashing. A serious proposal must detail a compliant extension to the protocol.

Questions to Address In Rebuttal

Please provide a quantitative analysis to justify that a single reuse_bit is a sufficient and accurate signal for predicting AMO behavior. How does this compare to more established reuse prediction mechanisms (e.g., counters, RRIP)?
Can the authors justify the fairness of the comparison against DynAMO? Why was the baseline DynAMO not modified to take advantage of the new delegated and migrating primitives? Such a study would provide a true apples-to-apples comparison of the predictor's efficacy.
The performance results are highly sensitive to the 50 ns interconnect latency. Please provide a sensitivity analysis across a more realistic range of latencies (e.g., 10 ns, 20 ns, 30 ns) to demonstrate the robustness of Delegato's performance claims. Furthermore, please provide a detailed analysis of the performance degradation observed in FMM, BAR, and LFQ. What is the specific mechanism in Delegato that causes this harm?
Please elaborate on the verification and validation challenges for the proposed protocol changes. Furthermore, is your proposed reuse of the DataPull field compliant with the AMBA 5 CHI specification? If not, what would a compliant implementation entail?
The predictor's state machine (Figure 6b, page 7) migrates a cache line after only two consecutive requests from the same core. What is the rate of "ping-ponging" induced by this aggressive policy, where the line is migrated away from an active owner only to be immediately requested back? Please quantify the performance impact of these incorrect migrations.

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper addresses the growing performance challenge of Atomic Memory Operations (AMOs) in modern chiplet-based architectures. The authors identify the limitations of the traditional binary choice between 'near' (local execution, causing costly data movement) and 'far' (centralized execution, causing serialization) AMOs. The core contribution is the introduction of two new types of far AMOs—'delegated' and 'migrating'—which allow for remote execution without a single point of centralization, by sending the operation directly to the cache line's current owner or migrating ownership to the requester on demand.

They complement this expanded architectural capability with 'Delegato,' a hardware tracing and prediction mechanism. Delegato enables the directory to dynamically select the optimal AMO type (from the newly expanded set of centralized, delegated, or migrating) based on observed data access patterns. This fundamentally expands the design space for handling atomic operations in NUMA systems, moving beyond simple prediction to intelligent operational dispatch.

Strengths

Novel and Significant Conceptual Contribution: The paper's primary strength lies in its conceptual reframing of remote AMO execution. Instead of treating "far" AMOs as a monolithic, centralized action, the authors decompose the problem and propose a more flexible, decentralized approach. Proposing new coherence primitives (delegated, migrating) is a significant architectural contribution that moves beyond the state-of-the-art, which has largely focused on building better predictors for the two existing modalities (e.g., DynAMO [102]). This work correctly identifies that the palette of available operations was itself a limitation.
Excellent Problem Motivation and Contextualization: The motivation is exceptionally well-grounded in a pressing industry trend: the shift to chiplet architectures. The analysis in Section 1 and Figure 1 (p. 2) clearly demonstrates that existing solutions are insufficient as systems scale and NUMA factors become more pronounced. By showing cases where centralized AMOs actually regress in performance on a dual-chiplet system, the authors create a compelling narrative for why a new approach is necessary. This work fits perfectly at the intersection of cache coherence research and next-generation system design.
A Pragmatic Hardware-Software Bridge: This work can be seen as a hardware realization of principles previously explored in software. The concept of "delegation" echoes software-based techniques where data is partitioned and operations are routed to the thread that "owns" the data to avoid costly synchronization (as noted in Section 2.2, p. 3). Delegato provides a transparent, hardware-managed mechanism to achieve a similar outcome without burdening the programmer, which is a powerful and highly desirable direction for architectural innovation.
Thorough and Convincing Evaluation: The evaluation is comprehensive. The authors not only propose the new primitives but also explore the design space of static policies (Section 4.2, p. 5) before introducing their dynamic predictor. The benchmark suite is well-chosen, including classic parallel kernels, graph analytics, and, importantly, modern lock-free data structures, which are notoriously sensitive to AMO performance. The comparison against both a baseline and a state-of-the-art predictor (DynAMO) provides strong evidence of the proposal's efficacy.

Weaknesses

Inherent Implementation Complexity: The primary weakness, inherent to the proposal's strength, is its complexity. Introducing new coherence transactions (SnpAMO), requiring ALUs at the L2 cache level for delegation (as discussed in Section 6.8, p. 11), and adding two new predictor tables represents a non-trivial increase in design and verification effort for a CPU core and its uncore components. While the authors perform a reasonable area analysis, the cost of validating such a fundamental change to the coherence protocol is significant and could be a barrier to adoption.
Potential for Negative Interactions with Other Optimizations: The paper evaluates Delegato in a relatively clean environment. However, its interaction with other advanced coherence and prefetching mechanisms is not explored. For instance, how would Delegato's decisions (which implicitly influence cache line placement) interact with a sophisticated, learning-based data prefetcher that is also trying to manage where data should reside? There is a risk of the two mechanisms working at cross-purposes, leading to performance oscillations or degradation that is not captured in this study.
Simplicity of the Predictor: While the proposed Delegato predictor is effective and simple (which is a strength), its state machine (Figure 6b, p. 7) is based on relatively simple heuristics (e.g., two consecutive requests from the same core triggers a migration). This leaves open the question of whether more sophisticated prediction techniques (e.g., incorporating stream/stride information, or machine learning-based predictors) could yield further gains, or if the problem space is such that these simple heuristics capture the majority of the available benefit.

Questions to Address In Rebuttal

On Complexity and Verification: Could the authors comment on the anticipated verification challenges of introducing new snoop and request types like SnpAMO into a complex, industry-standard coherence protocol like AMBA CHI? Is there a simplified path to implementation, perhaps by overloading existing message fields or transaction types, that could reduce this burden while still capturing some of the benefits?
On Software vs. Hardware Delegation: The paper rightly positions itself as a hardware alternative to software delegation (Section 2.2). Could you elaborate on the scenarios where you believe a hardware-only approach like Delegato provides a decisive advantage over programmer-driven directives (e.g., __builtin_prefetch style hints for AMOs) or compiler analysis that could achieve a similar effect by pinning data to a specific NUMA node and routing AMOs there?
On the Tracing Mechanism's Richness: The Delegato tracing mechanism relies on piggybacking a single 'reuse_bit' on SnpResp messages. This is elegantly lightweight. Have the authors considered the potential benefits of a richer feedback channel? For example, would providing a reuse counter or other metadata from the private cache to the directory allow for more nuanced predictions (e.g., distinguishing between a line that was used once vs. ten times), and what would the associated network and storage overheads be?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper addresses the performance challenges of Atomic Memory Operations (AMOs) in chiplet-based architectures, which suffer from high inter-chiplet communication latencies. The authors propose two new types of "far" AMOs: "delegated" and "migrating". Delegated AMOs forward the operation to the current owner of the cache line for local execution. Migrating AMOs transfer ownership to the requester, similar to a near AMO, but are initiated by the directory.

The central novel claim appears to be a tracing mechanism called "Delegato," which leverages the delegated AMO message path to piggyback a "reuse bit" from the owner back to the directory. This bit informs a directory-side predictor about the owner's local use of the line, allowing for more intelligent decisions about whether to keep the line with the current owner (delegate), transfer it to the requester (migrate), or handle it centrally.

While the packaging and evaluation are comprehensive, the novelty of the core primitives is questionable. The "migrating" AMO is functionally a relabeling of existing ownership transfer mechanisms under a new policy. The "delegated" AMO is a specific implementation of request forwarding, a known concept. The most significant novel contribution is the "Delegato" tracing mechanism, which provides an elegant, low-cost method for conveying reuse information back to the directory. The work's overall contribution is an evolutionary, not revolutionary, step in coherence design.

Strengths

The "Delegato" Tracing Mechanism: The core innovation of this paper lies in the design of Delegato (Section 5.2, page 6). The problem of an "information outage" at the directory is well-established in coherence prediction literature. The proposed solution—using the SnpResp message of a delegated AMO to carry a single bit of reuse information—is a clever and low-overhead mechanism. It elegantly couples the proposed delegated AMO primitive with the predictor's need for feedback, creating a "heartbeat" to confirm the utility of the line's current placement. This specific mechanism for feedback appears to be new.
Formalization of an Owner-Executed AMO: While the concept of forwarding requests is not new, the paper does a good job of formalizing the "Delegated AMO" within a modern, complex coherence protocol (AMBA CHI). Defining the specific transaction flows (SnpAMO) and state transitions (Figure 4, page 4) required to implement this is a valuable, albeit implementation-focused, contribution.

Weaknesses

Overstated Novelty of AMO Primitives: The paper presents "delegated" and "migrating" AMOs as two "new types" of far AMOs. This claim requires significant qualification.
- Migrating AMOs: As described, a migrating AMO is simply a directory policy that chooses to resolve a far AMO request by initiating an ownership transfer to the requester. The underlying mechanism—transferring a cache line to a new owner—is fundamental to all modern coherence protocols. Calling this a "new type" of AMO is misleading; it is a new policy for when to trigger an existing action. The state-of-the-art predictor, DynAMO [102], already makes a similar decision (near vs. far), albeit at the L1 cache. This is a shift in where the decision is made, not the introduction of a new primitive.
- Delegated AMOs: The concept of forwarding a request to the node holding the most up-to-date data (the owner) is a well-known pattern in distributed systems and cache coherence. For example, the idea of data forwarding (e.g., Koufaty and Torrellas, 1998, [63]) and producer-initiated communication (e.g., Goodman et al., 1989, [40]) involves sending data directly between caches. The delta here is forwarding the operation instead of just the data. While this specific formalization as an AMO primitive is notable, the conceptual foundation is not entirely new, and the paper should position its contribution more precisely against this backdrop.
Complexity vs. Benefit Trade-off: The proposed solution introduces non-trivial complexity. Specifically, it requires placing ALUs in the L2 caches to handle delegated AMOs (mentioned in Section 4.2, page 5) and adds new message types to the coherence protocol. The evaluation shows that Delegato achieves a 1.07x speedup over centralized AMOs and a 1.13x speedup over the state-of-the-art DynAMO predictor (Figure 9, page 9). While positive, these gains are modest. An innovator must ask if adding hardware ALUs to another level of the cache hierarchy and extending the coherence protocol is justified for an average 7-13% performance improvement. The case for this trade-off is not overwhelmingly strong.

Questions to Address In Rebuttal

Regarding "Migrating AMOs," please clarify why this should be considered a novel primitive rather than a new directory-level policy. Functionally, how does the resulting ownership transfer differ from the one initiated by a conventional ReadUnique request in a near AMO transaction, other than the point of initiation?
Regarding "Delegated AMOs," can the authors please elaborate on the delta between this proposal and prior art in request forwarding within coherence protocols? Please situate this contribution in the context of owner-driven resolution and remote invocation concepts from the broader literature.
The proposed design necessitates ALUs at the L2 cache, adding area and complexity. Given the reported average speedups of 1.07x–1.13x, please provide a stronger argument for why this architectural modification is more compelling than a less invasive, purely policy-based optimization at the directory that does not require new hardware at the L2.

Re-architecting End-host Networking with CXL: Coherence, Memory, and Offloading

Abstract

The traditional Network Interface Controller (NIC) suffers from the inherent inefficiency of the PCIe interconnect with two key limitations. First, since it allows the NIC to transfer packets to the host CPU memory only through DMA, it incurs high latency,...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The paper presents CXL-NIC, a Network Interface Controller architecture built on Compute Express Link (CXL) to address the performance limitations of traditional PCIe-based NICs. The authors propose two designs: a Type-1 CXL-NIC using CXL.cache to replace DMA/MMIO operations, and a Type-2 CXL-NIC that adds coherent on-device memory via CXL.mem. The central thesis is that by leveraging CXL's coherence and memory semantics, significant reductions in packet and application processing latency can be achieved. The designs are prototyped on an FPGA and evaluated against a commercial PCIe-based SmartNIC.

While the premise of using CXL to improve NIC performance is sound, the work is undermined by a significant methodological flaw: the experimental evaluation compares an FPGA-based CXL prototype against a commercial, ASIC-based PCIe SmartNIC. This comparison introduces numerous confounding variables, making it impossible to attribute the observed performance differences solely to the CXL interconnect. Consequently, the paper's primary claims of latency reduction are not adequately substantiated by the provided evidence.

Strengths

Detailed Protocol-Level Optimizations: The paper demonstrates a strong command of the CXL.cache protocol. Section 4.3 provides a granular analysis of how different request types (e.g., CS-read for prefetching, CO-read for polling, NC-write for packet transfers) can be strategically employed to optimize different stages of the networking datapath. This exploration is valuable for the community.
Insightful Analysis of On-Device Memory: The evaluation of packet buffer placement for the Type-2 device (Section 7.2, Figure 13) yields an important, if negative, result: naively placing packet buffers in NIC memory degrades performance due to remote access latency and coherence overhead. This is a crucial finding that cautions against simplistic architectural assumptions.
Coherent Problem Formulation: The paper correctly identifies the fundamental bottlenecks in the PCIe-based CPU-NIC datapath (Section 3.1) and logically proposes CXL as a potential solution. The motivation is clear and well-grounded in the limitations of existing interconnects.

Weaknesses

Fundamentally Unsound Experimental Comparison: The paper's primary claims rest on a comparison between two vastly different hardware platforms: an Intel Agilex-7 FPGA running at 400 MHz and an NVIDIA BlueField-3 (BF-3) ASIC SmartNIC with ARM cores running at 2.5GHz. This is not a valid apples-to-apples comparison. The observed latency differences could stem from a multitude of factors unrelated to the CXL protocol itself, including:
- The internal architecture of the NICs (FPGA logic vs. hardened ASIC blocks).
- The performance of the on-board memory controllers and subsystems.
- The compute capabilities used for evaluation (FPGA logic vs. ARM cores for the KVS workload). The authors explicitly state they could not create a PCIe baseline on the same FPGA (Section 6, page 9), which confirms this is a critical, unaddressed confounder that invalidates the main quantitative claims.
Conflation of Protocol Benefits and Implementation Limitations: The throughput evaluation (Section 7.2, Figures 11 and 12) is bound by the 400 MHz clock frequency of the FPGA and the single CXL request per cycle limit of the IP. The results demonstrate the efficiency of their design relative to their implementation's theoretical peak, not the absolute throughput capability of a CXL-based NIC architecture. The claims should be heavily qualified to reflect that these are artifacts of a slow prototype, not fundamental characteristics of the CXL approach.
Obscured Absolute Performance: The majority of the key results are presented as normalized figures relative to the BF-3 baseline (e.g., Figures 10, 13, 14, 15). This prevents a clear assessment of the absolute performance of the CXL-NIC. A 49% reduction in tail latency (Section 7.1, page 10) is meaningless without knowing the absolute baseline latency. The proposed CXL-NIC could still be substantially slower than the commercial ASIC in absolute terms.
Weak Comparison to CC-NIC: The comparison against the state-of-the-art CC-NIC (Figure 14) is based on an emulation of the UPI protocol using CXL.cache primitives (CS-read and CO-write). This is a questionable methodology. UPI and CXL have different underlying coherence semantics, link-layer protocols, and performance characteristics. The authors provide no evidence that this emulation is a faithful or accurate representation of a real UPI-based NIC, rendering the 37% claimed improvement unreliable.
Uncontrolled Variables in Application Study: The KVS application evaluation (Section 5.4 and Figure 15) compares a handler implemented in FPGA logic on the CXL-NIC against a software handler running on the BF-3's ARM cores. The 39% tail latency reduction cannot be uniquely attributed to CXL. It is highly likely that a hardware-accelerated FSM on an FPGA is simply faster at this specific task than general-purpose ARM cores running software, irrespective of the host interconnect. The experiment fails to isolate the variable of interest.

Questions to Address In Rebuttal

The central weakness of this paper is the comparison between a 400 MHz FPGA prototype and a commercial ASIC SmartNIC. How can the authors justify that the observed performance differences (e.g., in Figures 10 and 15) are due to the CXL vs. PCIe interconnect, and not the vast architectural, clock speed, and compute substrate differences between the two devices?
Please provide absolute latency numbers (in microseconds or nanoseconds) for the key results presented in Figures 10, 14, and 15. This is necessary to evaluate whether the proposed CXL-NIC is performant in an absolute sense, not just relative to a potentially mismatched baseline.
Please justify the methodology for emulating a UPI-based CC-NIC using CXL.cache primitives (Section 7.2, page 12). How does this emulation account for the architectural and protocol-level differences between UPI and CXL, and why should this be considered a valid point of comparison?
In Section 7.2, the paper presents a "latency-optimal configuration." How was the vast design space of data placement and request type combinations (per Figure 8) explored to rigorously justify this claim of optimality? What evidence supports that this specific configuration is globally optimal, rather than just the best among a small, tested subset?
For the KVS evaluation (Figure 15), how have the authors isolated the performance impact of the CXL interconnect from the performance impact of implementing the KVS handler in FPGA hardware logic versus software on ARM cores? Without this isolation, the claim that CXL is the source of the benefit is unsupported.

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents a comprehensive re-architecting of the end-host network interface controller (NIC) using the emerging Compute Express Link (CXL) standard. The authors identify the long-standing performance bottlenecks of the traditional PCIe interconnect—namely high latency for small packet operations due to DMA/MMIO overhead and the lack of hardware cache coherence between the CPU and the NIC.

The core contribution is the design and evaluation of "CXL-NIC," a novel NIC architecture that systematically replaces legacy PCIe mechanisms with CXL's more efficient, cache-coherent protocols. The work is presented in two stages: first, a Type-1 CXL-NIC that leverages CXL.cache to create a low-latency, coherent datapath for control and data between the host CPU and the NIC. Second, this design is extended to a Type-2 CXL-NIC, which introduces coherent on-NIC memory via CXL.mem, enabling flexible data placement and new opportunities for near-data processing. The authors demonstrate the power of this new architecture with a compelling networking-application co-acceleration case study of a Key-Value Store (KVS). An FPGA-based prototype shows significant latency reductions compared to both a commodity PCIe SmartNIC and a prior academic coherent NIC design.

Strengths

Excellent Problem-Solution Fit and Timeliness: The work tackles a classic and persistent problem in systems architecture: the host I/O bottleneck. The application of CXL to this domain is not just novel but exceptionally well-suited. While much of the early CXL discourse has focused on memory expansion (CXL.mem), this paper provides one of the first in-depth, systems-level explorations of CXL.cache for a "killer application." It moves the conversation from characterizing a new interconnect to architecting a new class of device with it.
Systematic and Principled Design Exploration: The paper's strength lies in its methodical approach. The progression from a Type-1 device (focusing on coherence and datapath) to a Type-2 device (adding memory and offload) is logical and allows for a clear separation of concerns. The detailed analysis in Section 4.3 on leveraging specific CXL cache line state control requests (e.g., CS-read for prefetching, CO-read for polling, NC-write for packet data) is particularly insightful. It demonstrates a deep understanding of the protocol and moves beyond a simple "replace DMA with CXL" narrative to a nuanced, optimized design.
Strong Grounding in the Literature and Context: This work is well-positioned within the broader landscape of high-performance networking and interconnect research. It implicitly builds on the legacy of projects that sought tighter CPU-I/O integration (e.g., DDIO, on-chip NICs) but demonstrates a more practical path forward using an open standard. The direct comparison to CC-NIC (Section 7.2, page 12, Figure 14), which used a proprietary interconnect (UPI), is crucial. It effectively argues that CXL provides a standardized way to achieve the benefits of cache coherence that were previously confined to specialized, proprietary systems.
Compelling Application-Level Demonstration: The KVS co-acceleration use case (Section 5.4, page 9) is a powerful demonstration of the architecture's potential. By hosting "hot" data in the NIC's coherent memory and using CXL.cache to handle misses by fetching from host memory, the authors showcase a seamless, hardware-managed tiered memory system spanning the host and the device. This is a glimpse into the future of tightly integrated heterogeneous computing and elevates the paper beyond just being a networking study.

Weaknesses

As a contextual analyst, I view these less as flaws and more as areas ripe for future discussion and exploration.

The Inevitable FPGA vs. ASIC Question: The evaluation is commendably performed on real hardware, which is a significant strength. However, the performance is necessarily limited by the FPGA's clock frequency and the maturity of the CXL IP. While the relative gains are clear, the absolute performance numbers may not fully represent the potential of an ASIC implementation. The discussion could benefit from some thoughtful speculation on how these architectural benefits would scale at higher line rates (400G+) and with the lower latencies of an ASIC design.
Software and Programmability Implications: The paper proposes a "CXL-NIC DPDK" framework, which mirrors the familiar DPDK model. This is a practical choice for evaluation. However, the paradigm shift from explicit DMA management to an implicitly coherent, NUMA-like memory model is profound. Does this genuinely simplify the programming model for application developers in the long run? A deeper discussion on the software abstractions needed to manage this new hardware—particularly around NUMA-awareness, data placement policies, and debugging—would add significant value.
Limited Scope of Application Co-Design: The KVS example is excellent but stands alone. The true power of this architecture lies in its generality. A brief discussion on other application domains that could be similarly transformed (e.g., distributed databases, HPC communication libraries like MPI, AI/ML inference serving) would help to broaden the paper's perceived impact.

Questions to Address In Rebuttal

The discussion in Section 8 regarding the lack of atomic operations support in CXL 1.1 is critical for multi-queue, multi-threaded scenarios. Could the authors elaborate on the practical scalability limitations this imposes on the current prototype? Are there software-based workarounds (e.g., delegating synchronization points to a single hardware engine) that could mitigate this until CXL 2.0 atomics are widely available?
The intelligent use of the NC-P (non-cacheable push) operation (Section 5.3, page 8) to inject data into the host LLC is a fascinating parallel to Intel's DDIO. The proposed "adaptive push-write gating" and "post-push write-back" mechanisms seem to address the classic "LLC pollution" problem. Could you comment on the complexity of tuning these mechanisms? How sensitive are they to workload characteristics, and do you envision this being managed by the driver or a higher-level runtime system?
Your latency-optimal configuration (Section 7.2, page 11) involves a hybrid memory layout (some structures on the NIC, some on the host). This co-optimization is a key result. How did you arrive at this specific configuration? Does this suggest that a "one-size-fits-all" data placement strategy is suboptimal, and that future systems will require runtime profiling and dynamic data migration between host and NIC memory to achieve the best performance?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents CXL-NIC, a Network Interface Controller architecture built on the Compute Express Link (CXL) standard. The authors aim to overcome the well-documented limitations of PCIe for low-latency networking by leveraging CXL's coherence and memory semantics. The work proposes a Type-1 CXL-NIC using CXL.cache to replace inefficient DMA/MMIO operations, and extends this to a Type-2 design that utilizes CXL.mem for coherent on-NIC memory, enabling flexible data placement and application offloading. The core claims of novelty appear to be: (1) a set of datapath optimizations using specific CXL.cache line state controls (e.g., CO-read for polling, CS-read for prefetching), and (2) an intelligent mechanism for pushing data into the host LLC using NC-P with feedback control.

While the paper presents a timely and well-executed systems study on real hardware, its claims of conceptual novelty are overstated. Many of the core architectural ideas are adaptations of patterns previously established in the literature on proprietary coherent interconnects. The primary contribution of this work is not the invention of new coherent networking patterns, but rather the mapping of these known patterns onto the CXL standard's specific primitives and the first comprehensive performance evaluation of such a system on a real CXL-enabled FPGA platform.

Strengths

First-of-a-Kind Implementation: This work appears to be one of the first, if not the first, to design, implement, and evaluate a full-featured NIC on a real CXL hardware platform (Intel Agilex-7). Moving from simulation to a real-world prototype is a significant and valuable contribution to the community.
Thorough Exploration of CXL Primitives: The paper does an excellent job dissecting the CXL.cache protocol and mapping its specific request types (NC-write, CS-read, CO-read, NC-P) to different networking operations (packet transfer, descriptor prefetch, tail pointer polling). This detailed analysis in Sections 4.3 and 5.3 is valuable for future work in this area.
Demonstration of CXL's Potential: The experimental results effectively demonstrate the performance benefits of using a coherent interconnect like CXL over traditional PCIe, providing concrete data to support the ongoing industry shift.

Weaknesses

The central weakness of this paper is the limited conceptual novelty of its core architectural ideas. A search of prior art reveals significant conceptual overlap, primarily with work on NICs attached via other coherent fabrics.

Coherent Polling is Not New: The "event-driven Tx datapath" described in Section 4.3, where the NIC uses a CO-read request to poll a tail pointer in host memory, is the central optimization for the Tx path. This mechanism is functionally identical to the "inline signaling" technique proposed and implemented in CC-NIC [48]. In CC-NIC, the NIC also polls a descriptor location in host memory using a coherent read, leveraging the UPI protocol to remain quiet until the host CPU writes to that location. The authors of this paper even state they "adopt the inline signaling technique from CC-NIC [48]". This is therefore an adaptation of a known technique to a new protocol, not a novel architectural concept.
Data Pushing to LLC is Conceptually Similar to DDIO: The "Intelligent Usage of NC-P" (Section 5.3) to push data directly into the host LLC is presented as a key CXL-specific optimization. However, this is conceptually the same goal as Intel's Direct Data I/O (DDIO). The "leaky DMA" problem mentioned is a well-known issue with DDIO. The proposed solutions—"Adaptive push-write gating" and "Post-push write-back"—are simple, reactive control heuristics (a configurable flag and a delayed write). While these mechanisms are new in their specific implementation, they represent incremental control knobs on an existing concept rather than a fundamentally new architectural approach to I/O data delivery.
Coherent On-NIC Memory is an Established Concept: The idea of using coherent, device-attached memory for networking structures (Section 5) has been explored before. CC-NIC [48] already proposed a "cache-coherent interface to the NIC" with the possibility of writer-homed buffers. Furthermore, platforms like Enzian [7] and work like Dagger [26] have extensively explored tightly coupling FPGAs to CPUs via proprietary coherent links for accelerating RPCs and other network functions. The novel element here is the use of the CXL.mem standard, but the architectural pattern of offloading state to coherent device memory is not new.

The delta between this work and CC-NIC [48] is primarily the interconnect (open-standard CXL vs. proprietary UPI) and the specific primitives used. For instance, CXL's NC-write avoids the need for CLFLUSH that CC-NIC required. This is an important finding about the CXL protocol's advantages, but it is not a novel architectural invention by the authors.

Questions to Address In Rebuttal

Please articulate the fundamental conceptual novelty of your proposed datapath optimizations over those presented in CC-NIC [48]. Beyond the fact that CXL provides different primitives than UPI (e.g., NC-write), what new architectural principle or communication pattern is being introduced here for the first time?
The proposed NC-P control mechanisms in Section 5.3 are presented as a key contribution. Can you argue why these simple, heuristic-based controls should be considered a significant research contribution, as opposed to a straightforward engineering solution to the known problems of DDIO-like mechanisms?
Given the significant conceptual overlap with prior work, would it be more accurate to frame the primary contribution of this paper as the first end-to-end design and characterization of a CXL-based NIC on real hardware, providing a roadmap for mapping known coherent communication patterns to the CXL standard?

GCC: A 3DGS Inference Architecture with Gaussian-Wise and Cross-Stage Conditional Processing

Abstract

3D Gaussian Splatting (3DGS) has emerged as a leading neural rendering technique for high-fidelity view synthesis, prompting the development of dedicated 3DGS accelerators for resource-constrained platforms. The conventional decoupled preprocessing-...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present GCC, a hardware accelerator architecture for 3D Gaussian Splatting (3DGS) inference. The central thesis is that the conventional "preprocess-then-render" tile-wise dataflow is fundamentally inefficient. To address this, they propose a "Gaussian-wise" dataflow, which processes one Gaussian completely before moving to the next, coupled with "cross-stage conditional processing" to skip work dynamically. While the motivation is sound, the submission suffers from several methodological and analytical weaknesses that call its central claims of superior efficiency into question. The proposed dataflow appears to trade one set of problems (redundant loads) for another, more severe set (serialization, complex control flow, and scalability issues), and the evaluation fails to adequately quantify these trade-offs.

Strengths

Problem Identification: The authors correctly identify and quantify two well-known inefficiencies in the standard 3DGS rendering pipeline: the significant fraction of preprocessed Gaussians that are ultimately discarded (Figure 2a) and the repeated loading of the same Gaussian's data for different tiles (Figure 2b). The motivation for a new dataflow is, therefore, well-grounded.
Co-design Principle: The work attempts a full-stack co-design, proposing a novel dataflow and a hardware architecture tailored to it. This holistic approach is commendable in principle.

Weaknesses

My primary concerns with this work are threefold: 1) the proposed Gaussian-wise blending introduces a critical performance pathology that is not analyzed, 2) the cost of the proposed boundary identification method is not justified against simpler alternatives, and 3) the evaluation framework contains significant omissions that weaken the conclusions.

Unanalyzed Performance Bottleneck in Blending Stage: The fundamental premise of a Gaussian-wise pipeline is that each Gaussian is rendered to completion across all pixels it covers. This necessitates random-access writes to an on-chip Image Buffer. The paper acknowledges in Section 4.5 (page 9) that strict back-to-front blending order must be maintained at the block level. The authors state: "If a later Gaussian in the sorted sequence attempts to access a block whose previous Gaussian has not yet completed processing, the pipeline stalls..." This is a critical admission. In any reasonably complex scene, multiple Gaussians will overlap within the same pixel block. This creates a high probability of data hazards, leading to frequent pipeline stalls. The paper provides zero quantitative analysis of the frequency or duration of these stalls. Without this data, the claimed performance improvements are unsubstantiated, as this serialization hazard could easily negate any gains from reduced data loading.
Insufficient Justification for Alpha-based Boundary Identification: The authors propose a runtime, breadth-first search (BFS) algorithm (Algorithm 1, page 6) to identify the exact pixel footprint of each Gaussian. While Table 1 (page 6) shows this method processes fewer pixels than AABB or OBB methods, this comparison is misleading. It compares the number of pixels rendered, not the total work done. The proposed BFS-style algorithm introduces significant control flow overhead (queue management, neighbor checking, visited map updates) that must be executed for every single rendered Gaussian. The paper fails to provide a cycle-level or runtime cost comparison between their complex identification algorithm and simply iterating through all pixels within a conservative OBB. It is entirely possible that the simpler data path of the OBB method, despite performing more alpha calculations, is faster overall due to the absence of this complex control logic. The claim of efficiency is therefore unsupported.
Flawed and Incomplete Evaluation:
- The "Compatibility Mode" (Section 4.6, page 9) is presented as a feature, but it is an admission of a fundamental scaling limitation. The Gaussian-wise approach requires the entire image's transmittance and color buffers to be on-chip, which is infeasible for high resolutions. The proposed solution is to tile the image into sub-views, which partially re-introduces the very tile-wise processing paradigm the authors claim is inefficient. Figure 6 (page 7) shows that as the sub-view size decreases, the number of redundant Gaussian invocations increases. The evaluation does not analyze the performance impact of this mode on a high-resolution target (e.g., 1080p or 4K), which would require significant tiling and likely erode the claimed benefits over GSCore.
- The GPU comparison in Section 6 (page 12) is weak. The authors state that their dataflow performs poorly on GPUs due to data races requiring atomic operations. This is not a justification for custom hardware; it is strong evidence that the proposed dataflow is inherently hostile to massively parallel architectures. An effective dataflow should be scalable, not fundamentally serial in its memory access pattern. This result undermines the generality and soundness of the proposed dataflow itself.
Redundant Processing Stages: The dataflow in Figure 3 (page 5) includes a global "Gaussian Grouping by Depth" (Stage I) followed by an "intra-group depth sorting" in Stage III. If a global sort order is established in Stage I, the necessity of a second, intra-group sort is unclear and seems redundant. This adds complexity and cost without clear justification.

Questions to Address In Rebuttal

Provide a quantitative analysis of the pipeline stalls mentioned in Section 4.5. For each benchmark, please report the total number of stall cycles as a percentage of total execution cycles. How does this scale with scene complexity (i.e., the number of overlapping Gaussians)?
Provide a direct, cycle-level comparison of the proposed Alpha-based Boundary Identification algorithm versus a simpler OBB-based approach. The comparison should not be based on the number of pixels processed (as in Table 1), but on the total execution time (or cycles) for the entire identification-plus-blending stage for a single Gaussian.
The "Compatibility Mode" is critical for practical deployment. Provide a performance and energy breakdown for rendering a standard 1920x1080 frame, which would require your architecture to process 120 sub-views of 128x128. How do the total number of Gaussian loads and overall performance in this mode compare to the GSCore baseline at the same resolution?
Please justify why the poor performance of your dataflow on a GPU (Section 6) should be interpreted as a need for custom hardware, rather than as an inherent parallelism flaw in the Gaussian-wise processing model itself.
Clarify the necessity of the second sorting stage (intra-group sorting in Stage III). Given the global depth grouping in Stage I, what specific ordering problem does this second sort solve, and what is its computational cost?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper introduces GCC, a novel hardware accelerator architecture for 3D Gaussian Splatting (3DGS) inference. The core contribution is a fundamental rethinking of the 3DGS dataflow, moving away from the conventional, GPU-inspired "preprocess-then-render" and tile-wise paradigm. Instead, the authors propose a "Gaussian-wise" rendering pipeline, augmented by "cross-stage conditional processing." In this new dataflow, Gaussians are sorted globally by depth once per frame and then processed sequentially. The entire pipeline—from 3D-to-2D projection to final color blending—is executed for a single Gaussian before moving to the next. This structure allows the system to dynamically skip all processing for Gaussians that are occluded or otherwise do not contribute to the final image, addressing major inefficiencies in the standard approach. The paper also introduces a novel alpha-based boundary identification method to more accurately define a Gaussian's region of influence, further reducing computational waste. The proposed architecture is evaluated against the state-of-the-art GSCore accelerator, demonstrating a significant 5.24× speedup and 3.35× energy efficiency improvement.

Strengths

Fundamental Dataflow Innovation: The paper's primary strength is its departure from merely optimizing the existing 3DGS pipeline. The authors correctly identify that the standard tile-wise dataflow, while a natural fit for traditional GPU rasterization of triangles, introduces systemic redundancies for 3DGS primitives. The proposed Gaussian-wise dataflow is a paradigm shift that directly attacks the root causes of inefficiency: redundant preprocessing of unused Gaussians and repeated memory access for Gaussians spanning multiple tiles. The illustration in Figure 1 (page 3) provides an exceptionally clear and compelling argument for this new approach.
Excellent Contextualization and Problem Motivation: The work is situated perfectly within the current landscape of neural rendering acceleration. It correctly identifies 3DGS as a key algorithm for real-time rendering on edge devices and GSCore [19] as the relevant state-of-the-art hardware baseline. The analysis in Section 2.2 and the motivating data in Figure 2 (page 4) crisply define the problem and quantify the opportunity for improvement, making the authors' proposed solution feel both necessary and impactful.
Elegant Algorithm-Architecture Co-design: GCC is a strong example of co-design. The Gaussian-wise dataflow enables the possibility of cross-stage conditional processing. This, in turn, motivates the need for a more efficient way to determine a Gaussian's screen-space footprint. The proposed "alpha-based boundary identification" (Section 3, page 6) is a clever algorithmic solution that replaces coarse bounding boxes (AABB/OBB) with a more accurate, dynamically computed region. This synergy between the high-level dataflow and the low-level compute kernels is a hallmark of excellent architecture research.
Significant and Well-Supported Results: The performance and efficiency gains over GSCore are substantial and position GCC as a new state-of-the-art. The authors not only present strong headline numbers but also provide insightful breakdown analyses (Figure 11, page 10) and sensitivity studies (Section 5.4, page 11) that attribute these gains to their specific innovations. The comparison with GPU implementations in Section 6 further strengthens the paper by demonstrating why custom hardware is necessary to fully exploit the potential of this new dataflow.

Weaknesses

While the core idea is powerful, the paper could benefit from a deeper exploration of its own architectural trade-offs and limitations.

Scalability of the Global Sorting Stage: The entire Gaussian-wise dataflow hinges on the ability to perform an initial global sort of all Gaussians by depth (Stage I, page 5). While this is amortized over a frame, the complexity of this sort grows with scene size. The paper does not analyze the potential bottlenecks (memory bandwidth, compute latency) of this stage for truly massive scenes containing tens or hundreds of millions of Gaussians. It is a critical preprocessing step that could become a new limiter at scale.
Pragmatism vs. Purity in Compatibility Mode: The "Compatibility Mode" (Section 4.6, page 9) is a necessary and practical solution for handling large-resolution images on memory-constrained devices. However, it essentially reintroduces a form of tiling, which the paper's core premise argues against. While the authors show the overhead is minimal at a 128x128 tile size, this feels like a compromise of the central design philosophy. A more detailed analysis of the performance trade-offs as this sub-view size changes would be valuable to understand the robustness of the architecture.
Complexity of Runtime Boundary Identification: The alpha-based boundary identification algorithm (Algorithm 1, page 6) is more accurate but also appears more complex than a simple geometric bounding box test. It involves an iterative, breadth-first search. The paper does not provide a detailed analysis of the hardware cost or cycle latency of this runtime search process compared to the simpler methods it replaces. It's possible that for some distributions of Gaussians, the overhead of this search could erode some of the gains from reduced blending computations.

Questions to Address In Rebuttal

Could the authors elaborate on the scalability of the initial global depth grouping stage (Stage I)? What are the projected memory traffic and latency characteristics for this stage on scenes that are an order of magnitude larger than those in the benchmarks (e.g., 50M+ Gaussians)? At what point does this upfront sorting cost begin to dominate the per-frame latency?
Regarding the Compatibility Mode, could you provide data showing how performance and memory access overhead scale as the sub-view size is reduced below 128x128 (e.g., to 64x64 or 32x32)? Understanding this curve would help clarify the robustness of the architecture for even more severely resource-constrained systems.
The alpha-based boundary identification is an elegant technique. Could you provide a brief analysis of the average number of iterations or pixels evaluated by Algorithm 1 per Gaussian? How does the cycle cost of this dynamic search in the Alpha Unit compare to the cost of a simpler OBB intersection test followed by processing more redundant pixels, as might be done in a baseline system?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present GCC, a hardware architecture and dataflow for 3D Gaussian Splatting (3DGS) inference. The central claim of novelty rests on a fundamental restructuring of the standard 3DGS processing pipeline. Instead of the conventional two-stage "preprocess-then-render" dataflow with tile-wise rendering, GCC introduces a "Gaussian-wise" rendering approach coupled with "cross-stage conditional processing." This new dataflow processes Gaussians sequentially in depth order, completing all operations for one Gaussian before starting the next. This structure enables the early termination of both preprocessing and rendering for Gaussians that are determined to be occluded or visually insignificant, thereby addressing the redundancy issues of repeated data loading and extraneous preprocessing inherent in prior art. A supporting algorithmic contribution is an alpha-based boundary identification method to dynamically compute a tight processing region for each Gaussian.

Strengths

The primary strength of this work is its core conceptual novelty in redefining the 3DGS accelerator dataflow. My analysis confirms that this contribution is significant and well-motivated.

Novel Dataflow Inversion: The most significant novel contribution is the shift from a tile-wise to a Gaussian-wise rendering paradigm within the context of 3DGS acceleration. Prior dedicated hardware, notably the baseline GSCore [19], adheres to the GPU-style tile-based approach. By processing one Gaussian completely at a time (as detailed in Section 3, page 4 and Figure 1), the authors eliminate the "Challenge 2" they identify: the repeated loading of a single Gaussian's parameters for every tile it overlaps. This is a clean and fundamental departure from the established methodology in this specific domain.
Cross-Stage Conditional Processing: This concept is inextricably linked to the Gaussian-wise dataflow and represents a crucial element of the work's novelty. Standard pipelines decouple preprocessing from rendering, leading to wasted computation on Gaussians that are later discarded during alpha blending ("Challenge 1," page 4). GCC's proposed interleaving of these stages allows rendering-dependent information (i.e., per-pixel accumulated transmittance) to gate the execution of the entire processing chain for subsequent Gaussians. This is a genuinely new mechanism for 3DGS accelerators that directly prunes the pipeline at a much earlier stage than previously possible, saving both data movement and computation.
Principled, Dynamic Boundary Identification: The proposed alpha-based boundary identification method (Section 3, page 6, "Alpha-based Gaussian Boundary Identification") is a strong supporting contribution. While prior work uses static approximations like the 3σ rule (AABBs) or slightly better OBBs, this work derives the exact elliptical support boundary as a function of the Gaussian's opacity ω (Equation 8, page 5). The runtime breadth-first traversal (Algorithm 1) to identify the minimal pixel set is a specific and novel implementation of this principle, moving beyond coarse bounding boxes to a fine-grained, accurate evaluation region. This is essential for making the Gaussian-wise splatting approach computationally efficient.

Weaknesses

While the application of the core ideas to 3DGS acceleration is novel, the underlying concepts have precedents in the broader field of computer graphics. The manuscript would be stronger if it more clearly situated its contributions relative to this foundational prior art.

Conceptual Precedent of Primitive-Wise Rendering: The concept of "Gaussian-wise" rendering is functionally analogous to classic immediate-mode or primitive-order rasterization, where each primitive (e.g., a triangle) is fully processed and rasterized before moving to the next. Modern tile-based rendering was developed specifically to overcome the memory bandwidth issues of this classic approach. The authors have effectively re-introduced a primitive-wise model for 3DGS, arguing its benefits in this new context. While its application here is novel and the performance justification is strong, the paper should acknowledge this conceptual lineage to more precisely define its delta over foundational graphics architectures.
Scalability of Global Sorting: The proposed dataflow is predicated on a global, front-to-back ordering of all Gaussians. This is accomplished in "Stage I: Gaussian Grouping by Depth" (page 5). This initial global sorting step could represent a new scalability bottleneck, especially for scenes containing tens of millions or billions of Gaussians. The paper describes a hierarchical grouping scheme (Section 4.2, page 7) but does not provide a rigorous analysis of this stage's performance characteristics or its asymptotic complexity as scene size increases. The novelty of the rendering pipeline could be undermined if this new preprocessing step does not scale effectively.
Complexity of Irregular Memory Access: A key reason for the prevalence of tile-based rendering is its highly coherent access to on-chip tile memory. The proposed Gaussian-wise approach, by contrast, splats elliptical footprints onto a global image buffer, creating a scattered and potentially irregular memory write pattern. While the authors mention processing in pixel blocks (8x8 PEs) to manage this (Section 4.4, page 8), the paper lacks a detailed analysis of the memory subsystem's design for handling this irregularity. The trade-off between reducing repeated DRAM loads (a clear win) and introducing less coherent on-chip memory accesses (a potential loss) is not fully explored.

Questions to Address In Rebuttal

Could the authors please elaborate on the novelty of "Gaussian-wise" rendering in the context of historical, primitive-order rasterization pipelines? Please clarify how your approach differs from, or is specifically adapted for, the unique properties of 3DGS compared to traditional triangle rasterization.
The enabling step for your dataflow is the global depth sorting of Gaussians. Can you provide an analysis of the performance and scalability of this sorting stage? How does its runtime cost scale with the number of Gaussians in a scene, and at what point, if any, might it become the dominant performance bottleneck?
The transition to Gaussian-wise rendering changes the memory access pattern on the Image Buffer from the coherent accesses of a tile-based system to more scattered writes. Could you provide more detail on how your memory hierarchy (e.g., buffer banking, queuing) is designed to mitigate the potential performance penalties associated with this irregularity, and quantify any remaining overhead?

RTGS: Real-Time 3D Gaussian Splatting SLAM via Multi-Level Redundancy Reduction

Abstract

3D Gaussian Splatting (3DGS) based Simultaneous Localization and Mapping (SLAM) systems can largely benefit from 3DGS’s state-of-the-art rendering efficiency and accuracy, but have not yet been adopted in resource-constrained edge devices due to ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors present RTGS, a co-designed algorithm and hardware framework to accelerate 3D Gaussian Splatting-based SLAM systems for real-time performance on edge devices. The core thesis is that significant redundancies exist at multiple levels of the 3DGS-SLAM pipeline (Gaussian, pixel, workload, memory access). The authors propose algorithmic solutions (adaptive pruning, dynamic downsampling) and a hardware plug-in architecture (featuring a Workload Scheduling Unit, R&B Buffer, and Gradient Merging Unit) to address these redundancies. The system is evaluated against several 3DGS-SLAM algorithms and datasets, claiming real-time performance (≥30 FPS) with "negligible quality loss."

Strengths

Comprehensive Scope: The work commendably attempts to address performance bottlenecks across the entire 3DGS-SLAM pipeline, from algorithm to hardware architecture. This multi-level approach is ambitious.
Hardware-Level Optimizations: The proposed R&B Buffer for reusing intermediate rendering data during backpropagation is a clever and well-justified optimization. Similarly, the design of the Gradient Merging Unit (GMU) to handle sparse gradient aggregation appears to be a technically sound approach to mitigating the known bottleneck of atomic operations.
Extensive Profiling: The authors have conducted detailed profiling (Section 3) to motivate their design choices. This analysis provides a useful, if not entirely novel, breakdown of where latency is concentrated in the 3DGS-SLAM pipeline.

Weaknesses

The paper's claims of robustness and efficiency rest on a foundation of questionable heuristics and an evaluation that lacks sufficient rigor.

Algorithmic Justification is Heuristic and Lacks Principled Analysis: The core algorithmic contributions are based on "magic numbers" and ad-hoc rules that are not adequately justified.
- Adaptive Pruning (Section 4.1): The importance score in Eq. 7 uses a weighting factor λ, which is not defined or analyzed. The pruning interval K is adjusted based on a tile-Gaussian intersection change ratio exceeding a 5% threshold. Why 5%? This appears to be an empirically tuned value that may not generalize. A robust method should not rely on such arbitrary thresholds.
- Dynamic Downsampling (Section 4.2): The resolution scaling for non-keyframes starts at (1/16)Ro and increases by a factor of m=2. The choice of these specific values is not justified. A sensitivity analysis is required to demonstrate that these are optimal choices and not simply values that worked for the selected test cases.
The "Negligible Quality Loss" Claim is Unsubstantiated and Contradicted by Data: The abstract makes a strong claim of "negligible quality loss," but the evidence is weak and, in some cases, contradictory.
- The authors state in Section 4.2 that with their downsampling, "both ATE and PSNR remain within a 10% variance". A 10% degradation in Absolute Trajectory Error (ATE) is far from negligible in any serious robotics or AR/VR application and could represent a critical failure.
- Table 6 presents several instances where ATE improves after applying the RTGS optimizations (e.g., GS-SLAM on ScanNet, ATE drops from 2.85 to 2.76). This is a highly counter-intuitive result. The authors provide no explanation for why removing information (pruning Gaussians, downsampling pixels) would lead to a more accurate trajectory. This suggests either an issue with the evaluation methodology or that the baseline implementations are suboptimal. Extraordinary claims require extraordinary evidence, which is absent here.
- PSNR, the measure of rendering quality, consistently drops across all experiments in Table 6. While the drops are small, they are not zero, which again challenges the term "negligible."
The Pruning Strategy's Robustness is Questionable: The ablation study on pruning ratio (Figure 14a) reveals a critical weakness. The authors observe that ATE increases sharply beyond a 50% pruning ratio and therefore "cap the pruning ratio at 50%". This is not a strength but an admission that the proposed importance score (Eq. 7) is not a reliable measure of a Gaussian's true contribution. A truly robust importance metric would naturally preserve critical Gaussians even at high pruning ratios; the need for an external, arbitrary cap implies the metric is flawed.
Hardware Evaluation Oversimplifies Critical Aspects:
- GPU Integration Model (Section 5.5): The proposed programming interface and synchronization via shared-memory flags is a high-level abstraction. It completely ignores the significant real-world overheads of polling, cache coherency traffic, and scheduler contention that would arise from such tight coupling between the SMs and an external accelerator. The performance model appears overly optimistic.
- Ablation Study of Speedups (Figure 17b): The overall speedup is presented as a product of independent factors from each optimization. This assumes the contributions are orthogonal, which is highly unlikely. For instance, Adaptive Pruning reduces the total workload, which in turn affects the severity of workload imbalance that the WSU must solve. The speedup from the WSU is therefore dependent on the pruning ratio. The analysis does not account for these interactions, making the reported breakdown misleading.

Questions to Address In Rebuttal

Please provide a rigorous justification for the choice of hyperparameters in your algorithms: the pruning score weight λ, the 5% change ratio threshold for adapting K, and the (1/16) and m=2 constants for dynamic downsampling. A sensitivity analysis showing the impact of these parameters on both performance and accuracy is expected.
The claim of "negligible quality loss" requires stronger defense. Please specifically address: (a) Why a potential 10% increase in ATE should be considered negligible for SLAM applications. (b) The mechanism by which your method of removing information leads to improved ATE in several cases reported in Table 6. This counter-intuitive result must be explained.
If the proposed Gaussian importance score (Eq. 7) is robust, why is it necessary to enforce a hard 50% cap on the pruning ratio to prevent accuracy degradation? Does this not indicate a fundamental limitation of the metric itself?
How do you validate the assumption that the speedup contributions from your various hardware and software techniques (as shown in Figure 17b) are independent and can be multiplied? Please provide evidence that the speedup from one component (e.g., WSU) is not dependent on the operation of another (e.g., Adaptive Pruning).
What are the estimated latency and energy overheads of the synchronization mechanism between the GPU SMs and the RTGS plug-in? The current model seems to assume these are zero.

Review 2

Review Form:

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents RTGS, a holistic algorithm-hardware co-design framework aimed at enabling real-time performance for 3D Gaussian Splatting (3DGS) based SLAM systems on resource-constrained edge devices. The core problem addressed is the significant computational and memory overhead of existing 3DGS-SLAM methods, which prevents them from achieving the ≥30 FPS threshold required for interactive applications.

The authors' central contribution is a systematic, multi-level approach to identifying and eliminating redundancies throughout the SLAM pipeline. On the algorithm side, they introduce an adaptive Gaussian pruning method that reuses existing backpropagation gradients to identify unimportant Gaussians, and a dynamic down-sampling technique for non-keyframes that leverages the SLAM system's own keyframe identification logic. On the hardware side, they propose a dedicated GPU plug-in featuring several novel units: a Workload Scheduling Unit (WSU) to balance load across pixels, a Rendering and Backpropagation (R&B) Buffer to reuse intermediate data between pipeline stages, and a Gradient Merging Unit (GMU) to accelerate gradient aggregation without costly atomic operations.

By tackling inefficiencies at the object, pixel, execution, and pipeline levels simultaneously, RTGS demonstrates the ability to significantly accelerate existing 3DGS-SLAM algorithms, achieving real-time performance and substantial energy efficiency gains with negligible impact on accuracy.

Strengths

Holistic, System-Level Contribution: The most significant strength of this work is its comprehensive, system-level perspective. Rather than proposing a single point-solution, the authors have conducted a thorough analysis of the entire 3DGS-SLAM pipeline (Section 3, pages 3-5), identified multiple, distinct bottlenecks (Observations 1-6), and engineered a set of synergistic solutions. This algorithm-hardware co-design philosophy is powerful and leads to a much more impactful result than a purely algorithmic or purely architectural approach would have.
High Potential for Impact: This work addresses a critical and timely problem. 3DGS has emerged as a leading representation for scene rendering, but its application in robotics and AR/VR is gated by performance. By demonstrating a clear path to real-time execution on edge platforms, this paper could unlock the widespread adoption of 3DGS for a new class of applications, from on-device photorealistic mapping for AR glasses to more capable autonomous robots. It effectively transforms 3DGS-SLAM from a near-real-time curiosity into a practical engineering solution.
Insightful and "Low-Overhead" Redundancy Reduction: A key insight of the paper is that the process of identifying and eliminating redundancy must itself be low-cost. The proposed methods are elegant in this regard. For example, using existing gradients for pruning (Section 4.1, page 5) and reusing keyframe decisions for downsampling (Section 4.2, page 6) avoids the need for expensive, orthogonal analysis steps. Similarly, the hardware's use of inter-iteration similarity for scheduling (Section 5.2, page 8) is a clever way to amortize the cost of workload analysis. This design principle is what makes the proposed speedups practically achievable.
Excellent Contextualization and Motivation: The paper does a superb job of positioning itself within the broader landscape. The introduction clearly traces the evolution of SLAM representations, and Table 1 (page 2) provides a concise and effective comparison against related hardware acceleration works (GauSPU, GSArch, etc.), clearly articulating the novelty of RTGS's more comprehensive approach. The detailed profiling results in Section 3 serve as a strong, data-driven motivation for every subsequent design decision.

Weaknesses

As a reviewer focused on synthesis and potential, the weaknesses noted here are less about flaws in the execution and more about the scope and future challenges of the work.

Specificity of the Hardware Solution: The proposed hardware architecture is tightly coupled to the specific pipeline structure of 3DGS (e.g., projection, sorting, alpha blending). While the authors suggest in the conclusion (Section 8, page 12) that the co-design could be applied to other differentiable renderers like NvDiffRec or Pulsar, this claim is not substantiated. It is unclear how concepts like the R&B Buffer or the WSU's pairwise pixel scheduling would map to fundamentally different rendering paradigms, such as volumetric rendering in NeRF, which have different memory access patterns and pipeline bottlenecks.
Focus on Pipeline Acceleration, Not SLAM Fundamentals: This is a choice of scope, not a flaw, but it is important to recognize. The work brilliantly accelerates the per-frame processing of existing 3DGS-SLAM algorithms. However, it does not engage with or propose improvements for other fundamental SLAM challenges like robust loop closure, long-term map management, or relocalization in the context of 3D Gaussian representations. The overall system's robustness and accuracy are still fundamentally limited by the base algorithm it accelerates.

Questions to Address In Rebuttal

The claim that the RTGS co-design techniques are applicable to other differentiable rendering systems is intriguing. Could the authors elaborate on this? For example, how would the concept of the R&B Buffer, which reuses data between forward rendering and backpropagation, be adapted for a NeRF-style volumetric renderer where the "forward pass" involves ray marching and querying an MLP?
A full SLAM system requires more than just fast tracking and mapping; it needs robust long-term operation. Does the adaptive Gaussian pruning or dynamic resolution scaling for non-keyframes introduce any potential risks for long-term map consistency or the ability to perform successful loop closures later in a trajectory? For instance, could aggressive pruning remove Gaussians that, while unimportant for the current frame, are critical for recognizing a previously visited location?
With the core rendering and backpropagation pipeline so effectively accelerated by the RTGS plug-in, what do the authors foresee as the next major system-level bottleneck? Will the workload shift to the "classic" GPU components responsible for preprocessing and sorting, or will the system become limited by off-chip memory bandwidth despite the on-chip optimizations?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

The authors present RTGS, an algorithm-hardware co-design framework intended to accelerate 3D Gaussian Splatting-based SLAM (3DGS-SLAM) to real-time performance on edge devices. The core thesis is that significant performance gains can be unlocked by systematically identifying and reducing redundancies at multiple levels of the SLAM pipeline, from individual Gaussians and pixels to entire frames and iterations.

The claimed novelty is not a single, groundbreaking algorithm or architectural principle. Rather, it lies in the comprehensive synthesis and co-design of several known optimization techniques, specifically tailored to the unique workload characteristics of 3DGS-SLAM. The authors propose algorithmic modifications (adaptive pruning, dynamic downsampling) and a corresponding hardware plug-in with specialized units (WSU, R&B Buffer, GMU) to implement these optimizations with minimal overhead. While the resulting system demonstrates a significant performance leap, an analysis of the individual components reveals that most are adaptations of well-established concepts from adjacent fields.

Strengths

The primary strength and most novel aspect of this work is its holistic, multi-level co-design approach. The authors correctly identify that a single-point optimization is insufficient. The main contributions that can be considered novel in their application context are:

Exploitation of Inter-Iteration Similarity: The insight that workload distributions are highly similar across optimization iterations within the same frame (Observation 6, page 5) is a key enabler. Using this temporal coherence to inform the pixel-level pairwise scheduling in the Workload Scheduling Unit (WSU) is a clever, application-specific optimization that reduces scheduling overhead.
Overhead-Aware Optimizations: The paper demonstrates a keen awareness of the cost of optimization itself. For instance, the adaptive Gaussian pruning reuses gradients already computed for backpropagation (Section 4.1, page 5), and the R&B Buffer reuses intermediate rendering values for the backward pass (Section 5.2, page 7). This focus on minimizing the meta-cost of redundancy reduction is a noteworthy engineering contribution.

Weaknesses

My primary concern is the conceptual novelty of the individual techniques employed. While the authors have engineered a complex and effective system, the foundational ideas are largely derivative of prior art.

Gaussian Pruning: The use of gradient magnitudes to determine importance is a standard technique in neural network pruning. The paper’s contribution (Section 4.1, page 5) is the application of this principle to 3D Gaussians within a SLAM loop, combined with a progressive masking strategy. While effective, this is an application of a known method, not the invention of a new one. The paper itself cites Taming 3DGS [29] as a prior method for pruning, with the main delta being that RTGS's method is better suited for the low-iteration count of SLAM.
Dynamic Downsampling: Adaptive resolution based on frame content or importance is a classic technique in real-time graphics and video compression. The idea of treating keyframes and non-keyframes differently is the cornerstone of modern SLAM. The proposed heuristic of progressively scaling resolution based on distance from the last keyframe (Section 4.2, page 6) is a specific implementation choice, not a new paradigm.
Workload Balancing: The problem of workload imbalance in rendering is decades old. Tile-based rendering and various dynamic scheduling strategies exist to combat this. The authors themselves acknowledge in Table 1 (page 2) that GauSPU [49] and MetaSapiens [23] address this at the tile level. The contribution of RTGS is to move this to a finer, pixel-level granularity, which is an incremental, albeit logical, refinement.
Hardware Specialization: The R&B Buffer is a form of specialized cache or memoization for the forward/backward pass, a common pattern in deep learning accelerators. Similarly, the Gradient Merging Unit (GMU) is a hardware implementation of a reduction tree to mitigate atomic operation hazards, a problem and solution-pattern seen in prior work like DISTWAR [5] and SIGMA [36], which the authors cite. The novelty is the direct mapping of these architectural patterns to the 3DGS workload, not the patterns themselves.

In summary, the work is a strong piece of engineering that combines existing ideas into a new context. However, it lacks a central, fundamentally new concept. The novelty is in the sum of its parts, not in any individual part.

Questions to Address In Rebuttal

The authors should clarify the "delta" between their work and prior art with greater precision.

The WSU's scheduling is guided by the previous iteration. Given the high-speed, parallel nature of GPUs, how does this differ conceptually from existing dynamic scheduling and work-stealing techniques that react to queue lengths and other real-time load metrics? Is the benefit purely from avoiding the overhead of on-the-fly analysis, and if so, how does this hold up if scene or camera motion becomes highly erratic?
The paper argues that the combination of these techniques requires a hardware-algorithm co-design. Could the core algorithmic ideas (gradient reuse for pruning, progressive downsampling) be implemented efficiently in software (e.g., via CUDA) on a future-generation GPU with more flexible atomic operations or scheduling primitives? Please justify why a dedicated hardware plug-in is essential, rather than just beneficial.
Regarding the GMU, the paper argues that the sparsity pattern in SLAM is different from other workloads. Can you quantify this difference and explain precisely how the proposed reduction tree architecture is uniquely suited to this pattern in a way that prior work on sparse aggregation (e.g., SIGMA [36]) is not?

REACT3D: Real-time Edge Accelerator for Incremental Training in 3D Gaussian Splatting based SLAM Systems

Abstract

3D Gaussian Splatting (3DGS) has emerged as a promising approach for high-fidelity scene reconstruction and has been widely adopted in Simultaneous Localization and Mapping (SLAM) systems. 3DGS SLAM requires incremental training and rendering of Gaussians ...

ACM LINK

Reviews

Review 1

Review Form

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose REACT3D, a hardware accelerator for incremental training in 3D Gaussian Splatting (3DGS) based SLAM systems, targeting real-time performance on edge devices. The work is motivated by a performance analysis of GPU-based 3DGS SLAM, identifying algorithmic redundancy, unnecessary loss computation, and a dual-index memory access bottleneck as key challenges. To address these, the paper introduces an algorithm-architecture co-design featuring: 1) a spatial consistency and convergence-aware sparsification algorithm based on optical flow, 2) a pixel block-wise fine-grained dataflow that fuses rendering and gradient calculation, and 3) a CAM-based buffer to resolve irregular memory access.

While the paper presents a comprehensive co-design effort and identifies relevant bottlenecks, its central claims of achieving real-time performance and outperforming prior work rest on a series of strong, under-justified assumptions and a critically flawed evaluation methodology concerning competitor analysis. The robustness of the core algorithmic contribution—sparsification—is not sufficiently proven.

Strengths

Problem Analysis: The initial workload characterization presented in Section 3 and Figure 1 provides a clear and logical breakdown of the 3DGS SLAM pipeline bottlenecks on edge GPUs. The identification of the dual-index access problem (Section 3.4) as a root cause for inefficiencies in sorting and gradient merging is a particularly sharp insight.
Architectural Cohesion: The proposed hardware architecture in Section 5 is well-structured and directly maps onto the identified problems. The design of the CAM-based Dual-index Gaussian Buffer (CDGB) is a direct, if conventional, architectural response to the dual-index access problem. The concept of a fused rendering dataflow (Section 5.2) is a logical extension of Insight 2.
Ablation Study: The inclusion of a cumulative ablation study (Section 6.5.4, Figure 18) is commendable practice. It provides a clear, step-by-step view of the purported performance contribution of each proposed optimization within the authors' own framework.

Weaknesses

Fundamentally Flawed Baseline Comparison: The performance comparison against prior accelerators GSArch [16] and GauSPU [42] is not based on direct implementation but on "performance models based on their hardware architectures" (Section 6.1, page 10). This is unacceptable. Such models are highly susceptible to implementation bias and inaccurate assumptions. Without a rigorous validation of these models against the original papers' results, or a re-implementation within a common simulation framework, the performance claims in Figure 13 and energy claims in Figure 14 are unsubstantiated and cannot be trusted.
Fragility of Sparsification Method: The core algorithmic novelty rests on a sparsification scheme that uses Lucas-Kanade optical flow (Section 4.1). Optical flow is notoriously brittle and fails under common SLAM conditions such as rapid motion, textureless surfaces, motion blur, and illumination changes. The paper hand-waves this critical issue away by mentioning a "dynamic validation scheme" that discards masks if "spatial distance between frames is too large, or when the predicted sparsity rate is excessively high." This is ad-hoc and insufficient. The thresholds for these conditions are not specified, nor is the performance impact of frequent reversions to a full forward pass.
Unsupported Algorithmic "Magic Numbers": The "Convergence-aware Adaptive Thresholding" (Section 4.2) sets the threshold as a "fixed fraction a% of the maximum loss value." The value of a is a critical hyperparameter for the entire system's performance-accuracy trade-off, yet it is never stated or justified. This suggests that the parameter may have been tuned specifically for the chosen datasets, questioning the generalizability of the results.
Overstated and Unsubstantiated Claims:
- The paper claims to be the "first hardware design that meets the real-time requirements" (Abstract). Yet, in their own results (Section 6.3), the system achieves 29.15 FPS on the of2 scene, falling short of their own 30 FPS target. This is a minor miss, but it contradicts the absolute nature of their claim.
- The authors claim the overhead of the CAM-based buffer's write operation "can be concealed by overlapping its execution with longer-latency computing stages" (Section 5.4, page 9). This is a classic assumption that may not hold. The paper provides no cycle-level analysis or evidence to prove this latency is always hidden and never stalls the pipeline.
Insufficient Evaluation Scope: The evaluation is conducted on only nine sequences from two well-behaved indoor datasets (TUM and Replica). This is an inadequate sample to validate a system intended for general SLAM. The method's robustness is not tested against more challenging scenarios common in robotics, such as large-scale environments or scenes with dynamic objects.

Questions to Address In Rebuttal

Baseline Models: Provide a detailed justification for using performance models for GSArch and GauSPU instead of a direct re-implementation. You must provide a thorough validation of your models against the results reported in the original papers, including an analysis of any discrepancies and assumptions made. Without this, the comparative performance claims are invalid.
Sparsification Robustness: Quantify the failure rate of the optical flow prediction in your test sequences. How often is the "dynamic validation scheme" triggered, and what is the performance penalty in those instances? Provide a sensitivity analysis on the (unspecified) thresholds for this validation scheme.
Hyperparameter Justification: State the exact value of the adaptive thresholding parameter a (from Section 4.2) used in your experiments. Provide a clear justification for its selection, including a sensitivity analysis showing how PSNR and FPS vary with different values of a.
Latency Concealment: Provide concrete, cycle-level data from your simulator to substantiate the claim that the CDGB write latency is fully concealed. Show pipeline diagrams for cases with both long and short compute stages to prove that stalls do not occur.
Generalization: Justify the limited selection of datasets. Why were more challenging sequences, for instance from datasets like EuRoC MAV or KITTI, not included to properly stress-test the system's robustness, particularly the optical flow-based sparsification?

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents REACT3D, a domain-specific accelerator for incremental training in 3D Gaussian Splatting (3DGS) based SLAM systems, designed for real-time performance on edge devices. The core contribution is a holistic, algorithm-architecture co-design that addresses the unique challenges of the continuous, streaming data inherent to SLAM. The authors identify that naive application of 3DGS training is too slow for the ~30 FPS requirement of real-time SLAM. To solve this, they introduce two key innovations: 1) an algorithmic technique called "spatial consistency and convergence aware sparsification" that leverages optical flow to predict and prune well-optimized regions in the scene, drastically reducing redundant computation; and 2) a specialized hardware architecture featuring a fused rendering dataflow to eliminate pipeline stalls and a novel CAM-based Dual-index Gaussian Buffer (CDGB) to resolve irregular memory access patterns. By tackling the problem at both levels, the authors claim to be the first to achieve real-time (>30 FPS) high-fidelity mapping for 3DGS SLAM on an edge platform, demonstrating a 12.10x speedup over a high-end embedded GPU.

Strengths

Excellent Problem Formulation and Significance: The authors have identified a critical and timely problem. The integration of high-fidelity explicit representations like 3DGS into SLAM is a major trend, but the computational cost has been a prohibitive barrier for deployment on resource-constrained platforms like robots and AR/VR headsets. This work directly confronts this bottleneck. By framing the goal as crossing the ~30 FPS real-time threshold, the authors provide a clear and compelling motivation. The potential impact of enabling real-time, dense, high-fidelity mapping on edge devices is substantial, potentially unlocking the next generation of autonomous navigation and spatial computing applications.
Insightful Algorithm-Architecture Co-Design: The most impressive aspect of this work is the synergy between the algorithmic and architectural contributions. The key insight that SLAM keyframes are temporally coherent is not new, but its application here is masterful. The proposed sparsification method (Section 4, page 5) exploits this coherence to prune the workload. This algorithmic pruning is what makes the subsequent hardware acceleration so effective. This is a textbook example of strong systems research, where understanding the structure of the data and the application informs the design of the underlying hardware.
Thorough Workload Analysis: The paper is grounded in a solid analysis of the 3DGS SLAM pipeline on existing hardware (Figure 1, page 2). The breakdown of the pipeline into stages and the identification of non-obvious, cross-stage problems—namely algorithmic redundancy, unnecessary loss computation, and the "dual-index access" bottleneck—is highly insightful. This detailed upfront analysis provides a strong justification for every subsequent design choice, from the fused dataflow to the specialized CDGB.
Novel and Well-Justified Architectural Solutions: The architectural proposals are not just generic compute units; they are tailored to the specific bottlenecks identified. The CAM-based Dual-index Gaussian Buffer (CDGB, Section 5.4, page 9) is a particularly clever solution to the tricky problem of resolving memory access patterns that switch between being indexed by Gaussian ID (Gid) and Pixel ID (Pid). This is a subtle but critical performance issue that a more generic architecture would handle poorly. Similarly, the fused forward-and-backward dataflow (Section 5.2.3, page 7), which eliminates the explicit loss computation stage by leveraging the properties of the L1 loss function, is an elegant pipeline optimization that improves both utilization and data reuse.

Weaknesses

While this is a strong paper, its significance is framed by its current scope. My concerns are less about flaws in the existing work and more about its generalizability and the robustness of its core assumptions.

Limited Evaluation Scope (Static, Indoor Scenes): The work is evaluated on the TUM and Replica datasets, which consist of well-behaved, indoor, and largely static environments. The core algorithmic contribution—sparsification based on temporal consistency and optical flow—is likely to be most effective in these scenarios. The real-world challenges for SLAM, however, often involve dynamic objects (e.g., people walking), significant lighting changes, and large, less-constrained outdoor spaces. The paper acknowledges this as future work (Section 7.2, page 13), but the potential fragility of the core assumptions in these more challenging settings is a notable limitation on the claimed real-world applicability.
Potential Fragility of Optical Flow: The entire sparsification scheme hinges on the Lucas-Kanade (LK) optical flow algorithm (Section 4.1, page 5) to establish correspondence between frames. While LK is efficient, it is known to struggle with large displacements, textureless surfaces, and illumination changes. The paper mentions a "dynamic validation scheme" to fall back to a dense pass, but the impact of these fallbacks on average performance is not deeply analyzed. A high rate of optical flow failure could significantly degrade performance, potentially pushing the system back below the real-time threshold in more complex scenes.

Questions to Address In Rebuttal

Could the authors elaborate on the robustness of the sparsification method? Specifically, regarding the "dynamic validation scheme" (Section 4.1, page 6), what are the precise heuristics for discarding a predicted sparse mask (e.g., frame-to-frame distance, sparsity rate threshold)? In the evaluated datasets, how frequently did this fallback to a dense pass occur, and what was its impact on the average frame rate?
While dynamic scenes are noted as future work, could the authors speculate on how the REACT3D framework might be adapted to handle them? For example, could the system integrate with a dynamic object segmentation model to exclude those regions from the sparsification process? Would this create new architectural bottlenecks?
The performance of the system seems to be tied to the gradual convergence of the scene representation. In scenarios involving rapid exploration of new areas or loop closures in SLAM, the scene representation changes dramatically. How would the proposed convergence-aware thresholding and sparsification handle these less gradual, more disruptive updates to the map?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper introduces REACT3D, a hardware accelerator for incremental training in 3D Gaussian Splatting (3DGS) based SLAM systems, targeting real-time performance on edge devices. The authors identify key bottlenecks in the 3DGS training pipeline—namely redundant computation, explicit loss calculation stalls, and memory access irregularities from dual-indexing—and propose a set of co-designed algorithmic and architectural solutions.

The core claims to novelty rest on three pillars: 1. An inter-frame sparsification algorithm that uses optical flow to propagate sparse masks across a sliding window of keyframes, guided by a convergence-aware adaptive threshold. 2. A fused, fine-grained rendering dataflow that eliminates the explicit loss computation stage by analytically deriving gradients for the L1 loss function. 3. A Content Addressable Memory (CAM)-based buffer (CDGB) designed to resolve the "dual-index" bottleneck in the sorting and gradient merging stages.

My analysis concludes that while the fundamental components used (optical flow, kernel fusion, CAMs) are not new in themselves, their specific application and synthesis to solve the profiled bottlenecks of incremental 3DGS SLAM training represent a significant and novel contribution to the field of domain-specific architecture.

Strengths

Novelty in Sparsification Strategy: The proposed "spatial consistency and convergence aware sparsification" (Section 4, page 5) is a genuinely novel approach in the context of 3DGS acceleration. Prior works like GauSPU [42] and GSArch [16] focus on intra-frame sparsification (static block masks or gradient pruning). REACT3D's method of using optical flow to propagate sparse masks across historical keyframes leverages the temporal nature of the SLAM problem, a dimension overlooked by previous hardware efforts. This is a conceptually significant delta, moving from a static, per-frame view of redundancy to a dynamic, inter-frame perspective.
Elegant Solution to the Dual-Index Problem: The identification of the "dual-index access" issue (Key Insight 3, page 5) is sharp, and the proposed solution—a CAM-based Dual-index Gaussian Buffer (CDGB, Section 5.4, page 9)—is a clever mapping of a known architectural primitive to a new problem domain. While CAMs are standard components in networking and cache design, their application to resolve the gather/scatter conflict inherent in the sorting (group by Pid) and gradient merging (group by Gid) stages of 3DGS is, to my knowledge, entirely new. It elegantly sidesteps the need for costly software atomics or inefficient full data re-sorting.
Effective Architectural-Algorithmic Co-design: The fused rendering dataflow (Section 5.2.3, page 7) is a strong example of co-design. The insight that SLAM systems can forgo the complex D-SSIM loss in favor of a simpler L1 loss (Key Insight 2, page 4) enables a key architectural innovation: eliminating the loss computation stage entirely. By directly calculating the trivial L1 gradient within the forward engine and pipelining it to the backward engine, the design avoids a major synchronization point that plagues GPU implementations (Figure 4, page 4). While kernel fusion is a known optimization paradigm, this specific, state-aware, tightly-coupled pipeline for 3DGS rendering is a novel dataflow design.

Weaknesses

The "Novelty Delta" of Primitives vs. Application: The paper's primary novelty lies in the synthesis and application of existing concepts. The components themselves—LK optical flow, CAMs, and fused pipelines—are well-established. The authors should be more precise in their claims to distinguish between the invention of new computational primitives and the novel application of existing ones to a new, complex workflow. The current framing could be misinterpreted as claiming the invention of these underlying technologies.
Justification of Complexity for the CAM-based Buffer: The introduction of a large CAM (64KB, as per Table 2, page 11) is a non-trivial architectural cost in terms of area and power. The performance benefit is cited as an average of 1.29x for the accelerator (Section 6.5.3, page 12). While impactful, it is unclear if this gain could have been approached by a less exotic, highly-optimized hardware sorting network combined with a serialized hardware accumulator for the gradient merge. The paper lacks a direct comparison to such an alternative, making it difficult to fully assess if the novelty of the CAM solution is justified by its cost-benefit trade-off versus more conventional accelerator designs.
Fragility of the Sparsification Approach: The novelty of the optical flow-based sparsification is tempered by its reliance on an algorithm with well-known failure modes, such as large displacements, occlusions, and significant lighting changes—all plausible in SLAM scenarios. The paper acknowledges this with a simple fallback mechanism (Section 4.1, page 6), where the predicted mask is discarded. This fallback negates the performance benefit. The work would be stronger if it characterized the frequency of this fallback on challenging datasets or proposed a more resilient mechanism for mask propagation. As it stands, the core algorithmic novelty has a clear Achilles' heel.

Questions to Address In Rebuttal

Regarding the CAM-based Dual-index Gaussian Buffer (CDGB): Could the authors provide a more rigorous justification for using a CAM over a more conventional architecture? For instance, what is the estimated performance and area/power cost of an alternative design using a dedicated hardware radix sorter for the sorting stage and a conflict-free banking scheme or serialized accumulator for the gradient merging stage? This would help quantify the "delta" in efficiency that this novel component provides.
Regarding the optical-flow sparsification: The proposed fallback to dense processing during large frame-to-frame changes seems pragmatic but potentially costly. In the provided TUM and Replica dataset experiments, what percentage of keyframes triggered this fallback mechanism? How does this affect the average end-to-end FPS in more dynamic sequences not included in the evaluation, where camera motion might be more erratic?
To clarify the novelty positioning: Would the authors agree that the paper's primary contribution is the novel synthesis of established architectural and algorithmic concepts (CAMs, optical flow, dataflow fusion) to create the first system that solves the specific performance challenges of incremental 3DGS SLAM on the edge? Positioning the work this way would more accurately reflect its relationship to prior art.

PointISA: ISA-Extensions for Efficient Point Cloud Analytics via Architecture and Algorithm Co-Design

Abstract

Point cloud analytics plays a crucial role in spatial machine vision for applications like autonomous driving, robotics and AR/VR. Recently, numerous domain-specific accelerators have been proposed to meet the stringent real-time and energy-efficiency ...

ACM LINK

Reviews

Review 1

Review Form:

Reviewer: The Guardian (Adversarial Skeptic)

Summary

The authors propose PointISA, an instruction set architecture (ISA) extension for accelerating point cloud analytics. The central thesis is that existing domain-specific accelerators are inefficient due to a mismatch between their rigid, kernel-independent hardware and the diverse, evolving nature of point cloud workloads. The proposed solution involves: 1) new ISA instructions for common primitives like Euclidean distance and sorting; 2) a unified systolic array to execute these instructions and standard matrix multiplications; and 3) "co-designed" algorithms that restructure computations into a parallel "multiple-points-to-multiple-points" (MP2MP) pattern. The authors claim an average 5.4× speedup and 4.9× power efficiency improvement over a baseline CPU with a negligible 0.9% area overhead.

Strengths

Problem Formulation: The paper correctly identifies a valid and significant problem in the domain: the poor resource utilization of coarse-grained, heterogeneous accelerators for point cloud tasks (Observation I, Section 3.1, Page 5). The analysis of PointAcc's utilization is a strong motivator.
Co-Design Principle: The core idea of tightly coupling algorithmic transformations with microarchitectural features is a sound engineering principle. The identification of the SP2MP vs. MP2MP parallelism mismatch (Observation III, Section 3.1, Page 5) provides a clear target for optimization.
Unified Architecture: The decision to augment an existing hardware structure (the SME systolic array) rather than adding entirely separate functional units is a sensible approach to minimizing area overhead.

Weaknesses

My primary concerns with this work center on the questionable strength of the baseline, unsubstantiated claims of algorithmic equivalence, and a lack of sufficient detail in the hardware implementation.

Weakness of the Baseline Comparison: The performance claims are entirely dependent on the quality of the baseline implementation. The authors state they developed "self-optimized kernels based on the Point Cloud Library (PCL)[38] reference implementation" (Section 7.1, Page 11). PCL is a functional reference, not a performance-tuned library. A state-of-the-art baseline would involve expert-level optimization using ARM SVE/SME intrinsics, potentially employing techniques like optimized bitonic sorting for the top-K operations and aggressive data layout transformations. Without a comparison to such a highly-tuned software baseline, the reported 5.4x speedup is likely inflated and does not convincingly prove the necessity of new hardware instructions over superior software engineering.
Unsubstantiated Claim of Algorithmic Correctness: The paper makes the critical claim that the proposed "lazy" algorithms maintain "algorithmic correctness" (Abstract, Page 1). Farthest Point Sampling (FPS) is a deterministic, greedy algorithm where the selection at step i depends on the exact set of points chosen in steps 0 to i-1. The proposed Lazy FPS (Algorithm 1, Page 9) appears to alter this fundamental dependency by using potentially stale distance information for candidate selection (lines 7-9). This fundamentally changes the algorithm. The authors provide no proof, or even a rigorous argument, that their modified algorithm produces an identical set of sampled points as the canonical FPS algorithm. At best, it is a heuristic approximation, and this must be clearly stated and its impact on downstream task accuracy must be evaluated. The claim of maintaining correctness is a significant logical flaw.
Insufficient Microarchitectural Detail: The mechanism for multi-dimensional sorting on the systolic array is described at a very high level (Section 5.3, Page 8; Figure 9, Page 9). A systolic array is naturally suited for regular, data-flow computations like matrix multiplication. Sorting is a comparison- and routing-intensive operation. The paper does not adequately explain how the inter-PE data paths and control logic manage the complex data exchange required for a two-stage sort. The claim of a "speedup of 33% compared to the traditional systolic sort architecture[48]" is presented without sufficient evidence or analysis of the underlying data flow and control.
Optimistic Overhead Analysis: The claimed 0.9% area overhead relative to an ARM Cortex-A72 processor (Section 7.3, Page 12) is highly suspect. Such comparisons are notoriously difficult to make fairly. The authors must provide the exact configuration of the baseline A72, clarify whether its area figure (derived from McPAT and scaled) is for the same 28nm technology node, and specify what is included (e.g., L1/L2 caches, FPU). A 0.547 mm² systolic array, even if reusing some SME components, is a non-trivial piece of hardware, and a sub-1% overhead figure requires much stronger justification.

Questions to Address In Rebuttal

Baseline Strength: Can you provide evidence that your "self-optimized" SVE/SME baseline is competitive with a state-of-the-art, manually-tuned software implementation? Please provide a micro-benchmark comparison for a core operation (e.g., finding the top-K smallest values among N distances) between your proposed HSTWI instruction and a highly optimized SVE/SME software routine.
Algorithmic Equivalence: Please provide a formal proof that your Lazy FPS algorithm is guaranteed to produce an output identical to the standard greedy FPS algorithm. If it is not identical, please re-characterize it as a heuristic and provide an empirical evaluation of its deviation from standard FPS and the resulting impact on the final accuracy of a downstream task like classification on ModelNet40.
Hardware Sorting Mechanism: Please provide a more detailed microarchitectural diagram and a cycle-by-cycle example of how the HSTWI instruction executes on the 8x8 PE array for a small example (e.g., sorting 8 elements in a row). The explanation should clarify the data movement between PEs and the control signals involved.
Area Overhead Calculation: Please provide the full configuration details of the baseline ARM Cortex-A72 processor used for the 0.9% area overhead comparison, including its technology node, cache sizes, and the source of the area data, to validate the comparison's fairness.

Review 2

Review Form

Reviewer: The Synthesizer (Contextual Analyst)

Summary

This paper presents PointISA, a novel set of Instruction Set Architecture (ISA) extensions designed to accelerate point cloud analytics. The authors identify a critical weakness in current domain-specific accelerators: they often use heterogeneous, kernel-independent hardware units, leading to low resource utilization due to the diverse and irregular nature of point cloud workloads.

The core contribution is a holistic co-design philosophy. Instead of just proposing new instructions, the authors couple a small, powerful set of primitives (primarily for Euclidean distance and sorting) with a significant algorithmic transformation. They redesign key algorithms like Farthest Point Sampling (FPS) and k-Nearest Neighbors (kNN) from their conventional single-point-to-multiple-point (SP2MP) pattern into a more hardware-friendly multiple-point-to-multiple-point (MP2MP) "lazy" computation pattern. This co-design is implemented on a versatile systolic array that cleverly extends existing CPU matrix engines, resulting in significant performance and efficiency gains (5.4x speedup, 4.9x power efficiency) with a remarkably low area overhead (0.9%).

Strengths

This is a well-executed and insightful piece of work that makes a compelling case for its approach. Its primary strengths are:

Excellent Problem Diagnosis and Motivation: The paper's framing in Section 3.1 is exceptionally clear. The three observations—low utilization of kernel-independent hardware, the lack of a "one-size-fits-all" algorithm for kernels like FPS/kNN, and the mismatched parallelism of SP2MP patterns—provide a robust and convincing foundation for the proposed solution. This work is not a solution in search of a problem; it is a direct and well-reasoned answer to a tangible issue in the field of 3D vision acceleration.
Elegant Co-Design Philosophy: The true novelty here is not just the ISA extensions themselves, but the tight integration of architecture and algorithm. The transformation of FPS and kNN into "lazy" MP2MP variants (Section 6, page 9) is the key insight that unlocks the hardware's potential. Many papers propose accelerators, but few demonstrate this level of thoughtful restructuring of the core algorithms to match the hardware's capabilities. The sensitivity analysis in Figure 17 (page 13), which cleanly separates the gains from the algorithm and the architecture, is a testament to this strong co-design methodology.
Pragmatic and Area-Efficient Hardware Design: The decision to augment an existing matrix extension (like ARM SME) rather than building a completely separate co-processor is both clever and practical. It directly addresses the cost sensitivity of embedded and mobile SoCs. The resulting 0.9% area overhead is a powerful argument for its real-world viability, positioning PointISA not as a standalone accelerator but as a low-cost, high-impact feature for a general-purpose CPU.
Flexibility and Future-Proofing: By offloading complex control flow (e.g., tree traversal, branching) to the host CPU and accelerating only the core computational primitives, PointISA offers a degree of flexibility that rigid, fixed-function accelerators lack. As point cloud analytics evolve—for instance, with a greater emphasis on high-dimensional feature spaces where tree-based methods falter—this ISA-based approach can adapt through software updates, a significant advantage over hardware that is hard-wired for a specific algorithm.

Weaknesses

While the core idea is strong, the paper could be improved by exploring the boundaries of its proposed solution more thoroughly.

Limited Scope of Primitives: The authors effectively abstract the dominant kernels (FPS, kNN) into distance computation and sorting. This is a powerful abstraction, but it raises the question of what lies beyond it. The paper would be strengthened by a discussion of other important point cloud operations that may not fit this model cleanly, such as voxelization, octree/KD-tree construction itself, or graph-based operations that require more complex connectivity information than simple sorting can provide. This would help contextualize PointISA as one part of a larger solution rather than a panacea.
Incomplete Comparison with Tree-Based Approaches: The paper rightly argues that tree-based methods are complex and can be inefficient in high-dimensional spaces. However, for the purely geometric (low-dimensional) tasks where they excel, a direct, quantitative comparison is missing. The baseline appears to be a brute-force software implementation. A more compelling comparison would involve pitting PointISA against a highly optimized software implementation of a tree-based algorithm (e.g., from the Point Cloud Library) running on the same baseline CPU. This would more clearly define the performance crossover point and the specific domains where PointISA offers the most benefit.
Programmability and Compiler Support: The paper mentions the use of LLVM intrinsics, which is the correct approach. However, the algorithmic transformation to the "lazy" MP2MP pattern is non-trivial and requires significant programmer effort to reason about batching and deferred updates. A brief discussion on the programmability challenges and the potential for future compiler techniques to automatically identify and transform standard SP2MP loops into the optimized MP2MP form would greatly enhance the work's practical significance.

Questions to Address In Rebuttal

The proposed primitives (distance and sorting) are shown to be highly effective for FPS and kNN. Can the authors elaborate on the applicability of PointISA to other emerging point cloud kernels? For example, could it accelerate parts of hash-based or voxel-based methods like those used in Minkowski Engine or SPVNAS?
For low-dimensional, geometry-based tasks, tree-based algorithms are often the state-of-the-art in software. How does the performance of PointISA on a geometric FPS or kNN task (e.g., on the ModelNet40 dataset) compare against a highly optimized tree-based software library running on your baseline ARM core?
The lazy, MP2MP algorithm transformation is key to this work's success. Could you comment on the software complexity involved? What is the level of programmer effort required to implement these patterns using your intrinsics, and do you foresee a path to automating this transformation within a compiler framework?
The analysis in Figure 14 (page 12) shows a significant reduction in L1 cache accesses but minimal change for L2/DRAM. As point clouds continue to grow in size, could L2/DRAM bandwidth become the next bottleneck? Does the MP2MP pattern offer any inherent advantages for prefetching or memory-level parallelism that could mitigate this?

Review 3

Review Form

Reviewer: The Innovator (Novelty Specialist)

Summary

This paper presents PointISA, a set of Instruction Set Architecture (ISA) extensions for accelerating point cloud analytics. The authors identify that existing domain-specific accelerators are often kernel-independent and suffer from low utilization and inflexibility. Their proposed solution is a tightly-coupled, in-CPU accelerator based on three core ideas: 1) A small set of new instructions designed to abstract common point cloud primitives like Euclidean distance (SEDS) and sorting with index tracking (HSTWI, etc.). 2) A unified and versatile systolic array architecture capable of executing both the new instructions and standard matrix multiplication. 3) An algorithm-architecture co-design strategy that transforms inherently sequential algorithms like Farthest Point Sampling (FPS) and k-Nearest Neighbors (kNN) into a parallel Multiple-Points-to-Multiple-Points (MP2MP) pattern, which the authors term a "lazy" computation approach.

My evaluation focuses exclusively on whether this set of ideas constitutes a genuinely novel contribution to the field.

Strengths

The primary novelty of this work lies in the holistic co-design of the ISA, microarchitecture, and algorithms for this specific domain. While individual components have conceptual precedents, their synthesis into a single, cohesive framework is original.

Novelty in Application Domain for an ISA-Extension: While the concept of ISA extensions for accelerating specific domains is well-established (e.g., graphics, AI), its application to the fundamental geometric primitives of point cloud analytics (specifically FPS and kNN) is a novel direction. Prior work like K-D Bonsai [8] proposed ISA extensions for point cloud compression via tree traversal, but PointISA targets the core arithmetic bottlenecks of the analytical pipeline itself. The abstraction of operations like distance computation and sorted selection into primitives like SEDS and HSTWI (Table 2, Page 7) is a clean and novel ISA design.
Significant Algorithmic Novelty: The most compelling novel contribution is the "lazy" algorithmic transformation, particularly for Farthest Point Sampling (Section 6.1, Page 9). FPS is classically a pathologically sequential algorithm. The authors' idea of deferring distance updates and sampling multiple points in a batch is a non-trivial algorithmic modification that fundamentally changes the computation pattern from SP2MP to MP2MP. This reconstruction of the algorithm specifically to saturate the proposed hardware is a strong example of co-design and represents a novel approach to parallelizing FPS.
Novelty in Architectural Philosophy: The paper's core argument against loosely-coupled, kernel-independent accelerators in favor of a tightly-integrated CPU extension is a novel philosophical stance in the context of point cloud hardware. This approach directly addresses the cited weaknesses of prior work (e.g., low utilization in PointAcc [26]), and the design of a single, versatile systolic array to handle disparate tasks (distance, sorting, GEMM) is a novel microarchitectural implementation of this philosophy.

Weaknesses

My concerns are primarily related to overstating the novelty of certain components when viewed in isolation.

Limited Novelty of the kNN Algorithm Transformation: The "lazy" kNN search algorithm (Section 6.2, Page 10), where queries are batched and processed at each KD-tree node, is conceptually very similar to established techniques for improving efficiency in parallel tree-based searches. Batching queries at nodes to improve data locality and amortization of traversal costs is a known pattern in high-performance computing. The authors should more clearly articulate the delta between their method and prior art in parallel kNN search literature. The novelty appears to be more in the mapping of this known pattern to their specific ISA rather than a fundamentally new algorithm.
Conceptual Precedents for Versatile Systolic Arrays: The idea of a "unified" or "versatile" systolic array is not entirely new. Research exists on making systolic arrays more flexible to support operations beyond standard GEMM. The paper presents its architecture (Section 5, Page 8) as novel but doesn't sufficiently contrast it with other attempts at creating flexible systolic arrays. The core additions—a subtractor and comparator—are logical but incremental extensions. The novelty is in the specific data paths for sorting, but the broader concept of a multi-purpose systolic array has been explored before.
The ISA Extension Concept Itself: To be clear, the specific instructions are novel. However, the overarching idea of using ISA extensions to offload work from a general-purpose core is the foundational principle behind SIMD, vector processing, and modern AI extensions like ARM SME and Intel AMX, which the paper builds upon. The contribution is thus a novel instance of a well-known design pattern, not the invention of the pattern itself. This is a minor point but important for contextualizing the work's inventiveness.

Questions to Address In Rebuttal

The authors should focus their rebuttal on clarifying the precise boundaries of their novel contributions.

Regarding the Lazy FPS Algorithm: Can the authors provide citations to the closest prior art in parallel or batched FPS algorithms and explicitly detail the conceptual difference? While I find this contribution strong, a more thorough comparison is needed to solidify its novelty.
Regarding the Lazy kNN Algorithm: Please clarify how the proposed query dispatching and batching mechanism (Algorithm 2, Page 10) differs fundamentally from existing parallel KD-tree search algorithms that also aim to maximize data locality and hardware utilization by processing multiple queries concurrently within tree nodes.
Regarding Hardware Versatility: The claim of a "versatile systolic array" could be strengthened. Can you contrast your design with prior academic or industrial work on multi-function systolic arrays? What is the key architectural insight beyond adding subtractor/comparator units that enables efficient execution of such a diverse instruction set (SEDS, HSTWI, GEMM)?
Regarding Extensibility: The paper argues that abstracting primitives leads to flexibility. How would the proposed PointISA instructions support other common but structurally different point cloud algorithms, such as DBSCAN clustering or octree-based neighbor search, which do not map as cleanly to distance/sort primitives? This would test the robustness of the novel abstraction.